You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

Synthesis Generation Pipeline — Full Algorithm

Initialization

Load user settings from DB (categories, provider, models, max_items, etc.)
Cleanup — delete old article history entries (>N days, dropped only) + truncate old LLM call logs
Validate — if no categories configured, there will just be the default category "Autre".
Load user sources (personalized URLs like https://openai.com/blog)
Resolve LLM provider — decrypt user's API key, create provider instance (Arc<dyn LlmProvider>)
Resolve models — research model + writing model (user override or admin default)
Setup rate limiter — per-user or global provider limiter
Prepare LLM scraping option — if use_llm_for_article_extraction enabled, clone provider+model for concurrent use
Initialize tracking structures — article_scraped (category→articles), source_counts (per-source article count), url_soucre (per-article source), filled_counts (per-category article count), seen_urls (cross-phase dedup), classification categories (user categories + "Autre")

Skipped entirely if user has 0 sources.

Query article_history for the last source used. Reorder the personalized source so that the first source is the one following the last source used (rolling window)
For each source, fetch the source page HTML:
- If use_llm_for_source_links enabled: send HTML <head> + first 8000 chars of <body> to LLM → extract all article URLs up to a maximum of 10, with the most recent first. If LLM call fails, fall back to HTML parsing as described below.
  - LLM call logged with full prompt/response/timing
- Otherwise: parse HTML <a href> links, filter by same-domain, non-homepage path, exclude /tag/, /login/, /contact/,/presentation/,/newsletter/, static assets, etc. and keep only the first 10 links found
- Deduplicate candidate URLs
- Hash each URL (normalized: lowercase, strip fragments/UTM params/trailing slashes)
- Query article_history for existing hashes → remove matches
- Trace dropped articles with status: filtered_history
- Add the url to url_soucre

Skipped if all user-defined categories are already filled to max_items_per_category.

Build search prompt with theme, categories, gap counts ("find N articles for category_1, M for category_2")
Send search prompt to LLM. LLM returns structured JSON: {category_0: [{title, url, summary}], category_1: [...]}
- LLM call logged with full prompt/response/timing
Filter homepage URLs — drop articles with path / or empty
Cross-phase dedup — drop URLs already seen in Phase 1
Dedup by URL — drop duplicate URLs within Phase 2 (case-insensitive)
Limit articles per source — enforce max_articles_per_source per domain (spread across categories first, then fill)
Filter against article history — BEFORE scraping (saves HTTP requests), drop already-seen URLs
Each drop traced in article_history with appropriate status

Same scraping as Phase 1 (bounded concurrency, SSRF check, optional LLM extraction)
Filter empty content (scrape failures, soft 404, too old)
Trace drops
Merge results into all_scraped
Move to synthesis generation

Sanitize — strip \u0000 null bytes from JSON (PostgreSQL rejects them in JSONB)
Save synthesis — insert into syntheses table with job_id, week (ISO week), sections (JSONB), status: completed
Record used articles — insert each article URL into article_history with status: used, synthesis_id, job_id, and category name (for future dedup + provenance)