You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

5.4 KiB

Synthesis Generation Pipeline — Full Algorithm

Initialization

  1. Load user settings from DB (categories, provider, models, max_items, etc.)
  2. Cleanup — delete old article history entries (>N days, dropped only) + truncate old LLM call logs
  3. Validate — if no categories configured, there will just be the default category "Autre".
  4. Load user sources (personalized URLs like https://openai.com/blog)
  5. Resolve LLM provider — decrypt user's API key, create provider instance (Arc<dyn LlmProvider>)
  6. Resolve models — research model + writing model (user override or admin default)
  7. Setup rate limiter — per-user or global provider limiter
  8. Prepare LLM scraping option — if use_llm_for_article_extraction enabled, clone provider+model for concurrent use
  9. Initialize tracking structuresarticle_scraped (category→articles), source_counts (per-source article count), url_soucre (per-article source), filled_counts (per-category article count), seen_urls (cross-phase dedup), classification categories (user categories + "Autre")

Phase 1: Personalized Sources

Skipped entirely if user has 0 sources.

1a. Extract article links from source pages and filter against article history

  • Query article_history for the last source used. Reorder the personalized source so that the first source is the one following the last source used (rolling window)
  • For each source, fetch the source page HTML:
    • If use_llm_for_source_links enabled: send HTML <head> + first 8000 chars of <body> to LLM → extract all article URLs up to a maximum of 10, with the most recent first. If LLM call fails, fall back to HTML parsing as described below.
      • LLM call logged with full prompt/response/timing
    • Otherwise: parse HTML <a href> links, filter by same-domain, non-homepage path, exclude /tag/, /login/, /contact/,/presentation/,/newsletter/, static assets, etc. and keep only the first 10 links found
    • Deduplicate candidate URLs
    • Hash each URL (normalized: lowercase, strip fragments/UTM params/trailing slashes)
    • Query article_history for existing hashes → remove matches
    • Trace dropped articles with status: filtered_history
    • Add the url to url_soucre

1b. Scrape, classify and summarize articles

  • For each url from step 1a:
    • if the number of articles in source_counts for the source of the current url exceeds max_articles_per_source:
      • Trace dropped article with status: filtered_diversity
      • Move to next url
    • Fetch each URL (bounded concurrency: 5 with LLM extraction, 10 without).
    • SSRF check (no private IPs), 15s timeout, 5MB body limit.
    • HTML parsing heuristics for title (<title>, og:title), date (meta tags, JSON-LD, <time>), body (strip scripts/nav), soft-404 detection
    • If article scraped body text is empty (scrape failure, soft 404, too old):
      • Trace dropped articles in article_history with status: filtered_empty
      • Move to next url
    • Send article (title + first 500 chars of body) + categories + "Autre" to LLM. LLM returns {title, summary, category} mapping the article to a category. The LLM generates the summary and a also a title if the provided title is empty
      • LLM call logged with full prompt/response/timing
    • Add the article to article_scraped and increase filled_counts
    • if number of articles in the category of this artcile exceeds max_items_per_category: change the article catgeory to "Autre"
    • If the total number of articles in article_scraped exceeds number of categories (including Autre) × max_items_per_category then exit for loop and move to synthesis generation

Phase 2: Web Search Fallback

Skipped if all user-defined categories are already filled to max_items_per_category.

2a. Compute category gaps

  • For each user category: needed = max - already_filled
  • Only proceed if any category needs more

2b. LLM web search pass

  • Build search prompt with theme, categories, gap counts ("find N articles for category_1, M for category_2")
  • Send search prompt to LLM. LLM returns structured JSON: {category_0: [{title, url, summary}], category_1: [...]}
    • LLM call logged with full prompt/response/timing
  • Filter homepage URLs — drop articles with path / or empty
  • Cross-phase dedup — drop URLs already seen in Phase 1
  • Dedup by URL — drop duplicate URLs within Phase 2 (case-insensitive)
  • Limit articles per source — enforce max_articles_per_source per domain (spread across categories first, then fill)
  • Filter against article history — BEFORE scraping (saves HTTP requests), drop already-seen URLs
  • Each drop traced in article_history with appropriate status

2c. Scrape web search results

  • Same scraping as Phase 1 (bounded concurrency, SSRF check, optional LLM extraction)
  • Filter empty content (scrape failures, soft 404, too old)
  • Trace drops
  • Merge results into all_scraped
  • Move to synthesis generation

Save + Record

  • Sanitize — strip \u0000 null bytes from JSON (PostgreSQL rejects them in JSONB)
  • Save synthesis — insert into syntheses table with job_id, week (ISO week), sections (JSONB), status: completed
  • Record used articles — insert each article URL into article_history with status: used, synthesis_id, job_id, and category name (for future dedup + provenance)