You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

7.8 KiB

Synthesis Generation Pipeline — Full Algorithm

Initialization

  1. Load user settings from DB (categories, provider, models, max_items, etc.)
  2. Cleanup — delete old article history entries (>N days, dropped only) + truncate old LLM call logs
  3. Validate — fail if no categories configured
  4. Load user sources (personalized URLs like https://openai.com/blog)
  5. Resolve LLM provider — decrypt user's API key, create provider instance (Arc<dyn LlmProvider>)
  6. Resolve models — research model + writing model (user override or admin default)
  7. Setup rate limiter — per-user or global provider limiter
  8. Prepare LLM scraping option — if use_llm_for_article_extraction enabled, clone provider+model for concurrent use
  9. Initialize tracking structuresfilled_counts (per-category article count), all_scraped (category→articles), all_overflow (dropped overflow), seen_urls (cross-phase dedup), classification categories (user categories + "Autre")

Phase 1: Personalized Sources

Skipped entirely if user has 0 sources.

1a. Extract article links from source pages

  • For each source (max 10), fetch the source page HTML
  • If use_llm_for_source_links enabled: send HTML <head> + first 8000 chars of <body> to LLM → extract article URLs (falls back to heuristic if LLM fails)
  • Otherwise: parse HTML <a href> links, filter by same-domain, non-homepage path, exclude /tag/, /login/, static assets, etc.
  • Over-fetch: 2 × max_articles_per_source candidates per source
  • Deduplicate candidate URLs

1b. Scrape candidate articles

  • Fetch each URL (bounded concurrency: 5 with LLM extraction, 10 without)
  • SSRF check (no private IPs), 15s timeout, 5MB body limit
  • If use_llm_for_article_extraction enabled: send <head> + body text to LLM → extract title, date, body, error detection (falls back to heuristic if LLM fails)
  • Otherwise: HTML parsing heuristics for title (<title>, og:title), date (meta tags, JSON-LD, <time>), body (strip scripts/nav), soft-404 detection
  • Capture final URL after redirects (canonical URL)

1c. Filter empty content

  • Remove articles where scraped body text is empty (scrape failure, soft 404, too old)
  • Trace dropped articles in article_history with status: filtered_empty

1d. Filter against article history

  • Hash each URL (normalized: lowercase, strip fragments/UTM params/trailing slashes)
  • Query article_history for existing hashes → remove matches
  • Trace dropped articles with status: filtered_history

1e. Retry if under-filled

  • If valid articles < categories × max_items_per_category and history is enabled
  • Re-scrape source pages for NEW links (exclude already-fetched URLs)
  • Scrape + filter empty + filter history on retry candidates
  • Merge with existing valid articles
  • Only 1 retry attempt

1f. LLM classification

  • Send articles (title + first 500 chars of body) + categories + "Autre" to LLM
  • LLM returns {assignments: [{index, category}]} mapping each article to a category
  • Overflow: articles that exceed both target category AND "Autre" limits → collected in all_overflow
  • LLM call logged with full prompt/response/timing

1g. Enforce source diversity

  • Count domains across all categories
  • Remove articles where domain exceeds max_articles_per_source
  • Trace dropped articles with status: filtered_diversity
  • Recount category fill levels

Phase 2: Web Search Fallback

Skipped if all user-defined categories are already filled to max_items_per_category.

2a. Compute category gaps

  • For each user category: needed = max - already_filled
  • Only proceed if any category needs more

2b. Load recent domains for diversity

  • If source_diversity_window > 0: extract domains from last N syntheses' JSONB sections
  • Used as soft "avoid if possible" instruction in search prompt

2c. LLM web search pass

  • Build search prompt with theme, categories, gap counts ("find N articles for AI News, M for Cybersecurity"), recent domains to avoid, personalized source URLs
  • Call provider.generate_search_pass() with web search tool enabled
  • LLM call logged with full prompt/response/timing
  • Returns structured JSON: {category_0: [{title, url, summary}], category_1: [...]}

2d. Filter pipeline on search results

  • Parse LLM output into (category_key, Vec<NewsItem>)
  • Filter homepage URLs — drop articles with path / or empty
  • Cross-phase dedup — drop URLs already seen in Phase 1
  • Dedup by URL — drop duplicate URLs within Phase 2 (case-insensitive)
  • Limit articles per source — enforce max_articles_per_source per domain (spread across categories first, then fill)
  • Filter against article history — BEFORE scraping (saves HTTP requests), drop already-seen URLs
  • Each drop traced in article_history with appropriate status

2e. Scrape web search results

  • Same scraping as Phase 1 (bounded concurrency, SSRF check, optional LLM extraction)
  • Filter empty content (scrape failures, soft 404, too old)
  • Trace drops

2f. LLM classification

  • Same as Phase 1 classification but with Phase 2 articles
  • filled_counts carries over from Phase 1 — categories already partially filled
  • Overflow collected
  • LLM call logged
  • Merge results into all_scraped

"Autre" Fill-Up

  • Count total articles across all categories
  • Target = 75% × (categories × max_items_per_category) (user categories only, "Autre" excluded from denominator)
  • If shortfall > 0 and overflow exists:
    • For each overflow article: check if domain is under max_articles_per_source limit
    • Add to all_scraped["category_autre"] up to the shortfall

Combined Rewrite Pass

  • Fail if no articles — return error if all categories are empty
  • Build rewrite prompt — serialize all scraped articles with body content, instruct LLM to rewrite title + summary (4-5 lines) faithfully based on scraped content
  • Build rewrite schemaminItems/maxItems set to ACTUAL count per category (not user max), empty categories omitted, "Autre" included if non-empty
  • LLM rewrite pass — call provider.generate_rewrite_pass() with writing model
  • LLM call logged with full prompt/response/timing
  • Build final sections — map category_N keys to user category names, add "Autre" section if present, omit empty categories
  • Restore scraped URLs — replace any hallucinated URLs from LLM rewrite with the validated scraped URLs (matched by category + position)

Save + Record

  • Sanitize — strip \u0000 null bytes from JSON (PostgreSQL rejects them in JSONB)
  • Save synthesis — insert into syntheses table with job_id, week (ISO week), sections (JSONB), status: completed
  • Record used articles — insert each article URL into article_history with status: used, synthesis_id, job_id, and category name (for future dedup + provenance)

Summary of LLM Calls (up to 4 per generation)

# Call When Model
1 Classification Phase 1 After Phase 1 scraping research
2 Web Search Phase 2 start research
3 Classification Phase 2 After Phase 2 scraping research
4 Rewrite After both phases writing

Plus optionally per-article calls for LLM link extraction and LLM article extraction (when those settings are enabled).

Summary of Filtering Steps

Step Phase What's dropped
Empty content 1 & 2 Scrape failures, soft 404s, too old
Article history 1 & 2 Already used in previous syntheses
Homepage URLs 2 Path is / or empty
Cross-phase dedup 2 URLs already found in Phase 1
URL dedup 2 Duplicate URLs within Phase 2
Source diversity 1 & 2 Domain exceeds max_articles_per_source
Category overflow 1 & 2 Category + "Autre" both full