You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
7.8 KiB
7.8 KiB
Synthesis Generation Pipeline — Full Algorithm
Initialization
- Load user settings from DB (categories, provider, models, max_items, etc.)
- Cleanup — delete old article history entries (>N days, dropped only) + truncate old LLM call logs
- Validate — fail if no categories configured
- Load user sources (personalized URLs like
https://openai.com/blog) - Resolve LLM provider — decrypt user's API key, create provider instance (
Arc<dyn LlmProvider>) - Resolve models — research model + writing model (user override or admin default)
- Setup rate limiter — per-user or global provider limiter
- Prepare LLM scraping option — if
use_llm_for_article_extractionenabled, clone provider+model for concurrent use - Initialize tracking structures —
filled_counts(per-category article count),all_scraped(category→articles),all_overflow(dropped overflow),seen_urls(cross-phase dedup), classification categories (user categories + "Autre")
Phase 1: Personalized Sources
Skipped entirely if user has 0 sources.
1a. Extract article links from source pages
- For each source (max 10), fetch the source page HTML
- If
use_llm_for_source_linksenabled: send HTML<head>+ first 8000 chars of<body>to LLM → extract article URLs (falls back to heuristic if LLM fails) - Otherwise: parse HTML
<a href>links, filter by same-domain, non-homepage path, exclude/tag/,/login/, static assets, etc. - Over-fetch:
2 × max_articles_per_sourcecandidates per source - Deduplicate candidate URLs
1b. Scrape candidate articles
- Fetch each URL (bounded concurrency: 5 with LLM extraction, 10 without)
- SSRF check (no private IPs), 15s timeout, 5MB body limit
- If
use_llm_for_article_extractionenabled: send<head>+ body text to LLM → extract title, date, body, error detection (falls back to heuristic if LLM fails) - Otherwise: HTML parsing heuristics for title (
<title>,og:title), date (meta tags, JSON-LD,<time>), body (strip scripts/nav), soft-404 detection - Capture final URL after redirects (canonical URL)
1c. Filter empty content
- Remove articles where scraped body text is empty (scrape failure, soft 404, too old)
- Trace dropped articles in
article_historywithstatus: filtered_empty
1d. Filter against article history
- Hash each URL (normalized: lowercase, strip fragments/UTM params/trailing slashes)
- Query
article_historyfor existing hashes → remove matches - Trace dropped articles with
status: filtered_history
1e. Retry if under-filled
- If valid articles <
categories × max_items_per_categoryand history is enabled - Re-scrape source pages for NEW links (exclude already-fetched URLs)
- Scrape + filter empty + filter history on retry candidates
- Merge with existing valid articles
- Only 1 retry attempt
1f. LLM classification
- Send articles (title + first 500 chars of body) + categories + "Autre" to LLM
- LLM returns
{assignments: [{index, category}]}mapping each article to a category - Overflow: articles that exceed both target category AND "Autre" limits → collected in
all_overflow - LLM call logged with full prompt/response/timing
1g. Enforce source diversity
- Count domains across all categories
- Remove articles where domain exceeds
max_articles_per_source - Trace dropped articles with
status: filtered_diversity - Recount category fill levels
Phase 2: Web Search Fallback
Skipped if all user-defined categories are already filled to max_items_per_category.
2a. Compute category gaps
- For each user category:
needed = max - already_filled - Only proceed if any category needs more
2b. Load recent domains for diversity
- If
source_diversity_window > 0: extract domains from last N syntheses' JSONB sections - Used as soft "avoid if possible" instruction in search prompt
2c. LLM web search pass
- Build search prompt with theme, categories, gap counts ("find N articles for AI News, M for Cybersecurity"), recent domains to avoid, personalized source URLs
- Call
provider.generate_search_pass()with web search tool enabled - LLM call logged with full prompt/response/timing
- Returns structured JSON:
{category_0: [{title, url, summary}], category_1: [...]}
2d. Filter pipeline on search results
- Parse LLM output into
(category_key, Vec<NewsItem>) - Filter homepage URLs — drop articles with path
/or empty - Cross-phase dedup — drop URLs already seen in Phase 1
- Dedup by URL — drop duplicate URLs within Phase 2 (case-insensitive)
- Limit articles per source — enforce
max_articles_per_sourceper domain (spread across categories first, then fill) - Filter against article history — BEFORE scraping (saves HTTP requests), drop already-seen URLs
- Each drop traced in
article_historywith appropriate status
2e. Scrape web search results
- Same scraping as Phase 1 (bounded concurrency, SSRF check, optional LLM extraction)
- Filter empty content (scrape failures, soft 404, too old)
- Trace drops
2f. LLM classification
- Same as Phase 1 classification but with Phase 2 articles
filled_countscarries over from Phase 1 — categories already partially filled- Overflow collected
- LLM call logged
- Merge results into
all_scraped
"Autre" Fill-Up
- Count total articles across all categories
- Target =
75% × (categories × max_items_per_category)(user categories only, "Autre" excluded from denominator) - If shortfall > 0 and overflow exists:
- For each overflow article: check if domain is under
max_articles_per_sourcelimit - Add to
all_scraped["category_autre"]up to the shortfall
- For each overflow article: check if domain is under
Combined Rewrite Pass
- Fail if no articles — return error if all categories are empty
- Build rewrite prompt — serialize all scraped articles with body content, instruct LLM to rewrite title + summary (4-5 lines) faithfully based on scraped content
- Build rewrite schema —
minItems/maxItemsset to ACTUAL count per category (not user max), empty categories omitted, "Autre" included if non-empty - LLM rewrite pass — call
provider.generate_rewrite_pass()with writing model - LLM call logged with full prompt/response/timing
- Build final sections — map
category_Nkeys to user category names, add "Autre" section if present, omit empty categories - Restore scraped URLs — replace any hallucinated URLs from LLM rewrite with the validated scraped URLs (matched by category + position)
Save + Record
- Sanitize — strip
\u0000null bytes from JSON (PostgreSQL rejects them in JSONB) - Save synthesis — insert into
synthesestable withjob_id,week(ISO week),sections(JSONB),status: completed - Record used articles — insert each article URL into
article_historywithstatus: used,synthesis_id,job_id, and category name (for future dedup + provenance)
Summary of LLM Calls (up to 4 per generation)
| # | Call | When | Model |
|---|---|---|---|
| 1 | Classification Phase 1 | After Phase 1 scraping | research |
| 2 | Web Search | Phase 2 start | research |
| 3 | Classification Phase 2 | After Phase 2 scraping | research |
| 4 | Rewrite | After both phases | writing |
Plus optionally per-article calls for LLM link extraction and LLM article extraction (when those settings are enabled).
Summary of Filtering Steps
| Step | Phase | What's dropped |
|---|---|---|
| Empty content | 1 & 2 | Scrape failures, soft 404s, too old |
| Article history | 1 & 2 | Already used in previous syntheses |
| Homepage URLs | 2 | Path is / or empty |
| Cross-phase dedup | 2 | URLs already found in Phase 1 |
| URL dedup | 2 | Duplicate URLs within Phase 2 |
| Source diversity | 1 & 2 | Domain exceeds max_articles_per_source |
| Category overflow | 1 & 2 | Category + "Autre" both full |