You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

7.8 KiB

Raw Blame History Unescape Escape

Synthesis Generation Pipeline — Full Algorithm

Initialization

Load user settings from DB (categories, provider, models, max_items, etc.)
Cleanup — delete old article history entries (>N days, dropped only) + truncate old LLM call logs
Validate — fail if no categories configured
Load user sources (personalized URLs like https://openai.com/blog)
Resolve LLM provider — decrypt user's API key, create provider instance (Arc<dyn LlmProvider>)
Resolve models — research model + writing model (user override or admin default)
Setup rate limiter — per-user or global provider limiter
Prepare LLM scraping option — if use_llm_for_article_extraction enabled, clone provider+model for concurrent use
Initialize tracking structures — filled_counts (per-category article count), all_scraped (category→articles), all_overflow (dropped overflow), seen_urls (cross-phase dedup), classification categories (user categories + "Autre")

Phase 1: Personalized Sources

Skipped entirely if user has 0 sources.

1a. Extract article links from source pages

For each source (max 10), fetch the source page HTML
If use_llm_for_source_links enabled: send HTML <head> + first 8000 chars of <body> to LLM → extract article URLs (falls back to heuristic if LLM fails)
Otherwise: parse HTML <a href> links, filter by same-domain, non-homepage path, exclude /tag/, /login/, static assets, etc.
Over-fetch: 2 × max_articles_per_source candidates per source
Deduplicate candidate URLs

1b. Scrape candidate articles

Fetch each URL (bounded concurrency: 5 with LLM extraction, 10 without)
SSRF check (no private IPs), 15s timeout, 5MB body limit
If use_llm_for_article_extraction enabled: send <head> + body text to LLM → extract title, date, body, error detection (falls back to heuristic if LLM fails)
Otherwise: HTML parsing heuristics for title (<title>, og:title), date (meta tags, JSON-LD, <time>), body (strip scripts/nav), soft-404 detection
Capture final URL after redirects (canonical URL)

1c. Filter empty content

Remove articles where scraped body text is empty (scrape failure, soft 404, too old)
Trace dropped articles in article_history with status: filtered_empty

1d. Filter against article history

Hash each URL (normalized: lowercase, strip fragments/UTM params/trailing slashes)
Query article_history for existing hashes → remove matches
Trace dropped articles with status: filtered_history

1e. Retry if under-filled

If valid articles < categories × max_items_per_category and history is enabled
Re-scrape source pages for NEW links (exclude already-fetched URLs)
Scrape + filter empty + filter history on retry candidates
Merge with existing valid articles
Only 1 retry attempt

1f. LLM classification

Send articles (title + first 500 chars of body) + categories + "Autre" to LLM
LLM returns {assignments: [{index, category}]} mapping each article to a category
Overflow: articles that exceed both target category AND "Autre" limits → collected in all_overflow
LLM call logged with full prompt/response/timing

1g. Enforce source diversity

Count domains across all categories
Remove articles where domain exceeds max_articles_per_source
Trace dropped articles with status: filtered_diversity
Recount category fill levels

Phase 2: Web Search Fallback

Skipped if all user-defined categories are already filled to max_items_per_category.

2a. Compute category gaps

For each user category: needed = max - already_filled
Only proceed if any category needs more

2b. Load recent domains for diversity

If source_diversity_window > 0: extract domains from last N syntheses' JSONB sections
Used as soft "avoid if possible" instruction in search prompt

2c. LLM web search pass

Build search prompt with theme, categories, gap counts ("find N articles for AI News, M for Cybersecurity"), recent domains to avoid, personalized source URLs
Call provider.generate_search_pass() with web search tool enabled
LLM call logged with full prompt/response/timing
Returns structured JSON: {category_0: [{title, url, summary}], category_1: [...]}

2d. Filter pipeline on search results

Parse LLM output into (category_key, Vec<NewsItem>)
Filter homepage URLs — drop articles with path / or empty
Cross-phase dedup — drop URLs already seen in Phase 1
Dedup by URL — drop duplicate URLs within Phase 2 (case-insensitive)
Limit articles per source — enforce max_articles_per_source per domain (spread across categories first, then fill)
Filter against article history — BEFORE scraping (saves HTTP requests), drop already-seen URLs
Each drop traced in article_history with appropriate status

2e. Scrape web search results

Same scraping as Phase 1 (bounded concurrency, SSRF check, optional LLM extraction)
Filter empty content (scrape failures, soft 404, too old)
Trace drops

2f. LLM classification

Same as Phase 1 classification but with Phase 2 articles
filled_counts carries over from Phase 1 — categories already partially filled
Overflow collected
LLM call logged
Merge results into all_scraped

"Autre" Fill-Up

Count total articles across all categories
Target = 75% × (categories × max_items_per_category) (user categories only, "Autre" excluded from denominator)
If shortfall > 0 and overflow exists:
- For each overflow article: check if domain is under max_articles_per_source limit
- Add to all_scraped["category_autre"] up to the shortfall

Combined Rewrite Pass

Fail if no articles — return error if all categories are empty
Build rewrite prompt — serialize all scraped articles with body content, instruct LLM to rewrite title + summary (4-5 lines) faithfully based on scraped content
Build rewrite schema — minItems/maxItems set to ACTUAL count per category (not user max), empty categories omitted, "Autre" included if non-empty
LLM rewrite pass — call provider.generate_rewrite_pass() with writing model
LLM call logged with full prompt/response/timing
Build final sections — map category_N keys to user category names, add "Autre" section if present, omit empty categories
Restore scraped URLs — replace any hallucinated URLs from LLM rewrite with the validated scraped URLs (matched by category + position)

Save + Record

Sanitize — strip \u0000 null bytes from JSON (PostgreSQL rejects them in JSONB)
Save synthesis — insert into syntheses table with job_id, week (ISO week), sections (JSONB), status: completed
Record used articles — insert each article URL into article_history with status: used, synthesis_id, job_id, and category name (for future dedup + provenance)

Summary of LLM Calls (up to 4 per generation)

#	Call	When	Model
1	Classification Phase 1	After Phase 1 scraping	research
2	Web Search	Phase 2 start	research
3	Classification Phase 2	After Phase 2 scraping	research
4	Rewrite	After both phases	writing

Plus optionally per-article calls for LLM link extraction and LLM article extraction (when those settings are enabled).

Summary of Filtering Steps

Step	Phase	What's dropped
Empty content	1 & 2	Scrape failures, soft 404s, too old
Article history	1 & 2	Already used in previous syntheses
Homepage URLs	2	Path is `/` or empty
Cross-phase dedup	2	URLs already found in Phase 1
URL dedup	2	Duplicate URLs within Phase 2
Source diversity	1 & 2	Domain exceeds `max_articles_per_source`
Category overflow	1 & 2	Category + "Autre" both full

7.8 KiB Raw Blame History Unescape Escape