You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
5.4 KiB
5.4 KiB
Synthesis Generation Pipeline — Full Algorithm
Initialization
- Load user settings from DB (categories, provider, models, max_items, etc.)
- Cleanup — delete old article history entries (>N days, dropped only) + truncate old LLM call logs
- Validate — if no categories configured, there will just be the default category "Autre".
- Load user sources (personalized URLs like
https://openai.com/blog) - Resolve LLM provider — decrypt user's API key, create provider instance (
Arc<dyn LlmProvider>) - Resolve models — research model + writing model (user override or admin default)
- Setup rate limiter — per-user or global provider limiter
- Prepare LLM scraping option — if
use_llm_for_article_extractionenabled, clone provider+model for concurrent use - Initialize tracking structures —
article_scraped(category→articles),source_counts(per-source article count),url_soucre(per-article source),filled_counts(per-category article count),seen_urls(cross-phase dedup), classification categories (user categories + "Autre")
Phase 1: Personalized Sources
Skipped entirely if user has 0 sources.
1a. Extract article links from source pages and filter against article history
- Query
article_historyfor the last source used. Reorder the personalized source so that the first source is the one following the last source used (rolling window) - For each source, fetch the source page HTML:
- If
use_llm_for_source_linksenabled: send HTML<head>+ first 8000 chars of<body>to LLM → extract all article URLs up to a maximum of 10, with the most recent first. If LLM call fails, fall back to HTML parsing as described below.- LLM call logged with full prompt/response/timing
- Otherwise: parse HTML
<a href>links, filter by same-domain, non-homepage path, exclude/tag/,/login/,/contact/,/presentation/,/newsletter/, static assets, etc. and keep only the first 10 links found - Deduplicate candidate URLs
- Hash each URL (normalized: lowercase, strip fragments/UTM params/trailing slashes)
- Query
article_historyfor existing hashes → remove matches - Trace dropped articles with
status: filtered_history - Add the url to
url_soucre
- If
1b. Scrape, classify and summarize articles
- For each url from step 1a:
- if the number of articles in
source_countsfor the source of the current url exceedsmax_articles_per_source:- Trace dropped article with
status: filtered_diversity - Move to next url
- Trace dropped article with
- Fetch each URL (bounded concurrency: 5 with LLM extraction, 10 without).
- SSRF check (no private IPs), 15s timeout, 5MB body limit.
- HTML parsing heuristics for title (
<title>,og:title), date (meta tags, JSON-LD,<time>), body (strip scripts/nav), soft-404 detection - If article scraped body text is empty (scrape failure, soft 404, too old):
- Trace dropped articles in
article_historywithstatus: filtered_empty - Move to next url
- Trace dropped articles in
- Send article (title + first 500 chars of body) + categories + "Autre" to LLM. LLM returns
{title, summary, category}mapping the article to a category. The LLM generates the summary and a also a title if the provided title is empty- LLM call logged with full prompt/response/timing
- Add the article to
article_scrapedand increasefilled_counts - if number of articles in the category of this artcile exceeds
max_items_per_category: change the article catgeory to "Autre" - If the total number of articles in
article_scrapedexceedsnumber of categories (including Autre) × max_items_per_categorythen exit for loop and move to synthesis generation
- if the number of articles in
Phase 2: Web Search Fallback
Skipped if all user-defined categories are already filled to max_items_per_category.
2a. Compute category gaps
- For each user category:
needed = max - already_filled - Only proceed if any category needs more
2b. LLM web search pass
- Build search prompt with theme, categories, gap counts ("find N articles for category_1, M for category_2")
- Send search prompt to LLM. LLM returns structured JSON:
{category_0: [{title, url, summary}], category_1: [...]}- LLM call logged with full prompt/response/timing
- Filter homepage URLs — drop articles with path
/or empty - Cross-phase dedup — drop URLs already seen in Phase 1
- Dedup by URL — drop duplicate URLs within Phase 2 (case-insensitive)
- Limit articles per source — enforce
max_articles_per_sourceper domain (spread across categories first, then fill) - Filter against article history — BEFORE scraping (saves HTTP requests), drop already-seen URLs
- Each drop traced in
article_historywith appropriate status
2c. Scrape web search results
- Same scraping as Phase 1 (bounded concurrency, SSRF check, optional LLM extraction)
- Filter empty content (scrape failures, soft 404, too old)
- Trace drops
- Merge results into
all_scraped - Move to synthesis generation
Save + Record
- Sanitize — strip
\u0000null bytes from JSON (PostgreSQL rejects them in JSONB) - Save synthesis — insert into
synthesestable withjob_id,week(ISO week),sections(JSONB),status: completed - Record used articles — insert each article URL into
article_historywithstatus: used,synthesis_id,job_id, and category name (for future dedup + provenance)