9.3 KiB
Synthesis Generation Pipeline — Full Algorithm
Startup & Background Tasks
- Session cleanup: an hourly background task deletes expired DB sessions (
db::sessions::delete_expired). - Job store TTL: expired job entries (older than 1 hour) are cleaned up via
JobStore::cleanup_expired.
Generation Lifecycle
POST /api/v1/syntheses/generate creates a job in the JobStore, then spawns two nested tasks:
- Inner task: wraps
run_generationin a 15-minutetokio::time::timeout. If the timeout fires, sends anErrorprogress event and releases the user lock. - Outer task: monitors the inner task's
JoinHandlefor panics. If the inner task panics, sends anErrorprogress event and releases the user lock.
Progress is streamed to clients via a tokio::sync::watch channel (SSE endpoint subscribes to it).
Initialization
- Load user settings from DB (categories, provider, models, max_items, batch_size, etc.)
- Cleanup — delete old article history entries (>N days, dropped only) + truncate old LLM call logs
- Validate — if no categories configured, the only available category will be "Autre".
- Load user sources (personalized URLs like
https://openai.com/blog) - Resolve LLM provider — decrypt user's API key, create provider instance (
Arc<dyn LlmProvider>) - Resolve models — research model + web-search model (user override or admin default)
- Setup rate limiter — per-user or global provider limiter
- Initialize tracking structures —
article_scraped(category→articles),source_counts(per-domain article count),url_source(per-article source),filled_counts(per-category article count),seen_urls(cross-phase dedup),classification_categories(user categories + "Autre") - Batch trace buffer —
pending_traces: Vec<ArticleHistoryEntry>accumulates all article history writes; flushed withdb::article_history::batch_insert_entriesat phase boundaries (not per article).
Phase 1: Personalized Sources
Skipped entirely if user has 0 sources.
1a. Extract article links from source pages and filter against article history
- Query
article_historyfor the last source used. Reorder the personalized sources so that the first source is the one following the last source used (rolling window). - Fetch source pages with bounded concurrency of 5 (hardcoded
max_concurrent = 5):- If
use_llm_for_source_linksenabled: send HTML<head>+ first 8000 chars of<body>to LLM → extract all article URLs up to a maximum of 15, with the most recent first. If the LLM call fails, fall back to HTML parsing as described below.- LLM call logged with full prompt/response/timing
- Otherwise: parse HTML
<a href>links, filter by same-domain, non-homepage path, exclude/tag/,/login/,/contact/,/presentation/,/newsletter/, static assets, etc., and keep only the first 15 links found. - SSRF check performed on each source URL before fetching (rejects private/loopback IPs).
- If
- Deduplicate candidate URLs (case-insensitive, cross-source via
seen_urls). - Filter against article history — hash each URL (normalized: lowercase, strip fragments/UTM params/trailing slashes), batch-query
article_history→ remove matches.- Trace dropped articles as
status: filtered_history(flushed immediately after this filter step).
- Trace dropped articles as
- Shuffle remaining candidates to interleave articles from different sources.
- Track url → source in
url_source.
1b. Scrape, classify, and summarize articles (batched)
Processing happens in batches of settings.batch_size (minimum 1). For each batch:
Batch assembly: pull up to batch_size candidates from the iterator, skipping any where source_counts[domain] >= max_articles_per_source (traced as filtered_diversity).
Phase A — Scrape batch in parallel (JoinSet):
- SSRF check (no private IPs), 15s timeout, 5MB body limit.
- HTML parsing heuristics for title (
<title>,og:title), date (meta tags, JSON-LD,<time>), body (strip scripts/nav), soft-404 detection. - If article body is empty, is a soft-404, or is too old: trace as
filtered_empty/filtered_too_oldand skip.
Phase B — Classify/summarize batch in parallel (JoinSet):
- Check rate limit before classifying (waits up to 60s, then errors).
- Send article (title + first 500 chars of body) + categories + "Autre" to LLM. LLM returns
{title, summary, category}. If the title was empty, the LLM also generates one.- LLM call logged with full prompt/response/timing.
assign_category()helper: validates category, falls back to "Autre" if category is unknown or full. If "Autre" is also full, drops the article.- Add the article to
article_scraped, incrementfilled_counts, incrementsource_counts[domain].
Early exit: after each batch, if total articles across all categories ≥ (num_categories + 1) × max_items_per_category, stop and move to Phase 2.
Trace flush: all pending traces accumulated during Phase 1 are batch-inserted into article_history after Phase 1 completes.
Phase 2: Web Search Fallback
Skipped if all user-defined categories are already filled to max_items_per_category.
2a. Compute category gaps
- For each user category:
needed = max_items_per_category - already_filled - Only proceed if any category needs more articles.
2b. Choose path: Brave Search or LLM web search
The path is selected by the settings.use_brave_search flag.
Path A: Brave Search (use_brave_search = true)
2b-A. Call Brave Search API
- Resolve and decrypt the user's Brave Search API key (error if not configured).
- Query:
"{settings.theme} actualites", up to 20 results, filtered bymax_age_days.
2c-A. Filter Brave results
Each result URL passes through filter_phase2_url():
- Homepage filter — drop URLs with path
/or empty (filtered_homepage) - Cross-phase dedup — drop URLs already in
seen_urls(filtered_cross_phase_dedup) - Article history — check hash in DB, drop if seen before (
filtered_history) - Source diversity — drop if
source_counts[domain] >= max_articles_per_source(filtered_diversity)
Accepted URLs are added to seen_urls. All rejections are traced. Traces are batch-flushed after this filter step.
2d-A. Scrape + classify Brave results (batched)
Same batch loop as Phase 1b, using settings.batch_size:
- Phase A: scrape batch in parallel, trace failures as
source_type: "brave_search". - Phase B: classify/summarize in parallel (same LLM call + logging as Phase 1).
assign_category()used identically to Phase 1.- Source domain tracked in
source_counts. - Early exit at
max_totalarticles. - Traces are batch-flushed after this loop.
Path B: LLM Web Search (use_brave_search = false)
2b-B. LLM web search pass
- Check rate limit.
- Build search prompt with theme, categories, gap counts ("find N articles for category_1, M for category_2").
- Send search prompt to LLM (using
model_websearch). LLM returns structured JSON:{category_0: [{title, url, summary}], category_1: [...]}- LLM call logged with full prompt/response/timing.
2c-B. Filter LLM search results
Same filter_phase2_url() logic as Path A (homepage, cross-phase dedup, history, diversity). Accepted URLs are added to seen_urls. Traces are batch-flushed after this filter step.
2d-B. Scrape LLM search results (sequential)
- For each accepted item: call
scrape_single_article(SSRF check, 15s timeout, 5MB limit). - If scrape fails or article is too old/empty: trace as appropriate and skip.
- Otherwise: keep the LLM-provided title and summary (no re-classification LLM call). Add to
article_scraped, incrementsource_counts[domain]. - Traces are batch-flushed after all items are processed.
Save + Record
- Error if empty — if all scraped article lists are empty, return an error.
- Order sections — user-defined categories first (in order), then "Autre" if non-empty.
- Sanitize — strip
\u0000null bytes from JSON (PostgreSQL rejects them in JSONB). - Save synthesis — insert into
synthesestable withjob_id,week(ISO week),sections(JSONB),status: completed. - Record used articles — for each article in the final synthesis, build a trace with
status: "used",synthesis_id, and the correctsource_type(personalized_source,brave_search, orweb_searchinferred fromurl_source). Batch-insert intoarticle_history.
Shared Helpers
build_trace_entry()— constructs anArticleHistoryEntryfrom anArticleTracestruct (replaces the old 11-positional-parametertrace_articlefunction). Never writes to DB directly; caller accumulates inpending_traces.assign_category()— validates LLM-returned category against the classification list, falls back to "Autre", drops article if "Autre" is also full.filter_phase2_url()— async helper applying homepage/dedup/history/diversity filters for Phase 2 (both Brave and LLM paths).scrape_single_article()— thin wrapper aroundscraper::scrape_urlreturning(body_text, page_title, final_url, drop_reason).hash_article_url()— normalizes a URL (strips fragments, UTM params, trailing slashes, lowercases) then SHA-256 hashes it for history lookup.