You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

9.3 KiB

Raw Blame History Unescape Escape

Synthesis Generation Pipeline — Full Algorithm

Startup & Background Tasks

Session cleanup: an hourly background task deletes expired DB sessions (db::sessions::delete_expired).
Job store TTL: expired job entries (older than 1 hour) are cleaned up via JobStore::cleanup_expired.

Generation Lifecycle

POST /api/v1/syntheses/generate creates a job in the JobStore, then spawns two nested tasks:

Inner task: wraps run_generation in a 15-minute tokio::time::timeout. If the timeout fires, sends an Error progress event and releases the user lock.
Outer task: monitors the inner task's JoinHandle for panics. If the inner task panics, sends an Error progress event and releases the user lock.

Progress is streamed to clients via a tokio::sync::watch channel (SSE endpoint subscribes to it).

Initialization

Load user settings from DB (categories, provider, models, max_items, batch_size, etc.)
Cleanup — delete old article history entries (>N days, dropped only) + truncate old LLM call logs
Validate — if no categories configured, the only available category will be "Autre".
Load user sources (personalized URLs like https://openai.com/blog)
Resolve LLM provider — decrypt user's API key, create provider instance (Arc<dyn LlmProvider>)
Resolve models — research model + web-search model (user override or admin default)
Setup rate limiter — per-user or global provider limiter
Initialize tracking structures — article_scraped (category→articles), source_counts (per-domain article count), url_source (per-article source), filled_counts (per-category article count), seen_urls (cross-phase dedup), classification_categories (user categories + "Autre")
Batch trace buffer — pending_traces: Vec<ArticleHistoryEntry> accumulates all article history writes; flushed with db::article_history::batch_insert_entries at phase boundaries (not per article).

Phase 1: Personalized Sources

Skipped entirely if user has 0 sources.

1a. Extract article links from source pages and filter against article history

Query article_history for the last source used. Reorder the personalized sources so that the first source is the one following the last source used (rolling window).
Fetch source pages with bounded concurrency of 5 (hardcoded max_concurrent = 5):
- If use_llm_for_source_links enabled: send HTML <head> + first 8000 chars of <body> to LLM → extract all article URLs up to a maximum of 15, with the most recent first. If the LLM call fails, fall back to HTML parsing as described below.
  - LLM call logged with full prompt/response/timing
- Otherwise: parse HTML <a href> links, filter by same-domain, non-homepage path, exclude /tag/, /login/, /contact/, /presentation/, /newsletter/, static assets, etc., and keep only the first 15 links found.
- SSRF check performed on each source URL before fetching (rejects private/loopback IPs).
Deduplicate candidate URLs (case-insensitive, cross-source via seen_urls).
Filter against article history — hash each URL (normalized: lowercase, strip fragments/UTM params/trailing slashes), batch-query article_history → remove matches.
- Trace dropped articles as status: filtered_history (flushed immediately after this filter step).
Shuffle remaining candidates to interleave articles from different sources.
Track url → source in url_source.

1b. Scrape, classify, and summarize articles (batched)

Processing happens in batches of settings.batch_size (minimum 1). For each batch:

Batch assembly: pull up to batch_size candidates from the iterator, skipping any where source_counts[domain] >= max_articles_per_source (traced as filtered_diversity).

Phase A — Scrape batch in parallel (JoinSet):

SSRF check (no private IPs), 15s timeout, 5MB body limit.
HTML parsing heuristics for title (<title>, og:title), date (meta tags, JSON-LD, <time>), body (strip scripts/nav), soft-404 detection.
If article body is empty, is a soft-404, or is too old: trace as filtered_empty / filtered_too_old and skip.

Phase B — Classify/summarize batch in parallel (JoinSet):

Check rate limit before classifying (waits up to 60s, then errors).
Send article (title + first 500 chars of body) + categories + "Autre" to LLM. LLM returns {title, summary, category}. If the title was empty, the LLM also generates one.
- LLM call logged with full prompt/response/timing.
assign_category() helper: validates category, falls back to "Autre" if category is unknown or full. If "Autre" is also full, drops the article.
Add the article to article_scraped, increment filled_counts, increment source_counts[domain].

Early exit: after each batch, if total articles across all categories ≥ (num_categories + 1) × max_items_per_category, stop and move to Phase 2.

Trace flush: all pending traces accumulated during Phase 1 are batch-inserted into article_history after Phase 1 completes.

Phase 2: Web Search Fallback

Skipped if all user-defined categories are already filled to max_items_per_category.

2a. Compute category gaps

For each user category: needed = max_items_per_category - already_filled
Only proceed if any category needs more articles.

2b. Choose path: Brave Search or LLM web search

The path is selected by the settings.use_brave_search flag.

Path A: Brave Search (`use_brave_search = true`)

2b-A. Call Brave Search API

Resolve and decrypt the user's Brave Search API key (error if not configured).
Query: "{settings.theme} actualites", up to 20 results, filtered by max_age_days.

2c-A. Filter Brave results

Each result URL passes through filter_phase2_url():

Homepage filter — drop URLs with path / or empty (filtered_homepage)
Cross-phase dedup — drop URLs already in seen_urls (filtered_cross_phase_dedup)
Article history — check hash in DB, drop if seen before (filtered_history)
Source diversity — drop if source_counts[domain] >= max_articles_per_source (filtered_diversity)

Accepted URLs are added to seen_urls. All rejections are traced. Traces are batch-flushed after this filter step.

2d-A. Scrape + classify Brave results (batched)

Same batch loop as Phase 1b, using settings.batch_size:

Phase A: scrape batch in parallel, trace failures as source_type: "brave_search".
Phase B: classify/summarize in parallel (same LLM call + logging as Phase 1).
assign_category() used identically to Phase 1.
Source domain tracked in source_counts.
Early exit at max_total articles.
Traces are batch-flushed after this loop.

Path B: LLM Web Search (`use_brave_search = false`)

2b-B. LLM web search pass

Check rate limit.
Build search prompt with theme, categories, gap counts ("find N articles for category_1, M for category_2").
Send search prompt to LLM (using model_websearch). LLM returns structured JSON: {category_0: [{title, url, summary}], category_1: [...]}
- LLM call logged with full prompt/response/timing.

2c-B. Filter LLM search results

Same filter_phase2_url() logic as Path A (homepage, cross-phase dedup, history, diversity). Accepted URLs are added to seen_urls. Traces are batch-flushed after this filter step.

2d-B. Scrape LLM search results (sequential)

For each accepted item: call scrape_single_article (SSRF check, 15s timeout, 5MB limit).
If scrape fails or article is too old/empty: trace as appropriate and skip.
Otherwise: keep the LLM-provided title and summary (no re-classification LLM call). Add to article_scraped, increment source_counts[domain].
Traces are batch-flushed after all items are processed.

Save + Record

Error if empty — if all scraped article lists are empty, return an error.
Order sections — user-defined categories first (in order), then "Autre" if non-empty.
Sanitize — strip \u0000 null bytes from JSON (PostgreSQL rejects them in JSONB).
Save synthesis — insert into syntheses table with job_id, week (ISO week), sections (JSONB), status: completed.
Record used articles — for each article in the final synthesis, build a trace with status: "used", synthesis_id, and the correct source_type (personalized_source, brave_search, or web_search inferred from url_source). Batch-insert into article_history.

Shared Helpers

build_trace_entry() — constructs an ArticleHistoryEntry from an ArticleTrace struct (replaces the old 11-positional-parameter trace_article function). Never writes to DB directly; caller accumulates in pending_traces.
assign_category() — validates LLM-returned category against the classification list, falls back to "Autre", drops article if "Autre" is also full.
filter_phase2_url() — async helper applying homepage/dedup/history/diversity filters for Phase 2 (both Brave and LLM paths).
scrape_single_article() — thin wrapper around scraper::scrape_url returning (body_text, page_title, final_url, drop_reason).
hash_article_url() — normalizes a URL (strips fragments, UTM params, trailing slashes, lowercases) then SHA-256 hashes it for history lookup.

9.3 KiB Raw Blame History Unescape Escape