You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

9.3 KiB

Synthesis Generation Pipeline — Full Algorithm

Startup & Background Tasks

  • Session cleanup: an hourly background task deletes expired DB sessions (db::sessions::delete_expired).
  • Job store TTL: expired job entries (older than 1 hour) are cleaned up via JobStore::cleanup_expired.

Generation Lifecycle

POST /api/v1/syntheses/generate creates a job in the JobStore, then spawns two nested tasks:

  • Inner task: wraps run_generation in a 15-minute tokio::time::timeout. If the timeout fires, sends an Error progress event and releases the user lock.
  • Outer task: monitors the inner task's JoinHandle for panics. If the inner task panics, sends an Error progress event and releases the user lock.

Progress is streamed to clients via a tokio::sync::watch channel (SSE endpoint subscribes to it).


Initialization

  1. Load user settings from DB (categories, provider, models, max_items, batch_size, etc.)
  2. Cleanup — delete old article history entries (>N days, dropped only) + truncate old LLM call logs
  3. Validate — if no categories configured, the only available category will be "Autre".
  4. Load user sources (personalized URLs like https://openai.com/blog)
  5. Resolve LLM provider — decrypt user's API key, create provider instance (Arc<dyn LlmProvider>)
  6. Resolve models — research model + web-search model (user override or admin default)
  7. Setup rate limiter — per-user or global provider limiter
  8. Initialize tracking structuresarticle_scraped (category→articles), source_counts (per-domain article count), url_source (per-article source), filled_counts (per-category article count), seen_urls (cross-phase dedup), classification_categories (user categories + "Autre")
  9. Batch trace bufferpending_traces: Vec<ArticleHistoryEntry> accumulates all article history writes; flushed with db::article_history::batch_insert_entries at phase boundaries (not per article).

Phase 1: Personalized Sources

Skipped entirely if user has 0 sources.

1a. Extract article links from source pages and filter against article history

  • Query article_history for the last source used. Reorder the personalized sources so that the first source is the one following the last source used (rolling window).
  • Fetch source pages with bounded concurrency of 5 (hardcoded max_concurrent = 5):
    • If use_llm_for_source_links enabled: send HTML <head> + first 8000 chars of <body> to LLM → extract all article URLs up to a maximum of 15, with the most recent first. If the LLM call fails, fall back to HTML parsing as described below.
      • LLM call logged with full prompt/response/timing
    • Otherwise: parse HTML <a href> links, filter by same-domain, non-homepage path, exclude /tag/, /login/, /contact/, /presentation/, /newsletter/, static assets, etc., and keep only the first 15 links found.
    • SSRF check performed on each source URL before fetching (rejects private/loopback IPs).
  • Deduplicate candidate URLs (case-insensitive, cross-source via seen_urls).
  • Filter against article history — hash each URL (normalized: lowercase, strip fragments/UTM params/trailing slashes), batch-query article_history → remove matches.
    • Trace dropped articles as status: filtered_history (flushed immediately after this filter step).
  • Shuffle remaining candidates to interleave articles from different sources.
  • Track url → source in url_source.

1b. Scrape, classify, and summarize articles (batched)

Processing happens in batches of settings.batch_size (minimum 1). For each batch:

Batch assembly: pull up to batch_size candidates from the iterator, skipping any where source_counts[domain] >= max_articles_per_source (traced as filtered_diversity).

Phase A — Scrape batch in parallel (JoinSet):

  • SSRF check (no private IPs), 15s timeout, 5MB body limit.
  • HTML parsing heuristics for title (<title>, og:title), date (meta tags, JSON-LD, <time>), body (strip scripts/nav), soft-404 detection.
  • If article body is empty, is a soft-404, or is too old: trace as filtered_empty / filtered_too_old and skip.

Phase B — Classify/summarize batch in parallel (JoinSet):

  • Check rate limit before classifying (waits up to 60s, then errors).
  • Send article (title + first 500 chars of body) + categories + "Autre" to LLM. LLM returns {title, summary, category}. If the title was empty, the LLM also generates one.
    • LLM call logged with full prompt/response/timing.
  • assign_category() helper: validates category, falls back to "Autre" if category is unknown or full. If "Autre" is also full, drops the article.
  • Add the article to article_scraped, increment filled_counts, increment source_counts[domain].

Early exit: after each batch, if total articles across all categories ≥ (num_categories + 1) × max_items_per_category, stop and move to Phase 2.

Trace flush: all pending traces accumulated during Phase 1 are batch-inserted into article_history after Phase 1 completes.


Phase 2: Web Search Fallback

Skipped if all user-defined categories are already filled to max_items_per_category.

2a. Compute category gaps

  • For each user category: needed = max_items_per_category - already_filled
  • Only proceed if any category needs more articles.

The path is selected by the settings.use_brave_search flag.


Path A: Brave Search (use_brave_search = true)

2b-A. Call Brave Search API

  • Resolve and decrypt the user's Brave Search API key (error if not configured).
  • Query: "{settings.theme} actualites", up to 20 results, filtered by max_age_days.

2c-A. Filter Brave results

Each result URL passes through filter_phase2_url():

  1. Homepage filter — drop URLs with path / or empty (filtered_homepage)
  2. Cross-phase dedup — drop URLs already in seen_urls (filtered_cross_phase_dedup)
  3. Article history — check hash in DB, drop if seen before (filtered_history)
  4. Source diversity — drop if source_counts[domain] >= max_articles_per_source (filtered_diversity)

Accepted URLs are added to seen_urls. All rejections are traced. Traces are batch-flushed after this filter step.

2d-A. Scrape + classify Brave results (batched)

Same batch loop as Phase 1b, using settings.batch_size:

  • Phase A: scrape batch in parallel, trace failures as source_type: "brave_search".
  • Phase B: classify/summarize in parallel (same LLM call + logging as Phase 1).
  • assign_category() used identically to Phase 1.
  • Source domain tracked in source_counts.
  • Early exit at max_total articles.
  • Traces are batch-flushed after this loop.

Path B: LLM Web Search (use_brave_search = false)

2b-B. LLM web search pass

  • Check rate limit.
  • Build search prompt with theme, categories, gap counts ("find N articles for category_1, M for category_2").
  • Send search prompt to LLM (using model_websearch). LLM returns structured JSON: {category_0: [{title, url, summary}], category_1: [...]}
    • LLM call logged with full prompt/response/timing.

2c-B. Filter LLM search results

Same filter_phase2_url() logic as Path A (homepage, cross-phase dedup, history, diversity). Accepted URLs are added to seen_urls. Traces are batch-flushed after this filter step.

2d-B. Scrape LLM search results (sequential)

  • For each accepted item: call scrape_single_article (SSRF check, 15s timeout, 5MB limit).
  • If scrape fails or article is too old/empty: trace as appropriate and skip.
  • Otherwise: keep the LLM-provided title and summary (no re-classification LLM call). Add to article_scraped, increment source_counts[domain].
  • Traces are batch-flushed after all items are processed.

Save + Record

  • Error if empty — if all scraped article lists are empty, return an error.
  • Order sections — user-defined categories first (in order), then "Autre" if non-empty.
  • Sanitize — strip \u0000 null bytes from JSON (PostgreSQL rejects them in JSONB).
  • Save synthesis — insert into syntheses table with job_id, week (ISO week), sections (JSONB), status: completed.
  • Record used articles — for each article in the final synthesis, build a trace with status: "used", synthesis_id, and the correct source_type (personalized_source, brave_search, or web_search inferred from url_source). Batch-insert into article_history.

Shared Helpers

  • build_trace_entry() — constructs an ArticleHistoryEntry from an ArticleTrace struct (replaces the old 11-positional-parameter trace_article function). Never writes to DB directly; caller accumulates in pending_traces.
  • assign_category() — validates LLM-returned category against the classification list, falls back to "Autre", drops article if "Autre" is also full.
  • filter_phase2_url() — async helper applying homepage/dedup/history/diversity filters for Phase 2 (both Brave and LLM paths).
  • scrape_single_article() — thin wrapper around scraper::scrape_url returning (body_text, page_title, final_url, drop_reason).
  • hash_article_url() — normalizes a URL (strips fragments, UTM params, trailing slashes, lowercases) then SHA-256 hashes it for history lookup.