From 0b71215ddc9cd9d036fbecfb76dfea3319dd84cd Mon Sep 17 00:00:00 2001 From: oabrivard Date: Thu, 26 Mar 2026 09:49:37 +0100 Subject: [PATCH] docs: update algorithm.md to reflect current pipeline state Documents the Brave Search path in Phase 2, batch_size setting for parallelism, batched article tracing with build_trace_entry/batch_insert_entries, 15-minute pipeline timeout, panic handling, session cleanup background task, SSRF checks on source pages, and the max_links change from 10 to 15. Co-Authored-By: Claude Sonnet 4.6 --- docs/algorithm.md | 172 ++++++++++++++++++++++++++++++++-------------- 1 file changed, 119 insertions(+), 53 deletions(-) diff --git a/docs/algorithm.md b/docs/algorithm.md index 6829034..2926c75 100644 --- a/docs/algorithm.md +++ b/docs/algorithm.md @@ -1,16 +1,31 @@ # Synthesis Generation Pipeline — Full Algorithm +## Startup & Background Tasks + +- **Session cleanup**: an hourly background task deletes expired DB sessions (`db::sessions::delete_expired`). +- **Job store TTL**: expired job entries (older than 1 hour) are cleaned up via `JobStore::cleanup_expired`. + +## Generation Lifecycle + +`POST /api/v1/syntheses/generate` creates a job in the `JobStore`, then spawns two nested tasks: +- Inner task: wraps `run_generation` in a **15-minute `tokio::time::timeout`**. If the timeout fires, sends an `Error` progress event and releases the user lock. +- Outer task: monitors the inner task's `JoinHandle` for panics. If the inner task panics, sends an `Error` progress event and releases the user lock. + +Progress is streamed to clients via a `tokio::sync::watch` channel (SSE endpoint subscribes to it). + +--- + ## Initialization -1. **Load user settings** from DB (categories, provider, models, max_items, etc.) +1. **Load user settings** from DB (categories, provider, models, max_items, batch_size, etc.) 2. **Cleanup** — delete old article history entries (>N days, dropped only) + truncate old LLM call logs -3. **Validate** — if no categories configured, there will just be the default category "Autre". +3. **Validate** — if no categories configured, the only available category will be "Autre". 4. **Load user sources** (personalized URLs like `https://openai.com/blog`) 5. **Resolve LLM provider** — decrypt user's API key, create provider instance (`Arc`) -6. **Resolve models** — research model + writing model (user override or admin default) +6. **Resolve models** — research model + web-search model (user override or admin default) 7. **Setup rate limiter** — per-user or global provider limiter -8. **Prepare LLM scraping option** — if `use_llm_for_article_extraction` enabled, clone provider+model for concurrent use -9. **Initialize tracking structures** — `article_scraped` (category→articles), `source_counts` (per-source article count), `url_soucre` (per-article source), `filled_counts` (per-category article count), `seen_urls` (cross-phase dedup), classification categories (user categories + "Autre") +8. **Initialize tracking structures** — `article_scraped` (category→articles), `source_counts` (per-domain article count), `url_source` (per-article source), `filled_counts` (per-category article count), `seen_urls` (cross-phase dedup), `classification_categories` (user categories + "Autre") +9. **Batch trace buffer** — `pending_traces: Vec` accumulates all article history writes; flushed with `db::article_history::batch_insert_entries` at phase boundaries (not per article). --- @@ -20,34 +35,39 @@ ### 1a. Extract article links from source pages and filter against article history -- Query `article_history` for the last source used. Reorder the personalized source so that the first source is the one following the last source used (rolling window) -- For each source, fetch the source page HTML: - - If `use_llm_for_source_links` enabled: send HTML `` + first 8000 chars of `` to LLM → extract all article URLs up to a maximum of 10, with the most recent first. If LLM call fails, fall back to HTML parsing as described below. +- Query `article_history` for the last source used. Reorder the personalized sources so that the first source is the one following the last source used (rolling window). +- Fetch source pages with bounded concurrency of **5** (hardcoded `max_concurrent = 5`): + - If `use_llm_for_source_links` enabled: send HTML `` + first 8000 chars of `` to LLM → extract all article URLs up to a maximum of **15**, with the most recent first. If the LLM call fails, fall back to HTML parsing as described below. - **LLM call logged** with full prompt/response/timing - - Otherwise: parse HTML `` links, filter by same-domain, non-homepage path, exclude `/tag/`, `/login/`, `/contact/`,`/presentation/`,`/newsletter/`, static assets, etc. and keep only the first 10 links found - - Deduplicate candidate URLs - - Hash each URL (normalized: lowercase, strip fragments/UTM params/trailing slashes) - - Query `article_history` for existing hashes → remove matches - - Trace dropped articles with `status: filtered_history` - - Add the url to `url_soucre` - -### 1b. Scrape, classify and summarize articles - -- For each url from step 1a: - - if the number of articles in `source_counts` for the source of the current url exceeds `max_articles_per_source`: - - Trace dropped article with `status: filtered_diversity` - - Move to next url - - Fetch each URL (bounded concurrency: 5 with LLM extraction, 10 without). - - SSRF check (no private IPs), 15s timeout, 5MB body limit. - - HTML parsing heuristics for title (``, `og:title`), date (meta tags, JSON-LD, `<time>`), body (strip scripts/nav), soft-404 detection - - If article scraped body text is empty (scrape failure, soft 404, too old): - - Trace dropped articles in `article_history` with `status: filtered_empty` - - Move to next url - - Send article (title + first 500 chars of body) + categories + "Autre" to LLM. LLM returns `{title, summary, category}` mapping the article to a category. The LLM generates the summary and a also a title if the provided title is empty - - **LLM call logged** with full prompt/response/timing - - Add the article to `article_scraped` and increase `filled_counts` - - if number of articles in the category of this artcile exceeds `max_items_per_category`: change the article catgeory to "Autre" - - If the total number of articles in `article_scraped` exceeds `number of categories (including Autre) × max_items_per_category` then exit for loop and move to synthesis generation + - Otherwise: parse HTML `<a href>` links, filter by same-domain, non-homepage path, exclude `/tag/`, `/login/`, `/contact/`, `/presentation/`, `/newsletter/`, static assets, etc., and keep only the first **15** links found. + - **SSRF check** performed on each source URL before fetching (rejects private/loopback IPs). +- Deduplicate candidate URLs (case-insensitive, cross-source via `seen_urls`). +- **Filter against article history** — hash each URL (normalized: lowercase, strip fragments/UTM params/trailing slashes), batch-query `article_history` → remove matches. + - Trace dropped articles as `status: filtered_history` (flushed immediately after this filter step). +- **Shuffle** remaining candidates to interleave articles from different sources. +- Track url → source in `url_source`. + +### 1b. Scrape, classify, and summarize articles (batched) + +Processing happens in batches of `settings.batch_size` (minimum 1). For each batch: + +**Batch assembly**: pull up to `batch_size` candidates from the iterator, skipping any where `source_counts[domain] >= max_articles_per_source` (traced as `filtered_diversity`). + +**Phase A — Scrape batch in parallel** (`JoinSet`): +- SSRF check (no private IPs), 15s timeout, 5MB body limit. +- HTML parsing heuristics for title (`<title>`, `og:title`), date (meta tags, JSON-LD, `<time>`), body (strip scripts/nav), soft-404 detection. +- If article body is empty, is a soft-404, or is too old: trace as `filtered_empty` / `filtered_too_old` and skip. + +**Phase B — Classify/summarize batch in parallel** (`JoinSet`): +- Check rate limit before classifying (waits up to 60s, then errors). +- Send article (title + first 500 chars of body) + categories + "Autre" to LLM. LLM returns `{title, summary, category}`. If the title was empty, the LLM also generates one. + - **LLM call logged** with full prompt/response/timing. +- **`assign_category()`** helper: validates category, falls back to "Autre" if category is unknown or full. If "Autre" is also full, drops the article. +- Add the article to `article_scraped`, increment `filled_counts`, increment `source_counts[domain]`. + +**Early exit**: after each batch, if total articles across all categories ≥ `(num_categories + 1) × max_items_per_category`, stop and move to Phase 2. + +**Trace flush**: all pending traces accumulated during Phase 1 are batch-inserted into `article_history` after Phase 1 completes. --- @@ -57,34 +77,80 @@ ### 2a. Compute category gaps -- For each user category: `needed = max - already_filled` -- Only proceed if any category needs more +- For each user category: `needed = max_items_per_category - already_filled` +- Only proceed if any category needs more articles. + +### 2b. Choose path: Brave Search or LLM web search -### 2b. LLM web search pass +The path is selected by the `settings.use_brave_search` flag. + +--- -- Build search prompt with theme, categories, gap counts ("find N articles for category_1, M for category_2") -- Send search prompt to LLM. LLM returns structured JSON: `{category_0: [{title, url, summary}], category_1: [...]}` - - **LLM call logged** with full prompt/response/timing -- **Filter homepage URLs** — drop articles with path `/` or empty -- **Cross-phase dedup** — drop URLs already seen in Phase 1 -- **Dedup by URL** — drop duplicate URLs within Phase 2 (case-insensitive) -- **Limit articles per source** — enforce `max_articles_per_source` per domain (spread across categories first, then fill) -- **Filter against article history** — BEFORE scraping (saves HTTP requests), drop already-seen URLs -- Each drop traced in `article_history` with appropriate status +### Path A: Brave Search (`use_brave_search = true`) -### 2c. Scrape web search results +#### 2b-A. Call Brave Search API -- Same scraping as Phase 1 (bounded concurrency, SSRF check, optional LLM extraction) -- Filter empty content (scrape failures, soft 404, too old) -- Trace drops -- Merge results into `all_scraped` -- Move to synthesis generation +- Resolve and decrypt the user's Brave Search API key (error if not configured). +- Query: `"{settings.theme} actualites"`, up to 20 results, filtered by `max_age_days`. + +#### 2c-A. Filter Brave results + +Each result URL passes through **`filter_phase2_url()`**: +1. **Homepage filter** — drop URLs with path `/` or empty (`filtered_homepage`) +2. **Cross-phase dedup** — drop URLs already in `seen_urls` (`filtered_cross_phase_dedup`) +3. **Article history** — check hash in DB, drop if seen before (`filtered_history`) +4. **Source diversity** — drop if `source_counts[domain] >= max_articles_per_source` (`filtered_diversity`) + +Accepted URLs are added to `seen_urls`. All rejections are traced. Traces are **batch-flushed** after this filter step. + +#### 2d-A. Scrape + classify Brave results (batched) + +Same batch loop as Phase 1b, using `settings.batch_size`: +- **Phase A**: scrape batch in parallel, trace failures as `source_type: "brave_search"`. +- **Phase B**: classify/summarize in parallel (same LLM call + logging as Phase 1). +- **`assign_category()`** used identically to Phase 1. +- Source domain tracked in `source_counts`. +- **Early exit** at `max_total` articles. +- Traces are batch-flushed after this loop. + +--- + +### Path B: LLM Web Search (`use_brave_search = false`) + +#### 2b-B. LLM web search pass + +- Check rate limit. +- Build search prompt with theme, categories, gap counts ("find N articles for category_1, M for category_2"). +- Send search prompt to LLM (using `model_websearch`). LLM returns structured JSON: `{category_0: [{title, url, summary}], category_1: [...]}` + - **LLM call logged** with full prompt/response/timing. + +#### 2c-B. Filter LLM search results + +Same **`filter_phase2_url()`** logic as Path A (homepage, cross-phase dedup, history, diversity). Accepted URLs are added to `seen_urls`. Traces are **batch-flushed** after this filter step. + +#### 2d-B. Scrape LLM search results (sequential) + +- For each accepted item: call `scrape_single_article` (SSRF check, 15s timeout, 5MB limit). +- If scrape fails or article is too old/empty: trace as appropriate and skip. +- Otherwise: keep the LLM-provided title and summary (no re-classification LLM call). Add to `article_scraped`, increment `source_counts[domain]`. +- Traces are batch-flushed after all items are processed. --- ## Save + Record -- **Sanitize** — strip `\u0000` null bytes from JSON (PostgreSQL rejects them in JSONB) -- **Save synthesis** — insert into `syntheses` table with `job_id`, `week` (ISO week), `sections` (JSONB), `status: completed` -- **Record used articles** — insert each article URL into `article_history` with `status: used`, `synthesis_id`, `job_id`, and category name (for future dedup + provenance) +- **Error if empty** — if all scraped article lists are empty, return an error. +- **Order sections** — user-defined categories first (in order), then "Autre" if non-empty. +- **Sanitize** — strip `\u0000` null bytes from JSON (PostgreSQL rejects them in JSONB). +- **Save synthesis** — insert into `syntheses` table with `job_id`, `week` (ISO week), `sections` (JSONB), `status: completed`. +- **Record used articles** — for each article in the final synthesis, build a trace with `status: "used"`, `synthesis_id`, and the correct `source_type` (`personalized_source`, `brave_search`, or `web_search` inferred from `url_source`). Batch-insert into `article_history`. + +--- + +## Shared Helpers +- **`build_trace_entry()`** — constructs an `ArticleHistoryEntry` from an `ArticleTrace` struct (replaces the old 11-positional-parameter `trace_article` function). Never writes to DB directly; caller accumulates in `pending_traces`. +- **`assign_category()`** — validates LLM-returned category against the classification list, falls back to "Autre", drops article if "Autre" is also full. +- **`filter_phase2_url()`** — async helper applying homepage/dedup/history/diversity filters for Phase 2 (both Brave and LLM paths). +- **`scrape_single_article()`** — thin wrapper around `scraper::scrape_url` returning `(body_text, page_title, final_url, drop_reason)`. +- **`hash_article_url()`** — normalizes a URL (strips fragments, UTM params, trailing slashes, lowercases) then SHA-256 hashes it for history lookup.