From 7835725fe8602ce260dd6662b3d92ccaf9d6e907 Mon Sep 17 00:00:00 2001
From: oabrivard <olivier@abrivard.fr>
Date: Fri, 27 Mar 2026 15:07:29 +0100
Subject: [PATCH] docs: merge algorithm.md into technical_specs.md Section 5

The full pipeline algorithm is now in technical_specs.md with added
details: preferred source ordering, windowed waves, is_article filter,
date fallback, "Articles sans date" category, cancellation support.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 docs/algorithm.md       | 156 ----------------------------------------
 docs/technical_specs.md | 141 +++++++++++++++++++++++-------------
 2 files changed, 93 insertions(+), 204 deletions(-)
 delete mode 100644 docs/algorithm.md
diff --git a/docs/algorithm.md b/docs/algorithm.md
deleted file mode 100644
index 2926c75..0000000
--- a/docs/algorithm.md
+++ /dev/null
@@ -1,156 +0,0 @@
-# Synthesis Generation Pipeline — Full Algorithm
-
-## Startup & Background Tasks
-
-- **Session cleanup**: an hourly background task deletes expired DB sessions (`db::sessions::delete_expired`).
-- **Job store TTL**: expired job entries (older than 1 hour) are cleaned up via `JobStore::cleanup_expired`.
-
-## Generation Lifecycle
-
-`POST /api/v1/syntheses/generate` creates a job in the `JobStore`, then spawns two nested tasks:
-- Inner task: wraps `run_generation` in a **15-minute `tokio::time::timeout`**. If the timeout fires, sends an `Error` progress event and releases the user lock.
-- Outer task: monitors the inner task's `JoinHandle` for panics. If the inner task panics, sends an `Error` progress event and releases the user lock.
-
-Progress is streamed to clients via a `tokio::sync::watch` channel (SSE endpoint subscribes to it).
-
----
-
-## Initialization
-
-1. **Load user settings** from DB (categories, provider, models, max_items, batch_size, etc.)
-2. **Cleanup** — delete old article history entries (>N days, dropped only) + truncate old LLM call logs
-3. **Validate** — if no categories configured, the only available category will be "Autre".
-4. **Load user sources** (personalized URLs like `https://openai.com/blog`)
-5. **Resolve LLM provider** — decrypt user's API key, create provider instance (`Arc<dyn LlmProvider>`)
-6. **Resolve models** — research model + web-search model (user override or admin default)
-7. **Setup rate limiter** — per-user or global provider limiter
-8. **Initialize tracking structures** — `article_scraped` (category→articles), `source_counts` (per-domain article count), `url_source` (per-article source), `filled_counts` (per-category article count), `seen_urls` (cross-phase dedup), `classification_categories` (user categories + "Autre")
-9. **Batch trace buffer** — `pending_traces: Vec<ArticleHistoryEntry>` accumulates all article history writes; flushed with `db::article_history::batch_insert_entries` at phase boundaries (not per article).
-
----
-
-## Phase 1: Personalized Sources
-
-**Skipped entirely if user has 0 sources.**
-
-### 1a. Extract article links from source pages and filter against article history
-
-- Query `article_history` for the last source used. Reorder the personalized sources so that the first source is the one following the last source used (rolling window).
-- Fetch source pages with bounded concurrency of **5** (hardcoded `max_concurrent = 5`):
-  - If `use_llm_for_source_links` enabled: send HTML `<head>` + first 8000 chars of `<body>` to LLM → extract all article URLs up to a maximum of **15**, with the most recent first. If the LLM call fails, fall back to HTML parsing as described below.
-    - **LLM call logged** with full prompt/response/timing
-  - Otherwise: parse HTML `<a href>` links, filter by same-domain, non-homepage path, exclude `/tag/`, `/login/`, `/contact/`, `/presentation/`, `/newsletter/`, static assets, etc., and keep only the first **15** links found.
-  - **SSRF check** performed on each source URL before fetching (rejects private/loopback IPs).
-- Deduplicate candidate URLs (case-insensitive, cross-source via `seen_urls`).
-- **Filter against article history** — hash each URL (normalized: lowercase, strip fragments/UTM params/trailing slashes), batch-query `article_history` → remove matches.
-  - Trace dropped articles as `status: filtered_history` (flushed immediately after this filter step).
-- **Shuffle** remaining candidates to interleave articles from different sources.
-- Track url → source in `url_source`.
-
-### 1b. Scrape, classify, and summarize articles (batched)
-
-Processing happens in batches of `settings.batch_size` (minimum 1). For each batch:
-
-**Batch assembly**: pull up to `batch_size` candidates from the iterator, skipping any where `source_counts[domain] >= max_articles_per_source` (traced as `filtered_diversity`).
-
-**Phase A — Scrape batch in parallel** (`JoinSet`):
-- SSRF check (no private IPs), 15s timeout, 5MB body limit.
-- HTML parsing heuristics for title (`<title>`, `og:title`), date (meta tags, JSON-LD, `<time>`), body (strip scripts/nav), soft-404 detection.
-- If article body is empty, is a soft-404, or is too old: trace as `filtered_empty` / `filtered_too_old` and skip.
-
-**Phase B — Classify/summarize batch in parallel** (`JoinSet`):
-- Check rate limit before classifying (waits up to 60s, then errors).
-- Send article (title + first 500 chars of body) + categories + "Autre" to LLM. LLM returns `{title, summary, category}`. If the title was empty, the LLM also generates one.
-  - **LLM call logged** with full prompt/response/timing.
-- **`assign_category()`** helper: validates category, falls back to "Autre" if category is unknown or full. If "Autre" is also full, drops the article.
-- Add the article to `article_scraped`, increment `filled_counts`, increment `source_counts[domain]`.
-
-**Early exit**: after each batch, if total articles across all categories ≥ `(num_categories + 1) × max_items_per_category`, stop and move to Phase 2.
-
-**Trace flush**: all pending traces accumulated during Phase 1 are batch-inserted into `article_history` after Phase 1 completes.
-
----
-
-## Phase 2: Web Search Fallback
-
-**Skipped if all user-defined categories are already filled to `max_items_per_category`.**
-
-### 2a. Compute category gaps
-
-- For each user category: `needed = max_items_per_category - already_filled`
-- Only proceed if any category needs more articles.
-
-### 2b. Choose path: Brave Search or LLM web search
-
-The path is selected by the `settings.use_brave_search` flag.
-
----
-
-### Path A: Brave Search (`use_brave_search = true`)
-
-#### 2b-A. Call Brave Search API
-
-- Resolve and decrypt the user's Brave Search API key (error if not configured).
-- Query: `"{settings.theme} actualites"`, up to 20 results, filtered by `max_age_days`.
-
-#### 2c-A. Filter Brave results
-
-Each result URL passes through **`filter_phase2_url()`**:
-1. **Homepage filter** — drop URLs with path `/` or empty (`filtered_homepage`)
-2. **Cross-phase dedup** — drop URLs already in `seen_urls` (`filtered_cross_phase_dedup`)
-3. **Article history** — check hash in DB, drop if seen before (`filtered_history`)
-4. **Source diversity** — drop if `source_counts[domain] >= max_articles_per_source` (`filtered_diversity`)
-
-Accepted URLs are added to `seen_urls`. All rejections are traced. Traces are **batch-flushed** after this filter step.
-
-#### 2d-A. Scrape + classify Brave results (batched)
-
-Same batch loop as Phase 1b, using `settings.batch_size`:
-- **Phase A**: scrape batch in parallel, trace failures as `source_type: "brave_search"`.
-- **Phase B**: classify/summarize in parallel (same LLM call + logging as Phase 1).
-- **`assign_category()`** used identically to Phase 1.
-- Source domain tracked in `source_counts`.
-- **Early exit** at `max_total` articles.
-- Traces are batch-flushed after this loop.
-
----
-
-### Path B: LLM Web Search (`use_brave_search = false`)
-
-#### 2b-B. LLM web search pass
-
-- Check rate limit.
-- Build search prompt with theme, categories, gap counts ("find N articles for category_1, M for category_2").
-- Send search prompt to LLM (using `model_websearch`). LLM returns structured JSON: `{category_0: [{title, url, summary}], category_1: [...]}`
-  - **LLM call logged** with full prompt/response/timing.
-
-#### 2c-B. Filter LLM search results
-
-Same **`filter_phase2_url()`** logic as Path A (homepage, cross-phase dedup, history, diversity). Accepted URLs are added to `seen_urls`. Traces are **batch-flushed** after this filter step.
-
-#### 2d-B. Scrape LLM search results (sequential)
-
-- For each accepted item: call `scrape_single_article` (SSRF check, 15s timeout, 5MB limit).
-- If scrape fails or article is too old/empty: trace as appropriate and skip.
-- Otherwise: keep the LLM-provided title and summary (no re-classification LLM call). Add to `article_scraped`, increment `source_counts[domain]`.
-- Traces are batch-flushed after all items are processed.
-
----
-
-## Save + Record
-
-- **Error if empty** — if all scraped article lists are empty, return an error.
-- **Order sections** — user-defined categories first (in order), then "Autre" if non-empty.
-- **Sanitize** — strip `\u0000` null bytes from JSON (PostgreSQL rejects them in JSONB).
-- **Save synthesis** — insert into `syntheses` table with `job_id`, `week` (ISO week), `sections` (JSONB), `status: completed`.
-- **Record used articles** — for each article in the final synthesis, build a trace with `status: "used"`, `synthesis_id`, and the correct `source_type` (`personalized_source`, `brave_search`, or `web_search` inferred from `url_source`). Batch-insert into `article_history`.
-
----
-
-## Shared Helpers
-
-- **`build_trace_entry()`** — constructs an `ArticleHistoryEntry` from an `ArticleTrace` struct (replaces the old 11-positional-parameter `trace_article` function). Never writes to DB directly; caller accumulates in `pending_traces`.
-- **`assign_category()`** — validates LLM-returned category against the classification list, falls back to "Autre", drops article if "Autre" is also full.
-- **`filter_phase2_url()`** — async helper applying homepage/dedup/history/diversity filters for Phase 2 (both Brave and LLM paths).
-- **`scrape_single_article()`** — thin wrapper around `scraper::scrape_url` returning `(body_text, page_title, final_url, drop_reason)`.
-- **`hash_article_url()`** — normalizes a URL (strips fragments, UTM params, trailing slashes, lowercases) then SHA-256 hashes it for history lookup.
diff --git a/docs/technical_specs.md b/docs/technical_specs.md
index c4ee605..1eec7e0 100644
--- a/docs/technical_specs.md
+++ b/docs/technical_specs.md
@@ -602,76 +602,121 @@ All admin endpoints require `AdminUser` extractor (role = admin).
 
 ---
 
-## 5. Generation Pipeline Technical Flow
+## 5. Generation Pipeline — Full Algorithm
 
-### Overview
+### Startup & Background Tasks
 
-The pipeline runs as a background tokio task spawned by `POST /syntheses/generate`. It has a 15-minute global timeout and supports cooperative cancellation via `AtomicBool`.
+- **Session cleanup**: an hourly background task deletes expired DB sessions (`db::sessions::delete_expired`).
+- **Job store TTL**: expired job entries (older than 1 hour) are cleaned up via `JobStore::cleanup_expired`.
+
+### Generation Lifecycle
+
+`POST /api/v1/syntheses/generate` creates a job in the `JobStore`, then spawns two nested tasks:
+- Inner task: wraps `run_generation` in a **15-minute `tokio::time::timeout`**. If the timeout fires, sends an `Error` progress event and releases the user lock.
+- Outer task: monitors the inner task's `JoinHandle` for panics. If the inner task panics, sends an `Error` progress event and releases the user lock.
+
+Progress is streamed to clients via a `tokio::sync::watch` channel (SSE endpoint subscribes to it).
 
 ### Initialization
 
-1. Load `UserSettings` from DB (or create defaults)
-2. Cleanup old article history (entries older than `article_history_days` with dropped status) and truncate old LLM call logs
-3. Load the target `Theme` (categories, max_items, max_age_days, summary_length)
-4. Load user `Sources` for the theme
-5. Decrypt user's LLM API key, create `Arc<dyn LlmProvider>` via factory
-6. Resolve models: `ai_model` (for scraping/classification) and `ai_model_websearch` (for web search); user override or admin default fallback
-7. Initialize per-user rate limiter (from settings or admin defaults)
-8. Initialize tracking structures: `article_scraped` (category -> Vec<NewsItem>), `source_counts`, `url_source`, `filled_counts`, `seen_urls`, `pending_traces`
+1. **Load user settings** from DB (provider, models, batch_size, rate limits, etc.)
+2. **Cleanup** — delete old article history entries (>N days, dropped only) + truncate old LLM call logs
+3. **Validate** — if no categories configured, the only available category will be "Divers".
+4. **Load theme** — categories, max_items_per_category, max_age_days, summary_length
+5. **Load user sources** (personalized URLs filtered by theme_id)
+6. **Resolve LLM provider** — decrypt user's API key, create provider instance (`Arc<dyn LlmProvider>`)
+7. **Resolve models** — research model + web-search model (user override or admin default)
+8. **Setup rate limiter** — per-user or global provider limiter
+9. **Initialize tracking structures** — `article_scraped` (category→articles), `source_counts` (per-domain article count), `url_source` (per-article source), `filled_counts` (per-category article count), `seen_urls` (cross-phase dedup), `classification_categories` (user categories + "Divers")
+10. **Batch trace buffer** — `pending_traces: Vec<ArticleHistoryEntry>` accumulates all article history writes; flushed with `db::article_history::batch_insert_entries` at phase boundaries.
 
 ### Phase 1: Personalized Sources
 
-Skipped if user has 0 sources for the theme.
+**Skipped entirely if user has 0 sources.**
+
+#### 1a. Windowed source extraction
+
+- Query `article_history` for the last source used. Reorder sources so the first source follows the last one used (rolling window).
+- Separate preferred sources (processed first) from non-preferred, preserving rotation order within each group.
+- Process sources in waves of `source_extraction_window` size:
+  - For each source in the wave: fetch page HTML, extract up to `max_links_per_source` article URLs via HTML parsing (same-domain, non-homepage, no static assets).
+  - **SSRF check** performed on each source URL before fetching.
+  - Deduplicate candidate URLs (case-insensitive, cross-source via `seen_urls`).
+  - **Filter against article history** — hash each URL (normalized: lowercase, strip fragments/UTM params/trailing slashes), batch-query `article_history` → remove matches. Trace dropped articles as `status: filtered_history`.
+  - **Preferred-first shuffle** — shuffle preferred URLs separately from non-preferred, then concatenate (preferred first).
+  - Track url → source in `url_source`.
+
+#### 1b. Scrape, classify, and summarize articles (batched)
+
+Processing in batches of `settings.batch_size` (minimum 1). For each batch:
 
-**1a. Windowed source extraction**
+**Batch assembly**: Pull up to `batch_size` candidates, skipping any where `source_counts[domain] >= max_articles_per_source` (traced as `filtered_diversity`).
 
-- Query article_history for the last source used; reorder sources in a rolling window starting after that source
-- Select up to `source_extraction_window` sources per generation
-- For each source (bounded concurrency of 5): fetch page HTML, extract up to `max_links_per_source` article URLs via HTML parsing (same-domain, non-homepage, no static assets)
-- Deduplicate URLs cross-source via `seen_urls`
-- Batch-check `article_history` for already-seen URL hashes; filter matches (traced as `filtered_history`)
-- Shuffle remaining candidates to interleave sources
-- Track url -> source in `url_source`
+**Phase A — Scrape batch in parallel** (`JoinSet`):
+- SSRF check (no private IPs), 15s timeout, 5MB body limit.
+- HTML parsing for title (`<title>`, `og:title`), date (meta tags, JSON-LD, `<time>`), body (strip scripts/nav), soft-404 detection.
+- If article body is empty, is a soft-404, or is too old: trace as `filtered_empty` / `filtered_too_old` and skip.
 
-**1b. Batch scrape + classify**
+**Phase B — Classify/summarize batch in parallel** (`JoinSet`):
+- Check rate limit before classifying (waits up to 60s, then errors).
+- Send article (title + body snippet based on `summary_length`: 500/2000/4000 chars) + categories + "Divers" to LLM.
+- LLM returns `{title, summary, category, date, is_article}`.
+- **`is_article` check**: if false, trace as `filtered_not_article` and skip.
+- **Date fallback**: if LLM returned a date and it exceeds `max_age_days`, trace as `filtered_too_old` and skip.
+- **No-date routing**: if no date found (neither scraper nor LLM), route to "Articles sans date" category.
+- **`assign_category()`** helper: validates category, falls back to "Divers" if unknown or full. If "Divers" is also full, drops the article.
+- **LLM call logged** with full prompt/response/timing.
+- Add article to `article_scraped`, increment `filled_counts` and `source_counts`.
 
-Processing in batches of `settings.batch_size`:
+**Early exit**: After each batch, if total articles ≥ `(num_categories + 1) × max_items_per_category`, stop.
 
-- **Batch assembly**: Pull up to batch_size candidates, skip if `source_counts[domain] >= max_articles_per_source` (traced as `filtered_diversity`)
-- **Scrape** (JoinSet, parallel): SSRF check, 15s timeout, 5MB limit, HTML parsing, title/date/body extraction, soft-404 detection. Skip empty/too-old articles.
-- **Classify** (JoinSet, parallel): Rate limit check (60s wait), send title + first 500 chars to LLM with categories list. LLM returns `{title, summary, category}`. Validate category via `assign_category()` (fallback to "Autre", drop if full).
-- **LLM call logging**: Every LLM call is logged with full prompt, response, timing, and article URL.
-- **Early exit**: Stop when total articles >= `(num_categories + 1) * max_items_per_category`.
-- Batch-flush pending traces to `article_history`.
+**Wave check**: After each wave, if synthesis is full, skip remaining waves.
+
+**Trace flush**: Pending traces batch-inserted into `article_history` between waves.
 
 ### Phase 2: Web Search Fallback
 
-Skipped if all categories are filled to `max_items_per_category`.
+**Skipped if all user-defined categories are already filled.**
+
+#### 2a. Compute category gaps
+
+For each user category: `needed = max_items_per_category - already_filled`. Only proceed if any category needs more.
+
+#### 2b. Choose path: Brave Search or LLM web search
+
+Selected by `settings.use_brave_search`.
+
+#### Path A: Brave Search (`use_brave_search = true`)
+
+1. Resolve and decrypt the user's Brave Search API key (error if not configured).
+2. Query: `"{theme} actualites"`, up to 20 results, freshness mapped from `max_age_days` (pd/pw/pm/py).
+3. Filter results through **`filter_phase2_url()`**: homepage filter → cross-phase dedup → article history → source diversity.
+4. Batch scrape + classify (same as Phase 1b, `source_type = "brave_search"`).
 
-**2a. Compute gaps**: For each category, `needed = max_items - filled`.
+#### Path B: LLM Web Search (`use_brave_search = false`)
 
-**2b. Path selection** based on `settings.use_brave_search`:
+1. Build search prompt with theme, categories, gap counts.
+2. Call LLM with `model_websearch`. Returns `{category_0: [{title, url, summary}], ...}`.
+3. Filter URLs through **`filter_phase2_url()`**.
+4. Scrape each result sequentially. Keep LLM-provided title/summary (no re-classification).
+5. `source_type = "web_search"`.
 
-**Path A -- Brave Search** (`use_brave_search = true`):
-- Decrypt user's Brave Search API key
-- Query: `"{theme} actualites"`, up to 20 results, freshness mapped from `max_age_days` (pd/pw/pm/py)
-- Filter results through `filter_phase2_url()`: homepage filter, cross-phase dedup, article history check, source diversity check
-- Batch scrape + classify (same logic as Phase 1b, source_type = "brave_search")
+### Save + Record
 
-**Path B -- LLM Web Search** (`use_brave_search = false`):
-- Build search prompt with theme, categories, and gap counts
-- Call LLM with `ai_model_websearch` model; returns structured JSON: `{category_0: [{title, url, summary}], ...}`
-- Filter URLs through `filter_phase2_url()`
-- Scrape each result sequentially to validate; keep LLM-provided title/summary (no re-classification)
-- source_type = "web_search"
+1. **Error if empty** — if all article lists are empty and generation wasn't cancelled, return error.
+2. **Order sections** — user-defined categories first (in order), then "Divers" if non-empty, then "Articles sans date" if non-empty.
+3. **Sanitize** — strip `\u0000` null bytes from JSON (PostgreSQL JSONB requirement).
+4. **Save synthesis** — insert into `syntheses` table with `job_id`, `week` (ISO week), `sections` (JSONB), `status: completed`, `theme_id`.
+5. **Record used articles** — for each article in the final synthesis, build trace with `status: "used"`, `synthesis_id`, and correct `source_type` (inferred from `url_source`). Batch-insert into `article_history`.
 
-### Save & Record
+### Shared Helpers
 
-1. Error if all article lists are empty
-2. Order sections: user-defined categories first (in order), then "Autre" if non-empty
-3. Sanitize: strip `\u0000` null bytes from JSON (PostgreSQL JSONB requirement)
-4. Insert synthesis row: job_id, week (ISO week string), sections (JSONB), status "completed", theme_id
-5. Record used articles: batch-insert `article_history` entries with status "used", synthesis_id, and correct source_type
+- **`build_trace_entry()`** — constructs an `ArticleHistoryEntry` from an `ArticleTrace` struct. Never writes to DB directly; caller accumulates in `pending_traces`.
+- **`scrape_and_classify_batch()`** — shared batch processing logic used by Phase 1 and Phase 2 Brave paths.
+- **`assign_category()`** — validates LLM-returned category, falls back to "Divers", drops if all full.
+- **`filter_phase2_url()`** — async helper applying homepage/dedup/history/diversity filters for Phase 2.
+- **`scrape_single_article()`** — thin wrapper around `scraper::scrape_url` returning `(body_text, page_title, final_url, drop_reason)`.
+- **`hash_article_url()`** — normalizes URL (strips fragments, UTM params, trailing slashes, lowercases) then SHA-256 hashes.
 
 ---