The full pipeline algorithm is now in technical_specs.md with added
details: preferred source ordering, windowed waves, is_article filter,
date fallback, "Articles sans date" category, cancellation support.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- **Session cleanup**: an hourly background task deletes expired DB sessions (`db::sessions::delete_expired`).
- **Job store TTL**: expired job entries (older than 1 hour) are cleaned up via `JobStore::cleanup_expired`.
## Generation Lifecycle
`POST /api/v1/syntheses/generate` creates a job in the `JobStore`, then spawns two nested tasks:
- Inner task: wraps `run_generation` in a **15-minute `tokio::time::timeout`**. If the timeout fires, sends an `Error` progress event and releases the user lock.
- Outer task: monitors the inner task's `JoinHandle` for panics. If the inner task panics, sends an `Error` progress event and releases the user lock.
Progress is streamed to clients via a `tokio::sync::watch` channel (SSE endpoint subscribes to it).
---
## Initialization
1. **Load user settings** from DB (categories, provider, models, max_items, batch_size, etc.)
2. **Cleanup** — delete old article history entries (>N days, dropped only) + truncate old LLM call logs
3. **Validate** — if no categories configured, the only available category will be "Autre".
4. **Load user sources** (personalized URLs like `https://openai.com/blog`)
9. **Batch trace buffer** — `pending_traces: Vec<ArticleHistoryEntry>` accumulates all article history writes; flushed with `db::article_history::batch_insert_entries` at phase boundaries (not per article).
---
## Phase 1: Personalized Sources
**Skipped entirely if user has 0 sources.**
### 1a. Extract article links from source pages and filter against article history
- Query `article_history` for the last source used. Reorder the personalized sources so that the first source is the one following the last source used (rolling window).
- Fetch source pages with bounded concurrency of **5** (hardcoded `max_concurrent = 5`):
- If `use_llm_for_source_links` enabled: send HTML `<head>` + first 8000 chars of `<body>` to LLM → extract all article URLs up to a maximum of **15**, with the most recent first. If the LLM call fails, fall back to HTML parsing as described below.
- **LLM call logged** with full prompt/response/timing
- Otherwise: parse HTML `<a href>` links, filter by same-domain, non-homepage path, exclude `/tag/`, `/login/`, `/contact/`, `/presentation/`, `/newsletter/`, static assets, etc., and keep only the first **15** links found.
- **SSRF check** performed on each source URL before fetching (rejects private/loopback IPs).
- Deduplicate candidate URLs (case-insensitive, cross-source via `seen_urls`).
- **Filter against article history** — hash each URL (normalized: lowercase, strip fragments/UTM params/trailing slashes), batch-query `article_history` → remove matches.
- Trace dropped articles as `status: filtered_history` (flushed immediately after this filter step).
- **Shuffle** remaining candidates to interleave articles from different sources.
- Track url → source in `url_source`.
### 1b. Scrape, classify, and summarize articles (batched)
Processing happens in batches of `settings.batch_size` (minimum 1). For each batch:
**Batch assembly**: pull up to `batch_size` candidates from the iterator, skipping any where `source_counts[domain] >= max_articles_per_source` (traced as `filtered_diversity`).
**Phase A — Scrape batch in parallel** (`JoinSet`):
- SSRF check (no private IPs), 15s timeout, 5MB body limit.
- HTML parsing heuristics for title (`<title>`, `og:title`), date (meta tags, JSON-LD, `<time>`), body (strip scripts/nav), soft-404 detection.
- If article body is empty, is a soft-404, or is too old: trace as `filtered_empty` / `filtered_too_old` and skip.
**Phase B — Classify/summarize batch in parallel** (`JoinSet`):
- Check rate limit before classifying (waits up to 60s, then errors).
- Send article (title + first 500 chars of body) + categories + "Autre" to LLM. LLM returns `{title, summary, category}`. If the title was empty, the LLM also generates one.
- **LLM call logged** with full prompt/response/timing.
- **`assign_category()`** helper: validates category, falls back to "Autre" if category is unknown or full. If "Autre" is also full, drops the article.
- Add the article to `article_scraped`, increment `filled_counts`, increment `source_counts[domain]`.
**Early exit**: after each batch, if total articles across all categories ≥ `(num_categories + 1) × max_items_per_category`, stop and move to Phase 2.
**Trace flush**: all pending traces accumulated during Phase 1 are batch-inserted into `article_history` after Phase 1 completes.
---
## Phase 2: Web Search Fallback
**Skipped if all user-defined categories are already filled to `max_items_per_category`.**
### 2a. Compute category gaps
- For each user category: `needed = max_items_per_category - already_filled`
- Only proceed if any category needs more articles.
### 2b. Choose path: Brave Search or LLM web search
The path is selected by the `settings.use_brave_search` flag.
- **LLM call logged** with full prompt/response/timing.
#### 2c-B. Filter LLM search results
Same **`filter_phase2_url()`** logic as Path A (homepage, cross-phase dedup, history, diversity). Accepted URLs are added to `seen_urls`. Traces are **batch-flushed** after this filter step.
#### 2d-B. Scrape LLM search results (sequential)
- For each accepted item: call `scrape_single_article` (SSRF check, 15s timeout, 5MB limit).
- If scrape fails or article is too old/empty: trace as appropriate and skip.
- Otherwise: keep the LLM-provided title and summary (no re-classification LLM call). Add to `article_scraped`, increment `source_counts[domain]`.
- Traces are batch-flushed after all items are processed.
---
## Save + Record
- **Error if empty** — if all scraped article lists are empty, return an error.
- **Order sections** — user-defined categories first (in order), then "Autre" if non-empty.
- **Sanitize** — strip `\u0000` null bytes from JSON (PostgreSQL rejects them in JSONB).
- **Save synthesis** — insert into `syntheses` table with `job_id`, `week` (ISO week), `sections` (JSONB), `status: completed`.
- **Record used articles** — for each article in the final synthesis, build a trace with `status: "used"`, `synthesis_id`, and the correct `source_type` (`personalized_source`, `brave_search`, or `web_search` inferred from `url_source`). Batch-insert into `article_history`.
---
## Shared Helpers
- **`build_trace_entry()`** — constructs an `ArticleHistoryEntry` from an `ArticleTrace` struct (replaces the old 11-positional-parameter `trace_article` function). Never writes to DB directly; caller accumulates in `pending_traces`.
- **`assign_category()`** — validates LLM-returned category against the classification list, falls back to "Autre", drops article if "Autre" is also full.
- **`filter_phase2_url()`** — async helper applying homepage/dedup/history/diversity filters for Phase 2 (both Brave and LLM paths).
The pipeline runs as a background tokio task spawned by `POST /syntheses/generate`. It has a 15-minute global timeout and supports cooperative cancellation via `AtomicBool`.
- **Session cleanup**: an hourly background task deletes expired DB sessions (`db::sessions::delete_expired`).
- **Job store TTL**: expired job entries (older than 1 hour) are cleaned up via `JobStore::cleanup_expired`.
### Generation Lifecycle
`POST /api/v1/syntheses/generate` creates a job in the `JobStore`, then spawns two nested tasks:
- Inner task: wraps `run_generation` in a **15-minute `tokio::time::timeout`**. If the timeout fires, sends an `Error` progress event and releases the user lock.
- Outer task: monitors the inner task's `JoinHandle` for panics. If the inner task panics, sends an `Error` progress event and releases the user lock.
Progress is streamed to clients via a `tokio::sync::watch` channel (SSE endpoint subscribes to it).
### Initialization
### Initialization
1. Load `UserSettings` from DB (or create defaults)
1. **Load user settings** from DB (provider, models, batch_size, rate limits, etc.)
2. Cleanup old article history (entries older than `article_history_days` with dropped status) and truncate old LLM call logs
2. **Cleanup** — delete old article history entries (>N days, dropped only) + truncate old LLM call logs
3. Load the target `Theme` (categories, max_items, max_age_days, summary_length)
3. **Validate** — if no categories configured, the only available category will be "Divers".
10. **Batch trace buffer** — `pending_traces: Vec<ArticleHistoryEntry>` accumulates all article history writes; flushed with `db::article_history::batch_insert_entries` at phase boundaries.
### Phase 1: Personalized Sources
### Phase 1: Personalized Sources
Skipped if user has 0 sources for the theme.
**Skipped entirely if user has 0 sources.**
#### 1a. Windowed source extraction
- Query `article_history` for the last source used. Reorder sources so the first source follows the last one used (rolling window).
- Separate preferred sources (processed first) from non-preferred, preserving rotation order within each group.
- Process sources in waves of `source_extraction_window` size:
- For each source in the wave: fetch page HTML, extract up to `max_links_per_source` article URLs via HTML parsing (same-domain, non-homepage, no static assets).
- **SSRF check** performed on each source URL before fetching.
- Deduplicate candidate URLs (case-insensitive, cross-source via `seen_urls`).
- **Filter against article history** — hash each URL (normalized: lowercase, strip fragments/UTM params/trailing slashes), batch-query `article_history` → remove matches. Trace dropped articles as `status: filtered_history`.
- **Preferred-first shuffle** — shuffle preferred URLs separately from non-preferred, then concatenate (preferred first).
- Track url → source in `url_source`.
#### 1b. Scrape, classify, and summarize articles (batched)
Processing in batches of `settings.batch_size` (minimum 1). For each batch:
**1a. Windowed source extraction**
**Batch assembly**: Pull up to `batch_size` candidates, skipping any where `source_counts[domain] >= max_articles_per_source` (traced as `filtered_diversity`).
- Query article_history for the last source used; reorder sources in a rolling window starting after that source
**Phase A — Scrape batch in parallel** (`JoinSet`):
- Select up to `source_extraction_window` sources per generation
- SSRF check (no private IPs), 15s timeout, 5MB body limit.
- For each source (bounded concurrency of 5): fetch page HTML, extract up to `max_links_per_source` article URLs via HTML parsing (same-domain, non-homepage, no static assets)
- HTML parsing for title (`<title>`, `og:title`), date (meta tags, JSON-LD, `<time>`), body (strip scripts/nav), soft-404 detection.
- Deduplicate URLs cross-source via `seen_urls`
- If article body is empty, is a soft-404, or is too old: trace as `filtered_empty` / `filtered_too_old` and skip.
- Batch-check `article_history` for already-seen URL hashes; filter matches (traced as `filtered_history`)
- Shuffle remaining candidates to interleave sources
- Track url -> source in `url_source`
**1b. Batch scrape + classify**
**Phase B — Classify/summarize batch in parallel** (`JoinSet`):
- Check rate limit before classifying (waits up to 60s, then errors).
- Send article (title + body snippet based on `summary_length`: 500/2000/4000 chars) + categories + "Divers" to LLM.
- **Classify** (JoinSet, parallel): Rate limit check (60s wait), send title + first 500 chars to LLM with categories list. LLM returns `{title, summary, category}`. Validate category via `assign_category()` (fallback to "Autre", drop if full).
**Trace flush**: Pending traces batch-inserted into `article_history` between waves.
- **LLM call logging**: Every LLM call is logged with full prompt, response, timing, and article URL.
- **Early exit**: Stop when total articles >= `(num_categories + 1) * max_items_per_category`.
- Batch-flush pending traces to `article_history`.
### Phase 2: Web Search Fallback
### Phase 2: Web Search Fallback
Skipped if all categories are filled to `max_items_per_category`.
**Skipped if all user-defined categories are already filled.**
#### 2a. Compute category gaps
For each user category: `needed = max_items_per_category - already_filled`. Only proceed if any category needs more.
#### 2b. Choose path: Brave Search or LLM web search
4. **Save synthesis** — insert into `syntheses` table with `job_id`, `week` (ISO week), `sections` (JSONB), `status: completed`, `theme_id`.
- Scrape each result sequentially to validate; keep LLM-provided title/summary (no re-classification)
5. **Record used articles** — for each article in the final synthesis, build trace with `status: "used"`, `synthesis_id`, and correct `source_type` (inferred from `url_source`). Batch-insert into `article_history`.
- source_type = "web_search"
### Save & Record
### Shared Helpers
1. Error if all article lists are empty
- **`build_trace_entry()`** — constructs an `ArticleHistoryEntry` from an `ArticleTrace` struct. Never writes to DB directly; caller accumulates in `pending_traces`.
2. Order sections: user-defined categories first (in order), then "Autre" if non-empty
- **`scrape_and_classify_batch()`** — shared batch processing logic used by Phase 1 and Phase 2 Brave paths.