You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

157 lines
9.3 KiB
Markdown

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

# Synthesis Generation Pipeline — Full Algorithm
## Startup & Background Tasks
- **Session cleanup**: an hourly background task deletes expired DB sessions (`db::sessions::delete_expired`).
- **Job store TTL**: expired job entries (older than 1 hour) are cleaned up via `JobStore::cleanup_expired`.
## Generation Lifecycle
`POST /api/v1/syntheses/generate` creates a job in the `JobStore`, then spawns two nested tasks:
- Inner task: wraps `run_generation` in a **15-minute `tokio::time::timeout`**. If the timeout fires, sends an `Error` progress event and releases the user lock.
- Outer task: monitors the inner task's `JoinHandle` for panics. If the inner task panics, sends an `Error` progress event and releases the user lock.
Progress is streamed to clients via a `tokio::sync::watch` channel (SSE endpoint subscribes to it).
---
## Initialization
1. **Load user settings** from DB (categories, provider, models, max_items, batch_size, etc.)
2. **Cleanup** — delete old article history entries (>N days, dropped only) + truncate old LLM call logs
3. **Validate** — if no categories configured, the only available category will be "Autre".
4. **Load user sources** (personalized URLs like `https://openai.com/blog`)
5. **Resolve LLM provider** — decrypt user's API key, create provider instance (`Arc<dyn LlmProvider>`)
6. **Resolve models** — research model + web-search model (user override or admin default)
7. **Setup rate limiter** — per-user or global provider limiter
8. **Initialize tracking structures**`article_scraped` (category→articles), `source_counts` (per-domain article count), `url_source` (per-article source), `filled_counts` (per-category article count), `seen_urls` (cross-phase dedup), `classification_categories` (user categories + "Autre")
9. **Batch trace buffer**`pending_traces: Vec<ArticleHistoryEntry>` accumulates all article history writes; flushed with `db::article_history::batch_insert_entries` at phase boundaries (not per article).
---
## Phase 1: Personalized Sources
**Skipped entirely if user has 0 sources.**
### 1a. Extract article links from source pages and filter against article history
- Query `article_history` for the last source used. Reorder the personalized sources so that the first source is the one following the last source used (rolling window).
- Fetch source pages with bounded concurrency of **5** (hardcoded `max_concurrent = 5`):
- If `use_llm_for_source_links` enabled: send HTML `<head>` + first 8000 chars of `<body>` to LLM → extract all article URLs up to a maximum of **15**, with the most recent first. If the LLM call fails, fall back to HTML parsing as described below.
- **LLM call logged** with full prompt/response/timing
- Otherwise: parse HTML `<a href>` links, filter by same-domain, non-homepage path, exclude `/tag/`, `/login/`, `/contact/`, `/presentation/`, `/newsletter/`, static assets, etc., and keep only the first **15** links found.
- **SSRF check** performed on each source URL before fetching (rejects private/loopback IPs).
- Deduplicate candidate URLs (case-insensitive, cross-source via `seen_urls`).
- **Filter against article history** — hash each URL (normalized: lowercase, strip fragments/UTM params/trailing slashes), batch-query `article_history` → remove matches.
- Trace dropped articles as `status: filtered_history` (flushed immediately after this filter step).
- **Shuffle** remaining candidates to interleave articles from different sources.
- Track url → source in `url_source`.
### 1b. Scrape, classify, and summarize articles (batched)
Processing happens in batches of `settings.batch_size` (minimum 1). For each batch:
**Batch assembly**: pull up to `batch_size` candidates from the iterator, skipping any where `source_counts[domain] >= max_articles_per_source` (traced as `filtered_diversity`).
**Phase A — Scrape batch in parallel** (`JoinSet`):
- SSRF check (no private IPs), 15s timeout, 5MB body limit.
- HTML parsing heuristics for title (`<title>`, `og:title`), date (meta tags, JSON-LD, `<time>`), body (strip scripts/nav), soft-404 detection.
- If article body is empty, is a soft-404, or is too old: trace as `filtered_empty` / `filtered_too_old` and skip.
**Phase B — Classify/summarize batch in parallel** (`JoinSet`):
- Check rate limit before classifying (waits up to 60s, then errors).
- Send article (title + first 500 chars of body) + categories + "Autre" to LLM. LLM returns `{title, summary, category}`. If the title was empty, the LLM also generates one.
- **LLM call logged** with full prompt/response/timing.
- **`assign_category()`** helper: validates category, falls back to "Autre" if category is unknown or full. If "Autre" is also full, drops the article.
- Add the article to `article_scraped`, increment `filled_counts`, increment `source_counts[domain]`.
**Early exit**: after each batch, if total articles across all categories ≥ `(num_categories + 1) × max_items_per_category`, stop and move to Phase 2.
**Trace flush**: all pending traces accumulated during Phase 1 are batch-inserted into `article_history` after Phase 1 completes.
---
## Phase 2: Web Search Fallback
**Skipped if all user-defined categories are already filled to `max_items_per_category`.**
### 2a. Compute category gaps
- For each user category: `needed = max_items_per_category - already_filled`
- Only proceed if any category needs more articles.
### 2b. Choose path: Brave Search or LLM web search
The path is selected by the `settings.use_brave_search` flag.
---
### Path A: Brave Search (`use_brave_search = true`)
#### 2b-A. Call Brave Search API
- Resolve and decrypt the user's Brave Search API key (error if not configured).
- Query: `"{settings.theme} actualites"`, up to 20 results, filtered by `max_age_days`.
#### 2c-A. Filter Brave results
Each result URL passes through **`filter_phase2_url()`**:
1. **Homepage filter** — drop URLs with path `/` or empty (`filtered_homepage`)
2. **Cross-phase dedup** — drop URLs already in `seen_urls` (`filtered_cross_phase_dedup`)
3. **Article history** — check hash in DB, drop if seen before (`filtered_history`)
4. **Source diversity** — drop if `source_counts[domain] >= max_articles_per_source` (`filtered_diversity`)
Accepted URLs are added to `seen_urls`. All rejections are traced. Traces are **batch-flushed** after this filter step.
#### 2d-A. Scrape + classify Brave results (batched)
Same batch loop as Phase 1b, using `settings.batch_size`:
- **Phase A**: scrape batch in parallel, trace failures as `source_type: "brave_search"`.
- **Phase B**: classify/summarize in parallel (same LLM call + logging as Phase 1).
- **`assign_category()`** used identically to Phase 1.
- Source domain tracked in `source_counts`.
- **Early exit** at `max_total` articles.
- Traces are batch-flushed after this loop.
---
### Path B: LLM Web Search (`use_brave_search = false`)
#### 2b-B. LLM web search pass
- Check rate limit.
- Build search prompt with theme, categories, gap counts ("find N articles for category_1, M for category_2").
- Send search prompt to LLM (using `model_websearch`). LLM returns structured JSON: `{category_0: [{title, url, summary}], category_1: [...]}`
- **LLM call logged** with full prompt/response/timing.
#### 2c-B. Filter LLM search results
Same **`filter_phase2_url()`** logic as Path A (homepage, cross-phase dedup, history, diversity). Accepted URLs are added to `seen_urls`. Traces are **batch-flushed** after this filter step.
#### 2d-B. Scrape LLM search results (sequential)
- For each accepted item: call `scrape_single_article` (SSRF check, 15s timeout, 5MB limit).
- If scrape fails or article is too old/empty: trace as appropriate and skip.
- Otherwise: keep the LLM-provided title and summary (no re-classification LLM call). Add to `article_scraped`, increment `source_counts[domain]`.
- Traces are batch-flushed after all items are processed.
---
## Save + Record
- **Error if empty** — if all scraped article lists are empty, return an error.
- **Order sections** — user-defined categories first (in order), then "Autre" if non-empty.
- **Sanitize** — strip `\u0000` null bytes from JSON (PostgreSQL rejects them in JSONB).
- **Save synthesis** — insert into `syntheses` table with `job_id`, `week` (ISO week), `sections` (JSONB), `status: completed`.
- **Record used articles** — for each article in the final synthesis, build a trace with `status: "used"`, `synthesis_id`, and the correct `source_type` (`personalized_source`, `brave_search`, or `web_search` inferred from `url_source`). Batch-insert into `article_history`.
---
## Shared Helpers
- **`build_trace_entry()`** — constructs an `ArticleHistoryEntry` from an `ArticleTrace` struct (replaces the old 11-positional-parameter `trace_article` function). Never writes to DB directly; caller accumulates in `pending_traces`.
- **`assign_category()`** — validates LLM-returned category against the classification list, falls back to "Autre", drops article if "Autre" is also full.
- **`filter_phase2_url()`** — async helper applying homepage/dedup/history/diversity filters for Phase 2 (both Brave and LLM paths).
- **`scrape_single_article()`** — thin wrapper around `scraper::scrape_url` returning `(body_text, page_title, final_url, drop_reason)`.
- **`hash_article_url()`** — normalizes a URL (strips fragments, UTM params, trailing slashes, lowercases) then SHA-256 hashes it for history lookup.