You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

171 lines
7.8 KiB
Markdown

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

# Synthesis Generation Pipeline — Full Algorithm
## Initialization
1. **Load user settings** from DB (categories, provider, models, max_items, etc.)
2. **Cleanup** — delete old article history entries (>N days, dropped only) + truncate old LLM call logs
3. **Validate** — fail if no categories configured
4. **Load user sources** (personalized URLs like `https://openai.com/blog`)
5. **Resolve LLM provider** — decrypt user's API key, create provider instance (`Arc<dyn LlmProvider>`)
6. **Resolve models** — research model + writing model (user override or admin default)
7. **Setup rate limiter** — per-user or global provider limiter
8. **Prepare LLM scraping option** — if `use_llm_for_article_extraction` enabled, clone provider+model for concurrent use
9. **Initialize tracking structures**`filled_counts` (per-category article count), `all_scraped` (category→articles), `all_overflow` (dropped overflow), `seen_urls` (cross-phase dedup), classification categories (user categories + "Autre")
---
## Phase 1: Personalized Sources
**Skipped entirely if user has 0 sources.**
### 1a. Extract article links from source pages
- For each source (max 10), fetch the source page HTML
- If `use_llm_for_source_links` enabled: send HTML `<head>` + first 8000 chars of `<body>` to LLM → extract article URLs (falls back to heuristic if LLM fails)
- Otherwise: parse HTML `<a href>` links, filter by same-domain, non-homepage path, exclude `/tag/`, `/login/`, static assets, etc.
- Over-fetch: `2 × max_articles_per_source` candidates per source
- Deduplicate candidate URLs
### 1b. Scrape candidate articles
- Fetch each URL (bounded concurrency: 5 with LLM extraction, 10 without)
- SSRF check (no private IPs), 15s timeout, 5MB body limit
- If `use_llm_for_article_extraction` enabled: send `<head>` + body text to LLM → extract title, date, body, error detection (falls back to heuristic if LLM fails)
- Otherwise: HTML parsing heuristics for title (`<title>`, `og:title`), date (meta tags, JSON-LD, `<time>`), body (strip scripts/nav), soft-404 detection
- Capture final URL after redirects (canonical URL)
### 1c. Filter empty content
- Remove articles where scraped body text is empty (scrape failure, soft 404, too old)
- Trace dropped articles in `article_history` with `status: filtered_empty`
### 1d. Filter against article history
- Hash each URL (normalized: lowercase, strip fragments/UTM params/trailing slashes)
- Query `article_history` for existing hashes → remove matches
- Trace dropped articles with `status: filtered_history`
### 1e. Retry if under-filled
- If valid articles < `categories × max_items_per_category` and history is enabled
- Re-scrape source pages for NEW links (exclude already-fetched URLs)
- Scrape + filter empty + filter history on retry candidates
- Merge with existing valid articles
- Only 1 retry attempt
### 1f. LLM classification
- Send articles (title + first 500 chars of body) + categories + "Autre" to LLM
- LLM returns `{assignments: [{index, category}]}` mapping each article to a category
- Overflow: articles that exceed both target category AND "Autre" limits collected in `all_overflow`
- **LLM call logged** with full prompt/response/timing
### 1g. Enforce source diversity
- Count domains across all categories
- Remove articles where domain exceeds `max_articles_per_source`
- Trace dropped articles with `status: filtered_diversity`
- Recount category fill levels
---
## Phase 2: Web Search Fallback
**Skipped if all user-defined categories are already filled to `max_items_per_category`.**
### 2a. Compute category gaps
- For each user category: `needed = max - already_filled`
- Only proceed if any category needs more
### 2b. Load recent domains for diversity
- If `source_diversity_window > 0`: extract domains from last N syntheses' JSONB sections
- Used as soft "avoid if possible" instruction in search prompt
### 2c. LLM web search pass
- Build search prompt with theme, categories, gap counts ("find N articles for AI News, M for Cybersecurity"), recent domains to avoid, personalized source URLs
- Call `provider.generate_search_pass()` with web search tool enabled
- **LLM call logged** with full prompt/response/timing
- Returns structured JSON: `{category_0: [{title, url, summary}], category_1: [...]}`
### 2d. Filter pipeline on search results
- **Parse** LLM output into `(category_key, Vec<NewsItem>)`
- **Filter homepage URLs** drop articles with path `/` or empty
- **Cross-phase dedup** drop URLs already seen in Phase 1
- **Dedup by URL** drop duplicate URLs within Phase 2 (case-insensitive)
- **Limit articles per source** enforce `max_articles_per_source` per domain (spread across categories first, then fill)
- **Filter against article history** BEFORE scraping (saves HTTP requests), drop already-seen URLs
- Each drop traced in `article_history` with appropriate status
### 2e. Scrape web search results
- Same scraping as Phase 1 (bounded concurrency, SSRF check, optional LLM extraction)
- Filter empty content (scrape failures, soft 404, too old)
- Trace drops
### 2f. LLM classification
- Same as Phase 1 classification but with Phase 2 articles
- `filled_counts` carries over from Phase 1 categories already partially filled
- Overflow collected
- **LLM call logged**
- Merge results into `all_scraped`
---
## "Autre" Fill-Up
- Count total articles across all categories
- Target = `75% × (categories × max_items_per_category)` (user categories only, "Autre" excluded from denominator)
- If shortfall > 0 and overflow exists:
- For each overflow article: check if domain is under `max_articles_per_source` limit
- Add to `all_scraped["category_autre"]` up to the shortfall
---
## Combined Rewrite Pass
- **Fail if no articles** — return error if all categories are empty
- **Build rewrite prompt** — serialize all scraped articles with body content, instruct LLM to rewrite title + summary (4-5 lines) faithfully based on scraped content
- **Build rewrite schema** — `minItems`/`maxItems` set to ACTUAL count per category (not user max), empty categories omitted, "Autre" included if non-empty
- **LLM rewrite pass** — call `provider.generate_rewrite_pass()` with writing model
- **LLM call logged** with full prompt/response/timing
- **Build final sections** — map `category_N` keys to user category names, add "Autre" section if present, omit empty categories
- **Restore scraped URLs** — replace any hallucinated URLs from LLM rewrite with the validated scraped URLs (matched by category + position)
---
## Save + Record
- **Sanitize** — strip `\u0000` null bytes from JSON (PostgreSQL rejects them in JSONB)
- **Save synthesis** — insert into `syntheses` table with `job_id`, `week` (ISO week), `sections` (JSONB), `status: completed`
- **Record used articles** — insert each article URL into `article_history` with `status: used`, `synthesis_id`, `job_id`, and category name (for future dedup + provenance)
---
## Summary of LLM Calls (up to 4 per generation)
| # | Call | When | Model |
|---|---|---|---|
| 1 | Classification Phase 1 | After Phase 1 scraping | research |
| 2 | Web Search | Phase 2 start | research |
| 3 | Classification Phase 2 | After Phase 2 scraping | research |
| 4 | Rewrite | After both phases | writing |
Plus optionally per-article calls for LLM link extraction and LLM article extraction (when those settings are enabled).
## Summary of Filtering Steps
| Step | Phase | What's dropped |
|---|---|---|
| Empty content | 1 & 2 | Scrape failures, soft 404s, too old |
| Article history | 1 & 2 | Already used in previous syntheses |
| Homepage URLs | 2 | Path is `/` or empty |
| Cross-phase dedup | 2 | URLs already found in Phase 1 |
| URL dedup | 2 | Duplicate URLs within Phase 2 |
| Source diversity | 1 & 2 | Domain exceeds `max_articles_per_source` |
| Category overflow | 1 & 2 | Category + "Autre" both full |