|
|
|
@ -0,0 +1,170 @@
|
|
|
|
|
|
|
|
# Synthesis Generation Pipeline — Full Algorithm
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Initialization
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1. **Load user settings** from DB (categories, provider, models, max_items, etc.)
|
|
|
|
|
|
|
|
2. **Cleanup** — delete old article history entries (>N days, dropped only) + truncate old LLM call logs
|
|
|
|
|
|
|
|
3. **Validate** — fail if no categories configured
|
|
|
|
|
|
|
|
4. **Load user sources** (personalized URLs like `https://openai.com/blog`)
|
|
|
|
|
|
|
|
5. **Resolve LLM provider** — decrypt user's API key, create provider instance (`Arc<dyn LlmProvider>`)
|
|
|
|
|
|
|
|
6. **Resolve models** — research model + writing model (user override or admin default)
|
|
|
|
|
|
|
|
7. **Setup rate limiter** — per-user or global provider limiter
|
|
|
|
|
|
|
|
8. **Prepare LLM scraping option** — if `use_llm_for_article_extraction` enabled, clone provider+model for concurrent use
|
|
|
|
|
|
|
|
9. **Initialize tracking structures** — `filled_counts` (per-category article count), `all_scraped` (category→articles), `all_overflow` (dropped overflow), `seen_urls` (cross-phase dedup), classification categories (user categories + "Autre")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Phase 1: Personalized Sources
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
**Skipped entirely if user has 0 sources.**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### 1a. Extract article links from source pages
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
- For each source (max 10), fetch the source page HTML
|
|
|
|
|
|
|
|
- If `use_llm_for_source_links` enabled: send HTML `<head>` + first 8000 chars of `<body>` to LLM → extract article URLs (falls back to heuristic if LLM fails)
|
|
|
|
|
|
|
|
- Otherwise: parse HTML `<a href>` links, filter by same-domain, non-homepage path, exclude `/tag/`, `/login/`, static assets, etc.
|
|
|
|
|
|
|
|
- Over-fetch: `2 × max_articles_per_source` candidates per source
|
|
|
|
|
|
|
|
- Deduplicate candidate URLs
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### 1b. Scrape candidate articles
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
- Fetch each URL (bounded concurrency: 5 with LLM extraction, 10 without)
|
|
|
|
|
|
|
|
- SSRF check (no private IPs), 15s timeout, 5MB body limit
|
|
|
|
|
|
|
|
- If `use_llm_for_article_extraction` enabled: send `<head>` + body text to LLM → extract title, date, body, error detection (falls back to heuristic if LLM fails)
|
|
|
|
|
|
|
|
- Otherwise: HTML parsing heuristics for title (`<title>`, `og:title`), date (meta tags, JSON-LD, `<time>`), body (strip scripts/nav), soft-404 detection
|
|
|
|
|
|
|
|
- Capture final URL after redirects (canonical URL)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### 1c. Filter empty content
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
- Remove articles where scraped body text is empty (scrape failure, soft 404, too old)
|
|
|
|
|
|
|
|
- Trace dropped articles in `article_history` with `status: filtered_empty`
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### 1d. Filter against article history
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
- Hash each URL (normalized: lowercase, strip fragments/UTM params/trailing slashes)
|
|
|
|
|
|
|
|
- Query `article_history` for existing hashes → remove matches
|
|
|
|
|
|
|
|
- Trace dropped articles with `status: filtered_history`
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### 1e. Retry if under-filled
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
- If valid articles < `categories × max_items_per_category` and history is enabled
|
|
|
|
|
|
|
|
- Re-scrape source pages for NEW links (exclude already-fetched URLs)
|
|
|
|
|
|
|
|
- Scrape + filter empty + filter history on retry candidates
|
|
|
|
|
|
|
|
- Merge with existing valid articles
|
|
|
|
|
|
|
|
- Only 1 retry attempt
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### 1f. LLM classification
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
- Send articles (title + first 500 chars of body) + categories + "Autre" to LLM
|
|
|
|
|
|
|
|
- LLM returns `{assignments: [{index, category}]}` mapping each article to a category
|
|
|
|
|
|
|
|
- Overflow: articles that exceed both target category AND "Autre" limits → collected in `all_overflow`
|
|
|
|
|
|
|
|
- **LLM call logged** with full prompt/response/timing
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### 1g. Enforce source diversity
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
- Count domains across all categories
|
|
|
|
|
|
|
|
- Remove articles where domain exceeds `max_articles_per_source`
|
|
|
|
|
|
|
|
- Trace dropped articles with `status: filtered_diversity`
|
|
|
|
|
|
|
|
- Recount category fill levels
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Phase 2: Web Search Fallback
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
**Skipped if all user-defined categories are already filled to `max_items_per_category`.**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### 2a. Compute category gaps
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
- For each user category: `needed = max - already_filled`
|
|
|
|
|
|
|
|
- Only proceed if any category needs more
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### 2b. Load recent domains for diversity
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
- If `source_diversity_window > 0`: extract domains from last N syntheses' JSONB sections
|
|
|
|
|
|
|
|
- Used as soft "avoid if possible" instruction in search prompt
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### 2c. LLM web search pass
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
- Build search prompt with theme, categories, gap counts ("find N articles for AI News, M for Cybersecurity"), recent domains to avoid, personalized source URLs
|
|
|
|
|
|
|
|
- Call `provider.generate_search_pass()` with web search tool enabled
|
|
|
|
|
|
|
|
- **LLM call logged** with full prompt/response/timing
|
|
|
|
|
|
|
|
- Returns structured JSON: `{category_0: [{title, url, summary}], category_1: [...]}`
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### 2d. Filter pipeline on search results
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
- **Parse** LLM output into `(category_key, Vec<NewsItem>)`
|
|
|
|
|
|
|
|
- **Filter homepage URLs** — drop articles with path `/` or empty
|
|
|
|
|
|
|
|
- **Cross-phase dedup** — drop URLs already seen in Phase 1
|
|
|
|
|
|
|
|
- **Dedup by URL** — drop duplicate URLs within Phase 2 (case-insensitive)
|
|
|
|
|
|
|
|
- **Limit articles per source** — enforce `max_articles_per_source` per domain (spread across categories first, then fill)
|
|
|
|
|
|
|
|
- **Filter against article history** — BEFORE scraping (saves HTTP requests), drop already-seen URLs
|
|
|
|
|
|
|
|
- Each drop traced in `article_history` with appropriate status
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### 2e. Scrape web search results
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
- Same scraping as Phase 1 (bounded concurrency, SSRF check, optional LLM extraction)
|
|
|
|
|
|
|
|
- Filter empty content (scrape failures, soft 404, too old)
|
|
|
|
|
|
|
|
- Trace drops
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### 2f. LLM classification
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
- Same as Phase 1 classification but with Phase 2 articles
|
|
|
|
|
|
|
|
- `filled_counts` carries over from Phase 1 — categories already partially filled
|
|
|
|
|
|
|
|
- Overflow collected
|
|
|
|
|
|
|
|
- **LLM call logged**
|
|
|
|
|
|
|
|
- Merge results into `all_scraped`
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## "Autre" Fill-Up
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
- Count total articles across all categories
|
|
|
|
|
|
|
|
- Target = `75% × (categories × max_items_per_category)` (user categories only, "Autre" excluded from denominator)
|
|
|
|
|
|
|
|
- If shortfall > 0 and overflow exists:
|
|
|
|
|
|
|
|
- For each overflow article: check if domain is under `max_articles_per_source` limit
|
|
|
|
|
|
|
|
- Add to `all_scraped["category_autre"]` up to the shortfall
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Combined Rewrite Pass
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
- **Fail if no articles** — return error if all categories are empty
|
|
|
|
|
|
|
|
- **Build rewrite prompt** — serialize all scraped articles with body content, instruct LLM to rewrite title + summary (4-5 lines) faithfully based on scraped content
|
|
|
|
|
|
|
|
- **Build rewrite schema** — `minItems`/`maxItems` set to ACTUAL count per category (not user max), empty categories omitted, "Autre" included if non-empty
|
|
|
|
|
|
|
|
- **LLM rewrite pass** — call `provider.generate_rewrite_pass()` with writing model
|
|
|
|
|
|
|
|
- **LLM call logged** with full prompt/response/timing
|
|
|
|
|
|
|
|
- **Build final sections** — map `category_N` keys to user category names, add "Autre" section if present, omit empty categories
|
|
|
|
|
|
|
|
- **Restore scraped URLs** — replace any hallucinated URLs from LLM rewrite with the validated scraped URLs (matched by category + position)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Save + Record
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
- **Sanitize** — strip `\u0000` null bytes from JSON (PostgreSQL rejects them in JSONB)
|
|
|
|
|
|
|
|
- **Save synthesis** — insert into `syntheses` table with `job_id`, `week` (ISO week), `sections` (JSONB), `status: completed`
|
|
|
|
|
|
|
|
- **Record used articles** — insert each article URL into `article_history` with `status: used`, `synthesis_id`, `job_id`, and category name (for future dedup + provenance)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Summary of LLM Calls (up to 4 per generation)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| # | Call | When | Model |
|
|
|
|
|
|
|
|
|---|---|---|---|
|
|
|
|
|
|
|
|
| 1 | Classification Phase 1 | After Phase 1 scraping | research |
|
|
|
|
|
|
|
|
| 2 | Web Search | Phase 2 start | research |
|
|
|
|
|
|
|
|
| 3 | Classification Phase 2 | After Phase 2 scraping | research |
|
|
|
|
|
|
|
|
| 4 | Rewrite | After both phases | writing |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Plus optionally per-article calls for LLM link extraction and LLM article extraction (when those settings are enabled).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Summary of Filtering Steps
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| Step | Phase | What's dropped |
|
|
|
|
|
|
|
|
|---|---|---|
|
|
|
|
|
|
|
|
| Empty content | 1 & 2 | Scrape failures, soft 404s, too old |
|
|
|
|
|
|
|
|
| Article history | 1 & 2 | Already used in previous syntheses |
|
|
|
|
|
|
|
|
| Homepage URLs | 2 | Path is `/` or empty |
|
|
|
|
|
|
|
|
| Cross-phase dedup | 2 | URLs already found in Phase 1 |
|
|
|
|
|
|
|
|
| URL dedup | 2 | Duplicate URLs within Phase 2 |
|
|
|
|
|
|
|
|
| Source diversity | 1 & 2 | Domain exceeds `max_articles_per_source` |
|
|
|
|
|
|
|
|
| Category overflow | 1 & 2 | Category + "Autre" both full |
|