8.6 KiB
Design: Source Priority Pipeline — Personalized Sources First, Web Search Fallback
Date: 2026-03-24 Scope: Redesign the synthesis generation pipeline to prioritize personalized sources with scraping, fall back to web search for gaps
Context
The current pipeline sends a single LLM call that mixes personalized sources and web search together. There is no prioritization, no retry when articles fail validation, and no fallback mechanism. The LLM decides freely which sources to use, often ignoring personalized ones.
New Pipeline (Two-Phase)
Phase 1: Personalized Sources (scrape-based, no LLM for discovery)
Skipped entirely if the user has 0 configured sources. Proceeds directly to Phase 2.
- For each user source URL (e.g.,
https://openai.com/blog), scrape the page and extract article links (max 10 sources processed, to bound scraping work) - Filter links: same domain only, non-empty path (not just
/), exclude non-article patterns - Normalize and deduplicate URLs, fetch up to
2 × max_articles_per_sourcecandidates per source (over-fetch to compensate for validation failures) - Scrape each candidate article (existing scraper: validate date, soft 404, content)
- Filter out articles with empty scraped content (too old, failed, soft 404)
- LLM classification call: send articles (title + first 500 chars of body) + user categories + "Autre" → LLM returns article-to-category mapping
- Fill categories from the mapping, respecting
max_items_per_categoryper category (including "Autre") - Trim excess: after classification, enforce
max_articles_per_sourceper domain across all categories
If all source scrapes fail (network errors, JS-rendered sites, etc.), Phase 1 produces 0 articles. Pipeline falls through to Phase 2 cleanly.
Phase 2: Web Search Fallback (LLM-based)
Only runs if any user-defined category is still under max_items_per_category after Phase 1. ("Autre" does not trigger Phase 2 — it only collects overflow.)
- Compute category gaps: for each user-defined category,
needed = max_items_per_category - already_filled - Run the LLM search pass with a modified prompt: include the gap counts per category ("find N articles for AI News, M articles for Cybersecurity")
- Apply existing filters:
filter_homepage_urls,dedup_by_url(cross-phase — dedup against Phase 1 URLs),limit_articles_per_source(cross-phase — count Phase 1 domains) - Scrape + validate web search results (existing scraper)
- Filter out articles with empty scraped content
- LLM classification call (same function as Phase 1): classify web search articles into remaining category gaps (including "Autre" for overflow)
- Fill remaining category slots, respecting limits
Combined Rewrite Pass
After both phases, merge all classified articles into a single HashMap<String, Vec<ScrapedNewsItem>> keyed by category. Run the rewrite pass on the combined set. The rewrite schema uses actual item counts per category. Categories with 0 articles are omitted from the schema (no hallucinated articles).
"Autre" Default Category
- Always exists as a fallback classification category, regardless of user settings
- Articles that don't fit any user-defined category are assigned to "Autre"
- Capped at
max_items_per_category(same limit as user categories) - Only included in the final synthesis if it has articles (not shown when empty)
- Not a user setting — hardcoded in the pipeline
- Uses category key
category_autrein the internal data structures - Included in
build_rewrite_schema,build_final_sections, andrestore_scraped_urlswhen it has articles limit_articles_per_sourceanddedup_by_urltreat "Autre" articles the same as any other category
Source Page Scraping (new module: source_scraper.rs)
Fetches a source URL and extracts article links:
- Fetch page HTML (reuse existing scraper HTTP client with 15s timeout)
- Extract all
<a href>links usingscrapercrate (already a dependency) - Filter:
- Same domain only (no external links)
- Path must be non-empty and not just
/(allows single-segment paths like/my-article) - Exclude patterns:
/tag/,/category/,/author/,/page/,/login,/signup,/privacy,/terms,/search,/contact - Exclude static assets:
.css,.js,.png,.jpg,.gif,.svg,.pdf,.zip,.xml
- Normalize URLs (resolve relative paths against base URL, deduplicate)
- Limit to
2 × max_articles_per_sourceper source (over-fetch) - Return
Vec<String>of candidate article URLs
Known limitations:
- JavaScript-rendered pages (React/Next.js SPAs) will return empty or navigation-only content. The pipeline degrades gracefully — Phase 2 web search fills the gaps.
- RSS/Atom feeds are not used in v1. Could be added as a future enhancement for more reliable article discovery.
Classification LLM Call
A lightweight LLM request for assigning articles to categories.
Input:
- List of articles:
[{index, title, url, body_snippet (first 500 chars)}] - List of categories: user categories + "Autre"
- Already-filled category counts (for Phase 2: "AI News already has 3/4")
- Max items per category
Prompt: "Classify each article into the most appropriate category. Each category, including 'Autre', accepts at most N articles. Return a JSON mapping."
Output schema:
{
"type": "object",
"properties": {
"assignments": {
"type": "array",
"items": {
"type": "object",
"properties": {
"index": { "type": "integer" },
"category": { "type": "string" }
},
"required": ["index", "category"],
"additionalProperties": false
}
}
},
"required": ["assignments"],
"additionalProperties": false
}
Error handling:
- Invalid article index → ignored (skip that assignment)
- Category name not matching any user category or "Autre" → assign to "Autre"
- Missing assignments (not all articles classified) → unclassified articles assigned to "Autre"
- Case-insensitive category matching
Model: Uses model_research (same as search pass).
LLM dispatch: Reuse generate_rewrite_pass (Chat Completions API, no web search needed). The classification call uses model_research even though it goes through the "rewrite" method — the method is provider-agnostic and just sends a structured prompt.
source_diversity_window Interaction
- Phase 1 (personalized sources): The diversity window does NOT apply. Personalized sources are explicitly chosen by the user and always scraped, even if their domain appeared in recent syntheses.
- Phase 2 (web search): The diversity window applies as today — recent domains are injected as a soft "avoid if possible" instruction in the search prompt.
Bug Fixes Included
build_rewrite_schemaforcingminItems: 1for empty categories — Categories with 0 articles are omitted from the rewrite schema entirely. No hallucinated articles.- Dead code removal —
url_quality_sufficient,URL_QUALITY_THRESHOLDremoved.
Files to Modify
- Create:
backend/src/services/source_scraper.rs— source page scraping + article link extraction - Modify:
backend/src/services/mod.rs— registersource_scrapermodule - Modify:
backend/src/services/synthesis.rs— rewriterun_generation_innerwith two-phase pipeline, classification response parsing, category filling logic, "Autre" handling inbuild_rewrite_schemaandbuild_final_sections - Modify:
backend/src/services/prompts.rs— addbuild_classification_prompt, modifybuild_search_promptto accept category gaps (how many items still needed per category) - Modify:
backend/src/services/llm/schema.rs— addbuild_classification_schema - Modify:
backend/tests/api_syntheses_test.rs— update generation pipeline integration test - Modify:
e2e/tests/generation-live.spec.ts— update settings, add assertions for personalized source articles and "Autre" category - Add: unit tests in
source_scraper.rs— link extraction, filtering, deduplication, edge cases - Add: unit tests in
prompts.rs— classification prompt generation - Add: unit tests in
synthesis.rs— classification parsing, category filling, two-phase integration, "Autre" handling
What Does NOT Change
- Frontend — no UI changes
- Database/migrations — no schema changes
- User settings — no new fields
- Individual article scraper (
scraper.rs) — reused as-is - LLM provider trait and implementations — reused as-is (classification uses
generate_rewrite_pass) restore_scraped_urls,sanitize_json_null_bytes— reused as-isfilter_empty_scraped_articles— reused as-is