You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

8.6 KiB

Raw Blame History Unescape Escape

Design: Source Priority Pipeline — Personalized Sources First, Web Search Fallback

Date: 2026-03-24 Scope: Redesign the synthesis generation pipeline to prioritize personalized sources with scraping, fall back to web search for gaps

Context

The current pipeline sends a single LLM call that mixes personalized sources and web search together. There is no prioritization, no retry when articles fail validation, and no fallback mechanism. The LLM decides freely which sources to use, often ignoring personalized ones.

New Pipeline (Two-Phase)

Phase 1: Personalized Sources (scrape-based, no LLM for discovery)

Skipped entirely if the user has 0 configured sources. Proceeds directly to Phase 2.

For each user source URL (e.g., https://openai.com/blog), scrape the page and extract article links (max 10 sources processed, to bound scraping work)
Filter links: same domain only, non-empty path (not just /), exclude non-article patterns
Normalize and deduplicate URLs, fetch up to 2 × max_articles_per_source candidates per source (over-fetch to compensate for validation failures)
Scrape each candidate article (existing scraper: validate date, soft 404, content)
Filter out articles with empty scraped content (too old, failed, soft 404)
LLM classification call: send articles (title + first 500 chars of body) + user categories + "Autre" → LLM returns article-to-category mapping
Fill categories from the mapping, respecting max_items_per_category per category (including "Autre")
Trim excess: after classification, enforce max_articles_per_source per domain across all categories

If all source scrapes fail (network errors, JS-rendered sites, etc.), Phase 1 produces 0 articles. Pipeline falls through to Phase 2 cleanly.

Phase 2: Web Search Fallback (LLM-based)

Only runs if any user-defined category is still under max_items_per_category after Phase 1. ("Autre" does not trigger Phase 2 — it only collects overflow.)

Compute category gaps: for each user-defined category, needed = max_items_per_category - already_filled
Run the LLM search pass with a modified prompt: include the gap counts per category ("find N articles for AI News, M articles for Cybersecurity")
Apply existing filters: filter_homepage_urls, dedup_by_url (cross-phase — dedup against Phase 1 URLs), limit_articles_per_source (cross-phase — count Phase 1 domains)
Scrape + validate web search results (existing scraper)
Filter out articles with empty scraped content
LLM classification call (same function as Phase 1): classify web search articles into remaining category gaps (including "Autre" for overflow)
Fill remaining category slots, respecting limits

Combined Rewrite Pass

After both phases, merge all classified articles into a single HashMap<String, Vec<ScrapedNewsItem>> keyed by category. Run the rewrite pass on the combined set. The rewrite schema uses actual item counts per category. Categories with 0 articles are omitted from the schema (no hallucinated articles).

"Autre" Default Category

Always exists as a fallback classification category, regardless of user settings
Articles that don't fit any user-defined category are assigned to "Autre"
Capped at max_items_per_category (same limit as user categories)
Only included in the final synthesis if it has articles (not shown when empty)
Not a user setting — hardcoded in the pipeline
Uses category key category_autre in the internal data structures
Included in build_rewrite_schema, build_final_sections, and restore_scraped_urls when it has articles
limit_articles_per_source and dedup_by_url treat "Autre" articles the same as any other category

Source Page Scraping (new module: `source_scraper.rs`)

Fetches a source URL and extracts article links:

Fetch page HTML (reuse existing scraper HTTP client with 15s timeout)
Extract all <a href> links using scraper crate (already a dependency)
Filter:
- Same domain only (no external links)
- Path must be non-empty and not just / (allows single-segment paths like /my-article)
- Exclude patterns: /tag/, /category/, /author/, /page/, /login, /signup, /privacy, /terms, /search, /contact
- Exclude static assets: .css, .js, .png, .jpg, .gif, .svg, .pdf, .zip, .xml
Normalize URLs (resolve relative paths against base URL, deduplicate)
Limit to 2 × max_articles_per_source per source (over-fetch)
Return Vec<String> of candidate article URLs

Known limitations:

JavaScript-rendered pages (React/Next.js SPAs) will return empty or navigation-only content. The pipeline degrades gracefully — Phase 2 web search fills the gaps.
RSS/Atom feeds are not used in v1. Could be added as a future enhancement for more reliable article discovery.

Classification LLM Call

A lightweight LLM request for assigning articles to categories.

Input:

List of articles: [{index, title, url, body_snippet (first 500 chars)}]
List of categories: user categories + "Autre"
Already-filled category counts (for Phase 2: "AI News already has 3/4")
Max items per category

Prompt: "Classify each article into the most appropriate category. Each category, including 'Autre', accepts at most N articles. Return a JSON mapping."

Output schema:

{
  "type": "object",
  "properties": {
    "assignments": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "index": { "type": "integer" },
          "category": { "type": "string" }
        },
        "required": ["index", "category"],
        "additionalProperties": false
      }
    }
  },
  "required": ["assignments"],
  "additionalProperties": false
}

Error handling:

Invalid article index → ignored (skip that assignment)
Category name not matching any user category or "Autre" → assign to "Autre"
Missing assignments (not all articles classified) → unclassified articles assigned to "Autre"
Case-insensitive category matching

Model: Uses model_research (same as search pass).

LLM dispatch: Reuse generate_rewrite_pass (Chat Completions API, no web search needed). The classification call uses model_research even though it goes through the "rewrite" method — the method is provider-agnostic and just sends a structured prompt.

source_diversity_window Interaction

Phase 1 (personalized sources): The diversity window does NOT apply. Personalized sources are explicitly chosen by the user and always scraped, even if their domain appeared in recent syntheses.
Phase 2 (web search): The diversity window applies as today — recent domains are injected as a soft "avoid if possible" instruction in the search prompt.

Bug Fixes Included

build_rewrite_schema forcing minItems: 1 for empty categories — Categories with 0 articles are omitted from the rewrite schema entirely. No hallucinated articles.
Dead code removal — url_quality_sufficient, URL_QUALITY_THRESHOLD removed.

Files to Modify

Create: backend/src/services/source_scraper.rs — source page scraping + article link extraction
Modify: backend/src/services/mod.rs — register source_scraper module
Modify: backend/src/services/synthesis.rs — rewrite run_generation_inner with two-phase pipeline, classification response parsing, category filling logic, "Autre" handling in build_rewrite_schema and build_final_sections
Modify: backend/src/services/prompts.rs — add build_classification_prompt, modify build_search_prompt to accept category gaps (how many items still needed per category)
Modify: backend/src/services/llm/schema.rs — add build_classification_schema
Modify: backend/tests/api_syntheses_test.rs — update generation pipeline integration test
Modify: e2e/tests/generation-live.spec.ts — update settings, add assertions for personalized source articles and "Autre" category
Add: unit tests in source_scraper.rs — link extraction, filtering, deduplication, edge cases
Add: unit tests in prompts.rs — classification prompt generation
Add: unit tests in synthesis.rs — classification parsing, category filling, two-phase integration, "Autre" handling

What Does NOT Change

Frontend — no UI changes
Database/migrations — no schema changes
User settings — no new fields
Individual article scraper (scraper.rs) — reused as-is
LLM provider trait and implementations — reused as-is (classification uses generate_rewrite_pass)
restore_scraped_urls, sanitize_json_null_bytes — reused as-is
filter_empty_scraped_articles — reused as-is

8.6 KiB Raw Blame History Unescape Escape