You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
ai_synth/docs/superpowers/specs/2026-03-24-source-priority-...

152 lines
8.6 KiB
Markdown

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

# Design: Source Priority Pipeline — Personalized Sources First, Web Search Fallback
**Date**: 2026-03-24
**Scope**: Redesign the synthesis generation pipeline to prioritize personalized sources with scraping, fall back to web search for gaps
---
## Context
The current pipeline sends a single LLM call that mixes personalized sources and web search together. There is no prioritization, no retry when articles fail validation, and no fallback mechanism. The LLM decides freely which sources to use, often ignoring personalized ones.
## New Pipeline (Two-Phase)
### Phase 1: Personalized Sources (scrape-based, no LLM for discovery)
**Skipped entirely if the user has 0 configured sources.** Proceeds directly to Phase 2.
1. For each user source URL (e.g., `https://openai.com/blog`), scrape the page and extract article links (max 10 sources processed, to bound scraping work)
2. Filter links: same domain only, non-empty path (not just `/`), exclude non-article patterns
3. Normalize and deduplicate URLs, fetch up to `2 × max_articles_per_source` candidates per source (over-fetch to compensate for validation failures)
4. Scrape each candidate article (existing scraper: validate date, soft 404, content)
5. Filter out articles with empty scraped content (too old, failed, soft 404)
6. **LLM classification call**: send articles (title + first 500 chars of body) + user categories + "Autre" → LLM returns article-to-category mapping
7. Fill categories from the mapping, respecting `max_items_per_category` per category (including "Autre")
8. Trim excess: after classification, enforce `max_articles_per_source` per domain across all categories
**If all source scrapes fail** (network errors, JS-rendered sites, etc.), Phase 1 produces 0 articles. Pipeline falls through to Phase 2 cleanly.
### Phase 2: Web Search Fallback (LLM-based)
Only runs if any **user-defined** category is still under `max_items_per_category` after Phase 1. ("Autre" does not trigger Phase 2 — it only collects overflow.)
1. Compute category gaps: for each user-defined category, `needed = max_items_per_category - already_filled`
2. Run the LLM search pass with a modified prompt: include the gap counts per category ("find N articles for AI News, M articles for Cybersecurity")
3. Apply existing filters: `filter_homepage_urls`, `dedup_by_url` (cross-phase — dedup against Phase 1 URLs), `limit_articles_per_source` (cross-phase — count Phase 1 domains)
4. Scrape + validate web search results (existing scraper)
5. Filter out articles with empty scraped content
6. **LLM classification call** (same function as Phase 1): classify web search articles into remaining category gaps (including "Autre" for overflow)
7. Fill remaining category slots, respecting limits
### Combined Rewrite Pass
After both phases, merge all classified articles into a single `HashMap<String, Vec<ScrapedNewsItem>>` keyed by category. Run the rewrite pass on the combined set. The rewrite schema uses actual item counts per category. Categories with 0 articles are omitted from the schema (no hallucinated articles).
## "Autre" Default Category
- Always exists as a fallback classification category, regardless of user settings
- Articles that don't fit any user-defined category are assigned to "Autre"
- Capped at `max_items_per_category` (same limit as user categories)
- Only included in the final synthesis if it has articles (not shown when empty)
- Not a user setting — hardcoded in the pipeline
- Uses category key `category_autre` in the internal data structures
- Included in `build_rewrite_schema`, `build_final_sections`, and `restore_scraped_urls` when it has articles
- `limit_articles_per_source` and `dedup_by_url` treat "Autre" articles the same as any other category
## Source Page Scraping (new module: `source_scraper.rs`)
Fetches a source URL and extracts article links:
1. Fetch page HTML (reuse existing scraper HTTP client with 15s timeout)
2. Extract all `<a href>` links using `scraper` crate (already a dependency)
3. Filter:
- Same domain only (no external links)
- Path must be non-empty and not just `/` (allows single-segment paths like `/my-article`)
- Exclude patterns: `/tag/`, `/category/`, `/author/`, `/page/`, `/login`, `/signup`, `/privacy`, `/terms`, `/search`, `/contact`
- Exclude static assets: `.css`, `.js`, `.png`, `.jpg`, `.gif`, `.svg`, `.pdf`, `.zip`, `.xml`
4. Normalize URLs (resolve relative paths against base URL, deduplicate)
5. Limit to `2 × max_articles_per_source` per source (over-fetch)
6. Return `Vec<String>` of candidate article URLs
**Known limitations:**
- JavaScript-rendered pages (React/Next.js SPAs) will return empty or navigation-only content. The pipeline degrades gracefully — Phase 2 web search fills the gaps.
- RSS/Atom feeds are not used in v1. Could be added as a future enhancement for more reliable article discovery.
## Classification LLM Call
A lightweight LLM request for assigning articles to categories.
**Input:**
- List of articles: `[{index, title, url, body_snippet (first 500 chars)}]`
- List of categories: user categories + "Autre"
- Already-filled category counts (for Phase 2: "AI News already has 3/4")
- Max items per category
**Prompt:** "Classify each article into the most appropriate category. Each category, including 'Autre', accepts at most N articles. Return a JSON mapping."
**Output schema:**
```json
{
"type": "object",
"properties": {
"assignments": {
"type": "array",
"items": {
"type": "object",
"properties": {
"index": { "type": "integer" },
"category": { "type": "string" }
},
"required": ["index", "category"],
"additionalProperties": false
}
}
},
"required": ["assignments"],
"additionalProperties": false
}
```
**Error handling:**
- Invalid article index → ignored (skip that assignment)
- Category name not matching any user category or "Autre" → assign to "Autre"
- Missing assignments (not all articles classified) → unclassified articles assigned to "Autre"
- Case-insensitive category matching
**Model:** Uses `model_research` (same as search pass).
**LLM dispatch:** Reuse `generate_rewrite_pass` (Chat Completions API, no web search needed). The classification call uses `model_research` even though it goes through the "rewrite" method — the method is provider-agnostic and just sends a structured prompt.
## source_diversity_window Interaction
- **Phase 1 (personalized sources):** The diversity window does NOT apply. Personalized sources are explicitly chosen by the user and always scraped, even if their domain appeared in recent syntheses.
- **Phase 2 (web search):** The diversity window applies as today — recent domains are injected as a soft "avoid if possible" instruction in the search prompt.
## Bug Fixes Included
1. **`build_rewrite_schema` forcing `minItems: 1` for empty categories** — Categories with 0 articles are omitted from the rewrite schema entirely. No hallucinated articles.
2. **Dead code removal**`url_quality_sufficient`, `URL_QUALITY_THRESHOLD` removed.
## Files to Modify
- **Create:** `backend/src/services/source_scraper.rs` — source page scraping + article link extraction
- **Modify:** `backend/src/services/mod.rs` — register `source_scraper` module
- **Modify:** `backend/src/services/synthesis.rs` — rewrite `run_generation_inner` with two-phase pipeline, classification response parsing, category filling logic, "Autre" handling in `build_rewrite_schema` and `build_final_sections`
- **Modify:** `backend/src/services/prompts.rs` — add `build_classification_prompt`, modify `build_search_prompt` to accept category gaps (how many items still needed per category)
- **Modify:** `backend/src/services/llm/schema.rs` — add `build_classification_schema`
- **Modify:** `backend/tests/api_syntheses_test.rs` — update generation pipeline integration test
- **Modify:** `e2e/tests/generation-live.spec.ts` — update settings, add assertions for personalized source articles and "Autre" category
- **Add:** unit tests in `source_scraper.rs` — link extraction, filtering, deduplication, edge cases
- **Add:** unit tests in `prompts.rs` — classification prompt generation
- **Add:** unit tests in `synthesis.rs` — classification parsing, category filling, two-phase integration, "Autre" handling
## What Does NOT Change
- Frontend — no UI changes
- Database/migrations — no schema changes
- User settings — no new fields
- Individual article scraper (`scraper.rs`) — reused as-is
- LLM provider trait and implementations — reused as-is (classification uses `generate_rewrite_pass`)
- `restore_scraped_urls`, `sanitize_json_null_bytes` — reused as-is
- `filter_empty_scraped_articles` — reused as-is