docs: add spec and plan for source priority pipeline redesign

Two-phase pipeline: scrape personalized sources first, classify with LLM,
fall back to web search for gaps. Tasks 1-3 ready, Task 4 needs elaboration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
master
oabrivard 3 months ago
parent 45e5ee8a7d
commit 7c8a19616d

@ -0,0 +1,760 @@
# Source Priority Pipeline — Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Redesign the synthesis generation pipeline to scrape personalized sources first, classify articles with the LLM, then fall back to web search for unfilled categories.
**Architecture:** Two-phase pipeline. Phase 1: scrape user source pages → extract article links → scrape articles → LLM classification → fill categories. Phase 2: LLM web search for gaps → scrape → classify → fill remaining. Combined rewrite pass at the end.
**Tech Stack:** Rust (reqwest, scraper crate, serde_json), existing LLM providers
**Spec:** `docs/superpowers/specs/2026-03-24-source-priority-pipeline-design.md`
---
### Task 1: Source page scraper module
**Files:**
- Create: `backend/src/services/source_scraper.rs`
- Modify: `backend/src/services/mod.rs`
Create a new module that fetches a source page and extracts article links.
- [ ] **Step 1: Create `backend/src/services/source_scraper.rs`**
The module should contain:
```rust
//! Source page scraper: fetches a source URL and extracts article links.
//!
//! Used in Phase 1 of the generation pipeline to discover articles
//! from user-configured sources before falling back to LLM web search.
use crate::errors::AppError;
use scraper::{Html, Selector};
use url::Url;
/// Patterns in URL paths that indicate non-article pages.
const EXCLUDED_PATH_PATTERNS: &[&str] = &[
"/tag/", "/category/", "/author/", "/page/", "/login", "/signup",
"/privacy", "/terms", "/search", "/contact", "/about",
];
/// File extensions that indicate static assets, not articles.
const EXCLUDED_EXTENSIONS: &[&str] = &[
".css", ".js", ".png", ".jpg", ".jpeg", ".gif", ".svg",
".pdf", ".zip", ".xml", ".json", ".ico", ".woff", ".woff2",
];
/// Extract article links from a source page.
///
/// Fetches the HTML at `source_url`, extracts all `<a href>` links,
/// filters to same-domain article-like URLs, deduplicates, and returns
/// up to `max_links` candidate URLs.
pub async fn extract_article_links(
http_client: &reqwest::Client,
source_url: &str,
max_links: usize,
) -> Result<Vec<String>, AppError> {
// Parse source URL to get base domain
let base_url = Url::parse(source_url)
.map_err(|e| AppError::BadRequest(format!("Invalid source URL: {}", e)))?;
let base_domain = base_url.host_str().unwrap_or("").to_lowercase();
// Fetch the page
let response = http_client
.get(source_url)
.send()
.await
.map_err(|e| {
tracing::warn!(url = source_url, error = %e, "Failed to fetch source page");
AppError::Internal(anyhow::anyhow!("Failed to fetch source page"))
})?;
if !response.status().is_success() {
tracing::warn!(url = source_url, status = %response.status(), "Source page returned non-200");
return Ok(Vec::new());
}
let html_text = response.text().await.map_err(|e| {
AppError::Internal(anyhow::anyhow!("Failed to read source page body: {}", e))
})?;
let links = extract_links_from_html(&html_text, &base_url, &base_domain);
Ok(links.into_iter().take(max_links).collect())
}
/// Extract and filter article links from HTML content.
///
/// This is a pure function (no I/O) for easy testing.
pub fn extract_links_from_html(
html: &str,
base_url: &Url,
base_domain: &str,
) -> Vec<String> {
let document = Html::parse_document(html);
let selector = Selector::parse("a[href]").unwrap();
let mut seen = std::collections::HashSet::new();
let mut links = Vec::new();
for element in document.select(&selector) {
if let Some(href) = element.value().attr("href") {
// Resolve relative URLs against the base
let resolved = match base_url.join(href) {
Ok(u) => u,
Err(_) => continue,
};
// Must be http(s)
if resolved.scheme() != "http" && resolved.scheme() != "https" {
continue;
}
// Must be same domain
let link_domain = resolved.host_str().unwrap_or("").to_lowercase();
if link_domain != base_domain {
continue;
}
// Must have a non-empty path (not just "/")
let path = resolved.path();
if path.is_empty() || path == "/" {
continue;
}
// Exclude non-article patterns
let path_lower = path.to_lowercase();
if EXCLUDED_PATH_PATTERNS.iter().any(|p| path_lower.contains(p)) {
continue;
}
// Exclude static assets
if EXCLUDED_EXTENSIONS.iter().any(|ext| path_lower.ends_with(ext)) {
continue;
}
// Normalize: strip fragment, keep path+query
let mut normalized = resolved.clone();
normalized.set_fragment(None);
let url_str = normalized.to_string();
// Deduplicate
if seen.insert(url_str.clone()) {
links.push(url_str);
}
}
}
links
}
```
- [ ] **Step 2: Register module in `backend/src/services/mod.rs`**
Add `pub mod source_scraper;` after `pub mod scraper;`.
- [ ] **Step 3: Add unit tests**
Add to the bottom of `source_scraper.rs`:
```rust
#[cfg(test)]
mod tests {
use super::*;
fn base_url(s: &str) -> Url {
Url::parse(s).unwrap()
}
#[test]
fn extracts_article_links_from_html() {
let html = r#"
<html><body>
<a href="/blog/article-1">Article 1</a>
<a href="/blog/article-2">Article 2</a>
<a href="/">Home</a>
</body></html>"#;
let base = base_url("https://example.com/blog");
let links = extract_links_from_html(html, &base, "example.com");
assert_eq!(links.len(), 2);
assert!(links[0].contains("/blog/article-1"));
assert!(links[1].contains("/blog/article-2"));
}
#[test]
fn filters_external_links() {
let html = r#"<a href="https://other.com/article">External</a>"#;
let base = base_url("https://example.com");
let links = extract_links_from_html(html, &base, "example.com");
assert!(links.is_empty());
}
#[test]
fn filters_non_article_patterns() {
let html = r#"
<a href="/tag/ai">Tag</a>
<a href="/category/tech">Category</a>
<a href="/author/john">Author</a>
<a href="/login">Login</a>
"#;
let base = base_url("https://example.com");
let links = extract_links_from_html(html, &base, "example.com");
assert!(links.is_empty());
}
#[test]
fn filters_static_assets() {
let html = r#"
<a href="/style.css">CSS</a>
<a href="/script.js">JS</a>
<a href="/logo.png">Image</a>
"#;
let base = base_url("https://example.com");
let links = extract_links_from_html(html, &base, "example.com");
assert!(links.is_empty());
}
#[test]
fn deduplicates_links() {
let html = r#"
<a href="/article">Link 1</a>
<a href="/article">Link 2</a>
<a href="/article#section">Link 3</a>
"#;
let base = base_url("https://example.com");
let links = extract_links_from_html(html, &base, "example.com");
assert_eq!(links.len(), 1);
}
#[test]
fn resolves_relative_urls() {
let html = r#"<a href="my-post">Relative</a>"#;
let base = base_url("https://example.com/blog/");
let links = extract_links_from_html(html, &base, "example.com");
assert_eq!(links.len(), 1);
assert!(links[0].contains("/blog/my-post"));
}
#[test]
fn allows_single_segment_paths() {
let html = r#"<a href="/my-great-article">Article</a>"#;
let base = base_url("https://example.com");
let links = extract_links_from_html(html, &base, "example.com");
assert_eq!(links.len(), 1);
}
#[test]
fn empty_html_returns_empty() {
let links = extract_links_from_html("", &base_url("https://example.com"), "example.com");
assert!(links.is_empty());
}
}
```
- [ ] **Step 4: Run tests**
Run: `cd backend && cargo test --lib source_scraper`
Expected: 7 tests pass
- [ ] **Step 5: Commit**
```bash
git add backend/src/services/source_scraper.rs backend/src/services/mod.rs
git commit -m "feat: add source_scraper module for extracting article links from source pages"
```
---
### Task 2: Classification prompt and schema
**Files:**
- Modify: `backend/src/services/prompts.rs`
- Modify: `backend/src/services/llm/schema.rs`
- [ ] **Step 1: Add `build_classification_schema` to `schema.rs`**
```rust
/// Build a JSON Schema for the article classification response.
///
/// The LLM returns an array of assignments mapping article indices to category names.
pub fn build_classification_schema() -> Value {
serde_json::json!({
"type": "object",
"properties": {
"assignments": {
"type": "array",
"items": {
"type": "object",
"properties": {
"index": { "type": "integer", "description": "Article index from the input list" },
"category": { "type": "string", "description": "Category name to assign this article to" }
},
"required": ["index", "category"],
"additionalProperties": false
}
}
},
"required": ["assignments"],
"additionalProperties": false
})
}
```
- [ ] **Step 2: Add `build_classification_prompt` to `prompts.rs`**
```rust
use crate::models::synthesis::ScrapedNewsItem;
/// Build a prompt for classifying scraped articles into categories.
///
/// # Arguments
/// * `articles` — scraped articles to classify (title + body snippet used)
/// * `categories` — user categories + "Autre"
/// * `max_per_category` — max items allowed per category
/// * `filled_counts` — how many items already fill each category (for Phase 2)
pub fn build_classification_prompt(
articles: &[ScrapedNewsItem],
categories: &[String],
max_per_category: i32,
filled_counts: &std::collections::HashMap<String, usize>,
) -> (String, String) {
let system_prompt =
"Tu es un assistant qui classe des articles dans des categories. \
Reponds uniquement au format JSON demande."
.to_string();
let articles_json: Vec<serde_json::Value> = articles
.iter()
.enumerate()
.map(|(i, a)| {
let snippet: String = a.scraped_content.chars().take(500).collect();
serde_json::json!({
"index": i,
"title": a.title,
"url": a.url,
"snippet": snippet
})
})
.collect();
let categories_info: Vec<String> = categories
.iter()
.map(|cat| {
let filled = filled_counts.get(cat).copied().unwrap_or(0);
let remaining = (max_per_category as usize).saturating_sub(filled);
format!("- \"{}\" (encore {} places)", cat, remaining)
})
.collect();
let user_prompt = format!(
"Voici une liste d'articles :\n{articles}\n\n\
Categories disponibles :\n{categories}\n\n\
Classe chaque article dans la categorie la plus appropriee. \
Si un article ne correspond a aucune categorie, classe-le dans \"Autre\".\n\
Respecte le nombre de places restantes par categorie.",
articles = serde_json::to_string_pretty(&articles_json).unwrap_or_default(),
categories = categories_info.join("\n"),
);
(system_prompt, user_prompt)
}
```
- [ ] **Step 3: Update `test_settings()` in prompts.rs if needed**
If `ScrapedNewsItem` is not already imported, add the import. The function uses types from `models::synthesis`.
- [ ] **Step 4: Add tests**
```rust
#[test]
fn classification_prompt_includes_categories_and_articles() {
let articles = vec![
ScrapedNewsItem {
title: "GPT-5 Released".into(),
url: "https://openai.com/blog/gpt5".into(),
summary: "s".into(),
original_title: "t".into(),
scraped_content: "OpenAI released GPT-5 today with major improvements".into(),
},
];
let categories = vec!["AI News".to_string(), "Autre".to_string()];
let filled = std::collections::HashMap::new();
let (_, user_prompt) = build_classification_prompt(&articles, &categories, 4, &filled);
assert!(user_prompt.contains("GPT-5 Released"));
assert!(user_prompt.contains("AI News"));
assert!(user_prompt.contains("Autre"));
assert!(user_prompt.contains("encore 4 places"));
}
#[test]
fn classification_prompt_shows_reduced_capacity() {
let articles = vec![
ScrapedNewsItem {
title: "T".into(), url: "https://a.com/1".into(),
summary: "s".into(), original_title: "t".into(),
scraped_content: "Content".into(),
},
];
let categories = vec!["AI News".to_string(), "Autre".to_string()];
let mut filled = std::collections::HashMap::new();
filled.insert("AI News".to_string(), 3);
let (_, user_prompt) = build_classification_prompt(&articles, &categories, 4, &filled);
assert!(user_prompt.contains("encore 1 places"));
}
```
- [ ] **Step 5: Add a test for `build_classification_schema` in schema.rs**
```rust
#[test]
fn classification_schema_has_assignments_array() {
let schema = build_classification_schema();
assert_eq!(schema["type"], "object");
let assignments = &schema["properties"]["assignments"];
assert_eq!(assignments["type"], "array");
let item_props = &assignments["items"]["properties"];
assert!(item_props.get("index").is_some());
assert!(item_props.get("category").is_some());
assert_eq!(assignments["items"]["additionalProperties"], false);
}
```
- [ ] **Step 6: Run tests**
Run: `cd backend && cargo test --lib`
Expected: all tests pass
- [ ] **Step 7: Commit**
```bash
git add backend/src/services/prompts.rs backend/src/services/llm/schema.rs
git commit -m "feat: add classification prompt and schema for article categorization"
```
---
### Task 3: Classification response parsing + category filling
**Files:**
- Modify: `backend/src/services/synthesis.rs`
Add helper functions for parsing the LLM classification response and filling categories.
- [ ] **Step 1: Add `parse_classification_response` function**
```rust
/// Parse the LLM classification response and assign articles to categories.
///
/// Returns a HashMap of category_key → Vec<ScrapedNewsItem>.
/// Invalid indices are ignored. Unknown categories default to "Autre".
/// Case-insensitive category matching.
fn parse_classification_response(
response: &serde_json::Value,
articles: &[ScrapedNewsItem],
categories: &[String],
max_per_category: i32,
filled_counts: &mut HashMap<String, usize>,
) -> HashMap<String, Vec<ScrapedNewsItem>> {
let max = max_per_category as usize;
let mut result: HashMap<String, Vec<ScrapedNewsItem>> = HashMap::new();
// Build category name → key mapping (case-insensitive)
let mut name_to_key: HashMap<String, String> = HashMap::new();
for (i, cat) in categories.iter().enumerate() {
let key = if cat == "Autre" {
"category_autre".to_string()
} else {
format!("category_{}", i)
};
name_to_key.insert(cat.to_lowercase(), key);
}
let assignments = response
.get("assignments")
.and_then(|a| a.as_array())
.cloned()
.unwrap_or_default();
let mut assigned_indices = std::collections::HashSet::new();
for assignment in &assignments {
let index = match assignment.get("index").and_then(|i| i.as_u64()) {
Some(i) => i as usize,
None => continue,
};
if index >= articles.len() || assigned_indices.contains(&index) {
continue;
}
let cat_name = assignment
.get("category")
.and_then(|c| c.as_str())
.unwrap_or("Autre")
.to_string();
let cat_key = name_to_key
.get(&cat_name.to_lowercase())
.cloned()
.unwrap_or_else(|| "category_autre".to_string());
// Resolve the category name for counting
let cat_display = categories
.iter()
.find(|c| c.to_lowercase() == cat_name.to_lowercase())
.cloned()
.unwrap_or_else(|| "Autre".to_string());
let filled = filled_counts.get(&cat_display).copied().unwrap_or(0);
if filled >= max {
// Category full — assign to Autre if Autre has room
let autre_filled = filled_counts.get("Autre").copied().unwrap_or(0);
if autre_filled < max {
result.entry("category_autre".to_string()).or_default().push(articles[index].clone());
*filled_counts.entry("Autre".to_string()).or_insert(0) += 1;
assigned_indices.insert(index);
}
continue;
}
result.entry(cat_key).or_default().push(articles[index].clone());
*filled_counts.entry(cat_display).or_insert(0) += 1;
assigned_indices.insert(index);
}
// Unclassified articles → Autre
for (i, article) in articles.iter().enumerate() {
if !assigned_indices.contains(&i) {
let autre_filled = filled_counts.get("Autre").copied().unwrap_or(0);
if autre_filled < max {
result.entry("category_autre".to_string()).or_default().push(article.clone());
*filled_counts.entry("Autre".to_string()).or_insert(0) += 1;
}
}
}
result
}
```
- [ ] **Step 2: Add unit tests**
```rust
// ── parse_classification_response tests ─────────────────────
#[test]
fn classification_assigns_articles_to_categories() {
use crate::models::synthesis::ScrapedNewsItem;
let articles = vec![
ScrapedNewsItem { title: "A".into(), url: "https://a.com/1".into(), summary: "s".into(), original_title: "t".into(), scraped_content: "c".into() },
ScrapedNewsItem { title: "B".into(), url: "https://b.com/2".into(), summary: "s".into(), original_title: "t".into(), scraped_content: "c".into() },
];
let categories = vec!["AI News".to_string(), "Autre".to_string()];
let response = serde_json::json!({
"assignments": [
{"index": 0, "category": "AI News"},
{"index": 1, "category": "Autre"}
]
});
let mut filled = HashMap::new();
let result = parse_classification_response(&response, &articles, &categories, 4, &mut filled);
assert_eq!(result.get("category_0").map(|v| v.len()), Some(1));
assert_eq!(result.get("category_autre").map(|v| v.len()), Some(1));
}
#[test]
fn classification_unknown_category_goes_to_autre() {
use crate::models::synthesis::ScrapedNewsItem;
let articles = vec![
ScrapedNewsItem { title: "A".into(), url: "https://a.com/1".into(), summary: "s".into(), original_title: "t".into(), scraped_content: "c".into() },
];
let categories = vec!["AI News".to_string(), "Autre".to_string()];
let response = serde_json::json!({
"assignments": [{"index": 0, "category": "Unknown Category"}]
});
let mut filled = HashMap::new();
let result = parse_classification_response(&response, &articles, &categories, 4, &mut filled);
assert_eq!(result.get("category_autre").map(|v| v.len()), Some(1));
}
#[test]
fn classification_respects_max_per_category() {
use crate::models::synthesis::ScrapedNewsItem;
let articles: Vec<ScrapedNewsItem> = (0..5).map(|i| ScrapedNewsItem {
title: format!("Art{}", i), url: format!("https://a.com/{}", i),
summary: "s".into(), original_title: "t".into(), scraped_content: "c".into(),
}).collect();
let categories = vec!["AI News".to_string(), "Autre".to_string()];
let response = serde_json::json!({
"assignments": (0..5).map(|i| serde_json::json!({"index": i, "category": "AI News"})).collect::<Vec<_>>()
});
let mut filled = HashMap::new();
let result = parse_classification_response(&response, &articles, &categories, 2, &mut filled);
// AI News capped at 2, overflow goes to Autre
assert_eq!(result.get("category_0").map(|v| v.len()), Some(2));
assert!(result.get("category_autre").map(|v| v.len()).unwrap_or(0) > 0);
}
#[test]
fn classification_invalid_index_ignored() {
use crate::models::synthesis::ScrapedNewsItem;
let articles = vec![
ScrapedNewsItem { title: "A".into(), url: "https://a.com/1".into(), summary: "s".into(), original_title: "t".into(), scraped_content: "c".into() },
];
let categories = vec!["AI News".to_string(), "Autre".to_string()];
let response = serde_json::json!({
"assignments": [{"index": 99, "category": "AI News"}]
});
let mut filled = HashMap::new();
let result = parse_classification_response(&response, &articles, &categories, 4, &mut filled);
// Index 99 is invalid → article 0 is unclassified → goes to Autre
assert_eq!(result.get("category_autre").map(|v| v.len()), Some(1));
}
```
- [ ] **Step 3: Run tests**
Run: `cd backend && cargo test --lib`
Expected: all tests pass
- [ ] **Step 4: Commit**
```bash
git add backend/src/services/synthesis.rs
git commit -m "feat: add classification response parsing with category filling and Autre fallback"
```
---
### Task 4: Rewrite `run_generation_inner` with two-phase pipeline
**Files:**
- Modify: `backend/src/services/synthesis.rs`
- Modify: `backend/src/services/prompts.rs`
This is the core task — replace the existing single-pass pipeline with the two-phase approach.
- [ ] **Step 1: Modify `build_search_prompt` to accept category gaps**
In `prompts.rs`, add an optional parameter `category_gaps: Option<&[(String, i32)]>` to `build_search_prompt`. When provided, append to the prompt: "Find specifically: N articles for category X, M articles for category Y."
If `None`, use the existing behavior (all categories, max_items_per_category each).
Update the call signature and all existing callers (add `, None` to existing calls).
- [ ] **Step 2: Rewrite `run_generation_inner`**
Replace the body of `run_generation_inner` with the two-phase pipeline. The new flow:
```
Steps 1-4b: Same as today (load settings, sources, resolve provider, build schema, resolve models)
Step 5: Build rate limiter (same)
PHASE 1 (if sources not empty):
5a. Scrape source pages → extract article links (source_scraper::extract_article_links)
5b. Scrape candidate articles (existing scrape_articles, but from flat URL list)
5c. Filter empty content (filter_empty_scraped_articles)
5d. LLM classification call (build_classification_prompt + generate_rewrite_pass + parse_classification_response)
5e. Enforce max_articles_per_source on classified results
PHASE 2 (if any user-defined category is under-filled):
6a. Compute category gaps
6b. Load recent domains for diversity (same as today)
6c. LLM search pass with gap-aware prompt
6d. Parse + filter_homepage_urls + dedup_by_url (cross-phase) + limit_articles_per_source (cross-phase)
6e. Scrape + filter empty
6f. LLM classification call (same function, with filled_counts from Phase 1)
6g. Fill remaining slots
COMBINED:
7. Merge Phase 1 + Phase 2 results into single HashMap
8. Build rewrite schema (actual counts, omit empty categories, include "Autre" if non-empty)
9. Build rewrite prompt + LLM rewrite pass
10. Build final sections (include "Autre" if non-empty)
11. Restore scraped URLs
12. Sanitize + save to DB
```
Key changes to existing functions:
- `build_rewrite_schema`: skip categories with 0 articles (remove `.max(1)`), add "Autre" support
- `build_final_sections`: add "Autre" section if `category_autre` has articles
- `scrape_articles`: needs a variant that takes a flat `Vec<String>` of URLs (for Phase 1 source articles) instead of `Vec<(String, Vec<NewsItem>)>`
- [ ] **Step 3: Remove dead code**
Delete `url_quality_sufficient`, `URL_QUALITY_THRESHOLD`, and their tests.
- [ ] **Step 4: Run tests**
Run: `cd backend && cargo test --lib`
Expected: all tests pass (some existing synthesis tests may need updates for the new pipeline shape)
- [ ] **Step 5: Commit**
```bash
git add backend/src/services/synthesis.rs backend/src/services/prompts.rs
git commit -m "feat: two-phase generation pipeline — personalized sources first, web search fallback"
```
---
### Task 5: Update integration tests
**Files:**
- Modify: `backend/tests/api_syntheses_test.rs`
- [ ] **Step 1: Update `generate_pipeline_resolves_model_from_admin_config`**
The test triggers generation with a fake API key. With the new pipeline, Phase 1 will scrape the configured source, fail (no real content), fall through to Phase 2 (LLM search, which will also fail with fake key). The test should still verify that the pipeline doesn't crash with a database error — it should fail at the LLM call level, not at model resolution.
Update the test's source URL to something that will return a valid HTML page but no article links (e.g., keep `https://example.com` which returns a simple page). The pipeline should gracefully skip Phase 1 and attempt Phase 2.
- [ ] **Step 2: Run integration tests**
Run: `cd backend && cargo test --no-run` (compile check, integration tests need Postgres)
Expected: compiles without errors
- [ ] **Step 3: Commit**
```bash
git add backend/tests/api_syntheses_test.rs
git commit -m "test: update integration tests for two-phase generation pipeline"
```
---
### Task 6: Update E2E test
**Files:**
- Modify: `e2e/tests/generation-live.spec.ts`
- [ ] **Step 1: Update settings payload**
Ensure `max_articles_per_source` and `source_diversity_window` are included (already done from previous work, but verify).
- [ ] **Step 2: Add personalized source assertion**
After generation completes, verify that at least one article URL matches the configured source domain (e.g., if source is `https://openai.com/blog`, check that at least one article URL contains `openai.com`).
- [ ] **Step 3: Verify "Autre" behavior**
Add an assertion that if "Autre" appears as a section title, it has at most `max_items_per_category` articles and each article has valid content.
- [ ] **Step 4: Run E2E test**
Start Docker stack, seed, run:
```bash
cd e2e && docker compose -f docker-compose.test.yml up --build -d
sleep 25 && npx tsx seed.ts
npx playwright test generation-live --reporter=list
```
Expected: test passes
- [ ] **Step 5: Commit**
```bash
git add e2e/tests/generation-live.spec.ts
git commit -m "test: update E2E test for two-phase pipeline with source priority"
```

@ -0,0 +1,151 @@
# Design: Source Priority Pipeline — Personalized Sources First, Web Search Fallback
**Date**: 2026-03-24
**Scope**: Redesign the synthesis generation pipeline to prioritize personalized sources with scraping, fall back to web search for gaps
---
## Context
The current pipeline sends a single LLM call that mixes personalized sources and web search together. There is no prioritization, no retry when articles fail validation, and no fallback mechanism. The LLM decides freely which sources to use, often ignoring personalized ones.
## New Pipeline (Two-Phase)
### Phase 1: Personalized Sources (scrape-based, no LLM for discovery)
**Skipped entirely if the user has 0 configured sources.** Proceeds directly to Phase 2.
1. For each user source URL (e.g., `https://openai.com/blog`), scrape the page and extract article links (max 10 sources processed, to bound scraping work)
2. Filter links: same domain only, non-empty path (not just `/`), exclude non-article patterns
3. Normalize and deduplicate URLs, fetch up to `2 × max_articles_per_source` candidates per source (over-fetch to compensate for validation failures)
4. Scrape each candidate article (existing scraper: validate date, soft 404, content)
5. Filter out articles with empty scraped content (too old, failed, soft 404)
6. **LLM classification call**: send articles (title + first 500 chars of body) + user categories + "Autre" → LLM returns article-to-category mapping
7. Fill categories from the mapping, respecting `max_items_per_category` per category (including "Autre")
8. Trim excess: after classification, enforce `max_articles_per_source` per domain across all categories
**If all source scrapes fail** (network errors, JS-rendered sites, etc.), Phase 1 produces 0 articles. Pipeline falls through to Phase 2 cleanly.
### Phase 2: Web Search Fallback (LLM-based)
Only runs if any **user-defined** category is still under `max_items_per_category` after Phase 1. ("Autre" does not trigger Phase 2 — it only collects overflow.)
1. Compute category gaps: for each user-defined category, `needed = max_items_per_category - already_filled`
2. Run the LLM search pass with a modified prompt: include the gap counts per category ("find N articles for AI News, M articles for Cybersecurity")
3. Apply existing filters: `filter_homepage_urls`, `dedup_by_url` (cross-phase — dedup against Phase 1 URLs), `limit_articles_per_source` (cross-phase — count Phase 1 domains)
4. Scrape + validate web search results (existing scraper)
5. Filter out articles with empty scraped content
6. **LLM classification call** (same function as Phase 1): classify web search articles into remaining category gaps (including "Autre" for overflow)
7. Fill remaining category slots, respecting limits
### Combined Rewrite Pass
After both phases, merge all classified articles into a single `HashMap<String, Vec<ScrapedNewsItem>>` keyed by category. Run the rewrite pass on the combined set. The rewrite schema uses actual item counts per category. Categories with 0 articles are omitted from the schema (no hallucinated articles).
## "Autre" Default Category
- Always exists as a fallback classification category, regardless of user settings
- Articles that don't fit any user-defined category are assigned to "Autre"
- Capped at `max_items_per_category` (same limit as user categories)
- Only included in the final synthesis if it has articles (not shown when empty)
- Not a user setting — hardcoded in the pipeline
- Uses category key `category_autre` in the internal data structures
- Included in `build_rewrite_schema`, `build_final_sections`, and `restore_scraped_urls` when it has articles
- `limit_articles_per_source` and `dedup_by_url` treat "Autre" articles the same as any other category
## Source Page Scraping (new module: `source_scraper.rs`)
Fetches a source URL and extracts article links:
1. Fetch page HTML (reuse existing scraper HTTP client with 15s timeout)
2. Extract all `<a href>` links using `scraper` crate (already a dependency)
3. Filter:
- Same domain only (no external links)
- Path must be non-empty and not just `/` (allows single-segment paths like `/my-article`)
- Exclude patterns: `/tag/`, `/category/`, `/author/`, `/page/`, `/login`, `/signup`, `/privacy`, `/terms`, `/search`, `/contact`
- Exclude static assets: `.css`, `.js`, `.png`, `.jpg`, `.gif`, `.svg`, `.pdf`, `.zip`, `.xml`
4. Normalize URLs (resolve relative paths against base URL, deduplicate)
5. Limit to `2 × max_articles_per_source` per source (over-fetch)
6. Return `Vec<String>` of candidate article URLs
**Known limitations:**
- JavaScript-rendered pages (React/Next.js SPAs) will return empty or navigation-only content. The pipeline degrades gracefully — Phase 2 web search fills the gaps.
- RSS/Atom feeds are not used in v1. Could be added as a future enhancement for more reliable article discovery.
## Classification LLM Call
A lightweight LLM request for assigning articles to categories.
**Input:**
- List of articles: `[{index, title, url, body_snippet (first 500 chars)}]`
- List of categories: user categories + "Autre"
- Already-filled category counts (for Phase 2: "AI News already has 3/4")
- Max items per category
**Prompt:** "Classify each article into the most appropriate category. Each category, including 'Autre', accepts at most N articles. Return a JSON mapping."
**Output schema:**
```json
{
"type": "object",
"properties": {
"assignments": {
"type": "array",
"items": {
"type": "object",
"properties": {
"index": { "type": "integer" },
"category": { "type": "string" }
},
"required": ["index", "category"],
"additionalProperties": false
}
}
},
"required": ["assignments"],
"additionalProperties": false
}
```
**Error handling:**
- Invalid article index → ignored (skip that assignment)
- Category name not matching any user category or "Autre" → assign to "Autre"
- Missing assignments (not all articles classified) → unclassified articles assigned to "Autre"
- Case-insensitive category matching
**Model:** Uses `model_research` (same as search pass).
**LLM dispatch:** Reuse `generate_rewrite_pass` (Chat Completions API, no web search needed). The classification call uses `model_research` even though it goes through the "rewrite" method — the method is provider-agnostic and just sends a structured prompt.
## source_diversity_window Interaction
- **Phase 1 (personalized sources):** The diversity window does NOT apply. Personalized sources are explicitly chosen by the user and always scraped, even if their domain appeared in recent syntheses.
- **Phase 2 (web search):** The diversity window applies as today — recent domains are injected as a soft "avoid if possible" instruction in the search prompt.
## Bug Fixes Included
1. **`build_rewrite_schema` forcing `minItems: 1` for empty categories** — Categories with 0 articles are omitted from the rewrite schema entirely. No hallucinated articles.
2. **Dead code removal**`url_quality_sufficient`, `URL_QUALITY_THRESHOLD` removed.
## Files to Modify
- **Create:** `backend/src/services/source_scraper.rs` — source page scraping + article link extraction
- **Modify:** `backend/src/services/mod.rs` — register `source_scraper` module
- **Modify:** `backend/src/services/synthesis.rs` — rewrite `run_generation_inner` with two-phase pipeline, classification response parsing, category filling logic, "Autre" handling in `build_rewrite_schema` and `build_final_sections`
- **Modify:** `backend/src/services/prompts.rs` — add `build_classification_prompt`, modify `build_search_prompt` to accept category gaps (how many items still needed per category)
- **Modify:** `backend/src/services/llm/schema.rs` — add `build_classification_schema`
- **Modify:** `backend/tests/api_syntheses_test.rs` — update generation pipeline integration test
- **Modify:** `e2e/tests/generation-live.spec.ts` — update settings, add assertions for personalized source articles and "Autre" category
- **Add:** unit tests in `source_scraper.rs` — link extraction, filtering, deduplication, edge cases
- **Add:** unit tests in `prompts.rs` — classification prompt generation
- **Add:** unit tests in `synthesis.rs` — classification parsing, category filling, two-phase integration, "Autre" handling
## What Does NOT Change
- Frontend — no UI changes
- Database/migrations — no schema changes
- User settings — no new fields
- Individual article scraper (`scraper.rs`) — reused as-is
- LLM provider trait and implementations — reused as-is (classification uses `generate_rewrite_pass`)
- `restore_scraped_urls`, `sanitize_json_null_bytes` — reused as-is
- `filter_empty_scraped_articles` — reused as-is
Loading…
Cancel
Save