docs: add algorithm rewrite implementation plan (7 tasks)

3 months ago · d3b63295f6
parent 1d5dc0596c
commit d3b63295f6
1 changed files with 688 additions and 0 deletions
--- a/docs/superpowers/plans/2026-03-25-algorithm-rewrite.md
+++ b/docs/superpowers/plans/2026-03-25-algorithm-rewrite.md
@ -0,0 +1,688 @@
+# Algorithm Rewrite — Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Rewrite the synthesis generation pipeline: per-article LLM classify/summarize, source rotation, no rewrite pass, remove deprecated settings.
+
+**Architecture:** Complete rewrite of `synthesis.rs` with a simpler two-phase pipeline. Phase 1: scrape personalized sources sequentially, classify/summarize each article with one LLM call. Phase 2: LLM search for gaps, scrape for validation. No batch classification, no rewrite pass.
+
+**Tech Stack:** Rust (sqlx, reqwest, scraper), existing LLM providers
+
+**Spec:** `docs/superpowers/specs/2026-03-25-algorithm-rewrite-design.md`
+**Algorithm:** `docs/algorithm.md`
+
+---
+
+### Task 1: Migration — drop deprecated settings columns
+
+**Files:**
+- Create: `backend/migrations/20260325000018_drop_deprecated_settings.sql`
+- Modify: `backend/src/models/settings.rs`
+- Modify: `backend/src/db/settings.rs`
+- Modify: `backend/src/services/prompts.rs` (test fixture)
+- Modify: `CLAUDE.md`
+
+- [ ] **Step 1: Create migration**
+
+```sql
+ALTER TABLE settings DROP COLUMN source_diversity_window;
+ALTER TABLE settings DROP COLUMN use_llm_for_article_extraction;
+```
+
+- [ ] **Step 2: Remove from settings model**
+
+In `models/settings.rs`, remove `source_diversity_window: i32` and `use_llm_for_article_extraction: bool` from `UserSettings`, `SettingsResponse`, `UpdateSettingsRequest`, `From` impl, `Default` impl, and validation.
+
+- [ ] **Step 3: Remove from DB queries**
+
+In `db/settings.rs`, remove both fields from `SettingsRow`, `TryFrom`, and both SQL queries (column lists, VALUES, RETURNING, ON CONFLICT SET, .bind() calls). Decrement $N placeholders carefully.
+
+- [ ] **Step 4: Update test fixtures**
+
+Remove both fields from `valid_request()` in settings tests and `test_settings()` in prompts tests. Remove any validation tests for these fields.
+
+- [ ] **Step 5: Update CLAUDE.md migration count to 18**
+
+- [ ] **Step 6: Verify + commit**
+
+```bash
+cd backend && cargo test --lib
+git add backend/migrations/20260325000018_drop_deprecated_settings.sql backend/src/models/settings.rs backend/src/db/settings.rs backend/src/services/prompts.rs CLAUDE.md
+git commit -m "feat: drop source_diversity_window and use_llm_for_article_extraction settings"
+```
+
+---
+
+### Task 2: New prompt + schema for per-article classify/summarize
+
+**Files:**
+- Modify: `backend/src/services/prompts.rs`
+- Modify: `backend/src/services/llm/schema.rs`
+
+- [ ] **Step 1: Add `build_article_classify_prompt` to prompts.rs**
+
+```rust
+/// Build a prompt for per-article classification and summarization.
+///
+/// The LLM classifies the article into a category and generates a title + summary.
+pub fn build_article_classify_prompt(
+    title: &str,
+    body_snippet: &str,
+    categories: &[String], // includes "Autre"
+) -> (String, String) {
+    let system_prompt =
+        "Tu es un assistant qui analyse des articles d'actualite. \
+         Tu dois classer l'article dans une categorie et generer un titre et un resume. \
+         Reponds uniquement au format JSON demande."
+            .to_string();
+
+    let categories_list = categories
+        .iter()
+        .map(|c| format!("- \"{}\"", c))
+        .collect::<Vec<_>>()
+        .join("\n");
+
+    let user_prompt = format!(
+        "Voici un article d'actualite.\n\n\
+         Titre : {title}\n\n\
+         Contenu (extrait) :\n{body}\n\n\
+         Categories disponibles :\n{categories}\n\n\
+         Classe cet article dans la categorie la plus appropriee.\n\
+         Si aucune categorie ne correspond, utilise \"Autre\".\n\
+         Genere un titre clair et un resume de 4 a 5 lignes.\n\
+         Si le titre fourni est vide, genere un titre a partir du contenu.",
+        title = if title.is_empty() { "(pas de titre)" } else { title },
+        body = body_snippet,
+        categories = categories_list,
+    );
+
+    (system_prompt, user_prompt)
+}
+```
+
+- [ ] **Step 2: Add `build_article_classify_schema` to schema.rs**
+
+```rust
+/// Build a JSON Schema for per-article classification and summarization.
+pub fn build_article_classify_schema() -> Value {
+    serde_json::json!({
+        "type": "object",
+        "properties": {
+            "title": { "type": "string", "description": "Article title" },
+            "summary": { "type": "string", "description": "4-5 line summary of the article" },
+            "category": { "type": "string", "description": "Category name from the provided list" }
+        },
+        "required": ["title", "summary", "category"],
+        "additionalProperties": false
+    })
+}
+```
+
+- [ ] **Step 3: Add tests**
+
+In prompts.rs tests:
+```rust
+    #[test]
+    fn article_classify_prompt_includes_content() {
+        let (sys, user) = build_article_classify_prompt("GPT-5 Released", "OpenAI released GPT-5", &["AI News".into(), "Autre".into()]);
+        assert!(user.contains("GPT-5 Released"));
+        assert!(user.contains("AI News"));
+        assert!(user.contains("Autre"));
+        assert!(sys.contains("classer"));
+    }
+
+    #[test]
+    fn article_classify_prompt_handles_empty_title() {
+        let (_, user) = build_article_classify_prompt("", "Some content", &["Tech".into(), "Autre".into()]);
+        assert!(user.contains("(pas de titre)"));
+    }
+```
+
+In schema.rs tests:
+```rust
+    #[test]
+    fn article_classify_schema_has_all_fields() {
+        let schema = build_article_classify_schema();
+        let props = schema["properties"].as_object().unwrap();
+        assert!(props.contains_key("title"));
+        assert!(props.contains_key("summary"));
+        assert!(props.contains_key("category"));
+        assert_eq!(schema["additionalProperties"], false);
+    }
+```
+
+- [ ] **Step 4: Verify + commit**
+
+```bash
+cd backend && cargo test --lib
+git add backend/src/services/prompts.rs backend/src/services/llm/schema.rs
+git commit -m "feat: add per-article classify/summarize prompt and schema"
+```
+
+---
+
+### Task 3: Add `get_last_source_url` to article_history DB + simplify ScrapedContent
+
+**Files:**
+- Modify: `backend/src/db/article_history.rs`
+- Modify: `backend/src/services/scraper.rs`
+
+- [ ] **Step 1: Add `get_last_source_url`**
+
+```rust
+/// Get the source_url from the most recent 'used' entry for source rotation.
+pub async fn get_last_source_url(
+    pool: &PgPool,
+    user_id: Uuid,
+) -> Result<Option<String>, AppError> {
+    let result = sqlx::query_scalar::<_, String>(
+        "SELECT source_url FROM article_history WHERE user_id = $1 AND status = 'used' AND source_url IS NOT NULL ORDER BY created_at DESC LIMIT 1",
+    )
+    .bind(user_id)
+    .fetch_optional(pool)
+    .await?;
+    Ok(result)
+}
+```
+
+- [ ] **Step 2: Remove `head_html` from `ScrapedContent`**
+
+In `scraper.rs`, remove `pub head_html: String` from the `ScrapedContent` struct. Remove the `head_html` extraction code in `scrape_url` (the block that finds `<head>...</head>`). Remove `head_html` from the return struct construction.
+
+This will cause compilation errors in `source_scraper.rs` where `extract_article_links_with_llm` uses `content.head_html` — but source_scraper uses its own `extract_head_and_body` function, not `ScrapedContent.head_html`. Check and fix any references.
+
+Also check `scrape_single_article_with_llm` in `synthesis.rs` — it references `content.head_html`. This function will be removed in Task 5, but it needs to compile now. Temporarily replace `content.head_html` with `String::new()` if needed, or remove the function now.
+
+- [ ] **Step 3: Verify + commit**
+
+```bash
+cd backend && cargo test --lib
+git add backend/src/db/article_history.rs backend/src/services/scraper.rs backend/src/services/synthesis.rs
+git commit -m "feat: add get_last_source_url + remove head_html from ScrapedContent"
+```
+
+---
+
+### Task 4: Remove old prompts, schemas, and unused code
+
+**Files:**
+- Modify: `backend/src/services/prompts.rs`
+- Modify: `backend/src/services/llm/schema.rs`
+
+- [ ] **Step 1: Remove old prompts from prompts.rs**
+
+Remove these functions and their tests:
+- `build_rewrite_prompt`
+- `build_classification_prompt`
+- `build_article_extraction_prompt`
+- `build_link_extraction_prompt` — WAIT, this one stays (used by source_scraper LLM link extraction)
+
+So remove: `build_rewrite_prompt`, `build_classification_prompt`, `build_article_extraction_prompt` and their tests.
+
+Also remove the `build_search_prompt` parameter `category_gaps: Option<&[(String, i32)]>` — simplify back to always using `max_items_per_category`. Actually wait — Phase 2 still uses gap-aware search. Keep `category_gaps` parameter.
+
+Remove `use crate::models::synthesis::ScrapedNewsItem;` if it's no longer needed (check if `build_classification_prompt` was the only user).
+
+- [ ] **Step 2: Remove old schemas from schema.rs**
+
+Remove: `build_classification_schema`, `build_article_extraction_schema`
+Keep: `build_category_schema` (Phase 2 search), `build_link_extraction_schema` (source scraper), `build_article_classify_schema` (new)
+
+- [ ] **Step 3: Verify + commit**
+
+```bash
+cd backend && cargo test --lib
+git add backend/src/services/prompts.rs backend/src/services/llm/schema.rs
+git commit -m "refactor: remove old classification, rewrite, and article extraction prompts/schemas"
+```
+
+---
+
+### Task 5: Rewrite `synthesis.rs` — the core pipeline
+
+**Files:**
+- Modify: `backend/src/services/synthesis.rs`
+
+This is the largest task. The entire `run_generation_inner` function is rewritten. Many helper functions are removed.
+
+- [ ] **Step 1: Remove dead helper functions**
+
+Delete these functions and their tests from `synthesis.rs`:
+- `scrape_single_article_with_llm`
+- `scrape_flat_urls`
+- `scrape_articles`
+- `filter_empty_scraped_articles`
+- `build_rewrite_schema`
+- `build_final_sections`
+- `restore_scraped_urls`
+- `parse_classification_response`
+- `limit_articles_per_source`
+- `dedup_by_url`
+- `filter_homepage_urls`
+- `SYNTHESIS_MIN_FILL_RATIO` constant
+- All associated tests for these functions
+
+Keep:
+- `scrape_single_article` (used for Phase 1 per-article scraping)
+- `emit_progress`
+- `trace_article`
+- `log_llm_call`
+- `normalize_article_url` / `hash_article_url`
+- `extract_domain`
+- `resolve_provider_and_key` / `resolve_model`
+- `check_rate_limit` / `get_user_rate_limiter`
+- `sanitize_json_null_bytes`
+- `sanitize_error_message`
+- `get_iso_week_string`
+- `parse_llm_output` (used in Phase 2)
+
+- [ ] **Step 2: Add `rotate_sources` helper**
+
+```rust
+/// Rotate the sources list so that the source after the last-used source comes first.
+fn rotate_sources(sources: Vec<Source>, last_source_url: Option<&str>) -> Vec<Source> {
+    let Some(last_url) = last_source_url else {
+        return sources;
+    };
+
+    let pos = sources.iter().position(|s| s.url == last_url);
+    match pos {
+        Some(idx) => {
+            let next = (idx + 1) % sources.len();
+            let mut rotated = sources[next..].to_vec();
+            rotated.extend_from_slice(&sources[..next]);
+            rotated
+        }
+        None => sources, // Last source not in list, don't rotate
+    }
+}
+```
+
+- [ ] **Step 3: Rewrite `run_generation_inner`**
+
+Replace the entire function body with the new algorithm. The new flow:
+
+```rust
+async fn run_generation_inner(
+    job_id: Uuid,
+    state: &AppState,
+    user_id: Uuid,
+    tx: &watch::Sender<ProgressEvent>,
+) -> Result<Uuid, AppError> {
+    // === INITIALIZATION ===
+    emit_progress(tx, "settings", "Chargement des parametres...", 5);
+    let settings = db::settings::get_or_create_default(&state.pool, user_id).await?;
+
+    // Cleanup
+    if settings.article_history_days > 0 {
+        db::article_history::cleanup_old(&state.pool, user_id, settings.article_history_days).await.unwrap_or(0);
+        db::llm_call_log::truncate_old(&state.pool, user_id, settings.article_history_days).await.ok();
+    }
+
+    // Categories — if empty, default to just "Autre"
+    let user_categories = if settings.categories.is_empty() {
+        Vec::new()
+    } else {
+        settings.categories.clone()
+    };
+    let mut classification_categories = user_categories.clone();
+    classification_categories.push("Autre".to_string());
+
+    // Load sources
+    emit_progress(tx, "sources", "Chargement des sources...", 10);
+    let sources = db::sources::list_for_user(&state.pool, user_id).await?;
+
+    // Resolve provider
+    emit_progress(tx, "provider", "Configuration du fournisseur IA...", 12);
+    let (provider_name, api_key) = resolve_provider_and_key(state, user_id, &settings).await?;
+    let provider = create_provider(&provider_name, api_key)?;
+    let model_research = if !settings.ai_model.is_empty() { settings.ai_model.clone() } else { resolve_model(state, &provider_name).await? };
+    let model_writing = if !settings.ai_model_writing.is_empty() { settings.ai_model_writing.clone() } else { model_research.clone() };
+    let user_rate_limiter = get_user_rate_limiter(state, &settings, user_id);
+
+    // Tracking structures
+    let mut article_scraped: HashMap<String, Vec<NewsItem>> = HashMap::new();
+    let mut source_counts: HashMap<String, usize> = HashMap::new();
+    let mut url_source: HashMap<String, String> = HashMap::new(); // url → source_url
+    let mut filled_counts: HashMap<String, usize> = HashMap::new();
+    let mut seen_urls: std::collections::HashSet<String> = std::collections::HashSet::new();
+    let max_total = (user_categories.len() + 1) * settings.max_items_per_category as usize;
+    let classify_schema = build_article_classify_schema();
+
+    // === PHASE 1: Personalized Sources ===
+    if !sources.is_empty() {
+        emit_progress(tx, "sources_scrape", "Analyse des sources personnalisees...", 15);
+
+        // 1a. Rotate sources
+        let last_source = db::article_history::get_last_source_url(&state.pool, user_id).await.unwrap_or(None);
+        let rotated_sources = rotate_sources(sources.clone(), last_source.as_deref());
+        let max_sources = rotated_sources.len().min(10);
+        let max_links = 10usize;
+
+        let mut candidate_urls: Vec<(String, String)> = Vec::new(); // (article_url, source_url)
+
+        for source in rotated_sources.iter().take(max_sources) {
+            let links = if settings.use_llm_for_source_links {
+                source_scraper::extract_article_links_with_llm(
+                    &state.http_client, &source.url, max_links, &provider, &model_research,
+                ).await
+            } else {
+                source_scraper::extract_article_links(
+                    &state.http_client, &source.url, max_links,
+                ).await
+            };
+
+            if let Ok(links) = links {
+                for link in links {
+                    if seen_urls.insert(link.to_lowercase()) {
+                        candidate_urls.push((link, source.url.clone()));
+                    }
+                }
+            }
+        }
+
+        // Filter against article history
+        if settings.article_history_days > 0 && !candidate_urls.is_empty() {
+            let hashes: Vec<String> = candidate_urls.iter().map(|(url, _)| hash_article_url(url)).collect();
+            let existing = db::article_history::check_urls_exist(&state.pool, user_id, &hashes).await.unwrap_or_default();
+            if !existing.is_empty() {
+                // Trace filtered articles
+                for (url, source_url) in &candidate_urls {
+                    if existing.contains(&hash_article_url(url)) {
+                        trace_article(&state.pool, user_id, job_id, url, "", "personalized_source", Some(source_url), None, None, "filtered_history", false).await;
+                    }
+                }
+                candidate_urls.retain(|(url, _)| !existing.contains(&hash_article_url(url)));
+            }
+        }
+
+        // Track url → source
+        for (url, source_url) in &candidate_urls {
+            url_source.insert(url.clone(), source_url.clone());
+        }
+
+        // 1b. Scrape, classify, summarize each article
+        emit_progress(tx, "processing", "Traitement des articles...", 25);
+        let total_candidates = candidate_urls.len();
+
+        for (idx, (url, source_url)) in candidate_urls.into_iter().enumerate() {
+            // Progress
+            let pct = 25 + ((idx as u32 * 40) / total_candidates.max(1) as u32).min(40);
+            emit_progress(tx, "processing", &format!("Article {}/{}...", idx + 1, total_candidates), pct as u8);
+
+            // Check source limit
+            let source_domain = extract_domain(&source_url).unwrap_or_default();
+            let source_count = source_counts.get(&source_domain).copied().unwrap_or(0);
+            if source_count >= settings.max_articles_per_source as usize {
+                trace_article(&state.pool, user_id, job_id, &url, "", "personalized_source", Some(&source_url), None, None, "filtered_diversity", false).await;
+                continue;
+            }
+
+            // Scrape
+            let (body_text, page_title, final_url) = scrape_single_article(&state.http_client, &url, settings.max_age_days as i64).await;
+
+            if body_text.trim().is_empty() {
+                trace_article(&state.pool, user_id, job_id, &final_url, &page_title, "personalized_source", Some(&source_url), None, None, "filtered_empty", false).await;
+                continue;
+            }
+
+            // LLM classify + summarize
+            check_rate_limit(state, &user_rate_limiter, &provider_name)?;
+            let body_snippet: String = body_text.chars().take(500).collect();
+            let (class_sys, class_user) = prompts::build_article_classify_prompt(&page_title, &body_snippet, &classification_categories);
+
+            let llm_start = std::time::Instant::now();
+            let class_response = provider.call_llm(&model_research, &class_sys, &class_user, &classify_schema).await?;
+            let llm_duration = llm_start.elapsed().as_millis() as u64;
+            log_llm_call(&state.pool, user_id, job_id, "classify_summarize", &model_research, &class_sys, &class_user, &class_response, llm_duration).await;
+
+            // Parse response
+            let llm_title = class_response.get("title").and_then(|t| t.as_str()).unwrap_or(&page_title).to_string();
+            let llm_summary = class_response.get("summary").and_then(|s| s.as_str()).unwrap_or("").to_string();
+            let mut llm_category = class_response.get("category").and_then(|c| c.as_str()).unwrap_or("Autre").to_string();
+
+            // Validate category — if not in list, use "Autre"
+            if !classification_categories.iter().any(|c| c.to_lowercase() == llm_category.to_lowercase()) {
+                llm_category = "Autre".to_string();
+            }
+
+            // Map category to key
+            let cat_key = if llm_category == "Autre" {
+                "category_autre".to_string()
+            } else {
+                user_categories.iter().position(|c| c.to_lowercase() == llm_category.to_lowercase())
+                    .map(|i| format!("category_{}", i))
+                    .unwrap_or_else(|| "category_autre".to_string())
+            };
+
+            // Check if category is full → overflow to "Autre"
+            let cat_filled = filled_counts.get(&llm_category).copied().unwrap_or(0);
+            let (final_cat_key, final_cat_name) = if cat_filled >= settings.max_items_per_category as usize && llm_category != "Autre" {
+                let autre_filled = filled_counts.get("Autre").copied().unwrap_or(0);
+                if autre_filled >= settings.max_items_per_category as usize {
+                    // Both full — skip article
+                    continue;
+                }
+                ("category_autre".to_string(), "Autre".to_string())
+            } else {
+                (cat_key, llm_category)
+            };
+
+            // Add article
+            article_scraped.entry(final_cat_key).or_default().push(NewsItem {
+                title: llm_title,
+                url: final_url.clone(),
+                summary: llm_summary,
+            });
+            *filled_counts.entry(final_cat_name).or_insert(0) += 1;
+            *source_counts.entry(source_domain).or_insert(0) += 1;
+
+            // Check if we've reached the maximum
+            let total: usize = article_scraped.values().map(|v| v.len()).sum();
+            if total >= max_total {
+                break;
+            }
+        }
+    }
+
+    // === PHASE 2: Web Search Fallback ===
+    let category_gaps: Vec<(String, i32)> = user_categories.iter().filter_map(|cat| {
+        let filled = filled_counts.get(cat).copied().unwrap_or(0);
+        let needed = (settings.max_items_per_category as usize).saturating_sub(filled);
+        if needed > 0 { Some((cat.clone(), needed as i32)) } else { None }
+    }).collect();
+
+    if !category_gaps.is_empty() {
+        emit_progress(tx, "search", "Recherche d'actualites complementaires...", 70);
+        check_rate_limit(state, &user_rate_limiter, &provider_name)?;
+
+        let search_schema = build_category_schema(&user_categories, settings.max_items_per_category);
+        let current_date = Utc::now().format("%A %d %B %Y").to_string();
+        let (sys_prompt, usr_prompt) = prompts::build_search_prompt(&settings, &sources, &current_date, &[], Some(&category_gaps));
+
+        let llm_start = std::time::Instant::now();
+        let raw_results = provider.call_llm(&model_research, &sys_prompt, &usr_prompt, &search_schema).await?;
+        let llm_duration = llm_start.elapsed().as_millis() as u64;
+        log_llm_call(&state.pool, user_id, job_id, "search", &model_research, &sys_prompt, &usr_prompt, &raw_results, llm_duration).await;
+
+        // Parse and filter
+        emit_progress(tx, "parsing", "Analyse des resultats...", 75);
+        let parsed = parse_llm_output(&raw_results, &user_categories)?;
+
+        // Filter: homepage, cross-phase dedup, url dedup, source limit, history
+        let mut phase2_articles: Vec<(String, NewsItem)> = Vec::new(); // (cat_key, item)
+
+        for (cat_key, items) in parsed {
+            for item in items {
+                let url_lower = item.url.to_lowercase();
+
+                // Homepage filter
+                if let Ok(parsed_url) = url::Url::parse(&item.url) {
+                    let path = parsed_url.path();
+                    if path.is_empty() || path == "/" {
+                        trace_article(&state.pool, user_id, job_id, &item.url, &item.title, "web_search", None, None, None, "filtered_homepage", false).await;
+                        continue;
+                    }
+                }
+
+                // Cross-phase dedup
+                if seen_urls.contains(&url_lower) {
+                    trace_article(&state.pool, user_id, job_id, &item.url, &item.title, "web_search", None, None, None, "filtered_cross_phase_dedup", false).await;
+                    continue;
+                }
+
+                // History dedup
+                if settings.article_history_days > 0 {
+                    let hash = hash_article_url(&item.url);
+                    let exists = db::article_history::check_urls_exist(&state.pool, user_id, &[hash.clone()]).await.unwrap_or_default();
+                    if exists.contains(&hash) {
+                        trace_article(&state.pool, user_id, job_id, &item.url, &item.title, "web_search", None, None, None, "filtered_history", false).await;
+                        continue;
+                    }
+                }
+
+                // Source limit
+                if let Some(domain) = extract_domain(&item.url) {
+                    let count = source_counts.get(&domain).copied().unwrap_or(0);
+                    if count >= settings.max_articles_per_source as usize {
+                        trace_article(&state.pool, user_id, job_id, &item.url, &item.title, "web_search", None, None, None, "filtered_diversity", false).await;
+                        continue;
+                    }
+                }
+
+                seen_urls.insert(url_lower);
+                phase2_articles.push((cat_key.clone(), item));
+            }
+        }
+
+        // Scrape Phase 2 articles for validation
+        emit_progress(tx, "scraping", "Verification des sources web...", 80);
+        for (cat_key, item) in phase2_articles {
+            let (body_text, _, final_url) = scrape_single_article(&state.http_client, &item.url, settings.max_age_days as i64).await;
+
+            if body_text.trim().is_empty() {
+                trace_article(&state.pool, user_id, job_id, &final_url, &item.title, "web_search", None, None, None, "filtered_empty", false).await;
+                continue;
+            }
+
+            // Use the LLM-provided title and summary (Phase 2 summaries are final)
+            article_scraped.entry(cat_key).or_default().push(NewsItem {
+                title: item.title,
+                url: final_url,
+                summary: item.summary,
+            });
+
+            if let Some(domain) = extract_domain(&item.url) {
+                *source_counts.entry(domain).or_insert(0) += 1;
+            }
+        }
+    }
+
+    // === SAVE ===
+    if article_scraped.values().all(|items| items.is_empty()) {
+        return Err(AppError::BadRequest("Aucun article valide trouve. Verifiez vos sources et categories.".into()));
+    }
+
+    emit_progress(tx, "saving", "Sauvegarde de la synthese...", 90);
+
+    // Build final sections
+    let mut final_sections: Vec<NewsSection> = Vec::new();
+    for (i, cat_name) in user_categories.iter().enumerate() {
+        let key = format!("category_{}", i);
+        if let Some(items) = article_scraped.get(&key) {
+            if !items.is_empty() {
+                final_sections.push(NewsSection { title: cat_name.clone(), items: items.clone() });
+            }
+        }
+    }
+    if let Some(autre_items) = article_scraped.get("category_autre") {
+        if !autre_items.is_empty() {
+            final_sections.push(NewsSection { title: "Autre".to_string(), items: autre_items.clone() });
+        }
+    }
+
+    let sections_json = serde_json::to_value(&final_sections).map_err(|e| AppError::Internal(anyhow::anyhow!("Failed to serialize: {}", e)))?;
+    let sections_json = sanitize_json_null_bytes(sections_json);
+
+    let synthesis = db::syntheses::create(&state.pool, user_id, &get_iso_week_string(Utc::now().date_naive()), &sections_json, job_id).await?;
+
+    // Record used articles
+    if settings.article_history_days > 0 {
+        for section in &final_sections {
+            for item in &section.items {
+                let source_url = url_source.get(&item.url).map(|s| s.as_str());
+                trace_article(&state.pool, user_id, job_id, &item.url, &item.title,
+                    if source_url.is_some() { "personalized_source" } else { "web_search" },
+                    source_url, Some(&section.title), Some(synthesis.id), "used", true).await;
+            }
+        }
+    }
+
+    Ok(synthesis.id)
+}
+```
+
+- [ ] **Step 4: Add `rotate_sources` unit tests**
+
+```rust
+    #[test]
+    fn rotate_sources_after_last_used() {
+        // Create mock sources — need Source struct with url field
+        // Test that rotation works correctly
+    }
+```
+
+- [ ] **Step 5: Verify + commit**
+
+```bash
+cd backend && cargo test --lib
+git add backend/src/services/synthesis.rs
+git commit -m "feat: rewrite synthesis pipeline — per-article classify/summarize, no rewrite pass"
+```
+
+---
+
+### Task 6: Frontend — remove deprecated settings
+
+**Files:**
+- Modify: `frontend/src/types.ts`
+- Modify: `frontend/src/pages/Settings.tsx`
+- Modify: `frontend/src/i18n/fr.ts`
+
+- [ ] **Step 1: Remove fields from types**
+
+Remove `source_diversity_window: number` and `use_llm_for_article_extraction: boolean` from `UserSettings` and `DEFAULT_SETTINGS`.
+
+- [ ] **Step 2: Remove from Settings page**
+
+Remove the diversity window number input and the LLM extraction checkbox from `Settings.tsx`.
+
+- [ ] **Step 3: Remove i18n labels**
+
+Remove `settings.diversityWindow` and `settings.useLlmForArticleExtraction` labels.
+
+- [ ] **Step 4: Verify + commit**
+
+```bash
+cd frontend && npx tsc --noEmit && npx vitest run
+git add frontend/src/types.ts frontend/src/pages/Settings.tsx frontend/src/i18n/fr.ts
+git commit -m "feat: remove deprecated settings from frontend"
+```
+
+---
+
+### Task 7: Update E2E test
+
+**Files:**
+- Modify: `e2e/tests/generation-live.spec.ts`
+
+- [ ] **Step 1: Update settings payload**
+
+Remove `source_diversity_window` and `use_llm_for_article_extraction` from the PUT settings body.
+
+- [ ] **Step 2: Commit**
+
+```bash
+git add e2e/tests/generation-live.spec.ts
+git commit -m "test: update E2E test for new pipeline (remove deprecated settings)"
+```