# Algorithm Rewrite — Implementation Plan > **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. **Goal:** Rewrite the synthesis generation pipeline: per-article LLM classify/summarize, source rotation, no rewrite pass, remove deprecated settings. **Architecture:** Complete rewrite of `synthesis.rs` with a simpler two-phase pipeline. Phase 1: scrape personalized sources sequentially, classify/summarize each article with one LLM call. Phase 2: LLM search for gaps, scrape for validation. No batch classification, no rewrite pass. **Tech Stack:** Rust (sqlx, reqwest, scraper), existing LLM providers **Spec:** `docs/superpowers/specs/2026-03-25-algorithm-rewrite-design.md` **Algorithm:** `docs/algorithm.md` --- ### Task 1: Migration — drop deprecated settings columns **Files:** - Create: `backend/migrations/20260325000018_drop_deprecated_settings.sql` - Modify: `backend/src/models/settings.rs` - Modify: `backend/src/db/settings.rs` - Modify: `backend/src/services/prompts.rs` (test fixture) - Modify: `CLAUDE.md` - [ ] **Step 1: Create migration** ```sql ALTER TABLE settings DROP COLUMN source_diversity_window; ALTER TABLE settings DROP COLUMN use_llm_for_article_extraction; ``` - [ ] **Step 2: Remove from settings model** In `models/settings.rs`, remove `source_diversity_window: i32` and `use_llm_for_article_extraction: bool` from `UserSettings`, `SettingsResponse`, `UpdateSettingsRequest`, `From` impl, `Default` impl, and validation. - [ ] **Step 3: Remove from DB queries** In `db/settings.rs`, remove both fields from `SettingsRow`, `TryFrom`, and both SQL queries (column lists, VALUES, RETURNING, ON CONFLICT SET, .bind() calls). Decrement $N placeholders carefully. - [ ] **Step 4: Update test fixtures** Remove both fields from `valid_request()` in settings tests and `test_settings()` in prompts tests. Remove any validation tests for these fields. - [ ] **Step 5: Update CLAUDE.md migration count to 18** - [ ] **Step 6: Verify + commit** ```bash cd backend && cargo test --lib git add backend/migrations/20260325000018_drop_deprecated_settings.sql backend/src/models/settings.rs backend/src/db/settings.rs backend/src/services/prompts.rs CLAUDE.md git commit -m "feat: drop source_diversity_window and use_llm_for_article_extraction settings" ``` --- ### Task 2: New prompt + schema for per-article classify/summarize **Files:** - Modify: `backend/src/services/prompts.rs` - Modify: `backend/src/services/llm/schema.rs` - [ ] **Step 1: Add `build_article_classify_prompt` to prompts.rs** ```rust /// Build a prompt for per-article classification and summarization. /// /// The LLM classifies the article into a category and generates a title + summary. pub fn build_article_classify_prompt( title: &str, body_snippet: &str, categories: &[String], // includes "Autre" ) -> (String, String) { let system_prompt = "Tu es un assistant qui analyse des articles d'actualite. \ Tu dois classer l'article dans une categorie et generer un titre et un resume. \ Reponds uniquement au format JSON demande." .to_string(); let categories_list = categories .iter() .map(|c| format!("- \"{}\"", c)) .collect::>() .join("\n"); let user_prompt = format!( "Voici un article d'actualite.\n\n\ Titre : {title}\n\n\ Contenu (extrait) :\n{body}\n\n\ Categories disponibles :\n{categories}\n\n\ Classe cet article dans la categorie la plus appropriee.\n\ Si aucune categorie ne correspond, utilise \"Autre\".\n\ Genere un titre clair et un resume de 4 a 5 lignes.\n\ Si le titre fourni est vide, genere un titre a partir du contenu.", title = if title.is_empty() { "(pas de titre)" } else { title }, body = body_snippet, categories = categories_list, ); (system_prompt, user_prompt) } ``` - [ ] **Step 2: Add `build_article_classify_schema` to schema.rs** ```rust /// Build a JSON Schema for per-article classification and summarization. pub fn build_article_classify_schema() -> Value { serde_json::json!({ "type": "object", "properties": { "title": { "type": "string", "description": "Article title" }, "summary": { "type": "string", "description": "4-5 line summary of the article" }, "category": { "type": "string", "description": "Category name from the provided list" } }, "required": ["title", "summary", "category"], "additionalProperties": false }) } ``` - [ ] **Step 3: Add tests** In prompts.rs tests: ```rust #[test] fn article_classify_prompt_includes_content() { let (sys, user) = build_article_classify_prompt("GPT-5 Released", "OpenAI released GPT-5", &["AI News".into(), "Autre".into()]); assert!(user.contains("GPT-5 Released")); assert!(user.contains("AI News")); assert!(user.contains("Autre")); assert!(sys.contains("classer")); } #[test] fn article_classify_prompt_handles_empty_title() { let (_, user) = build_article_classify_prompt("", "Some content", &["Tech".into(), "Autre".into()]); assert!(user.contains("(pas de titre)")); } ``` In schema.rs tests: ```rust #[test] fn article_classify_schema_has_all_fields() { let schema = build_article_classify_schema(); let props = schema["properties"].as_object().unwrap(); assert!(props.contains_key("title")); assert!(props.contains_key("summary")); assert!(props.contains_key("category")); assert_eq!(schema["additionalProperties"], false); } ``` - [ ] **Step 4: Verify + commit** ```bash cd backend && cargo test --lib git add backend/src/services/prompts.rs backend/src/services/llm/schema.rs git commit -m "feat: add per-article classify/summarize prompt and schema" ``` --- ### Task 3: Add `get_last_source_url` to article_history DB + simplify ScrapedContent **Files:** - Modify: `backend/src/db/article_history.rs` - Modify: `backend/src/services/scraper.rs` - [ ] **Step 1: Add `get_last_source_url`** ```rust /// Get the source_url from the most recent 'used' entry for source rotation. pub async fn get_last_source_url( pool: &PgPool, user_id: Uuid, ) -> Result, AppError> { let result = sqlx::query_scalar::<_, String>( "SELECT source_url FROM article_history WHERE user_id = $1 AND status = 'used' AND source_url IS NOT NULL ORDER BY created_at DESC LIMIT 1", ) .bind(user_id) .fetch_optional(pool) .await?; Ok(result) } ``` - [ ] **Step 2: Remove `head_html` from `ScrapedContent`** In `scraper.rs`, remove `pub head_html: String` from the `ScrapedContent` struct. Remove the `head_html` extraction code in `scrape_url` (the block that finds `...`). Remove `head_html` from the return struct construction. This will cause compilation errors in `source_scraper.rs` where `extract_article_links_with_llm` uses `content.head_html` — but source_scraper uses its own `extract_head_and_body` function, not `ScrapedContent.head_html`. Check and fix any references. Also check `scrape_single_article_with_llm` in `synthesis.rs` — it references `content.head_html`. This function will be removed in Task 5, but it needs to compile now. Temporarily replace `content.head_html` with `String::new()` if needed, or remove the function now. - [ ] **Step 3: Verify + commit** ```bash cd backend && cargo test --lib git add backend/src/db/article_history.rs backend/src/services/scraper.rs backend/src/services/synthesis.rs git commit -m "feat: add get_last_source_url + remove head_html from ScrapedContent" ``` --- ### Task 4: Remove old prompts, schemas, and unused code **Files:** - Modify: `backend/src/services/prompts.rs` - Modify: `backend/src/services/llm/schema.rs` - [ ] **Step 1: Remove old prompts from prompts.rs** Remove these functions and their tests: - `build_rewrite_prompt` - `build_classification_prompt` - `build_article_extraction_prompt` - `build_link_extraction_prompt` — WAIT, this one stays (used by source_scraper LLM link extraction) So remove: `build_rewrite_prompt`, `build_classification_prompt`, `build_article_extraction_prompt` and their tests. Also remove the `build_search_prompt` parameter `category_gaps: Option<&[(String, i32)]>` — simplify back to always using `max_items_per_category`. Actually wait — Phase 2 still uses gap-aware search. Keep `category_gaps` parameter. Remove `use crate::models::synthesis::ScrapedNewsItem;` if it's no longer needed (check if `build_classification_prompt` was the only user). - [ ] **Step 2: Remove old schemas from schema.rs** Remove: `build_classification_schema`, `build_article_extraction_schema` Keep: `build_category_schema` (Phase 2 search), `build_link_extraction_schema` (source scraper), `build_article_classify_schema` (new) - [ ] **Step 3: Verify + commit** ```bash cd backend && cargo test --lib git add backend/src/services/prompts.rs backend/src/services/llm/schema.rs git commit -m "refactor: remove old classification, rewrite, and article extraction prompts/schemas" ``` --- ### Task 5: Rewrite `synthesis.rs` — the core pipeline **Files:** - Modify: `backend/src/services/synthesis.rs` This is the largest task. The entire `run_generation_inner` function is rewritten. Many helper functions are removed. - [ ] **Step 1: Remove dead helper functions** Delete these functions and their tests from `synthesis.rs`: - `scrape_single_article_with_llm` - `scrape_flat_urls` - `scrape_articles` - `filter_empty_scraped_articles` - `build_rewrite_schema` - `build_final_sections` - `restore_scraped_urls` - `parse_classification_response` - `limit_articles_per_source` - `dedup_by_url` - `filter_homepage_urls` - `SYNTHESIS_MIN_FILL_RATIO` constant - All associated tests for these functions Keep: - `scrape_single_article` (used for Phase 1 per-article scraping) - `emit_progress` - `trace_article` - `log_llm_call` - `normalize_article_url` / `hash_article_url` - `extract_domain` - `resolve_provider_and_key` / `resolve_model` - `check_rate_limit` / `get_user_rate_limiter` - `sanitize_json_null_bytes` - `sanitize_error_message` - `get_iso_week_string` - `parse_llm_output` (used in Phase 2) - [ ] **Step 2: Add `rotate_sources` helper** ```rust /// Rotate the sources list so that the source after the last-used source comes first. fn rotate_sources(sources: Vec, last_source_url: Option<&str>) -> Vec { let Some(last_url) = last_source_url else { return sources; }; let pos = sources.iter().position(|s| s.url == last_url); match pos { Some(idx) => { let next = (idx + 1) % sources.len(); let mut rotated = sources[next..].to_vec(); rotated.extend_from_slice(&sources[..next]); rotated } None => sources, // Last source not in list, don't rotate } } ``` - [ ] **Step 3: Rewrite `run_generation_inner`** Replace the entire function body with the new algorithm. The new flow: ```rust async fn run_generation_inner( job_id: Uuid, state: &AppState, user_id: Uuid, tx: &watch::Sender, ) -> Result { // === INITIALIZATION === emit_progress(tx, "settings", "Chargement des parametres...", 5); let settings = db::settings::get_or_create_default(&state.pool, user_id).await?; // Cleanup if settings.article_history_days > 0 { db::article_history::cleanup_old(&state.pool, user_id, settings.article_history_days).await.unwrap_or(0); db::llm_call_log::truncate_old(&state.pool, user_id, settings.article_history_days).await.ok(); } // Categories — if empty, default to just "Autre" let user_categories = if settings.categories.is_empty() { Vec::new() } else { settings.categories.clone() }; let mut classification_categories = user_categories.clone(); classification_categories.push("Autre".to_string()); // Load sources emit_progress(tx, "sources", "Chargement des sources...", 10); let sources = db::sources::list_for_user(&state.pool, user_id).await?; // Resolve provider emit_progress(tx, "provider", "Configuration du fournisseur IA...", 12); let (provider_name, api_key) = resolve_provider_and_key(state, user_id, &settings).await?; let provider = create_provider(&provider_name, api_key)?; let model_research = if !settings.ai_model.is_empty() { settings.ai_model.clone() } else { resolve_model(state, &provider_name).await? }; let model_writing = if !settings.ai_model_writing.is_empty() { settings.ai_model_writing.clone() } else { model_research.clone() }; let user_rate_limiter = get_user_rate_limiter(state, &settings, user_id); // Tracking structures let mut article_scraped: HashMap> = HashMap::new(); let mut source_counts: HashMap = HashMap::new(); let mut url_source: HashMap = HashMap::new(); // url → source_url let mut filled_counts: HashMap = HashMap::new(); let mut seen_urls: std::collections::HashSet = std::collections::HashSet::new(); let max_total = (user_categories.len() + 1) * settings.max_items_per_category as usize; let classify_schema = build_article_classify_schema(); // === PHASE 1: Personalized Sources === if !sources.is_empty() { emit_progress(tx, "sources_scrape", "Analyse des sources personnalisees...", 15); // 1a. Rotate sources let last_source = db::article_history::get_last_source_url(&state.pool, user_id).await.unwrap_or(None); let rotated_sources = rotate_sources(sources.clone(), last_source.as_deref()); let max_sources = rotated_sources.len().min(10); let max_links = 10usize; let mut candidate_urls: Vec<(String, String)> = Vec::new(); // (article_url, source_url) for source in rotated_sources.iter().take(max_sources) { let links = if settings.use_llm_for_source_links { source_scraper::extract_article_links_with_llm( &state.http_client, &source.url, max_links, &provider, &model_research, ).await } else { source_scraper::extract_article_links( &state.http_client, &source.url, max_links, ).await }; if let Ok(links) = links { for link in links { if seen_urls.insert(link.to_lowercase()) { candidate_urls.push((link, source.url.clone())); } } } } // Filter against article history if settings.article_history_days > 0 && !candidate_urls.is_empty() { let hashes: Vec = candidate_urls.iter().map(|(url, _)| hash_article_url(url)).collect(); let existing = db::article_history::check_urls_exist(&state.pool, user_id, &hashes).await.unwrap_or_default(); if !existing.is_empty() { // Trace filtered articles for (url, source_url) in &candidate_urls { if existing.contains(&hash_article_url(url)) { trace_article(&state.pool, user_id, job_id, url, "", "personalized_source", Some(source_url), None, None, "filtered_history", false).await; } } candidate_urls.retain(|(url, _)| !existing.contains(&hash_article_url(url))); } } // Track url → source for (url, source_url) in &candidate_urls { url_source.insert(url.clone(), source_url.clone()); } // 1b. Scrape, classify, summarize each article emit_progress(tx, "processing", "Traitement des articles...", 25); let total_candidates = candidate_urls.len(); for (idx, (url, source_url)) in candidate_urls.into_iter().enumerate() { // Progress let pct = 25 + ((idx as u32 * 40) / total_candidates.max(1) as u32).min(40); emit_progress(tx, "processing", &format!("Article {}/{}...", idx + 1, total_candidates), pct as u8); // Check source limit let source_domain = extract_domain(&source_url).unwrap_or_default(); let source_count = source_counts.get(&source_domain).copied().unwrap_or(0); if source_count >= settings.max_articles_per_source as usize { trace_article(&state.pool, user_id, job_id, &url, "", "personalized_source", Some(&source_url), None, None, "filtered_diversity", false).await; continue; } // Scrape let (body_text, page_title, final_url) = scrape_single_article(&state.http_client, &url, settings.max_age_days as i64).await; if body_text.trim().is_empty() { trace_article(&state.pool, user_id, job_id, &final_url, &page_title, "personalized_source", Some(&source_url), None, None, "filtered_empty", false).await; continue; } // LLM classify + summarize check_rate_limit(state, &user_rate_limiter, &provider_name)?; let body_snippet: String = body_text.chars().take(500).collect(); let (class_sys, class_user) = prompts::build_article_classify_prompt(&page_title, &body_snippet, &classification_categories); let llm_start = std::time::Instant::now(); let class_response = provider.call_llm(&model_research, &class_sys, &class_user, &classify_schema).await?; let llm_duration = llm_start.elapsed().as_millis() as u64; log_llm_call(&state.pool, user_id, job_id, "classify_summarize", &model_research, &class_sys, &class_user, &class_response, llm_duration).await; // Parse response let llm_title = class_response.get("title").and_then(|t| t.as_str()).unwrap_or(&page_title).to_string(); let llm_summary = class_response.get("summary").and_then(|s| s.as_str()).unwrap_or("").to_string(); let mut llm_category = class_response.get("category").and_then(|c| c.as_str()).unwrap_or("Autre").to_string(); // Validate category — if not in list, use "Autre" if !classification_categories.iter().any(|c| c.to_lowercase() == llm_category.to_lowercase()) { llm_category = "Autre".to_string(); } // Map category to key let cat_key = if llm_category == "Autre" { "category_autre".to_string() } else { user_categories.iter().position(|c| c.to_lowercase() == llm_category.to_lowercase()) .map(|i| format!("category_{}", i)) .unwrap_or_else(|| "category_autre".to_string()) }; // Check if category is full → overflow to "Autre" let cat_filled = filled_counts.get(&llm_category).copied().unwrap_or(0); let (final_cat_key, final_cat_name) = if cat_filled >= settings.max_items_per_category as usize && llm_category != "Autre" { let autre_filled = filled_counts.get("Autre").copied().unwrap_or(0); if autre_filled >= settings.max_items_per_category as usize { // Both full — skip article continue; } ("category_autre".to_string(), "Autre".to_string()) } else { (cat_key, llm_category) }; // Add article article_scraped.entry(final_cat_key).or_default().push(NewsItem { title: llm_title, url: final_url.clone(), summary: llm_summary, }); *filled_counts.entry(final_cat_name).or_insert(0) += 1; *source_counts.entry(source_domain).or_insert(0) += 1; // Check if we've reached the maximum let total: usize = article_scraped.values().map(|v| v.len()).sum(); if total >= max_total { break; } } } // === PHASE 2: Web Search Fallback === let category_gaps: Vec<(String, i32)> = user_categories.iter().filter_map(|cat| { let filled = filled_counts.get(cat).copied().unwrap_or(0); let needed = (settings.max_items_per_category as usize).saturating_sub(filled); if needed > 0 { Some((cat.clone(), needed as i32)) } else { None } }).collect(); if !category_gaps.is_empty() { emit_progress(tx, "search", "Recherche d'actualites complementaires...", 70); check_rate_limit(state, &user_rate_limiter, &provider_name)?; let search_schema = build_category_schema(&user_categories, settings.max_items_per_category); let current_date = Utc::now().format("%A %d %B %Y").to_string(); let (sys_prompt, usr_prompt) = prompts::build_search_prompt(&settings, &sources, ¤t_date, &[], Some(&category_gaps)); let llm_start = std::time::Instant::now(); let raw_results = provider.call_llm(&model_research, &sys_prompt, &usr_prompt, &search_schema).await?; let llm_duration = llm_start.elapsed().as_millis() as u64; log_llm_call(&state.pool, user_id, job_id, "search", &model_research, &sys_prompt, &usr_prompt, &raw_results, llm_duration).await; // Parse and filter emit_progress(tx, "parsing", "Analyse des resultats...", 75); let parsed = parse_llm_output(&raw_results, &user_categories)?; // Filter: homepage, cross-phase dedup, url dedup, source limit, history let mut phase2_articles: Vec<(String, NewsItem)> = Vec::new(); // (cat_key, item) for (cat_key, items) in parsed { for item in items { let url_lower = item.url.to_lowercase(); // Homepage filter if let Ok(parsed_url) = url::Url::parse(&item.url) { let path = parsed_url.path(); if path.is_empty() || path == "/" { trace_article(&state.pool, user_id, job_id, &item.url, &item.title, "web_search", None, None, None, "filtered_homepage", false).await; continue; } } // Cross-phase dedup if seen_urls.contains(&url_lower) { trace_article(&state.pool, user_id, job_id, &item.url, &item.title, "web_search", None, None, None, "filtered_cross_phase_dedup", false).await; continue; } // History dedup if settings.article_history_days > 0 { let hash = hash_article_url(&item.url); let exists = db::article_history::check_urls_exist(&state.pool, user_id, &[hash.clone()]).await.unwrap_or_default(); if exists.contains(&hash) { trace_article(&state.pool, user_id, job_id, &item.url, &item.title, "web_search", None, None, None, "filtered_history", false).await; continue; } } // Source limit if let Some(domain) = extract_domain(&item.url) { let count = source_counts.get(&domain).copied().unwrap_or(0); if count >= settings.max_articles_per_source as usize { trace_article(&state.pool, user_id, job_id, &item.url, &item.title, "web_search", None, None, None, "filtered_diversity", false).await; continue; } } seen_urls.insert(url_lower); phase2_articles.push((cat_key.clone(), item)); } } // Scrape Phase 2 articles for validation emit_progress(tx, "scraping", "Verification des sources web...", 80); for (cat_key, item) in phase2_articles { let (body_text, _, final_url) = scrape_single_article(&state.http_client, &item.url, settings.max_age_days as i64).await; if body_text.trim().is_empty() { trace_article(&state.pool, user_id, job_id, &final_url, &item.title, "web_search", None, None, None, "filtered_empty", false).await; continue; } // Use the LLM-provided title and summary (Phase 2 summaries are final) article_scraped.entry(cat_key).or_default().push(NewsItem { title: item.title, url: final_url, summary: item.summary, }); if let Some(domain) = extract_domain(&item.url) { *source_counts.entry(domain).or_insert(0) += 1; } } } // === SAVE === if article_scraped.values().all(|items| items.is_empty()) { return Err(AppError::BadRequest("Aucun article valide trouve. Verifiez vos sources et categories.".into())); } emit_progress(tx, "saving", "Sauvegarde de la synthese...", 90); // Build final sections let mut final_sections: Vec = Vec::new(); for (i, cat_name) in user_categories.iter().enumerate() { let key = format!("category_{}", i); if let Some(items) = article_scraped.get(&key) { if !items.is_empty() { final_sections.push(NewsSection { title: cat_name.clone(), items: items.clone() }); } } } if let Some(autre_items) = article_scraped.get("category_autre") { if !autre_items.is_empty() { final_sections.push(NewsSection { title: "Autre".to_string(), items: autre_items.clone() }); } } let sections_json = serde_json::to_value(&final_sections).map_err(|e| AppError::Internal(anyhow::anyhow!("Failed to serialize: {}", e)))?; let sections_json = sanitize_json_null_bytes(sections_json); let synthesis = db::syntheses::create(&state.pool, user_id, &get_iso_week_string(Utc::now().date_naive()), §ions_json, job_id).await?; // Record used articles if settings.article_history_days > 0 { for section in &final_sections { for item in §ion.items { let source_url = url_source.get(&item.url).map(|s| s.as_str()); trace_article(&state.pool, user_id, job_id, &item.url, &item.title, if source_url.is_some() { "personalized_source" } else { "web_search" }, source_url, Some(§ion.title), Some(synthesis.id), "used", true).await; } } } Ok(synthesis.id) } ``` - [ ] **Step 4: Add `rotate_sources` unit tests** ```rust #[test] fn rotate_sources_after_last_used() { // Create mock sources — need Source struct with url field // Test that rotation works correctly } ``` - [ ] **Step 5: Verify + commit** ```bash cd backend && cargo test --lib git add backend/src/services/synthesis.rs git commit -m "feat: rewrite synthesis pipeline — per-article classify/summarize, no rewrite pass" ``` --- ### Task 6: Frontend — remove deprecated settings **Files:** - Modify: `frontend/src/types.ts` - Modify: `frontend/src/pages/Settings.tsx` - Modify: `frontend/src/i18n/fr.ts` - [ ] **Step 1: Remove fields from types** Remove `source_diversity_window: number` and `use_llm_for_article_extraction: boolean` from `UserSettings` and `DEFAULT_SETTINGS`. - [ ] **Step 2: Remove from Settings page** Remove the diversity window number input and the LLM extraction checkbox from `Settings.tsx`. - [ ] **Step 3: Remove i18n labels** Remove `settings.diversityWindow` and `settings.useLlmForArticleExtraction` labels. - [ ] **Step 4: Verify + commit** ```bash cd frontend && npx tsc --noEmit && npx vitest run git add frontend/src/types.ts frontend/src/pages/Settings.tsx frontend/src/i18n/fr.ts git commit -m "feat: remove deprecated settings from frontend" ``` --- ### Task 7: Update E2E test **Files:** - Modify: `e2e/tests/generation-live.spec.ts` - [ ] **Step 1: Update settings payload** Remove `source_diversity_window` and `use_llm_for_article_extraction` from the PUT settings body. - [ ] **Step 2: Commit** ```bash git add e2e/tests/generation-live.spec.ts git commit -m "test: update E2E test for new pipeline (remove deprecated settings)" ```