docs: add algorithm rewrite implementation plan (7 tasks)
parent
1d5dc0596c
commit
d3b63295f6
@ -0,0 +1,688 @@
|
|||||||
|
# Algorithm Rewrite — Implementation Plan
|
||||||
|
|
||||||
|
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||||
|
|
||||||
|
**Goal:** Rewrite the synthesis generation pipeline: per-article LLM classify/summarize, source rotation, no rewrite pass, remove deprecated settings.
|
||||||
|
|
||||||
|
**Architecture:** Complete rewrite of `synthesis.rs` with a simpler two-phase pipeline. Phase 1: scrape personalized sources sequentially, classify/summarize each article with one LLM call. Phase 2: LLM search for gaps, scrape for validation. No batch classification, no rewrite pass.
|
||||||
|
|
||||||
|
**Tech Stack:** Rust (sqlx, reqwest, scraper), existing LLM providers
|
||||||
|
|
||||||
|
**Spec:** `docs/superpowers/specs/2026-03-25-algorithm-rewrite-design.md`
|
||||||
|
**Algorithm:** `docs/algorithm.md`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 1: Migration — drop deprecated settings columns
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Create: `backend/migrations/20260325000018_drop_deprecated_settings.sql`
|
||||||
|
- Modify: `backend/src/models/settings.rs`
|
||||||
|
- Modify: `backend/src/db/settings.rs`
|
||||||
|
- Modify: `backend/src/services/prompts.rs` (test fixture)
|
||||||
|
- Modify: `CLAUDE.md`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Create migration**
|
||||||
|
|
||||||
|
```sql
|
||||||
|
ALTER TABLE settings DROP COLUMN source_diversity_window;
|
||||||
|
ALTER TABLE settings DROP COLUMN use_llm_for_article_extraction;
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Remove from settings model**
|
||||||
|
|
||||||
|
In `models/settings.rs`, remove `source_diversity_window: i32` and `use_llm_for_article_extraction: bool` from `UserSettings`, `SettingsResponse`, `UpdateSettingsRequest`, `From` impl, `Default` impl, and validation.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Remove from DB queries**
|
||||||
|
|
||||||
|
In `db/settings.rs`, remove both fields from `SettingsRow`, `TryFrom`, and both SQL queries (column lists, VALUES, RETURNING, ON CONFLICT SET, .bind() calls). Decrement $N placeholders carefully.
|
||||||
|
|
||||||
|
- [ ] **Step 4: Update test fixtures**
|
||||||
|
|
||||||
|
Remove both fields from `valid_request()` in settings tests and `test_settings()` in prompts tests. Remove any validation tests for these fields.
|
||||||
|
|
||||||
|
- [ ] **Step 5: Update CLAUDE.md migration count to 18**
|
||||||
|
|
||||||
|
- [ ] **Step 6: Verify + commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd backend && cargo test --lib
|
||||||
|
git add backend/migrations/20260325000018_drop_deprecated_settings.sql backend/src/models/settings.rs backend/src/db/settings.rs backend/src/services/prompts.rs CLAUDE.md
|
||||||
|
git commit -m "feat: drop source_diversity_window and use_llm_for_article_extraction settings"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 2: New prompt + schema for per-article classify/summarize
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `backend/src/services/prompts.rs`
|
||||||
|
- Modify: `backend/src/services/llm/schema.rs`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Add `build_article_classify_prompt` to prompts.rs**
|
||||||
|
|
||||||
|
```rust
|
||||||
|
/// Build a prompt for per-article classification and summarization.
|
||||||
|
///
|
||||||
|
/// The LLM classifies the article into a category and generates a title + summary.
|
||||||
|
pub fn build_article_classify_prompt(
|
||||||
|
title: &str,
|
||||||
|
body_snippet: &str,
|
||||||
|
categories: &[String], // includes "Autre"
|
||||||
|
) -> (String, String) {
|
||||||
|
let system_prompt =
|
||||||
|
"Tu es un assistant qui analyse des articles d'actualite. \
|
||||||
|
Tu dois classer l'article dans une categorie et generer un titre et un resume. \
|
||||||
|
Reponds uniquement au format JSON demande."
|
||||||
|
.to_string();
|
||||||
|
|
||||||
|
let categories_list = categories
|
||||||
|
.iter()
|
||||||
|
.map(|c| format!("- \"{}\"", c))
|
||||||
|
.collect::<Vec<_>>()
|
||||||
|
.join("\n");
|
||||||
|
|
||||||
|
let user_prompt = format!(
|
||||||
|
"Voici un article d'actualite.\n\n\
|
||||||
|
Titre : {title}\n\n\
|
||||||
|
Contenu (extrait) :\n{body}\n\n\
|
||||||
|
Categories disponibles :\n{categories}\n\n\
|
||||||
|
Classe cet article dans la categorie la plus appropriee.\n\
|
||||||
|
Si aucune categorie ne correspond, utilise \"Autre\".\n\
|
||||||
|
Genere un titre clair et un resume de 4 a 5 lignes.\n\
|
||||||
|
Si le titre fourni est vide, genere un titre a partir du contenu.",
|
||||||
|
title = if title.is_empty() { "(pas de titre)" } else { title },
|
||||||
|
body = body_snippet,
|
||||||
|
categories = categories_list,
|
||||||
|
);
|
||||||
|
|
||||||
|
(system_prompt, user_prompt)
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Add `build_article_classify_schema` to schema.rs**
|
||||||
|
|
||||||
|
```rust
|
||||||
|
/// Build a JSON Schema for per-article classification and summarization.
|
||||||
|
pub fn build_article_classify_schema() -> Value {
|
||||||
|
serde_json::json!({
|
||||||
|
"type": "object",
|
||||||
|
"properties": {
|
||||||
|
"title": { "type": "string", "description": "Article title" },
|
||||||
|
"summary": { "type": "string", "description": "4-5 line summary of the article" },
|
||||||
|
"category": { "type": "string", "description": "Category name from the provided list" }
|
||||||
|
},
|
||||||
|
"required": ["title", "summary", "category"],
|
||||||
|
"additionalProperties": false
|
||||||
|
})
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: Add tests**
|
||||||
|
|
||||||
|
In prompts.rs tests:
|
||||||
|
```rust
|
||||||
|
#[test]
|
||||||
|
fn article_classify_prompt_includes_content() {
|
||||||
|
let (sys, user) = build_article_classify_prompt("GPT-5 Released", "OpenAI released GPT-5", &["AI News".into(), "Autre".into()]);
|
||||||
|
assert!(user.contains("GPT-5 Released"));
|
||||||
|
assert!(user.contains("AI News"));
|
||||||
|
assert!(user.contains("Autre"));
|
||||||
|
assert!(sys.contains("classer"));
|
||||||
|
}
|
||||||
|
|
||||||
|
#[test]
|
||||||
|
fn article_classify_prompt_handles_empty_title() {
|
||||||
|
let (_, user) = build_article_classify_prompt("", "Some content", &["Tech".into(), "Autre".into()]);
|
||||||
|
assert!(user.contains("(pas de titre)"));
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
In schema.rs tests:
|
||||||
|
```rust
|
||||||
|
#[test]
|
||||||
|
fn article_classify_schema_has_all_fields() {
|
||||||
|
let schema = build_article_classify_schema();
|
||||||
|
let props = schema["properties"].as_object().unwrap();
|
||||||
|
assert!(props.contains_key("title"));
|
||||||
|
assert!(props.contains_key("summary"));
|
||||||
|
assert!(props.contains_key("category"));
|
||||||
|
assert_eq!(schema["additionalProperties"], false);
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4: Verify + commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd backend && cargo test --lib
|
||||||
|
git add backend/src/services/prompts.rs backend/src/services/llm/schema.rs
|
||||||
|
git commit -m "feat: add per-article classify/summarize prompt and schema"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 3: Add `get_last_source_url` to article_history DB + simplify ScrapedContent
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `backend/src/db/article_history.rs`
|
||||||
|
- Modify: `backend/src/services/scraper.rs`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Add `get_last_source_url`**
|
||||||
|
|
||||||
|
```rust
|
||||||
|
/// Get the source_url from the most recent 'used' entry for source rotation.
|
||||||
|
pub async fn get_last_source_url(
|
||||||
|
pool: &PgPool,
|
||||||
|
user_id: Uuid,
|
||||||
|
) -> Result<Option<String>, AppError> {
|
||||||
|
let result = sqlx::query_scalar::<_, String>(
|
||||||
|
"SELECT source_url FROM article_history WHERE user_id = $1 AND status = 'used' AND source_url IS NOT NULL ORDER BY created_at DESC LIMIT 1",
|
||||||
|
)
|
||||||
|
.bind(user_id)
|
||||||
|
.fetch_optional(pool)
|
||||||
|
.await?;
|
||||||
|
Ok(result)
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Remove `head_html` from `ScrapedContent`**
|
||||||
|
|
||||||
|
In `scraper.rs`, remove `pub head_html: String` from the `ScrapedContent` struct. Remove the `head_html` extraction code in `scrape_url` (the block that finds `<head>...</head>`). Remove `head_html` from the return struct construction.
|
||||||
|
|
||||||
|
This will cause compilation errors in `source_scraper.rs` where `extract_article_links_with_llm` uses `content.head_html` — but source_scraper uses its own `extract_head_and_body` function, not `ScrapedContent.head_html`. Check and fix any references.
|
||||||
|
|
||||||
|
Also check `scrape_single_article_with_llm` in `synthesis.rs` — it references `content.head_html`. This function will be removed in Task 5, but it needs to compile now. Temporarily replace `content.head_html` with `String::new()` if needed, or remove the function now.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Verify + commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd backend && cargo test --lib
|
||||||
|
git add backend/src/db/article_history.rs backend/src/services/scraper.rs backend/src/services/synthesis.rs
|
||||||
|
git commit -m "feat: add get_last_source_url + remove head_html from ScrapedContent"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 4: Remove old prompts, schemas, and unused code
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `backend/src/services/prompts.rs`
|
||||||
|
- Modify: `backend/src/services/llm/schema.rs`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Remove old prompts from prompts.rs**
|
||||||
|
|
||||||
|
Remove these functions and their tests:
|
||||||
|
- `build_rewrite_prompt`
|
||||||
|
- `build_classification_prompt`
|
||||||
|
- `build_article_extraction_prompt`
|
||||||
|
- `build_link_extraction_prompt` — WAIT, this one stays (used by source_scraper LLM link extraction)
|
||||||
|
|
||||||
|
So remove: `build_rewrite_prompt`, `build_classification_prompt`, `build_article_extraction_prompt` and their tests.
|
||||||
|
|
||||||
|
Also remove the `build_search_prompt` parameter `category_gaps: Option<&[(String, i32)]>` — simplify back to always using `max_items_per_category`. Actually wait — Phase 2 still uses gap-aware search. Keep `category_gaps` parameter.
|
||||||
|
|
||||||
|
Remove `use crate::models::synthesis::ScrapedNewsItem;` if it's no longer needed (check if `build_classification_prompt` was the only user).
|
||||||
|
|
||||||
|
- [ ] **Step 2: Remove old schemas from schema.rs**
|
||||||
|
|
||||||
|
Remove: `build_classification_schema`, `build_article_extraction_schema`
|
||||||
|
Keep: `build_category_schema` (Phase 2 search), `build_link_extraction_schema` (source scraper), `build_article_classify_schema` (new)
|
||||||
|
|
||||||
|
- [ ] **Step 3: Verify + commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd backend && cargo test --lib
|
||||||
|
git add backend/src/services/prompts.rs backend/src/services/llm/schema.rs
|
||||||
|
git commit -m "refactor: remove old classification, rewrite, and article extraction prompts/schemas"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 5: Rewrite `synthesis.rs` — the core pipeline
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `backend/src/services/synthesis.rs`
|
||||||
|
|
||||||
|
This is the largest task. The entire `run_generation_inner` function is rewritten. Many helper functions are removed.
|
||||||
|
|
||||||
|
- [ ] **Step 1: Remove dead helper functions**
|
||||||
|
|
||||||
|
Delete these functions and their tests from `synthesis.rs`:
|
||||||
|
- `scrape_single_article_with_llm`
|
||||||
|
- `scrape_flat_urls`
|
||||||
|
- `scrape_articles`
|
||||||
|
- `filter_empty_scraped_articles`
|
||||||
|
- `build_rewrite_schema`
|
||||||
|
- `build_final_sections`
|
||||||
|
- `restore_scraped_urls`
|
||||||
|
- `parse_classification_response`
|
||||||
|
- `limit_articles_per_source`
|
||||||
|
- `dedup_by_url`
|
||||||
|
- `filter_homepage_urls`
|
||||||
|
- `SYNTHESIS_MIN_FILL_RATIO` constant
|
||||||
|
- All associated tests for these functions
|
||||||
|
|
||||||
|
Keep:
|
||||||
|
- `scrape_single_article` (used for Phase 1 per-article scraping)
|
||||||
|
- `emit_progress`
|
||||||
|
- `trace_article`
|
||||||
|
- `log_llm_call`
|
||||||
|
- `normalize_article_url` / `hash_article_url`
|
||||||
|
- `extract_domain`
|
||||||
|
- `resolve_provider_and_key` / `resolve_model`
|
||||||
|
- `check_rate_limit` / `get_user_rate_limiter`
|
||||||
|
- `sanitize_json_null_bytes`
|
||||||
|
- `sanitize_error_message`
|
||||||
|
- `get_iso_week_string`
|
||||||
|
- `parse_llm_output` (used in Phase 2)
|
||||||
|
|
||||||
|
- [ ] **Step 2: Add `rotate_sources` helper**
|
||||||
|
|
||||||
|
```rust
|
||||||
|
/// Rotate the sources list so that the source after the last-used source comes first.
|
||||||
|
fn rotate_sources(sources: Vec<Source>, last_source_url: Option<&str>) -> Vec<Source> {
|
||||||
|
let Some(last_url) = last_source_url else {
|
||||||
|
return sources;
|
||||||
|
};
|
||||||
|
|
||||||
|
let pos = sources.iter().position(|s| s.url == last_url);
|
||||||
|
match pos {
|
||||||
|
Some(idx) => {
|
||||||
|
let next = (idx + 1) % sources.len();
|
||||||
|
let mut rotated = sources[next..].to_vec();
|
||||||
|
rotated.extend_from_slice(&sources[..next]);
|
||||||
|
rotated
|
||||||
|
}
|
||||||
|
None => sources, // Last source not in list, don't rotate
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: Rewrite `run_generation_inner`**
|
||||||
|
|
||||||
|
Replace the entire function body with the new algorithm. The new flow:
|
||||||
|
|
||||||
|
```rust
|
||||||
|
async fn run_generation_inner(
|
||||||
|
job_id: Uuid,
|
||||||
|
state: &AppState,
|
||||||
|
user_id: Uuid,
|
||||||
|
tx: &watch::Sender<ProgressEvent>,
|
||||||
|
) -> Result<Uuid, AppError> {
|
||||||
|
// === INITIALIZATION ===
|
||||||
|
emit_progress(tx, "settings", "Chargement des parametres...", 5);
|
||||||
|
let settings = db::settings::get_or_create_default(&state.pool, user_id).await?;
|
||||||
|
|
||||||
|
// Cleanup
|
||||||
|
if settings.article_history_days > 0 {
|
||||||
|
db::article_history::cleanup_old(&state.pool, user_id, settings.article_history_days).await.unwrap_or(0);
|
||||||
|
db::llm_call_log::truncate_old(&state.pool, user_id, settings.article_history_days).await.ok();
|
||||||
|
}
|
||||||
|
|
||||||
|
// Categories — if empty, default to just "Autre"
|
||||||
|
let user_categories = if settings.categories.is_empty() {
|
||||||
|
Vec::new()
|
||||||
|
} else {
|
||||||
|
settings.categories.clone()
|
||||||
|
};
|
||||||
|
let mut classification_categories = user_categories.clone();
|
||||||
|
classification_categories.push("Autre".to_string());
|
||||||
|
|
||||||
|
// Load sources
|
||||||
|
emit_progress(tx, "sources", "Chargement des sources...", 10);
|
||||||
|
let sources = db::sources::list_for_user(&state.pool, user_id).await?;
|
||||||
|
|
||||||
|
// Resolve provider
|
||||||
|
emit_progress(tx, "provider", "Configuration du fournisseur IA...", 12);
|
||||||
|
let (provider_name, api_key) = resolve_provider_and_key(state, user_id, &settings).await?;
|
||||||
|
let provider = create_provider(&provider_name, api_key)?;
|
||||||
|
let model_research = if !settings.ai_model.is_empty() { settings.ai_model.clone() } else { resolve_model(state, &provider_name).await? };
|
||||||
|
let model_writing = if !settings.ai_model_writing.is_empty() { settings.ai_model_writing.clone() } else { model_research.clone() };
|
||||||
|
let user_rate_limiter = get_user_rate_limiter(state, &settings, user_id);
|
||||||
|
|
||||||
|
// Tracking structures
|
||||||
|
let mut article_scraped: HashMap<String, Vec<NewsItem>> = HashMap::new();
|
||||||
|
let mut source_counts: HashMap<String, usize> = HashMap::new();
|
||||||
|
let mut url_source: HashMap<String, String> = HashMap::new(); // url → source_url
|
||||||
|
let mut filled_counts: HashMap<String, usize> = HashMap::new();
|
||||||
|
let mut seen_urls: std::collections::HashSet<String> = std::collections::HashSet::new();
|
||||||
|
let max_total = (user_categories.len() + 1) * settings.max_items_per_category as usize;
|
||||||
|
let classify_schema = build_article_classify_schema();
|
||||||
|
|
||||||
|
// === PHASE 1: Personalized Sources ===
|
||||||
|
if !sources.is_empty() {
|
||||||
|
emit_progress(tx, "sources_scrape", "Analyse des sources personnalisees...", 15);
|
||||||
|
|
||||||
|
// 1a. Rotate sources
|
||||||
|
let last_source = db::article_history::get_last_source_url(&state.pool, user_id).await.unwrap_or(None);
|
||||||
|
let rotated_sources = rotate_sources(sources.clone(), last_source.as_deref());
|
||||||
|
let max_sources = rotated_sources.len().min(10);
|
||||||
|
let max_links = 10usize;
|
||||||
|
|
||||||
|
let mut candidate_urls: Vec<(String, String)> = Vec::new(); // (article_url, source_url)
|
||||||
|
|
||||||
|
for source in rotated_sources.iter().take(max_sources) {
|
||||||
|
let links = if settings.use_llm_for_source_links {
|
||||||
|
source_scraper::extract_article_links_with_llm(
|
||||||
|
&state.http_client, &source.url, max_links, &provider, &model_research,
|
||||||
|
).await
|
||||||
|
} else {
|
||||||
|
source_scraper::extract_article_links(
|
||||||
|
&state.http_client, &source.url, max_links,
|
||||||
|
).await
|
||||||
|
};
|
||||||
|
|
||||||
|
if let Ok(links) = links {
|
||||||
|
for link in links {
|
||||||
|
if seen_urls.insert(link.to_lowercase()) {
|
||||||
|
candidate_urls.push((link, source.url.clone()));
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Filter against article history
|
||||||
|
if settings.article_history_days > 0 && !candidate_urls.is_empty() {
|
||||||
|
let hashes: Vec<String> = candidate_urls.iter().map(|(url, _)| hash_article_url(url)).collect();
|
||||||
|
let existing = db::article_history::check_urls_exist(&state.pool, user_id, &hashes).await.unwrap_or_default();
|
||||||
|
if !existing.is_empty() {
|
||||||
|
// Trace filtered articles
|
||||||
|
for (url, source_url) in &candidate_urls {
|
||||||
|
if existing.contains(&hash_article_url(url)) {
|
||||||
|
trace_article(&state.pool, user_id, job_id, url, "", "personalized_source", Some(source_url), None, None, "filtered_history", false).await;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
candidate_urls.retain(|(url, _)| !existing.contains(&hash_article_url(url)));
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Track url → source
|
||||||
|
for (url, source_url) in &candidate_urls {
|
||||||
|
url_source.insert(url.clone(), source_url.clone());
|
||||||
|
}
|
||||||
|
|
||||||
|
// 1b. Scrape, classify, summarize each article
|
||||||
|
emit_progress(tx, "processing", "Traitement des articles...", 25);
|
||||||
|
let total_candidates = candidate_urls.len();
|
||||||
|
|
||||||
|
for (idx, (url, source_url)) in candidate_urls.into_iter().enumerate() {
|
||||||
|
// Progress
|
||||||
|
let pct = 25 + ((idx as u32 * 40) / total_candidates.max(1) as u32).min(40);
|
||||||
|
emit_progress(tx, "processing", &format!("Article {}/{}...", idx + 1, total_candidates), pct as u8);
|
||||||
|
|
||||||
|
// Check source limit
|
||||||
|
let source_domain = extract_domain(&source_url).unwrap_or_default();
|
||||||
|
let source_count = source_counts.get(&source_domain).copied().unwrap_or(0);
|
||||||
|
if source_count >= settings.max_articles_per_source as usize {
|
||||||
|
trace_article(&state.pool, user_id, job_id, &url, "", "personalized_source", Some(&source_url), None, None, "filtered_diversity", false).await;
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Scrape
|
||||||
|
let (body_text, page_title, final_url) = scrape_single_article(&state.http_client, &url, settings.max_age_days as i64).await;
|
||||||
|
|
||||||
|
if body_text.trim().is_empty() {
|
||||||
|
trace_article(&state.pool, user_id, job_id, &final_url, &page_title, "personalized_source", Some(&source_url), None, None, "filtered_empty", false).await;
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
// LLM classify + summarize
|
||||||
|
check_rate_limit(state, &user_rate_limiter, &provider_name)?;
|
||||||
|
let body_snippet: String = body_text.chars().take(500).collect();
|
||||||
|
let (class_sys, class_user) = prompts::build_article_classify_prompt(&page_title, &body_snippet, &classification_categories);
|
||||||
|
|
||||||
|
let llm_start = std::time::Instant::now();
|
||||||
|
let class_response = provider.call_llm(&model_research, &class_sys, &class_user, &classify_schema).await?;
|
||||||
|
let llm_duration = llm_start.elapsed().as_millis() as u64;
|
||||||
|
log_llm_call(&state.pool, user_id, job_id, "classify_summarize", &model_research, &class_sys, &class_user, &class_response, llm_duration).await;
|
||||||
|
|
||||||
|
// Parse response
|
||||||
|
let llm_title = class_response.get("title").and_then(|t| t.as_str()).unwrap_or(&page_title).to_string();
|
||||||
|
let llm_summary = class_response.get("summary").and_then(|s| s.as_str()).unwrap_or("").to_string();
|
||||||
|
let mut llm_category = class_response.get("category").and_then(|c| c.as_str()).unwrap_or("Autre").to_string();
|
||||||
|
|
||||||
|
// Validate category — if not in list, use "Autre"
|
||||||
|
if !classification_categories.iter().any(|c| c.to_lowercase() == llm_category.to_lowercase()) {
|
||||||
|
llm_category = "Autre".to_string();
|
||||||
|
}
|
||||||
|
|
||||||
|
// Map category to key
|
||||||
|
let cat_key = if llm_category == "Autre" {
|
||||||
|
"category_autre".to_string()
|
||||||
|
} else {
|
||||||
|
user_categories.iter().position(|c| c.to_lowercase() == llm_category.to_lowercase())
|
||||||
|
.map(|i| format!("category_{}", i))
|
||||||
|
.unwrap_or_else(|| "category_autre".to_string())
|
||||||
|
};
|
||||||
|
|
||||||
|
// Check if category is full → overflow to "Autre"
|
||||||
|
let cat_filled = filled_counts.get(&llm_category).copied().unwrap_or(0);
|
||||||
|
let (final_cat_key, final_cat_name) = if cat_filled >= settings.max_items_per_category as usize && llm_category != "Autre" {
|
||||||
|
let autre_filled = filled_counts.get("Autre").copied().unwrap_or(0);
|
||||||
|
if autre_filled >= settings.max_items_per_category as usize {
|
||||||
|
// Both full — skip article
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
("category_autre".to_string(), "Autre".to_string())
|
||||||
|
} else {
|
||||||
|
(cat_key, llm_category)
|
||||||
|
};
|
||||||
|
|
||||||
|
// Add article
|
||||||
|
article_scraped.entry(final_cat_key).or_default().push(NewsItem {
|
||||||
|
title: llm_title,
|
||||||
|
url: final_url.clone(),
|
||||||
|
summary: llm_summary,
|
||||||
|
});
|
||||||
|
*filled_counts.entry(final_cat_name).or_insert(0) += 1;
|
||||||
|
*source_counts.entry(source_domain).or_insert(0) += 1;
|
||||||
|
|
||||||
|
// Check if we've reached the maximum
|
||||||
|
let total: usize = article_scraped.values().map(|v| v.len()).sum();
|
||||||
|
if total >= max_total {
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// === PHASE 2: Web Search Fallback ===
|
||||||
|
let category_gaps: Vec<(String, i32)> = user_categories.iter().filter_map(|cat| {
|
||||||
|
let filled = filled_counts.get(cat).copied().unwrap_or(0);
|
||||||
|
let needed = (settings.max_items_per_category as usize).saturating_sub(filled);
|
||||||
|
if needed > 0 { Some((cat.clone(), needed as i32)) } else { None }
|
||||||
|
}).collect();
|
||||||
|
|
||||||
|
if !category_gaps.is_empty() {
|
||||||
|
emit_progress(tx, "search", "Recherche d'actualites complementaires...", 70);
|
||||||
|
check_rate_limit(state, &user_rate_limiter, &provider_name)?;
|
||||||
|
|
||||||
|
let search_schema = build_category_schema(&user_categories, settings.max_items_per_category);
|
||||||
|
let current_date = Utc::now().format("%A %d %B %Y").to_string();
|
||||||
|
let (sys_prompt, usr_prompt) = prompts::build_search_prompt(&settings, &sources, ¤t_date, &[], Some(&category_gaps));
|
||||||
|
|
||||||
|
let llm_start = std::time::Instant::now();
|
||||||
|
let raw_results = provider.call_llm(&model_research, &sys_prompt, &usr_prompt, &search_schema).await?;
|
||||||
|
let llm_duration = llm_start.elapsed().as_millis() as u64;
|
||||||
|
log_llm_call(&state.pool, user_id, job_id, "search", &model_research, &sys_prompt, &usr_prompt, &raw_results, llm_duration).await;
|
||||||
|
|
||||||
|
// Parse and filter
|
||||||
|
emit_progress(tx, "parsing", "Analyse des resultats...", 75);
|
||||||
|
let parsed = parse_llm_output(&raw_results, &user_categories)?;
|
||||||
|
|
||||||
|
// Filter: homepage, cross-phase dedup, url dedup, source limit, history
|
||||||
|
let mut phase2_articles: Vec<(String, NewsItem)> = Vec::new(); // (cat_key, item)
|
||||||
|
|
||||||
|
for (cat_key, items) in parsed {
|
||||||
|
for item in items {
|
||||||
|
let url_lower = item.url.to_lowercase();
|
||||||
|
|
||||||
|
// Homepage filter
|
||||||
|
if let Ok(parsed_url) = url::Url::parse(&item.url) {
|
||||||
|
let path = parsed_url.path();
|
||||||
|
if path.is_empty() || path == "/" {
|
||||||
|
trace_article(&state.pool, user_id, job_id, &item.url, &item.title, "web_search", None, None, None, "filtered_homepage", false).await;
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Cross-phase dedup
|
||||||
|
if seen_urls.contains(&url_lower) {
|
||||||
|
trace_article(&state.pool, user_id, job_id, &item.url, &item.title, "web_search", None, None, None, "filtered_cross_phase_dedup", false).await;
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
// History dedup
|
||||||
|
if settings.article_history_days > 0 {
|
||||||
|
let hash = hash_article_url(&item.url);
|
||||||
|
let exists = db::article_history::check_urls_exist(&state.pool, user_id, &[hash.clone()]).await.unwrap_or_default();
|
||||||
|
if exists.contains(&hash) {
|
||||||
|
trace_article(&state.pool, user_id, job_id, &item.url, &item.title, "web_search", None, None, None, "filtered_history", false).await;
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Source limit
|
||||||
|
if let Some(domain) = extract_domain(&item.url) {
|
||||||
|
let count = source_counts.get(&domain).copied().unwrap_or(0);
|
||||||
|
if count >= settings.max_articles_per_source as usize {
|
||||||
|
trace_article(&state.pool, user_id, job_id, &item.url, &item.title, "web_search", None, None, None, "filtered_diversity", false).await;
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
seen_urls.insert(url_lower);
|
||||||
|
phase2_articles.push((cat_key.clone(), item));
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Scrape Phase 2 articles for validation
|
||||||
|
emit_progress(tx, "scraping", "Verification des sources web...", 80);
|
||||||
|
for (cat_key, item) in phase2_articles {
|
||||||
|
let (body_text, _, final_url) = scrape_single_article(&state.http_client, &item.url, settings.max_age_days as i64).await;
|
||||||
|
|
||||||
|
if body_text.trim().is_empty() {
|
||||||
|
trace_article(&state.pool, user_id, job_id, &final_url, &item.title, "web_search", None, None, None, "filtered_empty", false).await;
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Use the LLM-provided title and summary (Phase 2 summaries are final)
|
||||||
|
article_scraped.entry(cat_key).or_default().push(NewsItem {
|
||||||
|
title: item.title,
|
||||||
|
url: final_url,
|
||||||
|
summary: item.summary,
|
||||||
|
});
|
||||||
|
|
||||||
|
if let Some(domain) = extract_domain(&item.url) {
|
||||||
|
*source_counts.entry(domain).or_insert(0) += 1;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// === SAVE ===
|
||||||
|
if article_scraped.values().all(|items| items.is_empty()) {
|
||||||
|
return Err(AppError::BadRequest("Aucun article valide trouve. Verifiez vos sources et categories.".into()));
|
||||||
|
}
|
||||||
|
|
||||||
|
emit_progress(tx, "saving", "Sauvegarde de la synthese...", 90);
|
||||||
|
|
||||||
|
// Build final sections
|
||||||
|
let mut final_sections: Vec<NewsSection> = Vec::new();
|
||||||
|
for (i, cat_name) in user_categories.iter().enumerate() {
|
||||||
|
let key = format!("category_{}", i);
|
||||||
|
if let Some(items) = article_scraped.get(&key) {
|
||||||
|
if !items.is_empty() {
|
||||||
|
final_sections.push(NewsSection { title: cat_name.clone(), items: items.clone() });
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
if let Some(autre_items) = article_scraped.get("category_autre") {
|
||||||
|
if !autre_items.is_empty() {
|
||||||
|
final_sections.push(NewsSection { title: "Autre".to_string(), items: autre_items.clone() });
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
let sections_json = serde_json::to_value(&final_sections).map_err(|e| AppError::Internal(anyhow::anyhow!("Failed to serialize: {}", e)))?;
|
||||||
|
let sections_json = sanitize_json_null_bytes(sections_json);
|
||||||
|
|
||||||
|
let synthesis = db::syntheses::create(&state.pool, user_id, &get_iso_week_string(Utc::now().date_naive()), §ions_json, job_id).await?;
|
||||||
|
|
||||||
|
// Record used articles
|
||||||
|
if settings.article_history_days > 0 {
|
||||||
|
for section in &final_sections {
|
||||||
|
for item in §ion.items {
|
||||||
|
let source_url = url_source.get(&item.url).map(|s| s.as_str());
|
||||||
|
trace_article(&state.pool, user_id, job_id, &item.url, &item.title,
|
||||||
|
if source_url.is_some() { "personalized_source" } else { "web_search" },
|
||||||
|
source_url, Some(§ion.title), Some(synthesis.id), "used", true).await;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
Ok(synthesis.id)
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4: Add `rotate_sources` unit tests**
|
||||||
|
|
||||||
|
```rust
|
||||||
|
#[test]
|
||||||
|
fn rotate_sources_after_last_used() {
|
||||||
|
// Create mock sources — need Source struct with url field
|
||||||
|
// Test that rotation works correctly
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 5: Verify + commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd backend && cargo test --lib
|
||||||
|
git add backend/src/services/synthesis.rs
|
||||||
|
git commit -m "feat: rewrite synthesis pipeline — per-article classify/summarize, no rewrite pass"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 6: Frontend — remove deprecated settings
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `frontend/src/types.ts`
|
||||||
|
- Modify: `frontend/src/pages/Settings.tsx`
|
||||||
|
- Modify: `frontend/src/i18n/fr.ts`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Remove fields from types**
|
||||||
|
|
||||||
|
Remove `source_diversity_window: number` and `use_llm_for_article_extraction: boolean` from `UserSettings` and `DEFAULT_SETTINGS`.
|
||||||
|
|
||||||
|
- [ ] **Step 2: Remove from Settings page**
|
||||||
|
|
||||||
|
Remove the diversity window number input and the LLM extraction checkbox from `Settings.tsx`.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Remove i18n labels**
|
||||||
|
|
||||||
|
Remove `settings.diversityWindow` and `settings.useLlmForArticleExtraction` labels.
|
||||||
|
|
||||||
|
- [ ] **Step 4: Verify + commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd frontend && npx tsc --noEmit && npx vitest run
|
||||||
|
git add frontend/src/types.ts frontend/src/pages/Settings.tsx frontend/src/i18n/fr.ts
|
||||||
|
git commit -m "feat: remove deprecated settings from frontend"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 7: Update E2E test
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `e2e/tests/generation-live.spec.ts`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Update settings payload**
|
||||||
|
|
||||||
|
Remove `source_diversity_window` and `use_llm_for_article_extraction` from the PUT settings body.
|
||||||
|
|
||||||
|
- [ ] **Step 2: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add e2e/tests/generation-live.spec.ts
|
||||||
|
git commit -m "test: update E2E test for new pipeline (remove deprecated settings)"
|
||||||
|
```
|
||||||
Loading…
Reference in New Issue