6.7 KiB
Design: Algorithm Rewrite — Per-Article Classification, No Rewrite Pass
Date: 2026-03-25 Scope: Complete rewrite of synthesis generation pipeline. Per-article LLM classify/summarize, source rotation, remove rewrite pass, remove deprecated settings.
Context
The current pipeline has grown complex: batch classification, separate rewrite pass, URL restoration, "Autre" fill-up, LLM article extraction, source diversity window. The new algorithm simplifies to: scrape each article → LLM classify/summarize per article → stop when categories are full. No batch steps, no rewrite pass.
New Algorithm
See docs/algorithm.md for the complete specification.
Key Changes from Current Code
- Per-article LLM call replaces batch classification + batch rewrite. Each article gets one LLM call returning
{title, summary, category}. - Source rotation — rolling window based on last source used in article_history.
- No rewrite pass — summaries are final from the per-article call.
- No URL restoration — no rewrite means no hallucinated URLs to fix.
- No "Autre" fill-up — the 75% target logic is removed.
- No LLM article extraction — removed, per-article LLM call handles content directly.
- Phase 2 summaries are final — search LLM returns title/url/summary, scraping is validation only.
Removed Settings (Destructive Migration)
ALTER TABLE settings DROP COLUMN source_diversity_window;
ALTER TABLE settings DROP COLUMN use_llm_for_article_extraction;
Removed Code
SYNTHESIS_MIN_FILL_RATIOconstant + fill-up logicscrape_single_article_with_llmfunctionscrape_article_dispatch/ LLM extraction branching inscrape_flat_urlsandscrape_articlesbuild_rewrite_prompt/build_rewrite_schemabuild_classification_prompt/build_classification_schemabuild_article_extraction_prompt/build_article_extraction_schemaparse_classification_responsefilter_empty_scraped_articlesrestore_scraped_urlslimit_articles_per_source/dedup_by_url/filter_homepage_urls(dedup happens inline)build_rewrite_schema/build_final_sections(no rewrite pass)scrape_articles(replaced by per-article scraping in the main loop)scrape_flat_urls(replaced by per-article scraping in the main loop)head_htmlfield fromScrapedContent(LLM extraction removed)
New Code
New prompt: build_article_classify_prompt
pub fn build_article_classify_prompt(
title: &str,
body_snippet: &str,
categories: &[String], // includes "Autre"
) -> (String, String)
Asks the LLM to classify the article into a category and generate a title + 4-5 line summary.
New schema: build_article_classify_schema
{
"type": "object",
"properties": {
"title": { "type": "string" },
"summary": { "type": "string" },
"category": { "type": "string" }
},
"required": ["title", "summary", "category"],
"additionalProperties": false
}
New DB query: get_last_source_url
pub async fn get_last_source_url(pool: &PgPool, user_id: Uuid) -> Result<Option<String>, AppError>
Returns the source_url from the most recent used entry for source rotation.
Source rotation
Rotate the user's source list so that the source following the last-used source comes first.
Rewritten run_generation_inner Flow
Initialization:
Load settings, sources, provider, models, rate limiter
Cleanup article history + LLM logs
Build classification categories (user categories + "Autre")
Initialize: article_scraped, source_counts, url_source, filled_counts, seen_urls
Phase 1 (if sources not empty):
1a. Rotate sources (rolling window via get_last_source_url)
For each source:
Extract article links (LLM or heuristic, max 10)
Deduplicate + filter against article history
Track url → source mapping
1b. For each candidate URL:
Check source_counts → skip if exceeded max_articles_per_source (trace: filtered_diversity)
Scrape article (heuristic only — title, date, body, soft-404)
Skip if empty/failed (trace: filtered_empty)
LLM call: classify + summarize → {title, summary, category} (logged)
If category full → assign to "Autre"
Add to article_scraped + update filled_counts + source_counts
If total >= (num_categories_including_autre) × max_items_per_category → break
Phase 2 (if any user category under-filled):
2a. Compute gaps per category
2b. LLM search call → {category_0: [{title, url, summary}], ...} (logged)
2c. Filter: homepage, cross-phase dedup, url dedup, source limit, history (all traced)
Scrape for validation (filter empty, trace drops)
Merge into article_scraped
Save:
Sanitize null bytes
Build sections from article_scraped (group by category key → NewsSection)
Save synthesis with job_id
Record used articles in article_history
ScrapedContent Simplification
Remove head_html: String field (LLM extraction removed). Keep url: String (redirect-resolved URL).
Files to Modify
Migration:
- Create: migration to drop
source_diversity_windowanduse_llm_for_article_extraction
Backend — rewrite:
- Rewrite:
backend/src/services/synthesis.rs— newrun_generation_inner, remove all batch/rewrite/fill-up code - Simplify:
backend/src/services/scraper.rs— removehead_htmlfromScrapedContent - Modify:
backend/src/services/prompts.rs— remove old prompts, addbuild_article_classify_prompt - Modify:
backend/src/services/llm/schema.rs— remove old schemas, addbuild_article_classify_schema - Modify:
backend/src/models/settings.rs— remove 2 fields from all structs + Default + validation - Modify:
backend/src/db/settings.rs— remove from queries - Modify:
backend/src/db/article_history.rs— addget_last_source_url - Modify:
backend/src/models/synthesis.rs— removesource_urlfromScrapedNewsItemif no longer needed, or keep for tracing
Frontend:
- Modify:
frontend/src/types.ts— remove 2 fields - Modify:
frontend/src/pages/Settings.tsx— remove diversity window input + LLM extraction checkbox - Modify:
frontend/src/i18n/fr.ts— remove labels - Modify:
e2e/tests/generation-live.spec.ts— update settings payload
What Does NOT Change
source_scraper.rs— link extraction (both heuristic and LLM) stays as-isarticle_historytable + tracing — stays, used for dedup + provenancellm_call_logtable — stays, logging per LLM calluse_llm_for_source_linkssetting — staysmax_articles_per_sourcesetting — staysarticle_history_dayssetting — stays- LLM provider trait (
call_llm) — stays - Frontend: ArticleHistory page, LlmLogs page, provenance section — all stay