From 1d5dc0596c71fc4955d2cf71c0f40140357a8fc8 Mon Sep 17 00:00:00 2001 From: oabrivard Date: Wed, 25 Mar 2026 00:35:27 +0100 Subject: [PATCH] =?UTF-8?q?docs:=20add=20spec=20for=20algorithm=20rewrite?= =?UTF-8?q?=20=E2=80=94=20per-article=20classify,=20no=20rewrite=20pass?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../2026-03-25-algorithm-rewrite-design.md | 164 ++++++++++++++++++ 1 file changed, 164 insertions(+) create mode 100644 docs/superpowers/specs/2026-03-25-algorithm-rewrite-design.md diff --git a/docs/superpowers/specs/2026-03-25-algorithm-rewrite-design.md b/docs/superpowers/specs/2026-03-25-algorithm-rewrite-design.md new file mode 100644 index 0000000..6dfeb2d --- /dev/null +++ b/docs/superpowers/specs/2026-03-25-algorithm-rewrite-design.md @@ -0,0 +1,164 @@ +# Design: Algorithm Rewrite — Per-Article Classification, No Rewrite Pass + +**Date**: 2026-03-25 +**Scope**: Complete rewrite of synthesis generation pipeline. Per-article LLM classify/summarize, source rotation, remove rewrite pass, remove deprecated settings. + +--- + +## Context + +The current pipeline has grown complex: batch classification, separate rewrite pass, URL restoration, "Autre" fill-up, LLM article extraction, source diversity window. The new algorithm simplifies to: scrape each article → LLM classify/summarize per article → stop when categories are full. No batch steps, no rewrite pass. + +## New Algorithm + +See `docs/algorithm.md` for the complete specification. + +### Key Changes from Current Code + +1. **Per-article LLM call** replaces batch classification + batch rewrite. Each article gets one LLM call returning `{title, summary, category}`. +2. **Source rotation** — rolling window based on last source used in article_history. +3. **No rewrite pass** — summaries are final from the per-article call. +4. **No URL restoration** — no rewrite means no hallucinated URLs to fix. +5. **No "Autre" fill-up** — the 75% target logic is removed. +6. **No LLM article extraction** — removed, per-article LLM call handles content directly. +7. **Phase 2 summaries are final** — search LLM returns title/url/summary, scraping is validation only. + +## Removed Settings (Destructive Migration) + +```sql +ALTER TABLE settings DROP COLUMN source_diversity_window; +ALTER TABLE settings DROP COLUMN use_llm_for_article_extraction; +``` + +## Removed Code + +- `SYNTHESIS_MIN_FILL_RATIO` constant + fill-up logic +- `scrape_single_article_with_llm` function +- `scrape_article_dispatch` / LLM extraction branching in `scrape_flat_urls` and `scrape_articles` +- `build_rewrite_prompt` / `build_rewrite_schema` +- `build_classification_prompt` / `build_classification_schema` +- `build_article_extraction_prompt` / `build_article_extraction_schema` +- `parse_classification_response` +- `filter_empty_scraped_articles` +- `restore_scraped_urls` +- `limit_articles_per_source` / `dedup_by_url` / `filter_homepage_urls` (dedup happens inline) +- `build_rewrite_schema` / `build_final_sections` (no rewrite pass) +- `scrape_articles` (replaced by per-article scraping in the main loop) +- `scrape_flat_urls` (replaced by per-article scraping in the main loop) +- `head_html` field from `ScrapedContent` (LLM extraction removed) + +## New Code + +### New prompt: `build_article_classify_prompt` + +```rust +pub fn build_article_classify_prompt( + title: &str, + body_snippet: &str, + categories: &[String], // includes "Autre" +) -> (String, String) +``` + +Asks the LLM to classify the article into a category and generate a title + 4-5 line summary. + +### New schema: `build_article_classify_schema` + +```json +{ + "type": "object", + "properties": { + "title": { "type": "string" }, + "summary": { "type": "string" }, + "category": { "type": "string" } + }, + "required": ["title", "summary", "category"], + "additionalProperties": false +} +``` + +### New DB query: `get_last_source_url` + +```rust +pub async fn get_last_source_url(pool: &PgPool, user_id: Uuid) -> Result, AppError> +``` + +Returns the `source_url` from the most recent `used` entry for source rotation. + +### Source rotation + +Rotate the user's source list so that the source following the last-used source comes first. + +## Rewritten `run_generation_inner` Flow + +``` +Initialization: + Load settings, sources, provider, models, rate limiter + Cleanup article history + LLM logs + Build classification categories (user categories + "Autre") + Initialize: article_scraped, source_counts, url_source, filled_counts, seen_urls + +Phase 1 (if sources not empty): + 1a. Rotate sources (rolling window via get_last_source_url) + For each source: + Extract article links (LLM or heuristic, max 10) + Deduplicate + filter against article history + Track url → source mapping + + 1b. For each candidate URL: + Check source_counts → skip if exceeded max_articles_per_source (trace: filtered_diversity) + Scrape article (heuristic only — title, date, body, soft-404) + Skip if empty/failed (trace: filtered_empty) + LLM call: classify + summarize → {title, summary, category} (logged) + If category full → assign to "Autre" + Add to article_scraped + update filled_counts + source_counts + If total >= (num_categories_including_autre) × max_items_per_category → break + +Phase 2 (if any user category under-filled): + 2a. Compute gaps per category + 2b. LLM search call → {category_0: [{title, url, summary}], ...} (logged) + 2c. Filter: homepage, cross-phase dedup, url dedup, source limit, history (all traced) + Scrape for validation (filter empty, trace drops) + Merge into article_scraped + +Save: + Sanitize null bytes + Build sections from article_scraped (group by category key → NewsSection) + Save synthesis with job_id + Record used articles in article_history +``` + +## ScrapedContent Simplification + +Remove `head_html: String` field (LLM extraction removed). Keep `url: String` (redirect-resolved URL). + +## Files to Modify + +**Migration:** +- **Create:** migration to drop `source_diversity_window` and `use_llm_for_article_extraction` + +**Backend — rewrite:** +- **Rewrite:** `backend/src/services/synthesis.rs` — new `run_generation_inner`, remove all batch/rewrite/fill-up code +- **Simplify:** `backend/src/services/scraper.rs` — remove `head_html` from `ScrapedContent` +- **Modify:** `backend/src/services/prompts.rs` — remove old prompts, add `build_article_classify_prompt` +- **Modify:** `backend/src/services/llm/schema.rs` — remove old schemas, add `build_article_classify_schema` +- **Modify:** `backend/src/models/settings.rs` — remove 2 fields from all structs + Default + validation +- **Modify:** `backend/src/db/settings.rs` — remove from queries +- **Modify:** `backend/src/db/article_history.rs` — add `get_last_source_url` +- **Modify:** `backend/src/models/synthesis.rs` — remove `source_url` from `ScrapedNewsItem` if no longer needed, or keep for tracing + +**Frontend:** +- **Modify:** `frontend/src/types.ts` — remove 2 fields +- **Modify:** `frontend/src/pages/Settings.tsx` — remove diversity window input + LLM extraction checkbox +- **Modify:** `frontend/src/i18n/fr.ts` — remove labels +- **Modify:** `e2e/tests/generation-live.spec.ts` — update settings payload + +## What Does NOT Change + +- `source_scraper.rs` — link extraction (both heuristic and LLM) stays as-is +- `article_history` table + tracing — stays, used for dedup + provenance +- `llm_call_log` table — stays, logging per LLM call +- `use_llm_for_source_links` setting — stays +- `max_articles_per_source` setting — stays +- `article_history_days` setting — stays +- LLM provider trait (`call_llm`) — stays +- Frontend: ArticleHistory page, LlmLogs page, provenance section — all stay