You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

6.7 KiB

Raw Blame History Unescape Escape

Design: Algorithm Rewrite — Per-Article Classification, No Rewrite Pass

Date: 2026-03-25 Scope: Complete rewrite of synthesis generation pipeline. Per-article LLM classify/summarize, source rotation, remove rewrite pass, remove deprecated settings.

Context

The current pipeline has grown complex: batch classification, separate rewrite pass, URL restoration, "Autre" fill-up, LLM article extraction, source diversity window. The new algorithm simplifies to: scrape each article → LLM classify/summarize per article → stop when categories are full. No batch steps, no rewrite pass.

New Algorithm

See docs/algorithm.md for the complete specification.

Key Changes from Current Code

Per-article LLM call replaces batch classification + batch rewrite. Each article gets one LLM call returning {title, summary, category}.
Source rotation — rolling window based on last source used in article_history.
No rewrite pass — summaries are final from the per-article call.
No URL restoration — no rewrite means no hallucinated URLs to fix.
No "Autre" fill-up — the 75% target logic is removed.
No LLM article extraction — removed, per-article LLM call handles content directly.
Phase 2 summaries are final — search LLM returns title/url/summary, scraping is validation only.

Removed Settings (Destructive Migration)

ALTER TABLE settings DROP COLUMN source_diversity_window;
ALTER TABLE settings DROP COLUMN use_llm_for_article_extraction;

Removed Code

SYNTHESIS_MIN_FILL_RATIO constant + fill-up logic
scrape_single_article_with_llm function
scrape_article_dispatch / LLM extraction branching in scrape_flat_urls and scrape_articles
build_rewrite_prompt / build_rewrite_schema
build_classification_prompt / build_classification_schema
build_article_extraction_prompt / build_article_extraction_schema
parse_classification_response
filter_empty_scraped_articles
restore_scraped_urls
limit_articles_per_source / dedup_by_url / filter_homepage_urls (dedup happens inline)
build_rewrite_schema / build_final_sections (no rewrite pass)
scrape_articles (replaced by per-article scraping in the main loop)
scrape_flat_urls (replaced by per-article scraping in the main loop)
head_html field from ScrapedContent (LLM extraction removed)

New Code

New prompt: `build_article_classify_prompt`

pub fn build_article_classify_prompt(
    title: &str,
    body_snippet: &str,
    categories: &[String], // includes "Autre"
) -> (String, String)

Asks the LLM to classify the article into a category and generate a title + 4-5 line summary.

New schema: `build_article_classify_schema`

{
  "type": "object",
  "properties": {
    "title": { "type": "string" },
    "summary": { "type": "string" },
    "category": { "type": "string" }
  },
  "required": ["title", "summary", "category"],
  "additionalProperties": false
}

New DB query: `get_last_source_url`

pub async fn get_last_source_url(pool: &PgPool, user_id: Uuid) -> Result<Option<String>, AppError>

Returns the source_url from the most recent used entry for source rotation.

Source rotation

Rotate the user's source list so that the source following the last-used source comes first.

Rewritten `run_generation_inner` Flow

Initialization:
  Load settings, sources, provider, models, rate limiter
  Cleanup article history + LLM logs
  Build classification categories (user categories + "Autre")
  Initialize: article_scraped, source_counts, url_source, filled_counts, seen_urls

Phase 1 (if sources not empty):
  1a. Rotate sources (rolling window via get_last_source_url)
  For each source:
    Extract article links (LLM or heuristic, max 10)
    Deduplicate + filter against article history
    Track url → source mapping

  1b. For each candidate URL:
    Check source_counts → skip if exceeded max_articles_per_source (trace: filtered_diversity)
    Scrape article (heuristic only — title, date, body, soft-404)
    Skip if empty/failed (trace: filtered_empty)
    LLM call: classify + summarize → {title, summary, category} (logged)
    If category full → assign to "Autre"
    Add to article_scraped + update filled_counts + source_counts
    If total >= (num_categories_including_autre) × max_items_per_category → break

Phase 2 (if any user category under-filled):
  2a. Compute gaps per category
  2b. LLM search call → {category_0: [{title, url, summary}], ...} (logged)
  2c. Filter: homepage, cross-phase dedup, url dedup, source limit, history (all traced)
  Scrape for validation (filter empty, trace drops)
  Merge into article_scraped

Save:
  Sanitize null bytes
  Build sections from article_scraped (group by category key → NewsSection)
  Save synthesis with job_id
  Record used articles in article_history

ScrapedContent Simplification

Remove head_html: String field (LLM extraction removed). Keep url: String (redirect-resolved URL).

Files to Modify

Migration:

Create: migration to drop source_diversity_window and use_llm_for_article_extraction

Backend — rewrite:

Rewrite: backend/src/services/synthesis.rs — new run_generation_inner, remove all batch/rewrite/fill-up code
Simplify: backend/src/services/scraper.rs — remove head_html from ScrapedContent
Modify: backend/src/services/prompts.rs — remove old prompts, add build_article_classify_prompt
Modify: backend/src/services/llm/schema.rs — remove old schemas, add build_article_classify_schema
Modify: backend/src/models/settings.rs — remove 2 fields from all structs + Default + validation
Modify: backend/src/db/settings.rs — remove from queries
Modify: backend/src/db/article_history.rs — add get_last_source_url
Modify: backend/src/models/synthesis.rs — remove source_url from ScrapedNewsItem if no longer needed, or keep for tracing

Frontend:

Modify: frontend/src/types.ts — remove 2 fields
Modify: frontend/src/pages/Settings.tsx — remove diversity window input + LLM extraction checkbox
Modify: frontend/src/i18n/fr.ts — remove labels
Modify: e2e/tests/generation-live.spec.ts — update settings payload

What Does NOT Change

source_scraper.rs — link extraction (both heuristic and LLM) stays as-is
article_history table + tracing — stays, used for dedup + provenance
llm_call_log table — stays, logging per LLM call
use_llm_for_source_links setting — stays
max_articles_per_source setting — stays
article_history_days setting — stays
LLM provider trait (call_llm) — stays
Frontend: ArticleHistory page, LlmLogs page, provenance section — all stay

6.7 KiB Raw Blame History Unescape Escape