You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
ai_synth/docs/superpowers/specs/2026-03-25-algorithm-rewrit...

165 lines
6.7 KiB
Markdown

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

# Design: Algorithm Rewrite — Per-Article Classification, No Rewrite Pass
**Date**: 2026-03-25
**Scope**: Complete rewrite of synthesis generation pipeline. Per-article LLM classify/summarize, source rotation, remove rewrite pass, remove deprecated settings.
---
## Context
The current pipeline has grown complex: batch classification, separate rewrite pass, URL restoration, "Autre" fill-up, LLM article extraction, source diversity window. The new algorithm simplifies to: scrape each article → LLM classify/summarize per article → stop when categories are full. No batch steps, no rewrite pass.
## New Algorithm
See `docs/algorithm.md` for the complete specification.
### Key Changes from Current Code
1. **Per-article LLM call** replaces batch classification + batch rewrite. Each article gets one LLM call returning `{title, summary, category}`.
2. **Source rotation** — rolling window based on last source used in article_history.
3. **No rewrite pass** — summaries are final from the per-article call.
4. **No URL restoration** — no rewrite means no hallucinated URLs to fix.
5. **No "Autre" fill-up** — the 75% target logic is removed.
6. **No LLM article extraction** — removed, per-article LLM call handles content directly.
7. **Phase 2 summaries are final** — search LLM returns title/url/summary, scraping is validation only.
## Removed Settings (Destructive Migration)
```sql
ALTER TABLE settings DROP COLUMN source_diversity_window;
ALTER TABLE settings DROP COLUMN use_llm_for_article_extraction;
```
## Removed Code
- `SYNTHESIS_MIN_FILL_RATIO` constant + fill-up logic
- `scrape_single_article_with_llm` function
- `scrape_article_dispatch` / LLM extraction branching in `scrape_flat_urls` and `scrape_articles`
- `build_rewrite_prompt` / `build_rewrite_schema`
- `build_classification_prompt` / `build_classification_schema`
- `build_article_extraction_prompt` / `build_article_extraction_schema`
- `parse_classification_response`
- `filter_empty_scraped_articles`
- `restore_scraped_urls`
- `limit_articles_per_source` / `dedup_by_url` / `filter_homepage_urls` (dedup happens inline)
- `build_rewrite_schema` / `build_final_sections` (no rewrite pass)
- `scrape_articles` (replaced by per-article scraping in the main loop)
- `scrape_flat_urls` (replaced by per-article scraping in the main loop)
- `head_html` field from `ScrapedContent` (LLM extraction removed)
## New Code
### New prompt: `build_article_classify_prompt`
```rust
pub fn build_article_classify_prompt(
title: &str,
body_snippet: &str,
categories: &[String], // includes "Autre"
) -> (String, String)
```
Asks the LLM to classify the article into a category and generate a title + 4-5 line summary.
### New schema: `build_article_classify_schema`
```json
{
"type": "object",
"properties": {
"title": { "type": "string" },
"summary": { "type": "string" },
"category": { "type": "string" }
},
"required": ["title", "summary", "category"],
"additionalProperties": false
}
```
### New DB query: `get_last_source_url`
```rust
pub async fn get_last_source_url(pool: &PgPool, user_id: Uuid) -> Result<Option<String>, AppError>
```
Returns the `source_url` from the most recent `used` entry for source rotation.
### Source rotation
Rotate the user's source list so that the source following the last-used source comes first.
## Rewritten `run_generation_inner` Flow
```
Initialization:
Load settings, sources, provider, models, rate limiter
Cleanup article history + LLM logs
Build classification categories (user categories + "Autre")
Initialize: article_scraped, source_counts, url_source, filled_counts, seen_urls
Phase 1 (if sources not empty):
1a. Rotate sources (rolling window via get_last_source_url)
For each source:
Extract article links (LLM or heuristic, max 10)
Deduplicate + filter against article history
Track url → source mapping
1b. For each candidate URL:
Check source_counts → skip if exceeded max_articles_per_source (trace: filtered_diversity)
Scrape article (heuristic only — title, date, body, soft-404)
Skip if empty/failed (trace: filtered_empty)
LLM call: classify + summarize → {title, summary, category} (logged)
If category full → assign to "Autre"
Add to article_scraped + update filled_counts + source_counts
If total >= (num_categories_including_autre) × max_items_per_category → break
Phase 2 (if any user category under-filled):
2a. Compute gaps per category
2b. LLM search call → {category_0: [{title, url, summary}], ...} (logged)
2c. Filter: homepage, cross-phase dedup, url dedup, source limit, history (all traced)
Scrape for validation (filter empty, trace drops)
Merge into article_scraped
Save:
Sanitize null bytes
Build sections from article_scraped (group by category key → NewsSection)
Save synthesis with job_id
Record used articles in article_history
```
## ScrapedContent Simplification
Remove `head_html: String` field (LLM extraction removed). Keep `url: String` (redirect-resolved URL).
## Files to Modify
**Migration:**
- **Create:** migration to drop `source_diversity_window` and `use_llm_for_article_extraction`
**Backend — rewrite:**
- **Rewrite:** `backend/src/services/synthesis.rs` — new `run_generation_inner`, remove all batch/rewrite/fill-up code
- **Simplify:** `backend/src/services/scraper.rs` — remove `head_html` from `ScrapedContent`
- **Modify:** `backend/src/services/prompts.rs` — remove old prompts, add `build_article_classify_prompt`
- **Modify:** `backend/src/services/llm/schema.rs` — remove old schemas, add `build_article_classify_schema`
- **Modify:** `backend/src/models/settings.rs` — remove 2 fields from all structs + Default + validation
- **Modify:** `backend/src/db/settings.rs` — remove from queries
- **Modify:** `backend/src/db/article_history.rs` — add `get_last_source_url`
- **Modify:** `backend/src/models/synthesis.rs` — remove `source_url` from `ScrapedNewsItem` if no longer needed, or keep for tracing
**Frontend:**
- **Modify:** `frontend/src/types.ts` — remove 2 fields
- **Modify:** `frontend/src/pages/Settings.tsx` — remove diversity window input + LLM extraction checkbox
- **Modify:** `frontend/src/i18n/fr.ts` — remove labels
- **Modify:** `e2e/tests/generation-live.spec.ts` — update settings payload
## What Does NOT Change
- `source_scraper.rs` — link extraction (both heuristic and LLM) stays as-is
- `article_history` table + tracing — stays, used for dedup + provenance
- `llm_call_log` table — stays, logging per LLM call
- `use_llm_for_source_links` setting — stays
- `max_articles_per_source` setting — stays
- `article_history_days` setting — stays
- LLM provider trait (`call_llm`) — stays
- Frontend: ArticleHistory page, LlmLogs page, provenance section — all stay