|
|
|
@ -0,0 +1,164 @@
|
|
|
|
|
|
|
|
# Design: Algorithm Rewrite — Per-Article Classification, No Rewrite Pass
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
**Date**: 2026-03-25
|
|
|
|
|
|
|
|
**Scope**: Complete rewrite of synthesis generation pipeline. Per-article LLM classify/summarize, source rotation, remove rewrite pass, remove deprecated settings.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Context
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The current pipeline has grown complex: batch classification, separate rewrite pass, URL restoration, "Autre" fill-up, LLM article extraction, source diversity window. The new algorithm simplifies to: scrape each article → LLM classify/summarize per article → stop when categories are full. No batch steps, no rewrite pass.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## New Algorithm
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
See `docs/algorithm.md` for the complete specification.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### Key Changes from Current Code
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1. **Per-article LLM call** replaces batch classification + batch rewrite. Each article gets one LLM call returning `{title, summary, category}`.
|
|
|
|
|
|
|
|
2. **Source rotation** — rolling window based on last source used in article_history.
|
|
|
|
|
|
|
|
3. **No rewrite pass** — summaries are final from the per-article call.
|
|
|
|
|
|
|
|
4. **No URL restoration** — no rewrite means no hallucinated URLs to fix.
|
|
|
|
|
|
|
|
5. **No "Autre" fill-up** — the 75% target logic is removed.
|
|
|
|
|
|
|
|
6. **No LLM article extraction** — removed, per-article LLM call handles content directly.
|
|
|
|
|
|
|
|
7. **Phase 2 summaries are final** — search LLM returns title/url/summary, scraping is validation only.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Removed Settings (Destructive Migration)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
```sql
|
|
|
|
|
|
|
|
ALTER TABLE settings DROP COLUMN source_diversity_window;
|
|
|
|
|
|
|
|
ALTER TABLE settings DROP COLUMN use_llm_for_article_extraction;
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Removed Code
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
- `SYNTHESIS_MIN_FILL_RATIO` constant + fill-up logic
|
|
|
|
|
|
|
|
- `scrape_single_article_with_llm` function
|
|
|
|
|
|
|
|
- `scrape_article_dispatch` / LLM extraction branching in `scrape_flat_urls` and `scrape_articles`
|
|
|
|
|
|
|
|
- `build_rewrite_prompt` / `build_rewrite_schema`
|
|
|
|
|
|
|
|
- `build_classification_prompt` / `build_classification_schema`
|
|
|
|
|
|
|
|
- `build_article_extraction_prompt` / `build_article_extraction_schema`
|
|
|
|
|
|
|
|
- `parse_classification_response`
|
|
|
|
|
|
|
|
- `filter_empty_scraped_articles`
|
|
|
|
|
|
|
|
- `restore_scraped_urls`
|
|
|
|
|
|
|
|
- `limit_articles_per_source` / `dedup_by_url` / `filter_homepage_urls` (dedup happens inline)
|
|
|
|
|
|
|
|
- `build_rewrite_schema` / `build_final_sections` (no rewrite pass)
|
|
|
|
|
|
|
|
- `scrape_articles` (replaced by per-article scraping in the main loop)
|
|
|
|
|
|
|
|
- `scrape_flat_urls` (replaced by per-article scraping in the main loop)
|
|
|
|
|
|
|
|
- `head_html` field from `ScrapedContent` (LLM extraction removed)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## New Code
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### New prompt: `build_article_classify_prompt`
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
```rust
|
|
|
|
|
|
|
|
pub fn build_article_classify_prompt(
|
|
|
|
|
|
|
|
title: &str,
|
|
|
|
|
|
|
|
body_snippet: &str,
|
|
|
|
|
|
|
|
categories: &[String], // includes "Autre"
|
|
|
|
|
|
|
|
) -> (String, String)
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Asks the LLM to classify the article into a category and generate a title + 4-5 line summary.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### New schema: `build_article_classify_schema`
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
```json
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"type": "object",
|
|
|
|
|
|
|
|
"properties": {
|
|
|
|
|
|
|
|
"title": { "type": "string" },
|
|
|
|
|
|
|
|
"summary": { "type": "string" },
|
|
|
|
|
|
|
|
"category": { "type": "string" }
|
|
|
|
|
|
|
|
},
|
|
|
|
|
|
|
|
"required": ["title", "summary", "category"],
|
|
|
|
|
|
|
|
"additionalProperties": false
|
|
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### New DB query: `get_last_source_url`
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
```rust
|
|
|
|
|
|
|
|
pub async fn get_last_source_url(pool: &PgPool, user_id: Uuid) -> Result<Option<String>, AppError>
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Returns the `source_url` from the most recent `used` entry for source rotation.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### Source rotation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Rotate the user's source list so that the source following the last-used source comes first.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Rewritten `run_generation_inner` Flow
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
Initialization:
|
|
|
|
|
|
|
|
Load settings, sources, provider, models, rate limiter
|
|
|
|
|
|
|
|
Cleanup article history + LLM logs
|
|
|
|
|
|
|
|
Build classification categories (user categories + "Autre")
|
|
|
|
|
|
|
|
Initialize: article_scraped, source_counts, url_source, filled_counts, seen_urls
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Phase 1 (if sources not empty):
|
|
|
|
|
|
|
|
1a. Rotate sources (rolling window via get_last_source_url)
|
|
|
|
|
|
|
|
For each source:
|
|
|
|
|
|
|
|
Extract article links (LLM or heuristic, max 10)
|
|
|
|
|
|
|
|
Deduplicate + filter against article history
|
|
|
|
|
|
|
|
Track url → source mapping
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1b. For each candidate URL:
|
|
|
|
|
|
|
|
Check source_counts → skip if exceeded max_articles_per_source (trace: filtered_diversity)
|
|
|
|
|
|
|
|
Scrape article (heuristic only — title, date, body, soft-404)
|
|
|
|
|
|
|
|
Skip if empty/failed (trace: filtered_empty)
|
|
|
|
|
|
|
|
LLM call: classify + summarize → {title, summary, category} (logged)
|
|
|
|
|
|
|
|
If category full → assign to "Autre"
|
|
|
|
|
|
|
|
Add to article_scraped + update filled_counts + source_counts
|
|
|
|
|
|
|
|
If total >= (num_categories_including_autre) × max_items_per_category → break
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Phase 2 (if any user category under-filled):
|
|
|
|
|
|
|
|
2a. Compute gaps per category
|
|
|
|
|
|
|
|
2b. LLM search call → {category_0: [{title, url, summary}], ...} (logged)
|
|
|
|
|
|
|
|
2c. Filter: homepage, cross-phase dedup, url dedup, source limit, history (all traced)
|
|
|
|
|
|
|
|
Scrape for validation (filter empty, trace drops)
|
|
|
|
|
|
|
|
Merge into article_scraped
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Save:
|
|
|
|
|
|
|
|
Sanitize null bytes
|
|
|
|
|
|
|
|
Build sections from article_scraped (group by category key → NewsSection)
|
|
|
|
|
|
|
|
Save synthesis with job_id
|
|
|
|
|
|
|
|
Record used articles in article_history
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## ScrapedContent Simplification
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Remove `head_html: String` field (LLM extraction removed). Keep `url: String` (redirect-resolved URL).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Files to Modify
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
**Migration:**
|
|
|
|
|
|
|
|
- **Create:** migration to drop `source_diversity_window` and `use_llm_for_article_extraction`
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
**Backend — rewrite:**
|
|
|
|
|
|
|
|
- **Rewrite:** `backend/src/services/synthesis.rs` — new `run_generation_inner`, remove all batch/rewrite/fill-up code
|
|
|
|
|
|
|
|
- **Simplify:** `backend/src/services/scraper.rs` — remove `head_html` from `ScrapedContent`
|
|
|
|
|
|
|
|
- **Modify:** `backend/src/services/prompts.rs` — remove old prompts, add `build_article_classify_prompt`
|
|
|
|
|
|
|
|
- **Modify:** `backend/src/services/llm/schema.rs` — remove old schemas, add `build_article_classify_schema`
|
|
|
|
|
|
|
|
- **Modify:** `backend/src/models/settings.rs` — remove 2 fields from all structs + Default + validation
|
|
|
|
|
|
|
|
- **Modify:** `backend/src/db/settings.rs` — remove from queries
|
|
|
|
|
|
|
|
- **Modify:** `backend/src/db/article_history.rs` — add `get_last_source_url`
|
|
|
|
|
|
|
|
- **Modify:** `backend/src/models/synthesis.rs` — remove `source_url` from `ScrapedNewsItem` if no longer needed, or keep for tracing
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
**Frontend:**
|
|
|
|
|
|
|
|
- **Modify:** `frontend/src/types.ts` — remove 2 fields
|
|
|
|
|
|
|
|
- **Modify:** `frontend/src/pages/Settings.tsx` — remove diversity window input + LLM extraction checkbox
|
|
|
|
|
|
|
|
- **Modify:** `frontend/src/i18n/fr.ts` — remove labels
|
|
|
|
|
|
|
|
- **Modify:** `e2e/tests/generation-live.spec.ts` — update settings payload
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## What Does NOT Change
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
- `source_scraper.rs` — link extraction (both heuristic and LLM) stays as-is
|
|
|
|
|
|
|
|
- `article_history` table + tracing — stays, used for dedup + provenance
|
|
|
|
|
|
|
|
- `llm_call_log` table — stays, logging per LLM call
|
|
|
|
|
|
|
|
- `use_llm_for_source_links` setting — stays
|
|
|
|
|
|
|
|
- `max_articles_per_source` setting — stays
|
|
|
|
|
|
|
|
- `article_history_days` setting — stays
|
|
|
|
|
|
|
|
- LLM provider trait (`call_llm`) — stays
|
|
|
|
|
|
|
|
- Frontend: ArticleHistory page, LlmLogs page, provenance section — all stay
|