# Design: Windowed Source Extraction Pipeline **Date**: 2026-03-26 **Scope**: Restructure Phase 1 to extract links in waves, stopping early when the synthesis is full --- ## Context Currently Phase 1 extracts links from ALL personalized sources in parallel, collects them into one big list, shuffles, then batch-processes (scrape+classify). If the user has 8 sources but only needs 20 articles, all 8 sources are extracted upfront — wasting time and LLM calls (when `use_llm_for_source_links` is enabled). ## Design ### New setting: `source_extraction_window` `source_extraction_window INTEGER NOT NULL DEFAULT 3` in the `settings` table. Range: 1-10. Controls how many sources are extracted per wave. ### Pipeline flow ``` sources = rotate_sources(all_sources) // existing rotation logic waves = chunk sources into groups of source_extraction_window For each wave: 1. Extract links from all sources in this wave IN PARALLEL (JoinSet) 2. Collect all links → deduplicate (seen_urls) → filter against article history 3. Shuffle 4. Batch scrape+classify (existing batch loop with batch_size) 5. Update filled_counts, source_counts 6. Check if max_total reached → if full, STOP (skip remaining waves) 7. Flush pending_traces ``` ### Key behaviors - Sources processed in **rotation order** (existing rolling window logic) - `filled_counts`, `seen_urls`, `source_counts` carry across waves (accumulate) - `max_articles_per_source` cap still applies per source within each wave - `batch_size` still controls parallelism for scrape+classify within each wave - Progress events per wave: "Extraction des sources (vague 1/3)..." → "Traitement des articles..." ### What's saved With `source_extraction_window=3` and 8 sources: - Wave 1 (sources 1-3): 15-30 links → scrape+classify → maybe fills 15/20 articles - Wave 2 (sources 4-6): 15-30 links → scrape+classify → fills remaining 5 → STOP - Waves 3+ (sources 7-8): completely skipped — no link extraction, no scraping, no LLM calls --- ## Files to modify - **Create:** `backend/migrations/20260326000025_add_source_extraction_window.sql` - **Modify:** `backend/src/models/settings.rs` — add `source_extraction_window` to structs, validation (1-10), default 3 - **Modify:** `backend/src/db/settings.rs` — add to queries - **Modify:** `backend/src/services/synthesis.rs` — restructure Phase 1 into wave loop - **Modify:** `frontend/src/types.ts` — add field - **Modify:** `frontend/src/pages/Settings.tsx` — add number input - **Modify:** `frontend/src/i18n/fr.ts` — labels - **Modify:** `CLAUDE.md` — migration count - **Modify:** test fixtures (prompts.rs, api_syntheses_test.rs, pipeline_test.rs, e2e generation-live)