You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
2.7 KiB
2.7 KiB
Design: Windowed Source Extraction Pipeline
Date: 2026-03-26 Scope: Restructure Phase 1 to extract links in waves, stopping early when the synthesis is full
Context
Currently Phase 1 extracts links from ALL personalized sources in parallel, collects them into one big list, shuffles, then batch-processes (scrape+classify). If the user has 8 sources but only needs 20 articles, all 8 sources are extracted upfront — wasting time and LLM calls (when use_llm_for_source_links is enabled).
Design
New setting: source_extraction_window
source_extraction_window INTEGER NOT NULL DEFAULT 3 in the settings table. Range: 1-10. Controls how many sources are extracted per wave.
Pipeline flow
sources = rotate_sources(all_sources) // existing rotation logic
waves = chunk sources into groups of source_extraction_window
For each wave:
1. Extract links from all sources in this wave IN PARALLEL (JoinSet)
2. Collect all links → deduplicate (seen_urls) → filter against article history
3. Shuffle
4. Batch scrape+classify (existing batch loop with batch_size)
5. Update filled_counts, source_counts
6. Check if max_total reached → if full, STOP (skip remaining waves)
7. Flush pending_traces
Key behaviors
- Sources processed in rotation order (existing rolling window logic)
filled_counts,seen_urls,source_countscarry across waves (accumulate)max_articles_per_sourcecap still applies per source within each wavebatch_sizestill controls parallelism for scrape+classify within each wave- Progress events per wave: "Extraction des sources (vague 1/3)..." → "Traitement des articles..."
What's saved
With source_extraction_window=3 and 8 sources:
- Wave 1 (sources 1-3): 15-30 links → scrape+classify → maybe fills 15/20 articles
- Wave 2 (sources 4-6): 15-30 links → scrape+classify → fills remaining 5 → STOP
- Waves 3+ (sources 7-8): completely skipped — no link extraction, no scraping, no LLM calls
Files to modify
- Create:
backend/migrations/20260326000025_add_source_extraction_window.sql - Modify:
backend/src/models/settings.rs— addsource_extraction_windowto structs, validation (1-10), default 3 - Modify:
backend/src/db/settings.rs— add to queries - Modify:
backend/src/services/synthesis.rs— restructure Phase 1 into wave loop - Modify:
frontend/src/types.ts— add field - Modify:
frontend/src/pages/Settings.tsx— add number input - Modify:
frontend/src/i18n/fr.ts— labels - Modify:
CLAUDE.md— migration count - Modify: test fixtures (prompts.rs, api_syntheses_test.rs, pipeline_test.rs, e2e generation-live)