From 5426b342efdc3f692af4b2355e49147f5bd41752 Mon Sep 17 00:00:00 2001 From: oabrivard Date: Thu, 26 Mar 2026 16:34:16 +0100 Subject: [PATCH] docs: add spec for windowed source extraction pipeline Co-Authored-By: Claude Opus 4.6 (1M context) --- ...03-26-windowed-source-extraction-design.md | 61 +++++++++++++++++++ 1 file changed, 61 insertions(+) create mode 100644 docs/superpowers/specs/2026-03-26-windowed-source-extraction-design.md diff --git a/docs/superpowers/specs/2026-03-26-windowed-source-extraction-design.md b/docs/superpowers/specs/2026-03-26-windowed-source-extraction-design.md new file mode 100644 index 0000000..a4c872f --- /dev/null +++ b/docs/superpowers/specs/2026-03-26-windowed-source-extraction-design.md @@ -0,0 +1,61 @@ +# Design: Windowed Source Extraction Pipeline + +**Date**: 2026-03-26 +**Scope**: Restructure Phase 1 to extract links in waves, stopping early when the synthesis is full + +--- + +## Context + +Currently Phase 1 extracts links from ALL personalized sources in parallel, collects them into one big list, shuffles, then batch-processes (scrape+classify). If the user has 8 sources but only needs 20 articles, all 8 sources are extracted upfront — wasting time and LLM calls (when `use_llm_for_source_links` is enabled). + +## Design + +### New setting: `source_extraction_window` + +`source_extraction_window INTEGER NOT NULL DEFAULT 3` in the `settings` table. Range: 1-10. Controls how many sources are extracted per wave. + +### Pipeline flow + +``` +sources = rotate_sources(all_sources) // existing rotation logic +waves = chunk sources into groups of source_extraction_window + +For each wave: + 1. Extract links from all sources in this wave IN PARALLEL (JoinSet) + 2. Collect all links → deduplicate (seen_urls) → filter against article history + 3. Shuffle + 4. Batch scrape+classify (existing batch loop with batch_size) + 5. Update filled_counts, source_counts + 6. Check if max_total reached → if full, STOP (skip remaining waves) + 7. Flush pending_traces +``` + +### Key behaviors + +- Sources processed in **rotation order** (existing rolling window logic) +- `filled_counts`, `seen_urls`, `source_counts` carry across waves (accumulate) +- `max_articles_per_source` cap still applies per source within each wave +- `batch_size` still controls parallelism for scrape+classify within each wave +- Progress events per wave: "Extraction des sources (vague 1/3)..." → "Traitement des articles..." + +### What's saved + +With `source_extraction_window=3` and 8 sources: +- Wave 1 (sources 1-3): 15-30 links → scrape+classify → maybe fills 15/20 articles +- Wave 2 (sources 4-6): 15-30 links → scrape+classify → fills remaining 5 → STOP +- Waves 3+ (sources 7-8): completely skipped — no link extraction, no scraping, no LLM calls + +--- + +## Files to modify + +- **Create:** `backend/migrations/20260326000025_add_source_extraction_window.sql` +- **Modify:** `backend/src/models/settings.rs` — add `source_extraction_window` to structs, validation (1-10), default 3 +- **Modify:** `backend/src/db/settings.rs` — add to queries +- **Modify:** `backend/src/services/synthesis.rs` — restructure Phase 1 into wave loop +- **Modify:** `frontend/src/types.ts` — add field +- **Modify:** `frontend/src/pages/Settings.tsx` — add number input +- **Modify:** `frontend/src/i18n/fr.ts` — labels +- **Modify:** `CLAUDE.md` — migration count +- **Modify:** test fixtures (prompts.rs, api_syntheses_test.rs, pipeline_test.rs, e2e generation-live)