From 5426b342efdc3f692af4b2355e49147f5bd41752 Mon Sep 17 00:00:00 2001
From: oabrivard <olivier@abrivard.fr>
Date: Thu, 26 Mar 2026 16:34:16 +0100
Subject: [PATCH] docs: add spec for windowed source extraction pipeline

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 ...03-26-windowed-source-extraction-design.md | 61 +++++++++++++++++++
 1 file changed, 61 insertions(+)
 create mode 100644 docs/superpowers/specs/2026-03-26-windowed-source-extraction-design.md

diff --git a/docs/superpowers/specs/2026-03-26-windowed-source-extraction-design.md b/docs/superpowers/specs/2026-03-26-windowed-source-extraction-design.md
new file mode 100644
index 0000000..a4c872f
--- /dev/null
+++ b/docs/superpowers/specs/2026-03-26-windowed-source-extraction-design.md
@@ -0,0 +1,61 @@
+# Design: Windowed Source Extraction Pipeline
+
+**Date**: 2026-03-26
+**Scope**: Restructure Phase 1 to extract links in waves, stopping early when the synthesis is full
+
+---
+
+## Context
+
+Currently Phase 1 extracts links from ALL personalized sources in parallel, collects them into one big list, shuffles, then batch-processes (scrape+classify). If the user has 8 sources but only needs 20 articles, all 8 sources are extracted upfront — wasting time and LLM calls (when `use_llm_for_source_links` is enabled).
+
+## Design
+
+### New setting: `source_extraction_window`
+
+`source_extraction_window INTEGER NOT NULL DEFAULT 3` in the `settings` table. Range: 1-10. Controls how many sources are extracted per wave.
+
+### Pipeline flow
+
+```
+sources = rotate_sources(all_sources)   // existing rotation logic
+waves = chunk sources into groups of source_extraction_window
+
+For each wave:
+  1. Extract links from all sources in this wave IN PARALLEL (JoinSet)
+  2. Collect all links → deduplicate (seen_urls) → filter against article history
+  3. Shuffle
+  4. Batch scrape+classify (existing batch loop with batch_size)
+  5. Update filled_counts, source_counts
+  6. Check if max_total reached → if full, STOP (skip remaining waves)
+  7. Flush pending_traces
+```
+
+### Key behaviors
+
+- Sources processed in **rotation order** (existing rolling window logic)
+- `filled_counts`, `seen_urls`, `source_counts` carry across waves (accumulate)
+- `max_articles_per_source` cap still applies per source within each wave
+- `batch_size` still controls parallelism for scrape+classify within each wave
+- Progress events per wave: "Extraction des sources (vague 1/3)..." → "Traitement des articles..."
+
+### What's saved
+
+With `source_extraction_window=3` and 8 sources:
+- Wave 1 (sources 1-3): 15-30 links → scrape+classify → maybe fills 15/20 articles
+- Wave 2 (sources 4-6): 15-30 links → scrape+classify → fills remaining 5 → STOP
+- Waves 3+ (sources 7-8): completely skipped — no link extraction, no scraping, no LLM calls
+
+---
+
+## Files to modify
+
+- **Create:** `backend/migrations/20260326000025_add_source_extraction_window.sql`
+- **Modify:** `backend/src/models/settings.rs` — add `source_extraction_window` to structs, validation (1-10), default 3
+- **Modify:** `backend/src/db/settings.rs` — add to queries
+- **Modify:** `backend/src/services/synthesis.rs` — restructure Phase 1 into wave loop
+- **Modify:** `frontend/src/types.ts` — add field
+- **Modify:** `frontend/src/pages/Settings.tsx` — add number input
+- **Modify:** `frontend/src/i18n/fr.ts` — labels
+- **Modify:** `CLAUDE.md` — migration count
+- **Modify:** test fixtures (prompts.rs, api_syntheses_test.rs, pipeline_test.rs, e2e generation-live)