ai_synth/docs/superpowers/specs/2026-03-26-windowed-source-...

# Design: Windowed Source Extraction Pipeline

**Date**: 2026-03-26
**Scope**: Restructure Phase 1 to extract links in waves, stopping early when the synthesis is full

---

## Context

Currently Phase 1 extracts links from ALL personalized sources in parallel, collects them into one big list, shuffles, then batch-processes (scrape+classify). If the user has 8 sources but only needs 20 articles, all 8 sources are extracted upfront — wasting time and LLM calls (when `use_llm_for_source_links` is enabled).

## Design

### New setting: `source_extraction_window`

`source_extraction_window INTEGER NOT NULL DEFAULT 3` in the `settings` table. Range: 1-10. Controls how many sources are extracted per wave.

### Pipeline flow

```
sources = rotate_sources(all_sources)   // existing rotation logic
waves = chunk sources into groups of source_extraction_window

For each wave:
  1. Extract links from all sources in this wave IN PARALLEL (JoinSet)
  2. Collect all links → deduplicate (seen_urls) → filter against article history
  3. Shuffle
  4. Batch scrape+classify (existing batch loop with batch_size)
  5. Update filled_counts, source_counts
  6. Check if max_total reached → if full, STOP (skip remaining waves)
  7. Flush pending_traces
```

### Key behaviors

- Sources processed in **rotation order** (existing rolling window logic)
- `filled_counts`, `seen_urls`, `source_counts` carry across waves (accumulate)
- `max_articles_per_source` cap still applies per source within each wave
- `batch_size` still controls parallelism for scrape+classify within each wave
- Progress events per wave: "Extraction des sources (vague 1/3)..." → "Traitement des articles..."

### What's saved

With `source_extraction_window=3` and 8 sources:
- Wave 1 (sources 1-3): 15-30 links → scrape+classify → maybe fills 15/20 articles
- Wave 2 (sources 4-6): 15-30 links → scrape+classify → fills remaining 5 → STOP
- Waves 3+ (sources 7-8): completely skipped — no link extraction, no scraping, no LLM calls

---

## Files to modify

- **Create:** `backend/migrations/20260326000025_add_source_extraction_window.sql`
- **Modify:** `backend/src/models/settings.rs` — add `source_extraction_window` to structs, validation (1-10), default 3
- **Modify:** `backend/src/db/settings.rs` — add to queries
- **Modify:** `backend/src/services/synthesis.rs` — restructure Phase 1 into wave loop
- **Modify:** `frontend/src/types.ts` — add field
- **Modify:** `frontend/src/pages/Settings.tsx` — add number input
- **Modify:** `frontend/src/i18n/fr.ts` — labels
- **Modify:** `CLAUDE.md` — migration count
- **Modify:** test fixtures (prompts.rs, api_syntheses_test.rs, pipeline_test.rs, e2e generation-live)