You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
ai_synth/docs/superpowers/specs/2026-03-26-windowed-source-...

62 lines
2.7 KiB
Markdown

# Design: Windowed Source Extraction Pipeline
**Date**: 2026-03-26
**Scope**: Restructure Phase 1 to extract links in waves, stopping early when the synthesis is full
---
## Context
Currently Phase 1 extracts links from ALL personalized sources in parallel, collects them into one big list, shuffles, then batch-processes (scrape+classify). If the user has 8 sources but only needs 20 articles, all 8 sources are extracted upfront — wasting time and LLM calls (when `use_llm_for_source_links` is enabled).
## Design
### New setting: `source_extraction_window`
`source_extraction_window INTEGER NOT NULL DEFAULT 3` in the `settings` table. Range: 1-10. Controls how many sources are extracted per wave.
### Pipeline flow
```
sources = rotate_sources(all_sources) // existing rotation logic
waves = chunk sources into groups of source_extraction_window
For each wave:
1. Extract links from all sources in this wave IN PARALLEL (JoinSet)
2. Collect all links → deduplicate (seen_urls) → filter against article history
3. Shuffle
4. Batch scrape+classify (existing batch loop with batch_size)
5. Update filled_counts, source_counts
6. Check if max_total reached → if full, STOP (skip remaining waves)
7. Flush pending_traces
```
### Key behaviors
- Sources processed in **rotation order** (existing rolling window logic)
- `filled_counts`, `seen_urls`, `source_counts` carry across waves (accumulate)
- `max_articles_per_source` cap still applies per source within each wave
- `batch_size` still controls parallelism for scrape+classify within each wave
- Progress events per wave: "Extraction des sources (vague 1/3)..." → "Traitement des articles..."
### What's saved
With `source_extraction_window=3` and 8 sources:
- Wave 1 (sources 1-3): 15-30 links → scrape+classify → maybe fills 15/20 articles
- Wave 2 (sources 4-6): 15-30 links → scrape+classify → fills remaining 5 → STOP
- Waves 3+ (sources 7-8): completely skipped — no link extraction, no scraping, no LLM calls
---
## Files to modify
- **Create:** `backend/migrations/20260326000025_add_source_extraction_window.sql`
- **Modify:** `backend/src/models/settings.rs` — add `source_extraction_window` to structs, validation (1-10), default 3
- **Modify:** `backend/src/db/settings.rs` — add to queries
- **Modify:** `backend/src/services/synthesis.rs` — restructure Phase 1 into wave loop
- **Modify:** `frontend/src/types.ts` — add field
- **Modify:** `frontend/src/pages/Settings.tsx` — add number input
- **Modify:** `frontend/src/i18n/fr.ts` — labels
- **Modify:** `CLAUDE.md` — migration count
- **Modify:** test fixtures (prompts.rs, api_syntheses_test.rs, pipeline_test.rs, e2e generation-live)