You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
ai_synth/docs/superpowers/specs/2026-03-26-windowed-source-...

2.7 KiB

Design: Windowed Source Extraction Pipeline

Date: 2026-03-26 Scope: Restructure Phase 1 to extract links in waves, stopping early when the synthesis is full


Context

Currently Phase 1 extracts links from ALL personalized sources in parallel, collects them into one big list, shuffles, then batch-processes (scrape+classify). If the user has 8 sources but only needs 20 articles, all 8 sources are extracted upfront — wasting time and LLM calls (when use_llm_for_source_links is enabled).

Design

New setting: source_extraction_window

source_extraction_window INTEGER NOT NULL DEFAULT 3 in the settings table. Range: 1-10. Controls how many sources are extracted per wave.

Pipeline flow

sources = rotate_sources(all_sources)   // existing rotation logic
waves = chunk sources into groups of source_extraction_window

For each wave:
  1. Extract links from all sources in this wave IN PARALLEL (JoinSet)
  2. Collect all links → deduplicate (seen_urls) → filter against article history
  3. Shuffle
  4. Batch scrape+classify (existing batch loop with batch_size)
  5. Update filled_counts, source_counts
  6. Check if max_total reached → if full, STOP (skip remaining waves)
  7. Flush pending_traces

Key behaviors

  • Sources processed in rotation order (existing rolling window logic)
  • filled_counts, seen_urls, source_counts carry across waves (accumulate)
  • max_articles_per_source cap still applies per source within each wave
  • batch_size still controls parallelism for scrape+classify within each wave
  • Progress events per wave: "Extraction des sources (vague 1/3)..." → "Traitement des articles..."

What's saved

With source_extraction_window=3 and 8 sources:

  • Wave 1 (sources 1-3): 15-30 links → scrape+classify → maybe fills 15/20 articles
  • Wave 2 (sources 4-6): 15-30 links → scrape+classify → fills remaining 5 → STOP
  • Waves 3+ (sources 7-8): completely skipped — no link extraction, no scraping, no LLM calls

Files to modify

  • Create: backend/migrations/20260326000025_add_source_extraction_window.sql
  • Modify: backend/src/models/settings.rs — add source_extraction_window to structs, validation (1-10), default 3
  • Modify: backend/src/db/settings.rs — add to queries
  • Modify: backend/src/services/synthesis.rs — restructure Phase 1 into wave loop
  • Modify: frontend/src/types.ts — add field
  • Modify: frontend/src/pages/Settings.tsx — add number input
  • Modify: frontend/src/i18n/fr.ts — labels
  • Modify: CLAUDE.md — migration count
  • Modify: test fixtures (prompts.rs, api_syntheses_test.rs, pipeline_test.rs, e2e generation-live)