You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
62 lines
2.7 KiB
Markdown
62 lines
2.7 KiB
Markdown
# Design: Windowed Source Extraction Pipeline
|
|
|
|
**Date**: 2026-03-26
|
|
**Scope**: Restructure Phase 1 to extract links in waves, stopping early when the synthesis is full
|
|
|
|
---
|
|
|
|
## Context
|
|
|
|
Currently Phase 1 extracts links from ALL personalized sources in parallel, collects them into one big list, shuffles, then batch-processes (scrape+classify). If the user has 8 sources but only needs 20 articles, all 8 sources are extracted upfront — wasting time and LLM calls (when `use_llm_for_source_links` is enabled).
|
|
|
|
## Design
|
|
|
|
### New setting: `source_extraction_window`
|
|
|
|
`source_extraction_window INTEGER NOT NULL DEFAULT 3` in the `settings` table. Range: 1-10. Controls how many sources are extracted per wave.
|
|
|
|
### Pipeline flow
|
|
|
|
```
|
|
sources = rotate_sources(all_sources) // existing rotation logic
|
|
waves = chunk sources into groups of source_extraction_window
|
|
|
|
For each wave:
|
|
1. Extract links from all sources in this wave IN PARALLEL (JoinSet)
|
|
2. Collect all links → deduplicate (seen_urls) → filter against article history
|
|
3. Shuffle
|
|
4. Batch scrape+classify (existing batch loop with batch_size)
|
|
5. Update filled_counts, source_counts
|
|
6. Check if max_total reached → if full, STOP (skip remaining waves)
|
|
7. Flush pending_traces
|
|
```
|
|
|
|
### Key behaviors
|
|
|
|
- Sources processed in **rotation order** (existing rolling window logic)
|
|
- `filled_counts`, `seen_urls`, `source_counts` carry across waves (accumulate)
|
|
- `max_articles_per_source` cap still applies per source within each wave
|
|
- `batch_size` still controls parallelism for scrape+classify within each wave
|
|
- Progress events per wave: "Extraction des sources (vague 1/3)..." → "Traitement des articles..."
|
|
|
|
### What's saved
|
|
|
|
With `source_extraction_window=3` and 8 sources:
|
|
- Wave 1 (sources 1-3): 15-30 links → scrape+classify → maybe fills 15/20 articles
|
|
- Wave 2 (sources 4-6): 15-30 links → scrape+classify → fills remaining 5 → STOP
|
|
- Waves 3+ (sources 7-8): completely skipped — no link extraction, no scraping, no LLM calls
|
|
|
|
---
|
|
|
|
## Files to modify
|
|
|
|
- **Create:** `backend/migrations/20260326000025_add_source_extraction_window.sql`
|
|
- **Modify:** `backend/src/models/settings.rs` — add `source_extraction_window` to structs, validation (1-10), default 3
|
|
- **Modify:** `backend/src/db/settings.rs` — add to queries
|
|
- **Modify:** `backend/src/services/synthesis.rs` — restructure Phase 1 into wave loop
|
|
- **Modify:** `frontend/src/types.ts` — add field
|
|
- **Modify:** `frontend/src/pages/Settings.tsx` — add number input
|
|
- **Modify:** `frontend/src/i18n/fr.ts` — labels
|
|
- **Modify:** `CLAUDE.md` — migration count
|
|
- **Modify:** test fixtures (prompts.rs, api_syntheses_test.rs, pipeline_test.rs, e2e generation-live)
|