Update algorithm.md to reflect the rewritten per-article classify/summarize
pipeline (no batch classification, no rewrite pass). Update generation time
estimate from 1 minute to 10 minutes in frontend i18n and docs.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
### 1a. Extract article links from source pages and filter against article history
- For each source (max 10), fetch the source page HTML
- Query `article_history` for the last source used. Reorder the personalized source so that the first source is the one following the last source used (rolling window)
- If `use_llm_for_source_links` enabled: send HTML `<head>` + first 8000 chars of `<body>` to LLM → extract article URLs (falls back to heuristic if LLM fails)
- For each source, fetch the source page HTML:
- Otherwise: parse HTML `<a href>` links, filter by same-domain, non-homepage path, exclude `/tag/`, `/login/`, static assets, etc.
- If `use_llm_for_source_links` enabled: send HTML `<head>` + first 8000 chars of `<body>` to LLM → extract all article URLs up to a maximum of 10, with the most recent first. If LLM call fails, fall back to HTML parsing as described below.
- Over-fetch: `2 × max_articles_per_source` candidates per source
- **LLM call logged** with full prompt/response/timing
- Deduplicate candidate URLs
- Otherwise: parse HTML `<a href>` links, filter by same-domain, non-homepage path, exclude `/tag/`, `/login/`, `/contact/`,`/presentation/`,`/newsletter/`, static assets, etc. and keep only the first 10 links found
- Deduplicate candidate URLs
### 1b. Scrape candidate articles
- Hash each URL (normalized: lowercase, strip fragments/UTM params/trailing slashes)
- Query `article_history` for existing hashes → remove matches
- Fetch each URL (bounded concurrency: 5 with LLM extraction, 10 without)
- Trace dropped articles with `status: filtered_history`
- SSRF check (no private IPs), 15s timeout, 5MB body limit
- Add the url to `url_soucre`
- If `use_llm_for_article_extraction` enabled: send `<head>` + body text to LLM → extract title, date, body, error detection (falls back to heuristic if LLM fails)
- Otherwise: HTML parsing heuristics for title (`<title>`, `og:title`), date (meta tags, JSON-LD, `<time>`), body (strip scripts/nav), soft-404 detection
### 1b. Scrape, classify and summarize articles
- Capture final URL after redirects (canonical URL)
- For each url from step 1a:
### 1c. Filter empty content
- if the number of articles in `source_counts` for the source of the current url exceeds `max_articles_per_source`:
- Trace dropped article with `status: filtered_diversity`
- Remove articles where scraped body text is empty (scrape failure, soft 404, too old)
- Move to next url
- Trace dropped articles in `article_history` with `status: filtered_empty`
- Fetch each URL (bounded concurrency: 5 with LLM extraction, 10 without).
- SSRF check (no private IPs), 15s timeout, 5MB body limit.
### 1d. Filter against article history
- HTML parsing heuristics for title (`<title>`, `og:title`), date (meta tags, JSON-LD, `<time>`), body (strip scripts/nav), soft-404 detection
- If article scraped body text is empty (scrape failure, soft 404, too old):
- Hash each URL (normalized: lowercase, strip fragments/UTM params/trailing slashes)
- Trace dropped articles in `article_history` with `status: filtered_empty`
- Query `article_history` for existing hashes → remove matches
- Move to next url
- Trace dropped articles with `status: filtered_history`
- Send article (title + first 500 chars of body) + categories + "Autre" to LLM. LLM returns `{title, summary, category}` mapping the article to a category. The LLM generates the summary and a also a title if the provided title is empty
- **LLM call logged** with full prompt/response/timing
### 1e. Retry if under-filled
- Add the article to `article_scraped` and increase `filled_counts`
- if number of articles in the category of this artcile exceeds `max_items_per_category`: change the article catgeory to "Autre"
- If valid articles <`categories × max_items_per_category`andhistoryisenabled
- If the total number of articles in `article_scraped` exceeds `number of categories (including Autre) × max_items_per_category` then exit for loop and move to synthesis generation
- Re-scrape source pages for NEW links (exclude already-fetched URLs)
- Scrape + filter empty + filter history on retry candidates
- Merge with existing valid articles
- Only 1 retry attempt
### 1f. LLM classification
- Send articles (title + first 500 chars of body) + categories + "Autre" to LLM
- LLM returns `{assignments: [{index, category}]}` mapping each article to a category
- Overflow: articles that exceed both target category AND "Autre" limits → collected in `all_overflow`
- **LLM call logged** with full prompt/response/timing
### 1g. Enforce source diversity
- Count domains across all categories
- Remove articles where domain exceeds `max_articles_per_source`
- Trace dropped articles with `status: filtered_diversity`
- Recount category fill levels
---
---
@ -78,21 +60,11 @@
- For each user category: `needed = max - already_filled`
- For each user category: `needed = max - already_filled`
- Only proceed if any category needs more
- Only proceed if any category needs more
### 2b. Load recent domains for diversity
### 2b. LLM web search pass
- If `source_diversity_window > 0`: extract domains from last N syntheses' JSONB sections
- Used as soft "avoid if possible" instruction in search prompt
### 2c. LLM web search pass
- Build search prompt with theme, categories, gap counts ("find N articles for AI News, M for Cybersecurity"), recent domains to avoid, personalized source URLs
- Call `provider.generate_search_pass()` with web search tool enabled
- **LLM call logged** with full prompt/response/timing
- For each overflow article: check if domain is under `max_articles_per_source` limit
- Add to `all_scraped["category_autre"]` up to the shortfall
---
## Combined Rewrite Pass
- **Fail if no articles** — return error if all categories are empty
- **Build rewrite prompt** — serialize all scraped articles with body content, instruct LLM to rewrite title + summary (4-5 lines) faithfully based on scraped content
- **Build rewrite schema** — `minItems`/`maxItems` set to ACTUAL count per category (not user max), empty categories omitted, "Autre" included if non-empty
- **LLM rewrite pass** — call `provider.generate_rewrite_pass()` with writing model
- **LLM call logged** with full prompt/response/timing
- **Build final sections** — map `category_N` keys to user category names, add "Autre" section if present, omit empty categories
- **Restore scraped URLs** — replace any hallucinated URLs from LLM rewrite with the validated scraped URLs (matched by category + position)
---
---
@ -144,27 +88,3 @@
- **Save synthesis** — insert into `syntheses` table with `job_id`, `week` (ISO week), `sections` (JSONB), `status: completed`
- **Save synthesis** — insert into `syntheses` table with `job_id`, `week` (ISO week), `sections` (JSONB), `status: completed`
- **Record used articles** — insert each article URL into `article_history` with `status: used`, `synthesis_id`, `job_id`, and category name (for future dedup + provenance)
- **Record used articles** — insert each article URL into `article_history` with `status: used`, `synthesis_id`, `job_id`, and category name (for future dedup + provenance)
---
## Summary of LLM Calls (up to 4 per generation)
| # | Call | When | Model |
|---|---|---|---|
| 1 | Classification Phase 1 | After Phase 1 scraping | research |
| 2 | Web Search | Phase 2 start | research |
| 3 | Classification Phase 2 | After Phase 2 scraping | research |
| 4 | Rewrite | After both phases | writing |
Plus optionally per-article calls for LLM link extraction and LLM article extraction (when those settings are enabled).
## Summary of Filtering Steps
| Step | Phase | What's dropped |
|---|---|---|
| Empty content | 1 & 2 | Scrape failures, soft 404s, too old |
| Article history | 1 & 2 | Already used in previous syntheses |
| Homepage URLs | 2 | Path is `/` or empty |
| Cross-phase dedup | 2 | URLs already found in Phase 1 |
'generate.title': 'Generer la Synthese Hebdomadaire',
'generate.title': 'Generer la Synthese Hebdomadaire',
'generate.description': "Cette action va lancer l'analyse des actualites des {days} derniers jours sur le theme \"{theme}\" via {provider} ({model}).",
'generate.description': "Cette action va lancer l'analyse des actualites des {days} derniers jours sur le theme \"{theme}\" via {provider} ({model}).",
'generate.note': 'Note : La generation peut prendre jusqu\'a une minute.',
'generate.note': 'Note : La generation peut prendre jusqu\'a 10 minutes.',
'generate.noWebSearch': "Note : Le fournisseur selectionne ne dispose pas de la recherche web integree. Les resultats seront bases sur les connaissances du modele uniquement.",
'generate.noWebSearch': "Note : Le fournisseur selectionne ne dispose pas de la recherche web integree. Les resultats seront bases sur les connaissances du modele uniquement.",
'generate.start': 'Lancer la generation',
'generate.start': 'Lancer la generation',
'generate.canLeave': 'Vous pouvez quitter cette page. La generation continuera en arriere-plan.',
'generate.canLeave': 'Vous pouvez quitter cette page. La generation continuera en arriere-plan.',