ai_synth/docs/algorithm.md

# Synthesis Generation Pipeline — Full Algorithm

## Initialization

1. **Load user settings** from DB (categories, provider, models, max_items, etc.)
2. **Cleanup** — delete old article history entries (>N days, dropped only) + truncate old LLM call logs
3. **Validate** — if no categories configured, there will just be the default category "Autre".
4. **Load user sources** (personalized URLs like `https://openai.com/blog`)
5. **Resolve LLM provider** — decrypt user's API key, create provider instance (`Arc<dyn LlmProvider>`)
6. **Resolve models** — research model + writing model (user override or admin default)
7. **Setup rate limiter** — per-user or global provider limiter
8. **Prepare LLM scraping option** — if `use_llm_for_article_extraction` enabled, clone provider+model for concurrent use
9. **Initialize tracking structures** — `article_scraped` (category→articles), `source_counts` (per-source article count), `url_soucre` (per-article source), `filled_counts` (per-category article count), `seen_urls` (cross-phase dedup), classification categories (user categories + "Autre")

---

## Phase 1: Personalized Sources

**Skipped entirely if user has 0 sources.**

### 1a. Extract article links from source pages and filter against article history

- Query `article_history` for the last source used. Reorder the personalized source so that the first source is the one following the last source used (rolling window)
- For each source, fetch the source page HTML:
  - If `use_llm_for_source_links` enabled: send HTML `<head>` + first 8000 chars of `<body>` to LLM → extract all article URLs up to a maximum of 10, with the most recent first. If LLM call fails, fall back to HTML parsing as described below.
    - **LLM call logged** with full prompt/response/timing
  - Otherwise: parse HTML `<a href>` links, filter by same-domain, non-homepage path, exclude `/tag/`, `/login/`, `/contact/`,`/presentation/`,`/newsletter/`, static assets, etc. and keep only the first 10 links found
  - Deduplicate candidate URLs
  - Hash each URL (normalized: lowercase, strip fragments/UTM params/trailing slashes)
  - Query `article_history` for existing hashes → remove matches
  - Trace dropped articles with `status: filtered_history`
  - Add the url to `url_soucre`

### 1b. Scrape, classify and summarize articles

- For each url from step 1a:
  - if the number of articles in `source_counts` for the source of the current url exceeds `max_articles_per_source`:
    - Trace dropped article with `status: filtered_diversity`
    - Move to next url
  - Fetch each URL (bounded concurrency: 5 with LLM extraction, 10 without).
  - SSRF check (no private IPs), 15s timeout, 5MB body limit.
  - HTML parsing heuristics for title (`<title>`, `og:title`), date (meta tags, JSON-LD, `<time>`), body (strip scripts/nav), soft-404 detection
  - If article scraped body text is empty (scrape failure, soft 404, too old):
    - Trace dropped articles in `article_history` with `status: filtered_empty`
    - Move to next url
  - Send article (title + first 500 chars of body) + categories + "Autre" to LLM. LLM returns `{title, summary, category}` mapping the article to a category. The LLM generates the summary and a also a title if the provided title is empty
    - **LLM call logged** with full prompt/response/timing
  - Add the article to `article_scraped` and increase `filled_counts`
  - if number of articles in the category of this artcile exceeds `max_items_per_category`: change the article catgeory to "Autre"
  - If the total number of articles in `article_scraped` exceeds `number of categories (including Autre) × max_items_per_category` then exit for loop and move to synthesis generation

---

## Phase 2: Web Search Fallback

**Skipped if all user-defined categories are already filled to `max_items_per_category`.**

### 2a. Compute category gaps

- For each user category: `needed = max - already_filled`
- Only proceed if any category needs more

### 2b. LLM web search pass

- Build search prompt with theme, categories, gap counts ("find N articles for category_1, M for category_2")
- Send search prompt to LLM. LLM returns structured JSON: `{category_0: [{title, url, summary}], category_1: [...]}`
  - **LLM call logged** with full prompt/response/timing
- **Filter homepage URLs** — drop articles with path `/` or empty
- **Cross-phase dedup** — drop URLs already seen in Phase 1
- **Dedup by URL** — drop duplicate URLs within Phase 2 (case-insensitive)
- **Limit articles per source** — enforce `max_articles_per_source` per domain (spread across categories first, then fill)
- **Filter against article history** — BEFORE scraping (saves HTTP requests), drop already-seen URLs
- Each drop traced in `article_history` with appropriate status

### 2c. Scrape web search results

- Same scraping as Phase 1 (bounded concurrency, SSRF check, optional LLM extraction)
- Filter empty content (scrape failures, soft 404, too old)
- Trace drops
- Merge results into `all_scraped`
- Move to synthesis generation

---

## Save + Record

- **Sanitize** — strip `\u0000` null bytes from JSON (PostgreSQL rejects them in JSONB)
- **Save synthesis** — insert into `syntheses` table with `job_id`, `week` (ISO week), `sections` (JSONB), `status: completed`
- **Record used articles** — insert each article URL into `article_history` with `status: used`, `synthesis_id`, `job_id`, and category name (for future dedup + provenance)