ai_synth/docs/superpowers/specs/2026-03-24-llm-scraping-des...

# Design: LLM-Assisted Scraping — Link Extraction & Article Content Extraction

**Date**: 2026-03-24
**Scope**: Two optional LLM-powered enhancements to the scraping pipeline, controlled by user settings

---

## Context

The current scraping pipeline uses HTML parsing heuristics to extract article links from source pages and article content from individual pages. These heuristics fail on JavaScript-rendered pages, unusual HTML structures, and complex layouts. Two optional LLM-powered alternatives improve extraction quality when enabled.

## New User Settings

Two independent boolean toggles:

- `use_llm_for_source_links: bool` (default `false`) — "Utiliser l'IA pour extraire les liens"
- `use_llm_for_article_extraction: bool` (default `false`) — "Utiliser l'IA pour extraire le contenu"

Fully independent — user can enable either, both, or neither.

**Migration:** `ALTER TABLE settings ADD COLUMN use_llm_for_source_links BOOLEAN NOT NULL DEFAULT false; ALTER TABLE settings ADD COLUMN use_llm_for_article_extraction BOOLEAN NOT NULL DEFAULT false;`

**Frontend:** Two checkboxes in Settings page under a new "Extraction avancee" section.

## ScrapedContent URL Field

Add `pub url: String` to the `ScrapedContent` struct. Populated with the final URL after redirects (from `response.url().to_string()`).

**Pipeline impact:** `scrape_single_article` returns `(String, String, String)` — `(body_text, page_title, final_url)` instead of `(String, String)`. The caller (`scrape_flat_urls`, `scrape_articles`) uses `final_url` to set `ScrapedNewsItem.url`, replacing the original input URL with the validated redirect-resolved URL. This URL becomes the canonical article URL used throughout — replacing the LLM-provided URL in the synthesis via `restore_scraped_urls`.

## Option 1: LLM-Assisted Source Link Extraction

When `use_llm_for_source_links` is enabled:

1. Fetch the source page HTML (same as today)
2. Extract `<head>` + first 8000 chars of `<body>` for the LLM
3. **LLM prompt:** "Here is the HTML of a blog/news page. Extract only the URLs that point to actual articles (not navigation, tags, categories, login pages, etc.). Return a JSON array of URLs."
4. **LLM schema:** `{ "type": "object", "properties": { "urls": { "type": "array", "items": { "type": "string" } } }, "required": ["urls"], "additionalProperties": false }`
5. Parse the LLM response:
   - Resolve relative URLs against the source URL
   - Filter: only keep http/https URLs, skip malformed URLs (use `Url::parse`)
   - Filter: same domain only (match existing heuristic behavior)
   - Deduplicate, limit to `max_links`
6. **Fallback:** if the LLM call fails OR returns an empty array (`{"urls": []}`), fall back to the existing `extract_links_from_html`. Log a warning.

When disabled, the existing HTML parsing + heuristic filtering is used (unchanged).

**LLM dispatch:** Uses `model_research` via `provider.generate_rewrite_pass`.

## Option 2: LLM-Assisted Article Content Extraction

When `use_llm_for_article_extraction` is enabled:

1. Fetch the article page (same as today — HTTP request, SSRF check, body size limit, streaming)
2. Capture the final URL after redirects (for `ScrapedContent.url`)
3. Extract `<head>` section and clean body text using existing HTML stripping
4. Send both to the LLM with a structured extraction prompt
5. **LLM prompt:** "Extract the following from this article: title, publication date (ISO 8601 format, or empty string if not found), body text (main article content only, no navigation or ads), and whether this is a real article or an error/404 page."
6. **LLM schema:** (OpenAI strict mode compatible — no union types, `published_date` uses empty string instead of null)
```json
{
  "type": "object",
  "properties": {
    "title": { "type": "string", "description": "Article title" },
    "published_date": { "type": "string", "description": "ISO 8601 date or empty string if not found" },
    "body_text": { "type": "string", "description": "Main article content" },
    "is_error_page": { "type": "boolean", "description": "True if this is an error/404 page" }
  },
  "required": ["title", "published_date", "body_text", "is_error_page"],
  "additionalProperties": false
}
```
7. Parse the LLM response into `ScrapedContent` fields:
   - `title` → `ScrapedContent.title` (wrapped in `Some`)
   - `published_date` → if non-empty, parse ISO 8601 → `ScrapedContent.published_date`; if empty string → `None`
   - `body_text` → `ScrapedContent.body_text`
   - `is_error_page` → `ScrapedContent.is_soft_404`
   - `url` → from `response.url()` (not from LLM)
   - `ok` → `true` if `!is_error_page` and `body_text` is non-empty
   - `status` → from HTTP response status
8. **Fallback:** if the LLM call fails (network error, JSON parse failure, schema validation error, timeout), fall back to the existing HTML parsing. Log a warning.

When disabled, the existing scraper logic is used (unchanged), with the new `url` field populated from `response.url()`.

**Cost:** ~$0.001 per article with gpt-4o-mini. For 16 articles, ~$0.016 total.

**Concurrency:** LLM extraction calls run with bounded concurrency (max 5) to avoid hitting provider rate limits.

**Progress reporting:** During per-article LLM extraction, emit progress updates: "Extraction IA des articles (N/M)..."

## Files to Modify

- **Create:** migration `20260324000014_add_llm_scraping_settings.sql`
- **Modify:** `backend/src/models/settings.rs` — add 2 bool fields to `UserSettings`, `SettingsResponse`, `UpdateSettingsRequest`, `Default`, validation (none needed for bools)
- **Modify:** `backend/src/db/settings.rs` — add to `SettingsRow`, `TryFrom`, both SQL queries
- **Modify:** `backend/src/services/scraper.rs` — add `url: String` to `ScrapedContent`, populate from `response.url()`
- **Modify:** `backend/src/services/source_scraper.rs` — add LLM-assisted link extraction path, accept provider + model + schema params
- **Modify:** `backend/src/services/synthesis.rs` — pass settings + provider to scraper functions, update `scrape_single_article` to return `ScrapedContent` and accept optional LLM provider, add LLM extraction path
- **Modify:** `backend/src/services/prompts.rs` — add `build_link_extraction_prompt` and `build_article_extraction_prompt`
- **Modify:** `backend/src/services/llm/schema.rs` — add `build_link_extraction_schema` and `build_article_extraction_schema`
- **Modify:** `frontend/src/types.ts` — add 2 bool fields to `UserSettings` + `DEFAULT_SETTINGS`
- **Modify:** `frontend/src/i18n/fr.ts` — add labels
- **Modify:** `frontend/src/pages/Settings.tsx` — add 2 checkboxes in "Extraction avancee" section
- **Modify:** `CLAUDE.md` — update migration count
- **Modify:** `frontend/src/__tests__/fixtures.ts` — add 2 bool fields to MOCK_SETTINGS if manually constructed
- **Modify:** `backend/tests/api_syntheses_test.rs` — update integration test for new settings fields
- **Modify:** `e2e/tests/generation-live.spec.ts` — update settings payload, add comprehensive synthesis validation
- **Add:** unit tests in `source_scraper.rs` — LLM link extraction, fallback
- **Add:** unit tests in `synthesis.rs` — LLM article extraction, fallback
- **Add:** unit tests in `prompts.rs` — link extraction and article extraction prompts

## E2E Synthesis Validation

The E2E test generates a synthesis and validates:
- No duplicate URLs across all sections
- All article URLs return HTTP 200 (fetch each to verify links work)
- Each user-defined category has ≤ `max_items_per_category` articles
- Each source domain appears ≤ `max_articles_per_source` times globally
- No empty titles or summaries
- No Wikipedia/hallucinated URLs
- "Autre" section (if present) respects max limit
- Every summary is non-trivial (> 50 chars)

## What Does NOT Change

- LLM providers — reused as-is (classification uses `generate_rewrite_pass`)
- Database schema for syntheses — no changes
- Frontend synthesis display — no changes
- Rewrite pass — operates on `ScrapedContent` data regardless of extraction method
- `limit_articles_per_source`, `dedup_by_url`, `filter_homepage_urls` — unchanged
- Classification pipeline (Phase 1/Phase 2) — unchanged, just receives better data