# Design: LLM-Assisted Scraping — Link Extraction & Article Content Extraction **Date**: 2026-03-24 **Scope**: Two optional LLM-powered enhancements to the scraping pipeline, controlled by user settings --- ## Context The current scraping pipeline uses HTML parsing heuristics to extract article links from source pages and article content from individual pages. These heuristics fail on JavaScript-rendered pages, unusual HTML structures, and complex layouts. Two optional LLM-powered alternatives improve extraction quality when enabled. ## New User Settings Two independent boolean toggles: - `use_llm_for_source_links: bool` (default `false`) — "Utiliser l'IA pour extraire les liens" - `use_llm_for_article_extraction: bool` (default `false`) — "Utiliser l'IA pour extraire le contenu" Fully independent — user can enable either, both, or neither. **Migration:** `ALTER TABLE settings ADD COLUMN use_llm_for_source_links BOOLEAN NOT NULL DEFAULT false; ALTER TABLE settings ADD COLUMN use_llm_for_article_extraction BOOLEAN NOT NULL DEFAULT false;` **Frontend:** Two checkboxes in Settings page under a new "Extraction avancee" section. ## ScrapedContent URL Field Add `pub url: String` to the `ScrapedContent` struct. Populated with the final URL after redirects (from `response.url().to_string()`). **Pipeline impact:** `scrape_single_article` returns `(String, String, String)` — `(body_text, page_title, final_url)` instead of `(String, String)`. The caller (`scrape_flat_urls`, `scrape_articles`) uses `final_url` to set `ScrapedNewsItem.url`, replacing the original input URL with the validated redirect-resolved URL. This URL becomes the canonical article URL used throughout — replacing the LLM-provided URL in the synthesis via `restore_scraped_urls`. ## Option 1: LLM-Assisted Source Link Extraction When `use_llm_for_source_links` is enabled: 1. Fetch the source page HTML (same as today) 2. Extract `` + first 8000 chars of `` for the LLM 3. **LLM prompt:** "Here is the HTML of a blog/news page. Extract only the URLs that point to actual articles (not navigation, tags, categories, login pages, etc.). Return a JSON array of URLs." 4. **LLM schema:** `{ "type": "object", "properties": { "urls": { "type": "array", "items": { "type": "string" } } }, "required": ["urls"], "additionalProperties": false }` 5. Parse the LLM response: - Resolve relative URLs against the source URL - Filter: only keep http/https URLs, skip malformed URLs (use `Url::parse`) - Filter: same domain only (match existing heuristic behavior) - Deduplicate, limit to `max_links` 6. **Fallback:** if the LLM call fails OR returns an empty array (`{"urls": []}`), fall back to the existing `extract_links_from_html`. Log a warning. When disabled, the existing HTML parsing + heuristic filtering is used (unchanged). **LLM dispatch:** Uses `model_research` via `provider.generate_rewrite_pass`. ## Option 2: LLM-Assisted Article Content Extraction When `use_llm_for_article_extraction` is enabled: 1. Fetch the article page (same as today — HTTP request, SSRF check, body size limit, streaming) 2. Capture the final URL after redirects (for `ScrapedContent.url`) 3. Extract `` section and clean body text using existing HTML stripping 4. Send both to the LLM with a structured extraction prompt 5. **LLM prompt:** "Extract the following from this article: title, publication date (ISO 8601 format, or empty string if not found), body text (main article content only, no navigation or ads), and whether this is a real article or an error/404 page." 6. **LLM schema:** (OpenAI strict mode compatible — no union types, `published_date` uses empty string instead of null) ```json { "type": "object", "properties": { "title": { "type": "string", "description": "Article title" }, "published_date": { "type": "string", "description": "ISO 8601 date or empty string if not found" }, "body_text": { "type": "string", "description": "Main article content" }, "is_error_page": { "type": "boolean", "description": "True if this is an error/404 page" } }, "required": ["title", "published_date", "body_text", "is_error_page"], "additionalProperties": false } ``` 7. Parse the LLM response into `ScrapedContent` fields: - `title` → `ScrapedContent.title` (wrapped in `Some`) - `published_date` → if non-empty, parse ISO 8601 → `ScrapedContent.published_date`; if empty string → `None` - `body_text` → `ScrapedContent.body_text` - `is_error_page` → `ScrapedContent.is_soft_404` - `url` → from `response.url()` (not from LLM) - `ok` → `true` if `!is_error_page` and `body_text` is non-empty - `status` → from HTTP response status 8. **Fallback:** if the LLM call fails (network error, JSON parse failure, schema validation error, timeout), fall back to the existing HTML parsing. Log a warning. When disabled, the existing scraper logic is used (unchanged), with the new `url` field populated from `response.url()`. **Cost:** ~$0.001 per article with gpt-4o-mini. For 16 articles, ~$0.016 total. **Concurrency:** LLM extraction calls run with bounded concurrency (max 5) to avoid hitting provider rate limits. **Progress reporting:** During per-article LLM extraction, emit progress updates: "Extraction IA des articles (N/M)..." ## Files to Modify - **Create:** migration `20260324000014_add_llm_scraping_settings.sql` - **Modify:** `backend/src/models/settings.rs` — add 2 bool fields to `UserSettings`, `SettingsResponse`, `UpdateSettingsRequest`, `Default`, validation (none needed for bools) - **Modify:** `backend/src/db/settings.rs` — add to `SettingsRow`, `TryFrom`, both SQL queries - **Modify:** `backend/src/services/scraper.rs` — add `url: String` to `ScrapedContent`, populate from `response.url()` - **Modify:** `backend/src/services/source_scraper.rs` — add LLM-assisted link extraction path, accept provider + model + schema params - **Modify:** `backend/src/services/synthesis.rs` — pass settings + provider to scraper functions, update `scrape_single_article` to return `ScrapedContent` and accept optional LLM provider, add LLM extraction path - **Modify:** `backend/src/services/prompts.rs` — add `build_link_extraction_prompt` and `build_article_extraction_prompt` - **Modify:** `backend/src/services/llm/schema.rs` — add `build_link_extraction_schema` and `build_article_extraction_schema` - **Modify:** `frontend/src/types.ts` — add 2 bool fields to `UserSettings` + `DEFAULT_SETTINGS` - **Modify:** `frontend/src/i18n/fr.ts` — add labels - **Modify:** `frontend/src/pages/Settings.tsx` — add 2 checkboxes in "Extraction avancee" section - **Modify:** `CLAUDE.md` — update migration count - **Modify:** `frontend/src/__tests__/fixtures.ts` — add 2 bool fields to MOCK_SETTINGS if manually constructed - **Modify:** `backend/tests/api_syntheses_test.rs` — update integration test for new settings fields - **Modify:** `e2e/tests/generation-live.spec.ts` — update settings payload, add comprehensive synthesis validation - **Add:** unit tests in `source_scraper.rs` — LLM link extraction, fallback - **Add:** unit tests in `synthesis.rs` — LLM article extraction, fallback - **Add:** unit tests in `prompts.rs` — link extraction and article extraction prompts ## E2E Synthesis Validation The E2E test generates a synthesis and validates: - No duplicate URLs across all sections - All article URLs return HTTP 200 (fetch each to verify links work) - Each user-defined category has ≤ `max_items_per_category` articles - Each source domain appears ≤ `max_articles_per_source` times globally - No empty titles or summaries - No Wikipedia/hallucinated URLs - "Autre" section (if present) respects max limit - Every summary is non-trivial (> 50 chars) ## What Does NOT Change - LLM providers — reused as-is (classification uses `generate_rewrite_pass`) - Database schema for syntheses — no changes - Frontend synthesis display — no changes - Rewrite pass — operates on `ScrapedContent` data regardless of extraction method - `limit_articles_per_source`, `dedup_by_url`, `filter_homepage_urls` — unchanged - Classification pipeline (Phase 1/Phase 2) — unchanged, just receives better data