8.1 KiB
Design: LLM-Assisted Scraping — Link Extraction & Article Content Extraction
Date: 2026-03-24 Scope: Two optional LLM-powered enhancements to the scraping pipeline, controlled by user settings
Context
The current scraping pipeline uses HTML parsing heuristics to extract article links from source pages and article content from individual pages. These heuristics fail on JavaScript-rendered pages, unusual HTML structures, and complex layouts. Two optional LLM-powered alternatives improve extraction quality when enabled.
New User Settings
Two independent boolean toggles:
use_llm_for_source_links: bool(defaultfalse) — "Utiliser l'IA pour extraire les liens"use_llm_for_article_extraction: bool(defaultfalse) — "Utiliser l'IA pour extraire le contenu"
Fully independent — user can enable either, both, or neither.
Migration: ALTER TABLE settings ADD COLUMN use_llm_for_source_links BOOLEAN NOT NULL DEFAULT false; ALTER TABLE settings ADD COLUMN use_llm_for_article_extraction BOOLEAN NOT NULL DEFAULT false;
Frontend: Two checkboxes in Settings page under a new "Extraction avancee" section.
ScrapedContent URL Field
Add pub url: String to the ScrapedContent struct. Populated with the final URL after redirects (from response.url().to_string()).
Pipeline impact: scrape_single_article returns (String, String, String) — (body_text, page_title, final_url) instead of (String, String). The caller (scrape_flat_urls, scrape_articles) uses final_url to set ScrapedNewsItem.url, replacing the original input URL with the validated redirect-resolved URL. This URL becomes the canonical article URL used throughout — replacing the LLM-provided URL in the synthesis via restore_scraped_urls.
Option 1: LLM-Assisted Source Link Extraction
When use_llm_for_source_links is enabled:
- Fetch the source page HTML (same as today)
- Extract
<head>+ first 8000 chars of<body>for the LLM - LLM prompt: "Here is the HTML of a blog/news page. Extract only the URLs that point to actual articles (not navigation, tags, categories, login pages, etc.). Return a JSON array of URLs."
- LLM schema:
{ "type": "object", "properties": { "urls": { "type": "array", "items": { "type": "string" } } }, "required": ["urls"], "additionalProperties": false } - Parse the LLM response:
- Resolve relative URLs against the source URL
- Filter: only keep http/https URLs, skip malformed URLs (use
Url::parse) - Filter: same domain only (match existing heuristic behavior)
- Deduplicate, limit to
max_links
- Fallback: if the LLM call fails OR returns an empty array (
{"urls": []}), fall back to the existingextract_links_from_html. Log a warning.
When disabled, the existing HTML parsing + heuristic filtering is used (unchanged).
LLM dispatch: Uses model_research via provider.generate_rewrite_pass.
Option 2: LLM-Assisted Article Content Extraction
When use_llm_for_article_extraction is enabled:
- Fetch the article page (same as today — HTTP request, SSRF check, body size limit, streaming)
- Capture the final URL after redirects (for
ScrapedContent.url) - Extract
<head>section and clean body text using existing HTML stripping - Send both to the LLM with a structured extraction prompt
- LLM prompt: "Extract the following from this article: title, publication date (ISO 8601 format, or empty string if not found), body text (main article content only, no navigation or ads), and whether this is a real article or an error/404 page."
- LLM schema: (OpenAI strict mode compatible — no union types,
published_dateuses empty string instead of null)
{
"type": "object",
"properties": {
"title": { "type": "string", "description": "Article title" },
"published_date": { "type": "string", "description": "ISO 8601 date or empty string if not found" },
"body_text": { "type": "string", "description": "Main article content" },
"is_error_page": { "type": "boolean", "description": "True if this is an error/404 page" }
},
"required": ["title", "published_date", "body_text", "is_error_page"],
"additionalProperties": false
}
- Parse the LLM response into
ScrapedContentfields:title→ScrapedContent.title(wrapped inSome)published_date→ if non-empty, parse ISO 8601 →ScrapedContent.published_date; if empty string →Nonebody_text→ScrapedContent.body_textis_error_page→ScrapedContent.is_soft_404url→ fromresponse.url()(not from LLM)ok→trueif!is_error_pageandbody_textis non-emptystatus→ from HTTP response status
- Fallback: if the LLM call fails (network error, JSON parse failure, schema validation error, timeout), fall back to the existing HTML parsing. Log a warning.
When disabled, the existing scraper logic is used (unchanged), with the new url field populated from response.url().
Cost: ~$0.001 per article with gpt-4o-mini. For 16 articles, ~$0.016 total.
Concurrency: LLM extraction calls run with bounded concurrency (max 5) to avoid hitting provider rate limits.
Progress reporting: During per-article LLM extraction, emit progress updates: "Extraction IA des articles (N/M)..."
Files to Modify
- Create: migration
20260324000014_add_llm_scraping_settings.sql - Modify:
backend/src/models/settings.rs— add 2 bool fields toUserSettings,SettingsResponse,UpdateSettingsRequest,Default, validation (none needed for bools) - Modify:
backend/src/db/settings.rs— add toSettingsRow,TryFrom, both SQL queries - Modify:
backend/src/services/scraper.rs— addurl: StringtoScrapedContent, populate fromresponse.url() - Modify:
backend/src/services/source_scraper.rs— add LLM-assisted link extraction path, accept provider + model + schema params - Modify:
backend/src/services/synthesis.rs— pass settings + provider to scraper functions, updatescrape_single_articleto returnScrapedContentand accept optional LLM provider, add LLM extraction path - Modify:
backend/src/services/prompts.rs— addbuild_link_extraction_promptandbuild_article_extraction_prompt - Modify:
backend/src/services/llm/schema.rs— addbuild_link_extraction_schemaandbuild_article_extraction_schema - Modify:
frontend/src/types.ts— add 2 bool fields toUserSettings+DEFAULT_SETTINGS - Modify:
frontend/src/i18n/fr.ts— add labels - Modify:
frontend/src/pages/Settings.tsx— add 2 checkboxes in "Extraction avancee" section - Modify:
CLAUDE.md— update migration count - Modify:
frontend/src/__tests__/fixtures.ts— add 2 bool fields to MOCK_SETTINGS if manually constructed - Modify:
backend/tests/api_syntheses_test.rs— update integration test for new settings fields - Modify:
e2e/tests/generation-live.spec.ts— update settings payload, add comprehensive synthesis validation - Add: unit tests in
source_scraper.rs— LLM link extraction, fallback - Add: unit tests in
synthesis.rs— LLM article extraction, fallback - Add: unit tests in
prompts.rs— link extraction and article extraction prompts
E2E Synthesis Validation
The E2E test generates a synthesis and validates:
- No duplicate URLs across all sections
- All article URLs return HTTP 200 (fetch each to verify links work)
- Each user-defined category has ≤
max_items_per_categoryarticles - Each source domain appears ≤
max_articles_per_sourcetimes globally - No empty titles or summaries
- No Wikipedia/hallucinated URLs
- "Autre" section (if present) respects max limit
- Every summary is non-trivial (> 50 chars)
What Does NOT Change
- LLM providers — reused as-is (classification uses
generate_rewrite_pass) - Database schema for syntheses — no changes
- Frontend synthesis display — no changes
- Rewrite pass — operates on
ScrapedContentdata regardless of extraction method limit_articles_per_source,dedup_by_url,filter_homepage_urls— unchanged- Classification pipeline (Phase 1/Phase 2) — unchanged, just receives better data