You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

8.1 KiB

Raw Blame History

Design: LLM-Assisted Scraping — Link Extraction & Article Content Extraction

Date: 2026-03-24 Scope: Two optional LLM-powered enhancements to the scraping pipeline, controlled by user settings

Context

The current scraping pipeline uses HTML parsing heuristics to extract article links from source pages and article content from individual pages. These heuristics fail on JavaScript-rendered pages, unusual HTML structures, and complex layouts. Two optional LLM-powered alternatives improve extraction quality when enabled.

New User Settings

Two independent boolean toggles:

use_llm_for_source_links: bool (default false) — "Utiliser l'IA pour extraire les liens"
use_llm_for_article_extraction: bool (default false) — "Utiliser l'IA pour extraire le contenu"

Fully independent — user can enable either, both, or neither.

Migration: ALTER TABLE settings ADD COLUMN use_llm_for_source_links BOOLEAN NOT NULL DEFAULT false; ALTER TABLE settings ADD COLUMN use_llm_for_article_extraction BOOLEAN NOT NULL DEFAULT false;

Frontend: Two checkboxes in Settings page under a new "Extraction avancee" section.

ScrapedContent URL Field

Add pub url: String to the ScrapedContent struct. Populated with the final URL after redirects (from response.url().to_string()).

Pipeline impact: scrape_single_article returns (String, String, String) — (body_text, page_title, final_url) instead of (String, String). The caller (scrape_flat_urls, scrape_articles) uses final_url to set ScrapedNewsItem.url, replacing the original input URL with the validated redirect-resolved URL. This URL becomes the canonical article URL used throughout — replacing the LLM-provided URL in the synthesis via restore_scraped_urls.

Option 1: LLM-Assisted Source Link Extraction

When use_llm_for_source_links is enabled:

Fetch the source page HTML (same as today)
Extract <head> + first 8000 chars of <body> for the LLM
LLM prompt: "Here is the HTML of a blog/news page. Extract only the URLs that point to actual articles (not navigation, tags, categories, login pages, etc.). Return a JSON array of URLs."
LLM schema: { "type": "object", "properties": { "urls": { "type": "array", "items": { "type": "string" } } }, "required": ["urls"], "additionalProperties": false }
Parse the LLM response:
- Resolve relative URLs against the source URL
- Filter: only keep http/https URLs, skip malformed URLs (use Url::parse)
- Filter: same domain only (match existing heuristic behavior)
- Deduplicate, limit to max_links
Fallback: if the LLM call fails OR returns an empty array ({"urls": []}), fall back to the existing extract_links_from_html. Log a warning.

When disabled, the existing HTML parsing + heuristic filtering is used (unchanged).

LLM dispatch: Uses model_research via provider.generate_rewrite_pass.

Option 2: LLM-Assisted Article Content Extraction

When use_llm_for_article_extraction is enabled:

Fetch the article page (same as today — HTTP request, SSRF check, body size limit, streaming)
Capture the final URL after redirects (for ScrapedContent.url)
Extract <head> section and clean body text using existing HTML stripping
Send both to the LLM with a structured extraction prompt
LLM prompt: "Extract the following from this article: title, publication date (ISO 8601 format, or empty string if not found), body text (main article content only, no navigation or ads), and whether this is a real article or an error/404 page."
LLM schema: (OpenAI strict mode compatible — no union types, published_date uses empty string instead of null)

{
  "type": "object",
  "properties": {
    "title": { "type": "string", "description": "Article title" },
    "published_date": { "type": "string", "description": "ISO 8601 date or empty string if not found" },
    "body_text": { "type": "string", "description": "Main article content" },
    "is_error_page": { "type": "boolean", "description": "True if this is an error/404 page" }
  },
  "required": ["title", "published_date", "body_text", "is_error_page"],
  "additionalProperties": false
}

Parse the LLM response into ScrapedContent fields:
- title → ScrapedContent.title (wrapped in Some)
- published_date → if non-empty, parse ISO 8601 → ScrapedContent.published_date; if empty string → None
- body_text → ScrapedContent.body_text
- is_error_page → ScrapedContent.is_soft_404
- url → from response.url() (not from LLM)
- ok → true if !is_error_page and body_text is non-empty
- status → from HTTP response status
Fallback: if the LLM call fails (network error, JSON parse failure, schema validation error, timeout), fall back to the existing HTML parsing. Log a warning.

When disabled, the existing scraper logic is used (unchanged), with the new url field populated from response.url().

Cost: ~$0.001 per article with gpt-4o-mini. For 16 articles, ~$0.016 total.

Concurrency: LLM extraction calls run with bounded concurrency (max 5) to avoid hitting provider rate limits.

Progress reporting: During per-article LLM extraction, emit progress updates: "Extraction IA des articles (N/M)..."

Files to Modify

Create: migration 20260324000014_add_llm_scraping_settings.sql
Modify: backend/src/models/settings.rs — add 2 bool fields to UserSettings, SettingsResponse, UpdateSettingsRequest, Default, validation (none needed for bools)
Modify: backend/src/db/settings.rs — add to SettingsRow, TryFrom, both SQL queries
Modify: backend/src/services/scraper.rs — add url: String to ScrapedContent, populate from response.url()
Modify: backend/src/services/source_scraper.rs — add LLM-assisted link extraction path, accept provider + model + schema params
Modify: backend/src/services/synthesis.rs — pass settings + provider to scraper functions, update scrape_single_article to return ScrapedContent and accept optional LLM provider, add LLM extraction path
Modify: backend/src/services/prompts.rs — add build_link_extraction_prompt and build_article_extraction_prompt
Modify: backend/src/services/llm/schema.rs — add build_link_extraction_schema and build_article_extraction_schema
Modify: frontend/src/types.ts — add 2 bool fields to UserSettings + DEFAULT_SETTINGS
Modify: frontend/src/i18n/fr.ts — add labels
Modify: frontend/src/pages/Settings.tsx — add 2 checkboxes in "Extraction avancee" section
Modify: CLAUDE.md — update migration count
Modify: frontend/src/__tests__/fixtures.ts — add 2 bool fields to MOCK_SETTINGS if manually constructed
Modify: backend/tests/api_syntheses_test.rs — update integration test for new settings fields
Modify: e2e/tests/generation-live.spec.ts — update settings payload, add comprehensive synthesis validation
Add: unit tests in source_scraper.rs — LLM link extraction, fallback
Add: unit tests in synthesis.rs — LLM article extraction, fallback
Add: unit tests in prompts.rs — link extraction and article extraction prompts

E2E Synthesis Validation

The E2E test generates a synthesis and validates:

No duplicate URLs across all sections
All article URLs return HTTP 200 (fetch each to verify links work)
Each user-defined category has ≤ max_items_per_category articles
Each source domain appears ≤ max_articles_per_source times globally
No empty titles or summaries
No Wikipedia/hallucinated URLs
"Autre" section (if present) respects max limit
Every summary is non-trivial (> 50 chars)

What Does NOT Change

LLM providers — reused as-is (classification uses generate_rewrite_pass)
Database schema for syntheses — no changes
Frontend synthesis display — no changes
Rewrite pass — operates on ScrapedContent data regardless of extraction method
limit_articles_per_source, dedup_by_url, filter_homepage_urls — unchanged
Classification pipeline (Phase 1/Phase 2) — unchanged, just receives better data

8.1 KiB Raw Blame History