You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
ai_synth/docs/superpowers/specs/2026-03-24-llm-scraping-des...

132 lines
8.1 KiB
Markdown

# Design: LLM-Assisted Scraping — Link Extraction & Article Content Extraction
**Date**: 2026-03-24
**Scope**: Two optional LLM-powered enhancements to the scraping pipeline, controlled by user settings
---
## Context
The current scraping pipeline uses HTML parsing heuristics to extract article links from source pages and article content from individual pages. These heuristics fail on JavaScript-rendered pages, unusual HTML structures, and complex layouts. Two optional LLM-powered alternatives improve extraction quality when enabled.
## New User Settings
Two independent boolean toggles:
- `use_llm_for_source_links: bool` (default `false`) — "Utiliser l'IA pour extraire les liens"
- `use_llm_for_article_extraction: bool` (default `false`) — "Utiliser l'IA pour extraire le contenu"
Fully independent — user can enable either, both, or neither.
**Migration:** `ALTER TABLE settings ADD COLUMN use_llm_for_source_links BOOLEAN NOT NULL DEFAULT false; ALTER TABLE settings ADD COLUMN use_llm_for_article_extraction BOOLEAN NOT NULL DEFAULT false;`
**Frontend:** Two checkboxes in Settings page under a new "Extraction avancee" section.
## ScrapedContent URL Field
Add `pub url: String` to the `ScrapedContent` struct. Populated with the final URL after redirects (from `response.url().to_string()`).
**Pipeline impact:** `scrape_single_article` returns `(String, String, String)``(body_text, page_title, final_url)` instead of `(String, String)`. The caller (`scrape_flat_urls`, `scrape_articles`) uses `final_url` to set `ScrapedNewsItem.url`, replacing the original input URL with the validated redirect-resolved URL. This URL becomes the canonical article URL used throughout — replacing the LLM-provided URL in the synthesis via `restore_scraped_urls`.
## Option 1: LLM-Assisted Source Link Extraction
When `use_llm_for_source_links` is enabled:
1. Fetch the source page HTML (same as today)
2. Extract `<head>` + first 8000 chars of `<body>` for the LLM
3. **LLM prompt:** "Here is the HTML of a blog/news page. Extract only the URLs that point to actual articles (not navigation, tags, categories, login pages, etc.). Return a JSON array of URLs."
4. **LLM schema:** `{ "type": "object", "properties": { "urls": { "type": "array", "items": { "type": "string" } } }, "required": ["urls"], "additionalProperties": false }`
5. Parse the LLM response:
- Resolve relative URLs against the source URL
- Filter: only keep http/https URLs, skip malformed URLs (use `Url::parse`)
- Filter: same domain only (match existing heuristic behavior)
- Deduplicate, limit to `max_links`
6. **Fallback:** if the LLM call fails OR returns an empty array (`{"urls": []}`), fall back to the existing `extract_links_from_html`. Log a warning.
When disabled, the existing HTML parsing + heuristic filtering is used (unchanged).
**LLM dispatch:** Uses `model_research` via `provider.generate_rewrite_pass`.
## Option 2: LLM-Assisted Article Content Extraction
When `use_llm_for_article_extraction` is enabled:
1. Fetch the article page (same as today — HTTP request, SSRF check, body size limit, streaming)
2. Capture the final URL after redirects (for `ScrapedContent.url`)
3. Extract `<head>` section and clean body text using existing HTML stripping
4. Send both to the LLM with a structured extraction prompt
5. **LLM prompt:** "Extract the following from this article: title, publication date (ISO 8601 format, or empty string if not found), body text (main article content only, no navigation or ads), and whether this is a real article or an error/404 page."
6. **LLM schema:** (OpenAI strict mode compatible — no union types, `published_date` uses empty string instead of null)
```json
{
"type": "object",
"properties": {
"title": { "type": "string", "description": "Article title" },
"published_date": { "type": "string", "description": "ISO 8601 date or empty string if not found" },
"body_text": { "type": "string", "description": "Main article content" },
"is_error_page": { "type": "boolean", "description": "True if this is an error/404 page" }
},
"required": ["title", "published_date", "body_text", "is_error_page"],
"additionalProperties": false
}
```
7. Parse the LLM response into `ScrapedContent` fields:
- `title``ScrapedContent.title` (wrapped in `Some`)
- `published_date` → if non-empty, parse ISO 8601 → `ScrapedContent.published_date`; if empty string → `None`
- `body_text``ScrapedContent.body_text`
- `is_error_page``ScrapedContent.is_soft_404`
- `url` → from `response.url()` (not from LLM)
- `ok``true` if `!is_error_page` and `body_text` is non-empty
- `status` → from HTTP response status
8. **Fallback:** if the LLM call fails (network error, JSON parse failure, schema validation error, timeout), fall back to the existing HTML parsing. Log a warning.
When disabled, the existing scraper logic is used (unchanged), with the new `url` field populated from `response.url()`.
**Cost:** ~$0.001 per article with gpt-4o-mini. For 16 articles, ~$0.016 total.
**Concurrency:** LLM extraction calls run with bounded concurrency (max 5) to avoid hitting provider rate limits.
**Progress reporting:** During per-article LLM extraction, emit progress updates: "Extraction IA des articles (N/M)..."
## Files to Modify
- **Create:** migration `20260324000014_add_llm_scraping_settings.sql`
- **Modify:** `backend/src/models/settings.rs` — add 2 bool fields to `UserSettings`, `SettingsResponse`, `UpdateSettingsRequest`, `Default`, validation (none needed for bools)
- **Modify:** `backend/src/db/settings.rs` — add to `SettingsRow`, `TryFrom`, both SQL queries
- **Modify:** `backend/src/services/scraper.rs` — add `url: String` to `ScrapedContent`, populate from `response.url()`
- **Modify:** `backend/src/services/source_scraper.rs` — add LLM-assisted link extraction path, accept provider + model + schema params
- **Modify:** `backend/src/services/synthesis.rs` — pass settings + provider to scraper functions, update `scrape_single_article` to return `ScrapedContent` and accept optional LLM provider, add LLM extraction path
- **Modify:** `backend/src/services/prompts.rs` — add `build_link_extraction_prompt` and `build_article_extraction_prompt`
- **Modify:** `backend/src/services/llm/schema.rs` — add `build_link_extraction_schema` and `build_article_extraction_schema`
- **Modify:** `frontend/src/types.ts` — add 2 bool fields to `UserSettings` + `DEFAULT_SETTINGS`
- **Modify:** `frontend/src/i18n/fr.ts` — add labels
- **Modify:** `frontend/src/pages/Settings.tsx` — add 2 checkboxes in "Extraction avancee" section
- **Modify:** `CLAUDE.md` — update migration count
- **Modify:** `frontend/src/__tests__/fixtures.ts` — add 2 bool fields to MOCK_SETTINGS if manually constructed
- **Modify:** `backend/tests/api_syntheses_test.rs` — update integration test for new settings fields
- **Modify:** `e2e/tests/generation-live.spec.ts` — update settings payload, add comprehensive synthesis validation
- **Add:** unit tests in `source_scraper.rs` — LLM link extraction, fallback
- **Add:** unit tests in `synthesis.rs` — LLM article extraction, fallback
- **Add:** unit tests in `prompts.rs` — link extraction and article extraction prompts
## E2E Synthesis Validation
The E2E test generates a synthesis and validates:
- No duplicate URLs across all sections
- All article URLs return HTTP 200 (fetch each to verify links work)
- Each user-defined category has ≤ `max_items_per_category` articles
- Each source domain appears ≤ `max_articles_per_source` times globally
- No empty titles or summaries
- No Wikipedia/hallucinated URLs
- "Autre" section (if present) respects max limit
- Every summary is non-trivial (> 50 chars)
## What Does NOT Change
- LLM providers — reused as-is (classification uses `generate_rewrite_pass`)
- Database schema for syntheses — no changes
- Frontend synthesis display — no changes
- Rewrite pass — operates on `ScrapedContent` data regardless of extraction method
- `limit_articles_per_source`, `dedup_by_url`, `filter_homepage_urls` — unchanged
- Classification pipeline (Phase 1/Phase 2) — unchanged, just receives better data