27 KiB
LLM-Assisted Scraping — Implementation Plan
For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (
- [ ]) syntax for tracking.
Goal: Add two optional LLM-powered scraping enhancements: LLM link extraction from source pages and LLM article content extraction — controlled by user settings.
Architecture: Two boolean settings control two independent LLM scraping paths. Each has a fallback to existing heuristic-based extraction. ScrapedContent gains a url field for redirect-resolved URLs. New prompt/schema builders for both LLM calls.
Tech Stack: Rust (reqwest, scraper crate, serde_json), existing LLM providers via generate_rewrite_pass
Spec: docs/superpowers/specs/2026-03-24-llm-scraping-design.md
Task 1: Migration + backend model (2 bool settings)
Files:
-
Create:
backend/migrations/20260324000014_add_llm_scraping_settings.sql -
Modify:
backend/src/models/settings.rs -
Modify:
backend/src/db/settings.rs -
Modify:
backend/src/services/prompts.rs(test fixture) -
Modify:
CLAUDE.md -
Step 1: Create migration
ALTER TABLE settings ADD COLUMN use_llm_for_source_links BOOLEAN NOT NULL DEFAULT false;
ALTER TABLE settings ADD COLUMN use_llm_for_article_extraction BOOLEAN NOT NULL DEFAULT false;
- Step 2: Add fields to all structs in
models/settings.rs
Add pub use_llm_for_source_links: bool and pub use_llm_for_article_extraction: bool to UserSettings, SettingsResponse, UpdateSettingsRequest (after source_diversity_window).
Add to From<UserSettings> for SettingsResponse, Default for UserSettings (both false). No validation needed for bools.
- Step 3: Add to DB queries in
db/settings.rs
Add both fields to SettingsRow, TryFrom<SettingsRow>, and both SQL queries (get_or_create_default + upsert). Follow the pattern of the last column added.
- Step 4: Update test fixtures
Add use_llm_for_source_links: false, use_llm_for_article_extraction: false to:
-
valid_request()inmodels/settings.rstests -
test_settings()inservices/prompts.rstests -
Step 5: Update CLAUDE.md migration count to 14
-
Step 6: Run tests + commit
Run: cd backend && cargo test --lib
git add backend/migrations/20260324000014_add_llm_scraping_settings.sql backend/src/models/settings.rs backend/src/db/settings.rs backend/src/services/prompts.rs CLAUDE.md
git commit -m "feat: add use_llm_for_source_links and use_llm_for_article_extraction settings"
Task 2: Add url field to ScrapedContent + update scrape_single_article
Files:
-
Modify:
backend/src/services/scraper.rs -
Modify:
backend/src/services/synthesis.rs -
Step 1: Add
urltoScrapedContentinscraper.rs
Add pub url: String to the ScrapedContent struct (after is_soft_404).
In scrape_url, populate it from final_url:
Ok(ScrapedContent {
ok: !is_soft_404,
status,
title,
published_date,
body_text,
is_soft_404,
url: final_url.to_string(),
})
- Step 2: Update
scrape_single_articleto return(String, String, String)
In synthesis.rs, change scrape_single_article return type from (String, String) to (String, String, String) — (body_text, page_title, final_url):
async fn scrape_single_article(
http_client: &reqwest::Client,
url: &str,
max_age_days: i64,
) -> (String, String, String) {
match scraper::scrape_url(http_client, url).await {
Ok(content) => {
if !content.ok || content.is_soft_404 {
tracing::warn!(url = url, "Soft 404 or error page detected, skipping content");
return (String::new(), String::new(), content.url);
}
if scraper::is_article_too_old(content.published_date, max_age_days) {
tracing::warn!(url = url, "Article too old, skipping content");
return (String::new(), String::new(), content.url);
}
let title = content.title.unwrap_or_default();
(content.body_text, title, content.url)
}
Err(e) => {
tracing::warn!(url = url, error = %e, "Failed to scrape URL, keeping article with empty content");
(String::new(), String::new(), url.to_string())
}
}
}
- Step 3: Update callers of
scrape_single_article
In scrape_articles and scrape_flat_urls, update destructuring from (scraped_content, page_title) to (scraped_content, page_title, final_url). Use final_url to set ScrapedNewsItem.url instead of the input URL:
In scrape_articles (inside the join_set.spawn):
let scraped = scrape_single_article(&client, &url, mad).await;
(cat_key, item, scraped)
And in the result handler:
if let Ok((cat_key, item, (scraped_content, page_title, final_url))) = join_result {
let scraped_item = ScrapedNewsItem {
title: item.title,
url: final_url, // Use redirect-resolved URL instead of item.url
summary: item.summary,
original_title: page_title,
scraped_content,
};
Same pattern in scrape_flat_urls:
if let Ok((url, scraped_content, page_title, final_url)) = join_result {
results.push(ScrapedNewsItem {
title: page_title.clone(),
url: final_url, // Use redirect-resolved URL
summary: String::new(),
original_title: page_title,
scraped_content,
});
Note: the join_set.spawn closure must also capture and return final_url. Update the spawn to return 4-tuple: (url, scraped_content, page_title, final_url).
- Step 4: Run tests + commit
Run: cd backend && cargo test --lib
git add backend/src/services/scraper.rs backend/src/services/synthesis.rs
git commit -m "feat: add url field to ScrapedContent, use redirect-resolved URLs"
Task 3: LLM prompts and schemas for both extraction types
Files:
-
Modify:
backend/src/services/prompts.rs -
Modify:
backend/src/services/llm/schema.rs -
Step 1: Add
build_link_extraction_prompttoprompts.rs
/// Build a prompt for LLM-assisted link extraction from a source page.
///
/// # Arguments
/// * `head_html` — the <head> section of the page
/// * `body_html` — first 8000 chars of the <body> section
pub fn build_link_extraction_prompt(head_html: &str, body_html: &str) -> (String, String) {
let system_prompt =
"Tu es un assistant qui analyse des pages web. \
Tu dois identifier les liens vers des articles d'actualite. \
Reponds uniquement au format JSON demande."
.to_string();
let user_prompt = format!(
"Voici le contenu HTML d'une page de blog ou de site d'actualites.\n\n\
<head>\n{head}\n</head>\n\n\
<body (extrait)>\n{body}\n</body>\n\n\
Extrais UNIQUEMENT les URLs qui pointent vers des articles \
(pas les liens de navigation, tags, categories, login, pages statiques, etc.).\n\
Retourne les URLs completes dans le format JSON demande.",
head = head_html,
body = body_html,
);
(system_prompt, user_prompt)
}
- Step 2: Add
build_article_extraction_prompttoprompts.rs
/// Build a prompt for LLM-assisted article content extraction.
///
/// # Arguments
/// * `head_html` — the <head> section (contains meta tags, og:*, canonical)
/// * `body_text` — cleaned body text from existing HTML stripping
pub fn build_article_extraction_prompt(head_html: &str, body_text: &str) -> (String, String) {
let system_prompt =
"Tu es un assistant qui analyse des articles web. \
Tu dois extraire les informations structurees de l'article. \
Reponds uniquement au format JSON demande."
.to_string();
let user_prompt = format!(
"Voici le contenu d'une page web.\n\n\
<head>\n{head}\n</head>\n\n\
Contenu textuel de la page :\n{body}\n\n\
Extrais les informations suivantes :\n\
- title : le titre de l'article\n\
- published_date : la date de publication au format ISO 8601 (YYYY-MM-DDTHH:MM:SSZ), \
ou une chaine vide si introuvable\n\
- body_text : le contenu principal de l'article (pas la navigation, pas les pubs)\n\
- is_error_page : true si c'est une page d'erreur/404, false sinon",
head = head_html,
body = body_text,
);
(system_prompt, user_prompt)
}
- Step 3: Add schemas to
schema.rs
/// Build a JSON Schema for LLM link extraction response.
pub fn build_link_extraction_schema() -> Value {
serde_json::json!({
"type": "object",
"properties": {
"urls": {
"type": "array",
"items": { "type": "string" }
}
},
"required": ["urls"],
"additionalProperties": false
})
}
/// Build a JSON Schema for LLM article content extraction response.
pub fn build_article_extraction_schema() -> Value {
serde_json::json!({
"type": "object",
"properties": {
"title": { "type": "string", "description": "Article title" },
"published_date": { "type": "string", "description": "ISO 8601 date or empty string if not found" },
"body_text": { "type": "string", "description": "Main article content" },
"is_error_page": { "type": "boolean", "description": "True if this is an error/404 page" }
},
"required": ["title", "published_date", "body_text", "is_error_page"],
"additionalProperties": false
})
}
- Step 4: Add tests
In prompts.rs tests:
#[test]
fn link_extraction_prompt_includes_html() {
let (_, user) = build_link_extraction_prompt("<title>Blog</title>", "<a href='/post'>P</a>");
assert!(user.contains("<title>Blog</title>"));
assert!(user.contains("articles"));
}
#[test]
fn article_extraction_prompt_includes_content() {
let (_, user) = build_article_extraction_prompt("<meta name='date' content='2026'>", "Article body text here");
assert!(user.contains("Article body text here"));
assert!(user.contains("published_date"));
}
In schema.rs tests:
#[test]
fn link_extraction_schema_has_urls_array() {
let schema = build_link_extraction_schema();
assert_eq!(schema["properties"]["urls"]["type"], "array");
assert_eq!(schema["additionalProperties"], false);
}
#[test]
fn article_extraction_schema_has_all_fields() {
let schema = build_article_extraction_schema();
let props = schema["properties"].as_object().unwrap();
assert!(props.contains_key("title"));
assert!(props.contains_key("published_date"));
assert!(props.contains_key("body_text"));
assert!(props.contains_key("is_error_page"));
assert_eq!(schema["additionalProperties"], false);
}
- Step 5: Run tests + commit
Run: cd backend && cargo test --lib
git add backend/src/services/prompts.rs backend/src/services/llm/schema.rs
git commit -m "feat: add LLM prompts and schemas for link and article extraction"
Task 4: LLM-assisted source link extraction in source_scraper.rs
Files:
-
Modify:
backend/src/services/source_scraper.rs -
Step 1: Update
extract_article_linksto accept optional LLM provider
Add a new public function extract_article_links_with_llm that accepts LLM parameters. The existing extract_article_links stays unchanged for non-LLM path.
use crate::services::llm::LlmProvider;
use crate::services::llm::schema::build_link_extraction_schema;
use crate::services::prompts::build_link_extraction_prompt;
/// Extract article links using LLM analysis of the page HTML.
///
/// Falls back to heuristic extraction if the LLM call fails or returns empty results.
pub async fn extract_article_links_with_llm(
http_client: &reqwest::Client,
source_url: &str,
max_links: usize,
provider: &dyn LlmProvider,
model: &str,
) -> Result<Vec<String>, AppError> {
let base_url = Url::parse(source_url)
.map_err(|e| AppError::BadRequest(format!("Invalid source URL: {}", e)))?;
let base_domain = base_url.host_str().unwrap_or("").to_lowercase();
// Fetch the page
let response = http_client.get(source_url).send().await.map_err(|e| {
tracing::warn!(url = source_url, error = %e, "Failed to fetch source page");
AppError::Internal(anyhow::anyhow!("Failed to fetch source page"))
})?;
if !response.status().is_success() {
tracing::warn!(url = source_url, status = %response.status(), "Source page returned non-200");
return Ok(Vec::new());
}
let html_text = response.text().await.map_err(|e| {
AppError::Internal(anyhow::anyhow!("Failed to read source page body: {}", e))
})?;
// Extract <head> and first 8000 chars of <body> for the LLM
let (head_html, body_html) = extract_head_and_body(&html_text);
let (system, user) = build_link_extraction_prompt(&head_html, &body_html);
let schema = build_link_extraction_schema();
match provider.generate_rewrite_pass(model, &system, &user, &schema).await {
Ok(response) => {
let urls = response
.get("urls")
.and_then(|u| u.as_array())
.map(|arr| {
arr.iter()
.filter_map(|v| v.as_str())
.filter_map(|href| {
// Resolve relative URLs
let resolved = base_url.join(href).ok()?;
// Filter: http/https only, same domain
if resolved.scheme() != "http" && resolved.scheme() != "https" {
return None;
}
let domain = resolved.host_str()?.to_lowercase();
if domain != base_domain {
return None;
}
Some(resolved.to_string())
})
.collect::<Vec<_>>()
})
.unwrap_or_default();
if urls.is_empty() {
tracing::warn!(url = source_url, "LLM returned no links, falling back to heuristic extraction");
let fallback = extract_links_from_html(&html_text, &base_url, &base_domain);
Ok(fallback.into_iter().take(max_links).collect())
} else {
// Deduplicate
let mut seen = std::collections::HashSet::new();
let deduped: Vec<String> = urls.into_iter().filter(|u| seen.insert(u.clone())).collect();
Ok(deduped.into_iter().take(max_links).collect())
}
}
Err(e) => {
tracing::warn!(url = source_url, error = %e, "LLM link extraction failed, falling back to heuristic");
let fallback = extract_links_from_html(&html_text, &base_url, &base_domain);
Ok(fallback.into_iter().take(max_links).collect())
}
}
}
/// Extract the <head> section and first N chars of <body> from HTML.
fn extract_head_and_body(html: &str) -> (String, String) {
let head_start = html.find("<head").unwrap_or(0);
let head_end = html.find("</head>").map(|i| i + 7).unwrap_or(head_start);
let head = &html[head_start..head_end];
let body_start = html.find("<body").unwrap_or(head_end);
let body_end = (body_start + 8000).min(html.len());
let body = &html[body_start..body_end];
(head.to_string(), body.to_string())
}
- Step 2: Add tests
#[test]
fn extract_head_and_body_splits_correctly() {
let html = "<html><head><title>T</title></head><body><p>Content</p></body></html>";
let (head, body) = extract_head_and_body(html);
assert!(head.contains("<title>T</title>"));
assert!(body.contains("<p>Content</p>"));
}
#[test]
fn extract_head_and_body_truncates_body() {
let long_body = "x".repeat(20000);
let html = format!("<head></head><body>{}</body>", long_body);
let (_, body) = extract_head_and_body(&html);
assert!(body.len() <= 8006); // <body> tag + 8000 chars
}
- Step 3: Run tests + commit
Run: cd backend && cargo test --lib
git add backend/src/services/source_scraper.rs
git commit -m "feat: LLM-assisted source link extraction with heuristic fallback"
Task 5: LLM-assisted article extraction in synthesis pipeline
Files:
-
Modify:
backend/src/services/synthesis.rs -
Step 1: Add
scrape_single_article_with_llmfunction
Add a new async function alongside scrape_single_article:
/// Scrape an article URL using LLM for content extraction.
///
/// Falls back to heuristic extraction if the LLM call fails.
async fn scrape_single_article_with_llm(
http_client: &reqwest::Client,
url: &str,
max_age_days: i64,
provider: &dyn crate::services::llm::LlmProvider,
model: &str,
) -> (String, String, String) {
// First, do the HTTP fetch (same as regular scraping)
let fetch_result = scraper::scrape_url(http_client, url).await;
let content = match fetch_result {
Ok(c) => c,
Err(e) => {
tracing::warn!(url = url, error = %e, "Failed to fetch URL for LLM extraction");
return (String::new(), String::new(), url.to_string());
}
};
let final_url = content.url.clone();
if !content.ok || content.is_soft_404 {
return (String::new(), String::new(), final_url);
}
// Extract <head> from the raw HTML for the LLM
// We need to re-fetch the raw HTML or extract it from the scraper
// Since scraper already parsed it, we'll use the existing body_text + title as input
let head_html = String::new(); // The scraper doesn't preserve <head> — use empty
let body_text = &content.body_text;
let (system, user) = crate::services::prompts::build_article_extraction_prompt(
&head_html,
body_text,
);
let schema = crate::services::llm::schema::build_article_extraction_schema();
match provider.generate_rewrite_pass(model, &system, &user, &schema).await {
Ok(response) => {
let title = response.get("title").and_then(|t| t.as_str()).unwrap_or("").to_string();
let extracted_body = response.get("body_text").and_then(|b| b.as_str()).unwrap_or("").to_string();
let is_error = response.get("is_error_page").and_then(|e| e.as_bool()).unwrap_or(false);
let date_str = response.get("published_date").and_then(|d| d.as_str()).unwrap_or("");
if is_error || extracted_body.trim().is_empty() {
return (String::new(), String::new(), final_url);
}
// Check date if provided
if !date_str.is_empty() {
if let Ok(date) = chrono::DateTime::parse_from_rfc3339(date_str) {
if scraper::is_article_too_old(Some(date.with_timezone(&chrono::Utc)), max_age_days) {
tracing::warn!(url = url, "LLM-extracted article too old");
return (String::new(), String::new(), final_url);
}
}
}
(extracted_body, title, final_url)
}
Err(e) => {
tracing::warn!(url = url, error = %e, "LLM article extraction failed, using heuristic fallback");
// Fall back to existing heuristic data
if scraper::is_article_too_old(content.published_date, max_age_days) {
return (String::new(), String::new(), final_url);
}
let title = content.title.unwrap_or_default();
(content.body_text, title, final_url)
}
}
}
- Step 2: Update pipeline to use LLM extraction when enabled
In run_generation_inner, the scraping calls need to branch based on settings.use_llm_for_article_extraction. The simplest approach: update scrape_flat_urls and scrape_articles to accept an optional provider+model, and use scrape_single_article_with_llm when provided.
Add a wrapper that the pipeline calls:
/// Scrape a single article, optionally using LLM extraction.
async fn scrape_article_dispatch(
http_client: &reqwest::Client,
url: &str,
max_age_days: i64,
llm: Option<(&dyn crate::services::llm::LlmProvider, &str)>,
) -> (String, String, String) {
match llm {
Some((provider, model)) => {
scrape_single_article_with_llm(http_client, url, max_age_days, provider, model).await
}
None => scrape_single_article(http_client, url, max_age_days).await,
}
}
Update scrape_flat_urls and scrape_articles to use scrape_article_dispatch. The provider and model are passed from run_generation_inner based on settings.use_llm_for_article_extraction.
Similarly, update the Phase 1 source scraping in run_generation_inner to call extract_article_links_with_llm vs extract_article_links based on settings.use_llm_for_source_links.
- Step 3: Run tests + commit
Run: cd backend && cargo test --lib
git add backend/src/services/synthesis.rs
git commit -m "feat: LLM-assisted article extraction with heuristic fallback"
Task 6: Frontend settings
Files:
-
Modify:
frontend/src/types.ts -
Modify:
frontend/src/i18n/fr.ts -
Modify:
frontend/src/pages/Settings.tsx -
Step 1: Add fields to types
In frontend/src/types.ts, add to UserSettings:
use_llm_for_source_links: boolean;
use_llm_for_article_extraction: boolean;
Add to DEFAULT_SETTINGS:
use_llm_for_source_links: false,
use_llm_for_article_extraction: false,
- Step 2: Add i18n labels
In frontend/src/i18n/fr.ts:
'settings.advancedExtraction': 'Extraction avancee',
'settings.useLlmForSourceLinks': "Utiliser l'IA pour extraire les liens",
'settings.useLlmForArticleExtraction': "Utiliser l'IA pour extraire le contenu",
- Step 3: Add checkboxes to Settings page
In frontend/src/pages/Settings.tsx, add a new section after the existing generation settings (after the grid with maxAgeDays/maxItemsPerCategory/maxArticlesPerSource/diversityWindow):
{/* Advanced extraction */}
<div class="mt-6">
<h3 class="text-lg font-medium text-gray-900 mb-4">
{t('settings.advancedExtraction')}
</h3>
<div class="space-y-4">
<div class="flex items-center">
<input
type="checkbox"
id="useLlmSourceLinks"
checked={settings().use_llm_for_source_links}
onChange={(e) =>
setSettings((prev) => ({
...prev,
use_llm_for_source_links: e.currentTarget.checked,
}))
}
class="h-4 w-4 text-indigo-600 focus:ring-indigo-500 border-gray-300 rounded"
/>
<label for="useLlmSourceLinks" class="ml-2 block text-sm text-gray-700">
{t('settings.useLlmForSourceLinks')}
</label>
</div>
<div class="flex items-center">
<input
type="checkbox"
id="useLlmArticleExtraction"
checked={settings().use_llm_for_article_extraction}
onChange={(e) =>
setSettings((prev) => ({
...prev,
use_llm_for_article_extraction: e.currentTarget.checked,
}))
}
class="h-4 w-4 text-indigo-600 focus:ring-indigo-500 border-gray-300 rounded"
/>
<label for="useLlmArticleExtraction" class="ml-2 block text-sm text-gray-700">
{t('settings.useLlmForArticleExtraction')}
</label>
</div>
</div>
</div>
- Step 4: Run frontend tests + commit
Run: cd frontend && npx tsc --noEmit && npx vitest run
git add frontend/src/types.ts frontend/src/i18n/fr.ts frontend/src/pages/Settings.tsx
git commit -m "feat: add LLM scraping toggles to Settings page"
Task 7: Update E2E test with comprehensive synthesis validation
Files:
-
Modify:
e2e/tests/generation-live.spec.ts -
Step 1: Update settings payload
Add the new boolean fields to the PUT settings call:
use_llm_for_source_links: false,
use_llm_for_article_extraction: false,
- Step 2: Add comprehensive validation after synthesis fetch
After the existing structure validation, add:
// Comprehensive synthesis validation
const allUrls: string[] = [];
const domainCounts: Record<string, number> = {};
for (const section of synthesis.sections) {
for (const item of section.items) {
// Collect URLs for duplicate check
allUrls.push(item.url);
// Count domains for source diversity check
try {
const domain = new URL(item.url).hostname;
domainCounts[domain] = (domainCounts[domain] || 0) + 1;
} catch {}
}
// Category article count check
if (section.title !== 'Autre') {
expect(section.items.length).toBeLessThanOrEqual(4); // max_items_per_category
}
}
// No duplicate URLs across all sections
const uniqueUrls = new Set(allUrls);
expect(uniqueUrls.size).toBe(allUrls.length);
// No domain exceeds max_articles_per_source (3)
for (const [domain, count] of Object.entries(domainCounts)) {
expect(count).toBeLessThanOrEqual(3);
}
// Verify article links actually work (HTTP 200)
// Test a sample of up to 3 URLs to avoid slowness
const sampleUrls = allUrls.slice(0, 3);
for (const articleUrl of sampleUrls) {
const linkCheck = await page.evaluate(async (url: string) => {
try {
const resp = await fetch(url, { method: 'HEAD', redirect: 'follow' });
return resp.status;
} catch {
return 0;
}
}, articleUrl);
expect(linkCheck).toBeGreaterThanOrEqual(200);
expect(linkCheck).toBeLessThan(400);
}
- Step 3: Run E2E test
cd e2e && docker compose -f docker-compose.test.yml down
docker compose -f docker-compose.test.yml up --build -d
sleep 25 && npx tsx seed.ts && npx playwright test generation-live --reporter=list
- Step 4: Commit
git add e2e/tests/generation-live.spec.ts
git commit -m "test: comprehensive E2E synthesis validation (duplicates, links, counts)"