You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

28 KiB

Raw Blame History

LLM-Assisted Scraping — Implementation Plan (Revised)

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Add two optional LLM-powered scraping enhancements: LLM link extraction from source pages and LLM article content extraction — controlled by user settings.

Architecture: Two boolean settings control two independent LLM scraping paths. Each has a fallback to existing heuristic-based extraction. ScrapedContent gains url and head_html fields. create_provider returns Arc<dyn LlmProvider> for safe sharing across concurrent tasks. When LLM extraction is enabled, concurrency is reduced to max 5.

Tech Stack: Rust (reqwest, scraper crate, serde_json, Arc), existing LLM providers via generate_rewrite_pass

Spec: docs/superpowers/specs/2026-03-24-llm-scraping-design.md

Task 1: Migration + backend model (2 bool settings)

Files:

Create: backend/migrations/20260324000014_add_llm_scraping_settings.sql
Modify: backend/src/models/settings.rs
Modify: backend/src/db/settings.rs
Modify: backend/src/services/prompts.rs (test fixture)
Modify: CLAUDE.md
Step 1: Create migration

ALTER TABLE settings ADD COLUMN use_llm_for_source_links BOOLEAN NOT NULL DEFAULT false;
ALTER TABLE settings ADD COLUMN use_llm_for_article_extraction BOOLEAN NOT NULL DEFAULT false;

Step 2: Add fields to all structs in models/settings.rs

Add pub use_llm_for_source_links: bool and pub use_llm_for_article_extraction: bool to UserSettings, SettingsResponse, UpdateSettingsRequest (after source_diversity_window).

Add to From<UserSettings> for SettingsResponse, Default for UserSettings (both false). No validation needed for bools.

Step 3: Add to DB queries in db/settings.rs

Add both fields to SettingsRow, TryFrom<SettingsRow>, and both SQL queries. Follow the pattern of the last column added.

Step 4: Update test fixtures

Add use_llm_for_source_links: false, use_llm_for_article_extraction: false to valid_request() in settings tests and test_settings() in prompts tests.

Step 5: Update CLAUDE.md migration count to 14
Step 6: Run tests + commit

cd backend && cargo test --lib
git add backend/migrations/20260324000014_add_llm_scraping_settings.sql backend/src/models/settings.rs backend/src/db/settings.rs backend/src/services/prompts.rs CLAUDE.md
git commit -m "feat: add use_llm_for_source_links and use_llm_for_article_extraction settings"

Task 2: Add `url` and `head_html` to `ScrapedContent` + `Arc<dyn LlmProvider>` + update scraping functions

Files:

Modify: backend/src/services/scraper.rs
Modify: backend/src/services/llm/factory.rs
Modify: backend/src/services/llm/mod.rs (trait needs Send + Sync)
Modify: backend/src/services/synthesis.rs
Modify: backend/src/handlers/api_keys.rs
Step 1: Add url and head_html to ScrapedContent in scraper.rs

pub struct ScrapedContent {
    pub ok: bool,
    pub status: u16,
    pub title: Option<String>,
    pub published_date: Option<DateTime<Utc>>,
    pub body_text: String,
    pub is_soft_404: bool,
    pub url: String,
    pub head_html: String,
}

In scrape_url, before parsing the document, extract <head>:

let html_text = String::from_utf8_lossy(&bytes);

// Extract <head> section for potential LLM use
let head_html = extract_head_section(&html_text);

let document = Html::parse_document(&html_text);

Add helper:

/// Extract the <head>...</head> section from raw HTML.
fn extract_head_section(html: &str) -> String {
    let start = html.find("<head").unwrap_or(0);
    let end = html.find("</head>").map(|i| i + 7).unwrap_or(start);
    html[start..end].to_string()
}

Populate in the return:

Ok(ScrapedContent {
    ok: !is_soft_404,
    status,
    title,
    published_date,
    body_text,
    is_soft_404,
    url: final_url.to_string(),
    head_html,
})

Step 2: Change create_provider to return Arc<dyn LlmProvider>

In backend/src/services/llm/factory.rs, change the return type:

use std::sync::Arc;

pub fn create_provider(
    provider_name: &str,
    api_key: String,
) -> Result<Arc<dyn LlmProvider>, AppError> {
    let http_client = build_llm_client()?;
    match provider_name {
        "gemini" => Ok(Arc::new(GeminiProvider::new(api_key, http_client))),
        "openai" => Ok(Arc::new(OpenAiProvider::new(api_key, http_client))),
        "anthropic" => Ok(Arc::new(AnthropicProvider::new(api_key, http_client))),
        _ => Err(AppError::BadRequest(format!("Unknown provider: '{}'", provider_name))),
    }
}

Update all factory tests to use Arc (they call methods on the provider, which works the same).

Ensure LlmProvider trait in llm/mod.rs has Send + Sync bounds:

#[async_trait]
pub trait LlmProvider: Send + Sync {

Step 3: Update all callers of create_provider

In synthesis.rs run_generation_inner: let provider = create_provider(...) — now returns Arc. Method calls on Arc<dyn LlmProvider> work via auto-deref. Update provider.generate_search_pass(...) calls — they should work as-is since Arc<T> derefs to T.

In handlers/api_keys.rs: let llm_provider = factory::create_provider(...) — same, just works via deref.

Step 4: Update scrape_single_article to return 3-tuple

Change return type from (String, String) to (String, String, String) — (body_text, page_title, final_url):

async fn scrape_single_article(
    http_client: &reqwest::Client,
    url: &str,
    max_age_days: i64,
) -> (String, String, String) {
    match scraper::scrape_url(http_client, url).await {
        Ok(content) => {
            let final_url = content.url.clone();
            if !content.ok || content.is_soft_404 {
                return (String::new(), String::new(), final_url);
            }
            if scraper::is_article_too_old(content.published_date, max_age_days) {
                return (String::new(), String::new(), final_url);
            }
            let title = content.title.unwrap_or_default();
            (content.body_text, title, final_url)
        }
        Err(e) => {
            tracing::warn!(url = url, error = %e, "Failed to scrape URL");
            (String::new(), String::new(), url.to_string())
        }
    }
}

Step 5: Update callers of scrape_single_article

In scrape_articles: update spawn closure to return (cat_key, item, (scraped_content, page_title, final_url)). In result handler, use final_url for ScrapedNewsItem.url.

In scrape_flat_urls: update spawn closure to return (original_url, scraped_content, page_title, final_url). Use final_url for ScrapedNewsItem.url.

Step 6: Run tests + commit

cd backend && cargo test --lib
git add backend/src/services/scraper.rs backend/src/services/llm/factory.rs backend/src/services/llm/mod.rs backend/src/services/synthesis.rs backend/src/handlers/api_keys.rs
git commit -m "feat: ScrapedContent url+head_html fields, Arc<dyn LlmProvider>, 3-tuple scrape returns"

Task 3: LLM prompts and schemas for both extraction types

Files:

Modify: backend/src/services/prompts.rs
Modify: backend/src/services/llm/schema.rs
Step 1: Add build_link_extraction_prompt and build_article_extraction_prompt to prompts.rs

/// Build a prompt for LLM-assisted link extraction from a source page.
pub fn build_link_extraction_prompt(head_html: &str, body_html: &str) -> (String, String) {
    let system_prompt =
        "Tu es un assistant qui analyse des pages web. \
         Tu dois identifier les liens vers des articles d'actualite. \
         Reponds uniquement au format JSON demande."
            .to_string();

    let body_truncated: String = body_html.chars().take(8000).collect();

    let user_prompt = format!(
        "Voici le contenu HTML d'une page de blog ou de site d'actualites.\n\n\
         <head>\n{head}\n</head>\n\n\
         <body (extrait)>\n{body}\n</body>\n\n\
         Extrais UNIQUEMENT les URLs qui pointent vers des articles \
         (pas les liens de navigation, tags, categories, login, pages statiques, etc.).\n\
         Retourne les URLs completes dans le format JSON demande.",
        head = head_html,
        body = body_truncated,
    );

    (system_prompt, user_prompt)
}

/// Build a prompt for LLM-assisted article content extraction.
pub fn build_article_extraction_prompt(head_html: &str, body_text: &str) -> (String, String) {
    let system_prompt =
        "Tu es un assistant qui analyse des articles web. \
         Tu dois extraire les informations structurees de l'article. \
         Reponds uniquement au format JSON demande."
            .to_string();

    let user_prompt = format!(
        "Voici le contenu d'une page web.\n\n\
         <head>\n{head}\n</head>\n\n\
         Contenu textuel de la page :\n{body}\n\n\
         Extrais les informations suivantes :\n\
         - title : le titre de l'article\n\
         - published_date : la date de publication au format ISO 8601 (YYYY-MM-DDTHH:MM:SSZ), \
         ou une chaine vide si introuvable\n\
         - body_text : le contenu principal de l'article (pas la navigation, pas les pubs)\n\
         - is_error_page : true si c'est une page d'erreur/404, false sinon",
        head = head_html,
        body = body_text,
    );

    (system_prompt, user_prompt)
}

Note: build_link_extraction_prompt truncates body using .chars().take(8000) (UTF-8 safe).

Step 2: Add schemas to schema.rs

pub fn build_link_extraction_schema() -> Value {
    serde_json::json!({
        "type": "object",
        "properties": {
            "urls": {
                "type": "array",
                "items": { "type": "string" }
            }
        },
        "required": ["urls"],
        "additionalProperties": false
    })
}

pub fn build_article_extraction_schema() -> Value {
    serde_json::json!({
        "type": "object",
        "properties": {
            "title": { "type": "string", "description": "Article title" },
            "published_date": { "type": "string", "description": "ISO 8601 date or empty string if not found" },
            "body_text": { "type": "string", "description": "Main article content" },
            "is_error_page": { "type": "boolean", "description": "True if this is an error/404 page" }
        },
        "required": ["title", "published_date", "body_text", "is_error_page"],
        "additionalProperties": false
    })
}

Step 3: Add tests for prompts and schemas

In prompts.rs tests:

    #[test]
    fn link_extraction_prompt_includes_html() {
        let (sys, user) = build_link_extraction_prompt("<title>Blog</title>", "<a href='/post'>P</a>");
        assert!(user.contains("<title>Blog</title>"));
        assert!(user.contains("articles"));
        assert!(sys.contains("liens"));
    }

    #[test]
    fn link_extraction_prompt_truncates_body() {
        let long_body = "x".repeat(20000);
        let (_, user) = build_link_extraction_prompt("", &long_body);
        // Should not contain the full 20000 chars
        assert!(user.len() < 15000);
    }

    #[test]
    fn article_extraction_prompt_includes_content() {
        let (_, user) = build_article_extraction_prompt("<meta name='date'>", "Article body here");
        assert!(user.contains("Article body here"));
        assert!(user.contains("published_date"));
        assert!(user.contains("is_error_page"));
    }

In schema.rs tests:

    #[test]
    fn link_extraction_schema_has_urls_array() {
        let schema = build_link_extraction_schema();
        assert_eq!(schema["properties"]["urls"]["type"], "array");
        assert_eq!(schema["additionalProperties"], false);
    }

    #[test]
    fn article_extraction_schema_strict_mode_compatible() {
        let schema = build_article_extraction_schema();
        let props = schema["properties"].as_object().unwrap();
        assert!(props.contains_key("title"));
        assert!(props.contains_key("published_date"));
        assert!(props.contains_key("body_text"));
        assert!(props.contains_key("is_error_page"));
        assert_eq!(schema["additionalProperties"], false);
        // published_date is string (not ["string", "null"]) for OpenAI strict mode
        assert_eq!(props["published_date"]["type"], "string");
    }

Step 4: Run tests + commit

cd backend && cargo test --lib
git add backend/src/services/prompts.rs backend/src/services/llm/schema.rs
git commit -m "feat: add LLM prompts and schemas for link and article extraction"

Task 4: LLM-assisted source link extraction in `source_scraper.rs`

Files:

Modify: backend/src/services/source_scraper.rs
Step 1: Add extract_article_links_with_llm

use std::sync::Arc;
use crate::services::llm::LlmProvider;
use crate::services::llm::schema::build_link_extraction_schema;
use crate::services::prompts::build_link_extraction_prompt;

/// Extract article links using LLM analysis of the page HTML.
///
/// Falls back to heuristic extraction if the LLM call fails or returns empty.
pub async fn extract_article_links_with_llm(
    http_client: &reqwest::Client,
    source_url: &str,
    max_links: usize,
    provider: &Arc<dyn LlmProvider>,
    model: &str,
) -> Result<Vec<String>, AppError> {
    let base_url = Url::parse(source_url)
        .map_err(|e| AppError::BadRequest(format!("Invalid source URL: {}", e)))?;
    let base_domain = base_url.host_str().unwrap_or("").to_lowercase();

    let response = http_client.get(source_url).send().await.map_err(|e| {
        tracing::warn!(url = source_url, error = %e, "Failed to fetch source page");
        AppError::Internal(anyhow::anyhow!("Failed to fetch source page"))
    })?;

    if !response.status().is_success() {
        return Ok(Vec::new());
    }

    let html_text = response.text().await.map_err(|e| {
        AppError::Internal(anyhow::anyhow!("Failed to read source page body: {}", e))
    })?;

    let (head_html, body_html) = extract_head_and_body(&html_text);
    let (system, user) = build_link_extraction_prompt(&head_html, &body_html);
    let schema = build_link_extraction_schema();

    match provider.generate_rewrite_pass(model, &system, &user, &schema).await {
        Ok(llm_response) => {
            let urls: Vec<String> = llm_response
                .get("urls")
                .and_then(|u| u.as_array())
                .map(|arr| {
                    arr.iter()
                        .filter_map(|v| v.as_str())
                        .filter_map(|href| {
                            let resolved = base_url.join(href).ok()?;
                            if resolved.scheme() != "http" && resolved.scheme() != "https" {
                                return None;
                            }
                            if resolved.host_str()?.to_lowercase() != base_domain {
                                return None;
                            }
                            Some(resolved.to_string())
                        })
                        .collect()
                })
                .unwrap_or_default();

            if urls.is_empty() {
                tracing::warn!(url = source_url, "LLM returned no links, falling back to heuristic");
                let fallback = extract_links_from_html(&html_text, &base_url, &base_domain);
                Ok(fallback.into_iter().take(max_links).collect())
            } else {
                let mut seen = std::collections::HashSet::new();
                let deduped: Vec<String> = urls.into_iter().filter(|u| seen.insert(u.clone())).collect();
                Ok(deduped.into_iter().take(max_links).collect())
            }
        }
        Err(e) => {
            tracing::warn!(url = source_url, error = %e, "LLM link extraction failed, falling back");
            let fallback = extract_links_from_html(&html_text, &base_url, &base_domain);
            Ok(fallback.into_iter().take(max_links).collect())
        }
    }
}

/// Extract <head> section and first 8000 chars of <body> from HTML (UTF-8 safe).
pub fn extract_head_and_body(html: &str) -> (String, String) {
    let head_start = html.find("<head").unwrap_or(0);
    let head_end = html.find("</head>").map(|i| i + 7).unwrap_or(head_start);
    let head = &html[head_start..head_end];

    let body_start = html.find("<body").unwrap_or(head_end);
    let body: String = html[body_start..].chars().take(8000).collect();

    (head.to_string(), body)
}

Step 2: Add tests

    #[test]
    fn extract_head_and_body_splits_correctly() {
        let html = "<html><head><title>T</title></head><body><p>Content</p></body></html>";
        let (head, body) = extract_head_and_body(html);
        assert!(head.contains("<title>T</title>"));
        assert!(body.contains("<p>Content</p>"));
    }

    #[test]
    fn extract_head_and_body_truncates_body_safely() {
        let long_body = "x".repeat(20000);
        let html = format!("<head></head><body>{}</body>", long_body);
        let (_, body) = extract_head_and_body(&html);
        assert_eq!(body.chars().count(), 8000);
    }

Step 3: Run tests + commit

cd backend && cargo test --lib
git add backend/src/services/source_scraper.rs
git commit -m "feat: LLM-assisted source link extraction with heuristic fallback"

Task 5: LLM-assisted article extraction in synthesis pipeline

Files:

Modify: backend/src/services/synthesis.rs
Step 1: Add scrape_single_article_with_llm

This function receives the LLM provider via Arc and uses head_html from ScrapedContent:

async fn scrape_single_article_with_llm(
    http_client: &reqwest::Client,
    url: &str,
    max_age_days: i64,
    provider: Arc<dyn crate::services::llm::LlmProvider>,
    model: String,
) -> (String, String, String) {
    let content = match scraper::scrape_url(http_client, url).await {
        Ok(c) => c,
        Err(e) => {
            tracing::warn!(url = url, error = %e, "Failed to fetch URL for LLM extraction");
            return (String::new(), String::new(), url.to_string());
        }
    };

    let final_url = content.url.clone();

    if !content.ok || content.is_soft_404 {
        return (String::new(), String::new(), final_url);
    }

    let (system, user) = crate::services::prompts::build_article_extraction_prompt(
        &content.head_html,
        &content.body_text,
    );
    let schema = crate::services::llm::schema::build_article_extraction_schema();

    match provider.generate_rewrite_pass(&model, &system, &user, &schema).await {
        Ok(response) => {
            let title = response.get("title").and_then(|t| t.as_str()).unwrap_or("").to_string();
            let body = response.get("body_text").and_then(|b| b.as_str()).unwrap_or("").to_string();
            let is_error = response.get("is_error_page").and_then(|e| e.as_bool()).unwrap_or(false);
            let date_str = response.get("published_date").and_then(|d| d.as_str()).unwrap_or("");

            if is_error || body.trim().is_empty() {
                return (String::new(), String::new(), final_url);
            }

            if !date_str.is_empty() {
                if let Ok(date) = chrono::DateTime::parse_from_rfc3339(date_str) {
                    if scraper::is_article_too_old(Some(date.with_timezone(&chrono::Utc)), max_age_days) {
                        return (String::new(), String::new(), final_url);
                    }
                }
            }

            (body, title, final_url)
        }
        Err(e) => {
            tracing::warn!(url = url, error = %e, "LLM extraction failed, using heuristic fallback");
            if scraper::is_article_too_old(content.published_date, max_age_days) {
                return (String::new(), String::new(), final_url);
            }
            let title = content.title.unwrap_or_default();
            (content.body_text, title, final_url)
        }
    }
}

Note: provider: Arc<dyn LlmProvider> and model: String — both are 'static and can be moved into spawned tasks.

Step 2: Update scrape_flat_urls and scrape_articles for LLM dispatch

Add a parameter llm: Option<(Arc<dyn LlmProvider>, String)> to both functions. When Some, use scrape_single_article_with_llm instead of scrape_single_article. Set max_concurrent = 5 when LLM is enabled, 10 otherwise.

In the spawn closures, clone the Arc and String:

if let Some((ref provider, ref model)) = llm {
    let provider = Arc::clone(provider);
    let model = model.clone();
    join_set.spawn(async move {
        let scraped = scrape_single_article_with_llm(&client, &url, mad, provider, model).await;
        // ...
    });
} else {
    join_set.spawn(async move {
        let scraped = scrape_single_article(&client, &url, mad).await;
        // ...
    });
}

Add progress reporting for LLM extraction:

let progress_label = if llm.is_some() {
    format!("Extraction IA des articles ({}/{})...", completed, total)
} else {
    format!("Verification des sources ({}/{})...", completed, total)
};
emit_progress(tx, "scraping", &progress_label, pct as u8);

Step 3: Update run_generation_inner to pass LLM params

In Phase 1 and Phase 2 scraping calls, pass the LLM option:

let llm_for_scraping = if settings.use_llm_for_article_extraction {
    Some((Arc::clone(&provider), model_research.clone()))
} else {
    None
};

Pass llm_for_scraping to scrape_flat_urls and scrape_articles.

Similarly for source link extraction:

if settings.use_llm_for_source_links {
    source_scraper::extract_article_links_with_llm(
        &state.http_client, &source.url, max_links_per_source,
        &provider, &model_research,
    ).await
} else {
    source_scraper::extract_article_links(
        &state.http_client, &source.url, max_links_per_source,
    ).await
}

Step 4: Run tests + commit

cd backend && cargo test --lib
git add backend/src/services/synthesis.rs
git commit -m "feat: LLM-assisted article extraction with Arc provider and heuristic fallback"

Task 6: Frontend settings

Files:

Modify: frontend/src/types.ts
Modify: frontend/src/i18n/fr.ts
Modify: frontend/src/pages/Settings.tsx
Step 1: Add fields to types + DEFAULT_SETTINGS

// In UserSettings interface:
use_llm_for_source_links: boolean;
use_llm_for_article_extraction: boolean;

// In DEFAULT_SETTINGS:
use_llm_for_source_links: false,
use_llm_for_article_extraction: false,

Step 2: Add i18n labels

'settings.advancedExtraction': 'Extraction avancee',
'settings.useLlmForSourceLinks': "Utiliser l'IA pour extraire les liens",
'settings.useLlmForArticleExtraction': "Utiliser l'IA pour extraire le contenu",

Step 3: Add checkboxes in Settings page

Add after the generation settings grid, before the search agent behavior section:

          {/* Advanced extraction */}
          <div class="mt-6">
            <h3 class="text-lg font-medium text-gray-900 mb-4">
              {t('settings.advancedExtraction')}
            </h3>
            <div class="space-y-4">
              <div class="flex items-center">
                <input
                  type="checkbox"
                  id="useLlmSourceLinks"
                  checked={settings().use_llm_for_source_links}
                  onChange={(e) =>
                    setSettings((prev) => ({
                      ...prev,
                      use_llm_for_source_links: e.currentTarget.checked,
                    }))
                  }
                  class="h-4 w-4 text-indigo-600 focus:ring-indigo-500 border-gray-300 rounded"
                />
                <label for="useLlmSourceLinks" class="ml-2 block text-sm text-gray-700">
                  {t('settings.useLlmForSourceLinks')}
                </label>
              </div>
              <div class="flex items-center">
                <input
                  type="checkbox"
                  id="useLlmArticleExtraction"
                  checked={settings().use_llm_for_article_extraction}
                  onChange={(e) =>
                    setSettings((prev) => ({
                      ...prev,
                      use_llm_for_article_extraction: e.currentTarget.checked,
                    }))
                  }
                  class="h-4 w-4 text-indigo-600 focus:ring-indigo-500 border-gray-300 rounded"
                />
                <label for="useLlmArticleExtraction" class="ml-2 block text-sm text-gray-700">
                  {t('settings.useLlmForArticleExtraction')}
                </label>
              </div>
            </div>
          </div>

Step 4: Run frontend tests + commit

cd frontend && npx tsc --noEmit && npx vitest run
git add frontend/src/types.ts frontend/src/i18n/fr.ts frontend/src/pages/Settings.tsx
git commit -m "feat: add LLM scraping toggles to Settings page"

Task 7: Update E2E test with comprehensive synthesis validation

Files:

Modify: e2e/tests/generation-live.spec.ts
Step 1: Update settings payload

Add to the PUT settings body:

use_llm_for_source_links: false,
use_llm_for_article_extraction: false,

Step 2: Add comprehensive validation using request fixture

Update the test function signature to include the request fixture:

test('full generation pipeline produces valid synthesis', async ({
    page,
    request,
}) => {

Add after existing structure validation:

    // Comprehensive synthesis validation
    const allUrls: string[] = [];
    const domainCounts: Record<string, number> = {};

    for (const section of synthesis.sections) {
      for (const item of section.items) {
        allUrls.push(item.url);
        try {
          const domain = new URL(item.url).hostname;
          domainCounts[domain] = (domainCounts[domain] || 0) + 1;
        } catch {}
      }

      // Category article count check (including Autre)
      expect(section.items.length).toBeLessThanOrEqual(4); // max_items_per_category
    }

    // No duplicate URLs across all sections
    const uniqueUrls = new Set(allUrls);
    expect(uniqueUrls.size).toBe(allUrls.length);

    // No domain exceeds max_articles_per_source (3)
    for (const [domain, count] of Object.entries(domainCounts)) {
      expect(count).toBeLessThanOrEqual(3);
    }

    // Verify a sample of article links actually work (using Playwright request API, no CORS issues)
    const sampleUrls = allUrls.slice(0, 3);
    for (const articleUrl of sampleUrls) {
      const resp = await request.head(articleUrl);
      expect(resp.status()).toBeGreaterThanOrEqual(200);
      expect(resp.status()).toBeLessThan(400);
    }

Step 3: Run E2E test

cd e2e && docker compose -f docker-compose.test.yml down
docker compose -f docker-compose.test.yml up --build -d
sleep 25 && npx tsx seed.ts && npx playwright test generation-live --reporter=list

Step 4: Commit

git add e2e/tests/generation-live.spec.ts
git commit -m "test: comprehensive E2E synthesis validation (duplicates, links, counts, domains)"

Task 8: Update integration test

Files:

Modify: backend/tests/api_syntheses_test.rs
Step 1: Update settings payload in generate_pipeline_resolves_model_from_admin_config

Add the new boolean fields to the PUT settings body:

"use_llm_for_source_links": false,
"use_llm_for_article_extraction": false,

Step 2: Run integration test compilation check + commit

cd backend && cargo test --no-run
git add backend/tests/api_syntheses_test.rs
git commit -m "test: update integration test with LLM scraping settings"

28 KiB Raw Blame History

LLM-Assisted Scraping — Implementation Plan (Revised)

Task 1: Migration + backend model (2 bool settings)

Task 2: Add url and head_html to ScrapedContent + Arc<dyn LlmProvider> + update scraping functions

Task 3: LLM prompts and schemas for both extraction types

Task 4: LLM-assisted source link extraction in source_scraper.rs

Task 5: LLM-assisted article extraction in synthesis pipeline

Task 6: Frontend settings

Task 7: Update E2E test with comprehensive synthesis validation

Task 8: Update integration test

28 KiB

Raw Blame History

Task 2: Add `url` and `head_html` to `ScrapedContent` + `Arc<dyn LlmProvider>` + update scraping functions

Task 4: LLM-assisted source link extraction in `source_scraper.rs`