You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
ai_synth/docs/superpowers/specs/2026-04-03-site-search-fall...

6.2 KiB

Site Search Fallback for Failed Sources

Date: 2026-04-03 Status: Approved

Summary

When a personalized source yields 0 links from both RSS feed and HTML extraction (e.g., Cloudflare-protected sites, JS-only pages), automatically fall back to a site:{domain} {theme} search to discover articles from that source. Uses Brave Search API if available, otherwise LLM websearch. Integrated inline in the Phase 1 spawn per source, transparent to the user.

Motivation

Inspired by Claude Chat's behavior when scraping Cloudflare-protected sites: it fell back to site:korben.info intelligence artificielle via web search and successfully found articles. Our pipeline currently yields 0 links and moves on, losing that source entirely.

Design Decisions

Decision Choice Rationale
Trigger condition RSS + HTML both return 0 links Only when both primary strategies fail — not for sparse results
Integration point Inline in Phase 1 join_set.spawn No new phase, reuses existing parallel processing, minimal pipeline change
Search provider Brave if available, LLM websearch fallback Maximizes coverage — works with or without Brave API key
Query format site:{domain} {theme} Targets the specific source domain with thematic filtering
Max results max_links_per_source setting Consistent with RSS/HTML extraction limits
New settings None Uses existing use_brave_search + Brave API key
Frontend/API changes None Feature is transparent to the user

New Service: site_search.rs

Location: backend/src/services/site_search.rs

Responsibility: Execute a site:{domain} {theme} search via Brave API or LLM websearch and return article URLs.

Public API

pub struct SiteSearchConfig {
    pub domain: String,        // e.g., "korben.info"
    pub theme: String,         // e.g., "intelligence artificielle"
    pub max_results: usize,    // = max_links_per_source
    pub max_age_days: i32,     // for Brave freshness filter
}

pub enum SiteSearchProvider {
    Brave { api_key: String },
    Llm {
        provider: Arc<dyn LlmProvider>,
        model: String,
    },
}

/// Execute a site-scoped search, returning article URLs.
/// Returns an empty Vec on failure (silent fallback).
pub async fn search(
    http_client: &reqwest::Client,
    config: &SiteSearchConfig,
    provider: &SiteSearchProvider,
) -> Vec<String>

Brave Path

  • Calls brave_search::search with query site:{domain} {theme}, count = max_results, freshness based on max_age_days.
  • Extracts URLs from BraveResult entries.
  • Filters out URLs that don't match the target domain (safety check).

LLM Path

Prompt sent to the LLM with the websearch model:

Trouve les {max_results} articles les plus récents publiés sur le site {domain}
à propos de "{theme}".

Retourne uniquement un tableau JSON d'URLs, sans explication :
["https://...", "https://...", ...]

Critères :
- Articles publiés dans les {max_age_days} derniers jours
- URLs complètes pointant vers des pages d'articles (pas de pages catégorie, tag, ou accueil)
- Uniquement des URLs du domaine {domain}
  • Parses the JSON array from the LLM response.
  • Filters URLs to only keep those matching the target domain (protection against LLM hallucinations).
  • Returns empty Vec if parsing fails or LLM returns non-JSON.

Error Handling

No errors propagated — this is a fallback. All failures result in an empty Vec with tracing::warn! logging.

Pipeline Integration

Phase 1 Spawn Modification (synthesis/mod.rs)

Current flow per source (in spawn):

RSS (>= 3 entries) → use RSS links
  else → HTML extraction → use HTML links (may be 0)

New flow per source (in spawn):

RSS (>= 3 entries) → use RSS links
  else → HTML extraction
    if HTML > 0 links → use HTML links
    if HTML == 0 links → site_search(domain, theme)
      if site_search > 0 links → use those
      else → source contributes 0 links

SiteSearchProvider Construction

Built once before the wave_loop, after LLM provider resolution:

let site_search_provider = if settings.use_brave_search {
    match resolve_brave_key(state, user_id).await {
        Ok(key) => SiteSearchProvider::Brave { api_key: key },
        Err(_) => SiteSearchProvider::Llm { provider, model_websearch },
    }
} else {
    SiteSearchProvider::Llm { provider, model_websearch }
};
let site_search_provider = Arc::new(site_search_provider);

Wrapped in Arc and cloned into each spawn. The Brave key is resolved once (DB + decryption), not per-spawn.

Data Passed to Spawn (additions)

  • site_search_provider: Arc<SiteSearchProvider> — cloned per spawn
  • theme_text: Stringtheme.theme.clone()
  • max_age_days: i32theme.max_age_days

The domain is extracted from source_url inside the spawn via the existing extract_domain() helper.

Article History

Articles found via site_search use source_type = "site_search" in the article_history table (existing TEXT column, no migration needed). This distinguishes them from "personalized_source" and "brave_search" entries.

No Changes

  • No new settings — uses existing use_brave_search and Brave API key
  • No frontend changes — feature is transparent
  • No API changes — source and synthesis endpoints unchanged
  • No DB migrationsource_type is a free-form TEXT column
  • Phase 2 unchanged — continues to fill category gaps after Phase 1

Observability

  • tracing::info! when site_search fallback triggers for a source (logs domain and result count)
  • tracing::warn! when site_search also fails (for diagnosis)
  • source_type = "site_search" in article_history for stats/debugging

Testing Strategy

  • Unit tests for site_search::search: mock Brave API response, mock LLM response, domain filtering, empty results, malformed LLM response
  • Unit tests for LLM response parsing: valid JSON array, mixed valid/invalid URLs, non-JSON response, wrong domain URLs filtered
  • Integration test: pipeline with a source that returns 0 links from RSS+HTML, verify site_search fallback kicks in and produces articles with source_type = "site_search" in article_history