From a09973f5699472148c3f3cccbfd80579b528da4b Mon Sep 17 00:00:00 2001 From: oabrivard Date: Sat, 4 Apr 2026 00:00:29 +0200 Subject: [PATCH] docs: add site search fallback design spec Spec for automatic site:{domain} search fallback when RSS + HTML extraction both return 0 links for a personalized source. Uses Brave Search or LLM websearch. Inline in Phase 1 spawn. Co-Authored-By: Claude Opus 4.6 (1M context) --- .../2026-04-03-site-search-fallback-design.md | 158 ++++++++++++++++++ 1 file changed, 158 insertions(+) create mode 100644 docs/superpowers/specs/2026-04-03-site-search-fallback-design.md diff --git a/docs/superpowers/specs/2026-04-03-site-search-fallback-design.md b/docs/superpowers/specs/2026-04-03-site-search-fallback-design.md new file mode 100644 index 0000000..fcf3069 --- /dev/null +++ b/docs/superpowers/specs/2026-04-03-site-search-fallback-design.md @@ -0,0 +1,158 @@ +# Site Search Fallback for Failed Sources + +**Date:** 2026-04-03 +**Status:** Approved + +## Summary + +When a personalized source yields 0 links from both RSS feed and HTML extraction (e.g., Cloudflare-protected sites, JS-only pages), automatically fall back to a `site:{domain} {theme}` search to discover articles from that source. Uses Brave Search API if available, otherwise LLM websearch. Integrated inline in the Phase 1 spawn per source, transparent to the user. + +## Motivation + +Inspired by Claude Chat's behavior when scraping Cloudflare-protected sites: it fell back to `site:korben.info intelligence artificielle` via web search and successfully found articles. Our pipeline currently yields 0 links and moves on, losing that source entirely. + +## Design Decisions + +| Decision | Choice | Rationale | +|---|---|---| +| Trigger condition | RSS + HTML both return 0 links | Only when both primary strategies fail — not for sparse results | +| Integration point | Inline in Phase 1 `join_set.spawn` | No new phase, reuses existing parallel processing, minimal pipeline change | +| Search provider | Brave if available, LLM websearch fallback | Maximizes coverage — works with or without Brave API key | +| Query format | `site:{domain} {theme}` | Targets the specific source domain with thematic filtering | +| Max results | `max_links_per_source` setting | Consistent with RSS/HTML extraction limits | +| New settings | None | Uses existing `use_brave_search` + Brave API key | +| Frontend/API changes | None | Feature is transparent to the user | + +## New Service: `site_search.rs` + +**Location:** `backend/src/services/site_search.rs` + +**Responsibility:** Execute a `site:{domain} {theme}` search via Brave API or LLM websearch and return article URLs. + +### Public API + +```rust +pub struct SiteSearchConfig { + pub domain: String, // e.g., "korben.info" + pub theme: String, // e.g., "intelligence artificielle" + pub max_results: usize, // = max_links_per_source + pub max_age_days: i32, // for Brave freshness filter +} + +pub enum SiteSearchProvider { + Brave { api_key: String }, + Llm { + provider: Arc, + model: String, + }, +} + +/// Execute a site-scoped search, returning article URLs. +/// Returns an empty Vec on failure (silent fallback). +pub async fn search( + http_client: &reqwest::Client, + config: &SiteSearchConfig, + provider: &SiteSearchProvider, +) -> Vec +``` + +### Brave Path + +- Calls `brave_search::search` with query `site:{domain} {theme}`, count = `max_results`, freshness based on `max_age_days`. +- Extracts URLs from `BraveResult` entries. +- Filters out URLs that don't match the target domain (safety check). + +### LLM Path + +Prompt sent to the LLM with the websearch model: + +``` +Trouve les {max_results} articles les plus récents publiés sur le site {domain} +à propos de "{theme}". + +Retourne uniquement un tableau JSON d'URLs, sans explication : +["https://...", "https://...", ...] + +Critères : +- Articles publiés dans les {max_age_days} derniers jours +- URLs complètes pointant vers des pages d'articles (pas de pages catégorie, tag, ou accueil) +- Uniquement des URLs du domaine {domain} +``` + +- Parses the JSON array from the LLM response. +- Filters URLs to only keep those matching the target domain (protection against LLM hallucinations). +- Returns empty Vec if parsing fails or LLM returns non-JSON. + +### Error Handling + +No errors propagated — this is a fallback. All failures result in an empty Vec with `tracing::warn!` logging. + +## Pipeline Integration + +### Phase 1 Spawn Modification (`synthesis/mod.rs`) + +**Current flow per source (in spawn):** +``` +RSS (>= 3 entries) → use RSS links + else → HTML extraction → use HTML links (may be 0) +``` + +**New flow per source (in spawn):** +``` +RSS (>= 3 entries) → use RSS links + else → HTML extraction + if HTML > 0 links → use HTML links + if HTML == 0 links → site_search(domain, theme) + if site_search > 0 links → use those + else → source contributes 0 links +``` + +### SiteSearchProvider Construction + +Built once before the wave_loop, after LLM provider resolution: + +```rust +let site_search_provider = if settings.use_brave_search { + match resolve_brave_key(state, user_id).await { + Ok(key) => SiteSearchProvider::Brave { api_key: key }, + Err(_) => SiteSearchProvider::Llm { provider, model_websearch }, + } +} else { + SiteSearchProvider::Llm { provider, model_websearch } +}; +let site_search_provider = Arc::new(site_search_provider); +``` + +Wrapped in `Arc` and cloned into each spawn. The Brave key is resolved once (DB + decryption), not per-spawn. + +### Data Passed to Spawn (additions) + +- `site_search_provider: Arc` — cloned per spawn +- `theme_text: String` — `theme.theme.clone()` +- `max_age_days: i32` — `theme.max_age_days` + +The domain is extracted from `source_url` inside the spawn via the existing `extract_domain()` helper. + +### Article History + +Articles found via site_search use `source_type = "site_search"` in the article_history table (existing TEXT column, no migration needed). This distinguishes them from `"personalized_source"` and `"brave_search"` entries. + +## No Changes + +- **No new settings** — uses existing `use_brave_search` and Brave API key +- **No frontend changes** — feature is transparent +- **No API changes** — source and synthesis endpoints unchanged +- **No DB migration** — `source_type` is a free-form TEXT column +- **Phase 2 unchanged** — continues to fill category gaps after Phase 1 + +## Observability + +- `tracing::info!` when site_search fallback triggers for a source (logs domain and result count) +- `tracing::warn!` when site_search also fails (for diagnosis) +- `source_type = "site_search"` in article_history for stats/debugging + +## Testing Strategy + +- **Unit tests** for `site_search::search`: mock Brave API response, mock LLM response, domain filtering, empty results, malformed LLM response +- **Unit tests** for LLM response parsing: valid JSON array, mixed valid/invalid URLs, non-JSON response, wrong domain URLs filtered +- **Integration test**: pipeline with a source that returns 0 links from RSS+HTML, verify site_search fallback kicks in and produces articles with `source_type = "site_search"` in article_history