docs: add site search fallback design spec
Spec for automatic site:{domain} search fallback when RSS + HTML
extraction both return 0 links for a personalized source. Uses
Brave Search or LLM websearch. Inline in Phase 1 spawn.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
master
parent
1cb7bf6c6f
commit
a09973f569
@ -0,0 +1,158 @@
|
|||||||
|
# Site Search Fallback for Failed Sources
|
||||||
|
|
||||||
|
**Date:** 2026-04-03
|
||||||
|
**Status:** Approved
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
When a personalized source yields 0 links from both RSS feed and HTML extraction (e.g., Cloudflare-protected sites, JS-only pages), automatically fall back to a `site:{domain} {theme}` search to discover articles from that source. Uses Brave Search API if available, otherwise LLM websearch. Integrated inline in the Phase 1 spawn per source, transparent to the user.
|
||||||
|
|
||||||
|
## Motivation
|
||||||
|
|
||||||
|
Inspired by Claude Chat's behavior when scraping Cloudflare-protected sites: it fell back to `site:korben.info intelligence artificielle` via web search and successfully found articles. Our pipeline currently yields 0 links and moves on, losing that source entirely.
|
||||||
|
|
||||||
|
## Design Decisions
|
||||||
|
|
||||||
|
| Decision | Choice | Rationale |
|
||||||
|
|---|---|---|
|
||||||
|
| Trigger condition | RSS + HTML both return 0 links | Only when both primary strategies fail — not for sparse results |
|
||||||
|
| Integration point | Inline in Phase 1 `join_set.spawn` | No new phase, reuses existing parallel processing, minimal pipeline change |
|
||||||
|
| Search provider | Brave if available, LLM websearch fallback | Maximizes coverage — works with or without Brave API key |
|
||||||
|
| Query format | `site:{domain} {theme}` | Targets the specific source domain with thematic filtering |
|
||||||
|
| Max results | `max_links_per_source` setting | Consistent with RSS/HTML extraction limits |
|
||||||
|
| New settings | None | Uses existing `use_brave_search` + Brave API key |
|
||||||
|
| Frontend/API changes | None | Feature is transparent to the user |
|
||||||
|
|
||||||
|
## New Service: `site_search.rs`
|
||||||
|
|
||||||
|
**Location:** `backend/src/services/site_search.rs`
|
||||||
|
|
||||||
|
**Responsibility:** Execute a `site:{domain} {theme}` search via Brave API or LLM websearch and return article URLs.
|
||||||
|
|
||||||
|
### Public API
|
||||||
|
|
||||||
|
```rust
|
||||||
|
pub struct SiteSearchConfig {
|
||||||
|
pub domain: String, // e.g., "korben.info"
|
||||||
|
pub theme: String, // e.g., "intelligence artificielle"
|
||||||
|
pub max_results: usize, // = max_links_per_source
|
||||||
|
pub max_age_days: i32, // for Brave freshness filter
|
||||||
|
}
|
||||||
|
|
||||||
|
pub enum SiteSearchProvider {
|
||||||
|
Brave { api_key: String },
|
||||||
|
Llm {
|
||||||
|
provider: Arc<dyn LlmProvider>,
|
||||||
|
model: String,
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Execute a site-scoped search, returning article URLs.
|
||||||
|
/// Returns an empty Vec on failure (silent fallback).
|
||||||
|
pub async fn search(
|
||||||
|
http_client: &reqwest::Client,
|
||||||
|
config: &SiteSearchConfig,
|
||||||
|
provider: &SiteSearchProvider,
|
||||||
|
) -> Vec<String>
|
||||||
|
```
|
||||||
|
|
||||||
|
### Brave Path
|
||||||
|
|
||||||
|
- Calls `brave_search::search` with query `site:{domain} {theme}`, count = `max_results`, freshness based on `max_age_days`.
|
||||||
|
- Extracts URLs from `BraveResult` entries.
|
||||||
|
- Filters out URLs that don't match the target domain (safety check).
|
||||||
|
|
||||||
|
### LLM Path
|
||||||
|
|
||||||
|
Prompt sent to the LLM with the websearch model:
|
||||||
|
|
||||||
|
```
|
||||||
|
Trouve les {max_results} articles les plus récents publiés sur le site {domain}
|
||||||
|
à propos de "{theme}".
|
||||||
|
|
||||||
|
Retourne uniquement un tableau JSON d'URLs, sans explication :
|
||||||
|
["https://...", "https://...", ...]
|
||||||
|
|
||||||
|
Critères :
|
||||||
|
- Articles publiés dans les {max_age_days} derniers jours
|
||||||
|
- URLs complètes pointant vers des pages d'articles (pas de pages catégorie, tag, ou accueil)
|
||||||
|
- Uniquement des URLs du domaine {domain}
|
||||||
|
```
|
||||||
|
|
||||||
|
- Parses the JSON array from the LLM response.
|
||||||
|
- Filters URLs to only keep those matching the target domain (protection against LLM hallucinations).
|
||||||
|
- Returns empty Vec if parsing fails or LLM returns non-JSON.
|
||||||
|
|
||||||
|
### Error Handling
|
||||||
|
|
||||||
|
No errors propagated — this is a fallback. All failures result in an empty Vec with `tracing::warn!` logging.
|
||||||
|
|
||||||
|
## Pipeline Integration
|
||||||
|
|
||||||
|
### Phase 1 Spawn Modification (`synthesis/mod.rs`)
|
||||||
|
|
||||||
|
**Current flow per source (in spawn):**
|
||||||
|
```
|
||||||
|
RSS (>= 3 entries) → use RSS links
|
||||||
|
else → HTML extraction → use HTML links (may be 0)
|
||||||
|
```
|
||||||
|
|
||||||
|
**New flow per source (in spawn):**
|
||||||
|
```
|
||||||
|
RSS (>= 3 entries) → use RSS links
|
||||||
|
else → HTML extraction
|
||||||
|
if HTML > 0 links → use HTML links
|
||||||
|
if HTML == 0 links → site_search(domain, theme)
|
||||||
|
if site_search > 0 links → use those
|
||||||
|
else → source contributes 0 links
|
||||||
|
```
|
||||||
|
|
||||||
|
### SiteSearchProvider Construction
|
||||||
|
|
||||||
|
Built once before the wave_loop, after LLM provider resolution:
|
||||||
|
|
||||||
|
```rust
|
||||||
|
let site_search_provider = if settings.use_brave_search {
|
||||||
|
match resolve_brave_key(state, user_id).await {
|
||||||
|
Ok(key) => SiteSearchProvider::Brave { api_key: key },
|
||||||
|
Err(_) => SiteSearchProvider::Llm { provider, model_websearch },
|
||||||
|
}
|
||||||
|
} else {
|
||||||
|
SiteSearchProvider::Llm { provider, model_websearch }
|
||||||
|
};
|
||||||
|
let site_search_provider = Arc::new(site_search_provider);
|
||||||
|
```
|
||||||
|
|
||||||
|
Wrapped in `Arc` and cloned into each spawn. The Brave key is resolved once (DB + decryption), not per-spawn.
|
||||||
|
|
||||||
|
### Data Passed to Spawn (additions)
|
||||||
|
|
||||||
|
- `site_search_provider: Arc<SiteSearchProvider>` — cloned per spawn
|
||||||
|
- `theme_text: String` — `theme.theme.clone()`
|
||||||
|
- `max_age_days: i32` — `theme.max_age_days`
|
||||||
|
|
||||||
|
The domain is extracted from `source_url` inside the spawn via the existing `extract_domain()` helper.
|
||||||
|
|
||||||
|
### Article History
|
||||||
|
|
||||||
|
Articles found via site_search use `source_type = "site_search"` in the article_history table (existing TEXT column, no migration needed). This distinguishes them from `"personalized_source"` and `"brave_search"` entries.
|
||||||
|
|
||||||
|
## No Changes
|
||||||
|
|
||||||
|
- **No new settings** — uses existing `use_brave_search` and Brave API key
|
||||||
|
- **No frontend changes** — feature is transparent
|
||||||
|
- **No API changes** — source and synthesis endpoints unchanged
|
||||||
|
- **No DB migration** — `source_type` is a free-form TEXT column
|
||||||
|
- **Phase 2 unchanged** — continues to fill category gaps after Phase 1
|
||||||
|
|
||||||
|
## Observability
|
||||||
|
|
||||||
|
- `tracing::info!` when site_search fallback triggers for a source (logs domain and result count)
|
||||||
|
- `tracing::warn!` when site_search also fails (for diagnosis)
|
||||||
|
- `source_type = "site_search"` in article_history for stats/debugging
|
||||||
|
|
||||||
|
## Testing Strategy
|
||||||
|
|
||||||
|
- **Unit tests** for `site_search::search`: mock Brave API response, mock LLM response, domain filtering, empty results, malformed LLM response
|
||||||
|
- **Unit tests** for LLM response parsing: valid JSON array, mixed valid/invalid URLs, non-JSON response, wrong domain URLs filtered
|
||||||
|
- **Integration test**: pipeline with a source that returns 0 links from RSS+HTML, verify site_search fallback kicks in and produces articles with `source_type = "site_search"` in article_history
|
||||||
Loading…
Reference in New Issue