docs: add RSS feed integration design spec
Spec for adding RSS/Atom feed support to personalized sources in Phase 1 of the synthesis pipeline — auto-discovery, persistence with 30-day re-check, and fallback to HTML extraction. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>master
parent
f8588a57a3
commit
8e56fcdb3a
@ -0,0 +1,170 @@
|
||||
# RSS Feed Integration for Personalized Sources
|
||||
|
||||
**Date:** 2026-04-03
|
||||
**Status:** Approved
|
||||
|
||||
## Summary
|
||||
|
||||
Add RSS/Atom feed support to the synthesis pipeline. When processing personalized sources in Phase 1, the system attempts to use the source's RSS feed first (discovered automatically or provided directly), falling back to the existing HTML extraction if no feed is found or if the feed yields fewer than 3 links. Feed entries are sorted by publication date (newest first), giving priority to the most recent articles.
|
||||
|
||||
## Design Decisions
|
||||
|
||||
| Decision | Choice | Rationale |
|
||||
|---|---|---|
|
||||
| Feed detection | Content-Type + `<link rel="alternate">` | Simple, covers the two standard mechanisms without speculative URL probing |
|
||||
| Feed URL persistence | Persist with 30-day re-discovery | Avoids repeated discovery requests while handling URL changes over time |
|
||||
| Metadata extracted | URL + title + published_date | Minimum needed for sorting by recency; scrape+classify handles enrichment |
|
||||
| Fallback threshold | < 3 entries | Below 3 the feed is too sparse to be useful, HTML extraction may find more |
|
||||
| Architecture | Separate `feed_parser` service | Clean separation of concerns, independently testable |
|
||||
| Frontend/API changes | None | RSS discovery is transparent to the user |
|
||||
|
||||
## Data Model Changes
|
||||
|
||||
### Migration: `sources` table
|
||||
|
||||
Add two nullable columns:
|
||||
|
||||
```sql
|
||||
ALTER TABLE sources ADD COLUMN rss_url TEXT;
|
||||
ALTER TABLE sources ADD COLUMN rss_discovered_at TIMESTAMPTZ;
|
||||
```
|
||||
|
||||
- `rss_url` — URL of the discovered or directly-provided RSS/Atom feed
|
||||
- `rss_discovered_at` — Timestamp of last successful discovery/verification
|
||||
|
||||
No changes to `article_history` or any other table.
|
||||
|
||||
## New Service: `feed_parser.rs`
|
||||
|
||||
**Location:** `backend/src/services/feed_parser.rs`
|
||||
|
||||
### Public API
|
||||
|
||||
```rust
|
||||
pub struct FeedEntry {
|
||||
pub url: String,
|
||||
pub title: String,
|
||||
pub published_date: Option<DateTime<Utc>>,
|
||||
}
|
||||
|
||||
pub enum FeedResult {
|
||||
/// Feed found and parsed successfully
|
||||
Found {
|
||||
feed_url: String,
|
||||
entries: Vec<FeedEntry>,
|
||||
},
|
||||
/// No feed discovered or feed invalid
|
||||
NotFound,
|
||||
}
|
||||
|
||||
/// Main entry point — called by Phase 1 pipeline per source.
|
||||
pub async fn detect_and_parse_feed(
|
||||
http_client: &HttpClient,
|
||||
source_url: &str,
|
||||
rss_url: Option<&str>,
|
||||
rss_discovered_at: Option<DateTime<Utc>>,
|
||||
max_links: usize,
|
||||
) -> FeedResult
|
||||
|
||||
/// Discover a feed URL from a source URL.
|
||||
/// Checks Content-Type (direct RSS/Atom) or parses <link rel="alternate"> from HTML.
|
||||
pub async fn discover_feed(
|
||||
http_client: &HttpClient,
|
||||
source_url: &str,
|
||||
) -> Option<String>
|
||||
|
||||
/// Fetch and parse an RSS/Atom feed. Returns entries sorted by published_date descending.
|
||||
pub async fn parse_feed(
|
||||
http_client: &HttpClient,
|
||||
feed_url: &str,
|
||||
max_links: usize,
|
||||
) -> Result<Vec<FeedEntry>, FeedError>
|
||||
```
|
||||
|
||||
### `detect_and_parse_feed` Logic
|
||||
|
||||
```
|
||||
if rss_url is Some AND rss_discovered_at < 30 days ago:
|
||||
parse_feed(rss_url) → return Found or NotFound
|
||||
|
||||
if rss_url is Some AND rss_discovered_at >= 30 days ago:
|
||||
discover_feed(source_url)
|
||||
if new feed found → parse_feed(new_feed_url) → return Found (+ signal update)
|
||||
else → return NotFound (+ signal clear rss_url)
|
||||
|
||||
if rss_url is None:
|
||||
discover_feed(source_url)
|
||||
if feed found → parse_feed(feed_url) → return Found (+ signal persist)
|
||||
else → return NotFound
|
||||
```
|
||||
|
||||
### Feed Detection Strategy
|
||||
|
||||
1. **Content-Type check:** Fetch `source_url`, inspect response Content-Type:
|
||||
- `application/rss+xml`, `application/atom+xml`, `text/xml`, `application/xml` with RSS/Atom content → the URL itself is a feed
|
||||
2. **HTML `<link>` discovery:** If Content-Type is `text/html`, parse for:
|
||||
- `<link rel="alternate" type="application/rss+xml" href="...">`
|
||||
- `<link rel="alternate" type="application/atom+xml" href="...">`
|
||||
- Take the first match
|
||||
|
||||
### Feed Parsing
|
||||
|
||||
- **Crate:** `feed-rs` (handles RSS 1.0, RSS 2.0, Atom, JSON Feed)
|
||||
- **Extract per entry:** `url` (from `<link>` or `<guid>`), `title`, `published_date` (from `<pubDate>` or `<updated>`)
|
||||
- **Sort:** by `published_date` descending (most recent first), entries without dates placed last
|
||||
- **Limit:** return at most `max_links` entries
|
||||
- **SSRF protection:** reuse existing URL validation (no private IPs, http/https only)
|
||||
|
||||
## Pipeline Integration
|
||||
|
||||
### Phase 1 Modification (`synthesis/mod.rs`)
|
||||
|
||||
**Current flow per source:**
|
||||
```
|
||||
extract_article_links(source_url) → Vec<String>
|
||||
```
|
||||
|
||||
**New flow per source:**
|
||||
```
|
||||
1. feed_parser::detect_and_parse_feed(source_url, rss_url, rss_discovered_at, max_links)
|
||||
2. If Found AND entries.len() >= 3 → use entry URLs
|
||||
3. Else → fallback to source_scraper::extract_article_links(source_url)
|
||||
4. If rss_url changed → async UPDATE sources SET rss_url, rss_discovered_at
|
||||
```
|
||||
|
||||
**What stays the same:**
|
||||
- Downstream pipeline: dedup against article_history → batch scrape → classify → accumulate
|
||||
- Preferred-first ordering (preferred sources processed first in wave)
|
||||
- Wave-based parallel processing
|
||||
- Phase 2 web search fallback (completely untouched)
|
||||
- Source diversity limits (max_articles_per_source)
|
||||
|
||||
### RSS URL Persistence (`db/sources.rs`)
|
||||
|
||||
New function:
|
||||
|
||||
```rust
|
||||
pub async fn update_source_rss(
|
||||
pool: &PgPool,
|
||||
source_id: Uuid,
|
||||
rss_url: Option<&str>,
|
||||
rss_discovered_at: Option<DateTime<Utc>>,
|
||||
) -> Result<(), sqlx::Error>
|
||||
```
|
||||
|
||||
Called during generation:
|
||||
- Discovery successful → `rss_url = Some(url)`, `rss_discovered_at = Some(now)`
|
||||
- Re-discovery, feed still valid → update `rss_discovered_at` only
|
||||
- Re-discovery, feed gone → `rss_url = None`, `rss_discovered_at = None`
|
||||
|
||||
## No Frontend / API Changes
|
||||
|
||||
The `rss_url` and `rss_discovered_at` fields are internal — not exposed via the API or UI. The user adds a URL as before; the system transparently discovers and exploits RSS feeds when available.
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
- **Unit tests** for `feed_parser`: mock HTTP responses for RSS 2.0, Atom, HTML with `<link>`, HTML without feed, direct feed URL, malformed feeds
|
||||
- **Unit tests** for date sorting: verify newest-first ordering, entries without dates placed last
|
||||
- **Unit tests** for re-discovery logic: fresh cache, stale cache (>30 days), cache with changed feed
|
||||
- **Integration tests**: full Phase 1 pipeline with a source that has a mock RSS feed, verify articles extracted and sorted by date
|
||||
- **Edge cases**: feed with 0 entries, feed with < 3 entries (triggers fallback), feed with duplicate URLs, feed with relative URLs, feed URL that returns 404
|
||||
Loading…
Reference in New Issue