You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

6.4 KiB

Raw Permalink Blame History

RSS Feed Integration for Personalized Sources

Date: 2026-04-03 Status: Approved

Summary

Add RSS/Atom feed support to the synthesis pipeline. When processing personalized sources in Phase 1, the system attempts to use the source's RSS feed first (discovered automatically or provided directly), falling back to the existing HTML extraction if no feed is found or if the feed yields fewer than 3 links. Feed entries are sorted by publication date (newest first), giving priority to the most recent articles.

Design Decisions

Decision	Choice	Rationale
Feed detection	Content-Type + `<link rel="alternate">`	Simple, covers the two standard mechanisms without speculative URL probing
Feed URL persistence	Persist with 30-day re-discovery	Avoids repeated discovery requests while handling URL changes over time
Metadata extracted	URL + title + published_date	Minimum needed for sorting by recency; scrape+classify handles enrichment
Fallback threshold	< 3 entries	Below 3 the feed is too sparse to be useful, HTML extraction may find more
Architecture	Separate `feed_parser` service	Clean separation of concerns, independently testable
Frontend/API changes	None	RSS discovery is transparent to the user

Data Model Changes

Migration: `sources` table

Add two nullable columns:

ALTER TABLE sources ADD COLUMN rss_url TEXT;
ALTER TABLE sources ADD COLUMN rss_discovered_at TIMESTAMPTZ;

rss_url — URL of the discovered or directly-provided RSS/Atom feed
rss_discovered_at — Timestamp of last successful discovery/verification

No changes to article_history or any other table.

New Service: `feed_parser.rs`

Location: backend/src/services/feed_parser.rs

Public API

pub struct FeedEntry {
    pub url: String,
    pub title: String,
    pub published_date: Option<DateTime<Utc>>,
}

pub enum FeedResult {
    /// Feed found and parsed successfully
    Found {
        feed_url: String,
        entries: Vec<FeedEntry>,
    },
    /// No feed discovered or feed invalid
    NotFound,
}

/// Main entry point — called by Phase 1 pipeline per source.
pub async fn detect_and_parse_feed(
    http_client: &HttpClient,
    source_url: &str,
    rss_url: Option<&str>,
    rss_discovered_at: Option<DateTime<Utc>>,
    max_links: usize,
) -> FeedResult

/// Discover a feed URL from a source URL.
/// Checks Content-Type (direct RSS/Atom) or parses <link rel="alternate"> from HTML.
pub async fn discover_feed(
    http_client: &HttpClient,
    source_url: &str,
) -> Option<String>

/// Fetch and parse an RSS/Atom feed. Returns entries sorted by published_date descending.
pub async fn parse_feed(
    http_client: &HttpClient,
    feed_url: &str,
    max_links: usize,
) -> Result<Vec<FeedEntry>, FeedError>

`detect_and_parse_feed` Logic

if rss_url is Some AND rss_discovered_at < 30 days ago:
    parse_feed(rss_url) → return Found or NotFound

if rss_url is Some AND rss_discovered_at >= 30 days ago:
    discover_feed(source_url)
    if new feed found → parse_feed(new_feed_url) → return Found (+ signal update)
    else → return NotFound (+ signal clear rss_url)

if rss_url is None:
    discover_feed(source_url)
    if feed found → parse_feed(feed_url) → return Found (+ signal persist)
    else → return NotFound

Feed Detection Strategy

Content-Type check: Fetch source_url, inspect response Content-Type:
- application/rss+xml, application/atom+xml, text/xml, application/xml with RSS/Atom content → the URL itself is a feed
HTML <link> discovery: If Content-Type is text/html, parse for:
- <link rel="alternate" type="application/rss+xml" href="...">
- <link rel="alternate" type="application/atom+xml" href="...">
- Take the first match

Feed Parsing

Crate: feed-rs (handles RSS 1.0, RSS 2.0, Atom, JSON Feed)
Extract per entry: url (from <link> or <guid>), title, published_date (from <pubDate> or <updated>)
Sort: by published_date descending (most recent first), entries without dates placed last
Limit: return at most max_links entries
SSRF protection: reuse existing URL validation (no private IPs, http/https only)

Pipeline Integration

Phase 1 Modification (`synthesis/mod.rs`)

Current flow per source:

extract_article_links(source_url) → Vec<String>

New flow per source:

1. feed_parser::detect_and_parse_feed(source_url, rss_url, rss_discovered_at, max_links)
2. If Found AND entries.len() >= 3 → use entry URLs
3. Else → fallback to source_scraper::extract_article_links(source_url)
4. If rss_url changed → async UPDATE sources SET rss_url, rss_discovered_at

What stays the same:

Downstream pipeline: dedup against article_history → batch scrape → classify → accumulate
Preferred-first ordering (preferred sources processed first in wave)
Wave-based parallel processing
Phase 2 web search fallback (completely untouched)
Source diversity limits (max_articles_per_source)

RSS URL Persistence (`db/sources.rs`)

New function:

pub async fn update_source_rss(
    pool: &PgPool,
    source_id: Uuid,
    rss_url: Option<&str>,
    rss_discovered_at: Option<DateTime<Utc>>,
) -> Result<(), sqlx::Error>

Called during generation:

Discovery successful → rss_url = Some(url), rss_discovered_at = Some(now)
Re-discovery, feed still valid → update rss_discovered_at only
Re-discovery, feed gone → rss_url = None, rss_discovered_at = None

No Frontend / API Changes

The rss_url and rss_discovered_at fields are internal — not exposed via the API or UI. The user adds a URL as before; the system transparently discovers and exploits RSS feeds when available.

Testing Strategy

Unit tests for feed_parser: mock HTTP responses for RSS 2.0, Atom, HTML with <link>, HTML without feed, direct feed URL, malformed feeds
Unit tests for date sorting: verify newest-first ordering, entries without dates placed last
Unit tests for re-discovery logic: fresh cache, stale cache (>30 days), cache with changed feed
Integration tests: full Phase 1 pipeline with a source that has a mock RSS feed, verify articles extracted and sorted by date
Edge cases: feed with 0 entries, feed with < 3 entries (triggers fallback), feed with duplicate URLs, feed with relative URLs, feed URL that returns 404

6.4 KiB Raw Permalink Blame History