You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

6.4 KiB

RSS Feed Integration for Personalized Sources

Date: 2026-04-03 Status: Approved

Summary

Add RSS/Atom feed support to the synthesis pipeline. When processing personalized sources in Phase 1, the system attempts to use the source's RSS feed first (discovered automatically or provided directly), falling back to the existing HTML extraction if no feed is found or if the feed yields fewer than 3 links. Feed entries are sorted by publication date (newest first), giving priority to the most recent articles.

Design Decisions

Decision Choice Rationale
Feed detection Content-Type + <link rel="alternate"> Simple, covers the two standard mechanisms without speculative URL probing
Feed URL persistence Persist with 30-day re-discovery Avoids repeated discovery requests while handling URL changes over time
Metadata extracted URL + title + published_date Minimum needed for sorting by recency; scrape+classify handles enrichment
Fallback threshold < 3 entries Below 3 the feed is too sparse to be useful, HTML extraction may find more
Architecture Separate feed_parser service Clean separation of concerns, independently testable
Frontend/API changes None RSS discovery is transparent to the user

Data Model Changes

Migration: sources table

Add two nullable columns:

ALTER TABLE sources ADD COLUMN rss_url TEXT;
ALTER TABLE sources ADD COLUMN rss_discovered_at TIMESTAMPTZ;
  • rss_url — URL of the discovered or directly-provided RSS/Atom feed
  • rss_discovered_at — Timestamp of last successful discovery/verification

No changes to article_history or any other table.

New Service: feed_parser.rs

Location: backend/src/services/feed_parser.rs

Public API

pub struct FeedEntry {
    pub url: String,
    pub title: String,
    pub published_date: Option<DateTime<Utc>>,
}

pub enum FeedResult {
    /// Feed found and parsed successfully
    Found {
        feed_url: String,
        entries: Vec<FeedEntry>,
    },
    /// No feed discovered or feed invalid
    NotFound,
}

/// Main entry point — called by Phase 1 pipeline per source.
pub async fn detect_and_parse_feed(
    http_client: &HttpClient,
    source_url: &str,
    rss_url: Option<&str>,
    rss_discovered_at: Option<DateTime<Utc>>,
    max_links: usize,
) -> FeedResult

/// Discover a feed URL from a source URL.
/// Checks Content-Type (direct RSS/Atom) or parses <link rel="alternate"> from HTML.
pub async fn discover_feed(
    http_client: &HttpClient,
    source_url: &str,
) -> Option<String>

/// Fetch and parse an RSS/Atom feed. Returns entries sorted by published_date descending.
pub async fn parse_feed(
    http_client: &HttpClient,
    feed_url: &str,
    max_links: usize,
) -> Result<Vec<FeedEntry>, FeedError>

detect_and_parse_feed Logic

if rss_url is Some AND rss_discovered_at < 30 days ago:
    parse_feed(rss_url) → return Found or NotFound

if rss_url is Some AND rss_discovered_at >= 30 days ago:
    discover_feed(source_url)
    if new feed found → parse_feed(new_feed_url) → return Found (+ signal update)
    else → return NotFound (+ signal clear rss_url)

if rss_url is None:
    discover_feed(source_url)
    if feed found → parse_feed(feed_url) → return Found (+ signal persist)
    else → return NotFound

Feed Detection Strategy

  1. Content-Type check: Fetch source_url, inspect response Content-Type:
    • application/rss+xml, application/atom+xml, text/xml, application/xml with RSS/Atom content → the URL itself is a feed
  2. HTML <link> discovery: If Content-Type is text/html, parse for:
    • <link rel="alternate" type="application/rss+xml" href="...">
    • <link rel="alternate" type="application/atom+xml" href="...">
    • Take the first match

Feed Parsing

  • Crate: feed-rs (handles RSS 1.0, RSS 2.0, Atom, JSON Feed)
  • Extract per entry: url (from <link> or <guid>), title, published_date (from <pubDate> or <updated>)
  • Sort: by published_date descending (most recent first), entries without dates placed last
  • Limit: return at most max_links entries
  • SSRF protection: reuse existing URL validation (no private IPs, http/https only)

Pipeline Integration

Phase 1 Modification (synthesis/mod.rs)

Current flow per source:

extract_article_links(source_url) → Vec<String>

New flow per source:

1. feed_parser::detect_and_parse_feed(source_url, rss_url, rss_discovered_at, max_links)
2. If Found AND entries.len() >= 3 → use entry URLs
3. Else → fallback to source_scraper::extract_article_links(source_url)
4. If rss_url changed → async UPDATE sources SET rss_url, rss_discovered_at

What stays the same:

  • Downstream pipeline: dedup against article_history → batch scrape → classify → accumulate
  • Preferred-first ordering (preferred sources processed first in wave)
  • Wave-based parallel processing
  • Phase 2 web search fallback (completely untouched)
  • Source diversity limits (max_articles_per_source)

RSS URL Persistence (db/sources.rs)

New function:

pub async fn update_source_rss(
    pool: &PgPool,
    source_id: Uuid,
    rss_url: Option<&str>,
    rss_discovered_at: Option<DateTime<Utc>>,
) -> Result<(), sqlx::Error>

Called during generation:

  • Discovery successful → rss_url = Some(url), rss_discovered_at = Some(now)
  • Re-discovery, feed still valid → update rss_discovered_at only
  • Re-discovery, feed gone → rss_url = None, rss_discovered_at = None

No Frontend / API Changes

The rss_url and rss_discovered_at fields are internal — not exposed via the API or UI. The user adds a URL as before; the system transparently discovers and exploits RSS feeds when available.

Testing Strategy

  • Unit tests for feed_parser: mock HTTP responses for RSS 2.0, Atom, HTML with <link>, HTML without feed, direct feed URL, malformed feeds
  • Unit tests for date sorting: verify newest-first ordering, entries without dates placed last
  • Unit tests for re-discovery logic: fresh cache, stale cache (>30 days), cache with changed feed
  • Integration tests: full Phase 1 pipeline with a source that has a mock RSS feed, verify articles extracted and sorted by date
  • Edge cases: feed with 0 entries, feed with < 3 entries (triggers fallback), feed with duplicate URLs, feed with relative URLs, feed URL that returns 404