You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
ai_synth/docs/superpowers/plans/2026-04-03-rss-feed-integra...

38 KiB

RSS Feed Integration Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Add RSS/Atom feed support to personalized sources so the synthesis pipeline discovers articles via feeds first (sorted by recency), falling back to HTML extraction when no feed is found or it yields fewer than 3 links.

Architecture: New feed_parser service handles feed discovery, parsing, and caching. The Phase 1 pipeline calls it before the existing source_scraper. Two new nullable columns on sources persist discovered feed URLs with a 30-day re-discovery cycle.

Tech Stack: Rust, feed-rs crate (RSS/Atom/JSON Feed parsing), scraper crate (HTML <link> discovery), reqwest (HTTP), sqlx (Postgres), wiremock (test mocks)


Task 1: Add feed-rs dependency

Files:

  • Modify: backend/Cargo.toml

  • Step 1: Add feed-rs to dependencies

In backend/Cargo.toml, add after the scraper line in [dependencies]:

# RSS/Atom feed parsing
feed-rs = "2"
  • Step 2: Verify it compiles

Run: cd backend && cargo check Expected: compiles with no errors

  • Step 3: Commit
git add backend/Cargo.toml
git commit -m "deps: add feed-rs crate for RSS/Atom feed parsing"

Task 2: Database migration — add RSS columns to sources

Files:

  • Create: backend/migrations/20260403000031_add_source_rss_fields.sql

  • Modify: backend/src/models/source.rs

  • Modify: backend/src/db/sources.rs

  • Step 1: Create the migration file

Create backend/migrations/20260403000031_add_source_rss_fields.sql:

ALTER TABLE sources ADD COLUMN rss_url TEXT;
ALTER TABLE sources ADD COLUMN rss_discovered_at TIMESTAMPTZ;
  • Step 2: Add fields to the Source struct

In backend/src/models/source.rs, add two fields to the Source struct after is_preferred:

pub struct Source {
    pub id: Uuid,
    pub user_id: Uuid,
    pub title: String,
    pub url: String,
    pub theme_id: Option<Uuid>,
    pub is_preferred: bool,
    pub rss_url: Option<String>,
    pub rss_discovered_at: Option<DateTime<Utc>>,
    pub created_at: DateTime<Utc>,
}
  • Step 3: Update all SQL SELECT queries in db/sources.rs

Every SELECT in backend/src/db/sources.rs that uses query_as::<_, Source> needs the two new columns. Update the column lists from:

SELECT id, user_id, title, url, theme_id, is_preferred, created_at

to:

SELECT id, user_id, title, url, theme_id, is_preferred, rss_url, rss_discovered_at, created_at

This applies to:

  • list_for_user (2 queries: with and without theme_id filter)

  • create (the RETURNING clause)

  • bulk_create (the RETURNING clause)

  • Step 4: Add update_source_rss function to db/sources.rs

Append to backend/src/db/sources.rs:

/// Update the cached RSS feed URL and discovery timestamp for a source.
///
/// Called during synthesis generation when a feed is discovered or re-verified.
/// Pass `rss_url = None` to clear a previously cached feed (e.g., feed no longer exists).
pub async fn update_source_rss(
    pool: &PgPool,
    source_id: Uuid,
    rss_url: Option<&str>,
    rss_discovered_at: Option<DateTime<Utc>>,
) -> Result<(), AppError> {
    sqlx::query(
        "UPDATE sources SET rss_url = $1, rss_discovered_at = $2 WHERE id = $3",
    )
    .bind(rss_url)
    .bind(rss_discovered_at)
    .bind(source_id)
    .execute(pool)
    .await?;

    Ok(())
}

Add use chrono::{DateTime, Utc}; to the imports at the top of db/sources.rs.

  • Step 5: Verify it compiles

Run: cd backend && cargo check Expected: compiles (migration will run at startup)

  • Step 6: Commit
git add backend/migrations/20260403000031_add_source_rss_fields.sql backend/src/models/source.rs backend/src/db/sources.rs
git commit -m "feat: add rss_url and rss_discovered_at columns to sources"

Task 3: Create feed_parser service — parse_feed function

Files:

  • Create: backend/src/services/feed_parser.rs

  • Modify: backend/src/services/mod.rs

  • Step 1: Write the failing test for parse_feed

Create backend/src/services/feed_parser.rs with the test module and types:

//! RSS/Atom feed parser service.
//!
//! Discovers and parses RSS/Atom feeds from source URLs.
//! Used in Phase 1 of the generation pipeline to extract article links
//! sorted by publication date (newest first), before falling back
//! to the HTML-based source_scraper.

use chrono::{DateTime, Utc};
use url::Url;

use crate::errors::AppError;

/// A single entry extracted from an RSS/Atom feed.
#[derive(Debug, Clone)]
pub struct FeedEntry {
    pub url: String,
    pub title: String,
    pub published_date: Option<DateTime<Utc>>,
}

/// Result of attempting to detect and parse a feed for a source.
pub enum FeedResult {
    /// Feed found and parsed successfully.
    Found {
        feed_url: String,
        entries: Vec<FeedEntry>,
    },
    /// No feed discovered or feed invalid.
    NotFound,
}

/// Minimum number of feed entries to consider the feed useful.
/// Below this threshold, the pipeline falls back to HTML extraction.
pub const MIN_FEED_ENTRIES: usize = 3;

/// Number of days before a cached feed URL is re-verified.
pub const REDISCOVERY_DAYS: i64 = 30;

/// Parse an RSS/Atom feed URL and return entries sorted by date (newest first).
///
/// Uses the `feed-rs` crate which handles RSS 1.0, RSS 2.0, Atom, and JSON Feed.
/// Entries without a published date are placed last.
pub async fn parse_feed(
    http_client: &reqwest::Client,
    feed_url: &str,
    max_links: usize,
) -> Result<Vec<FeedEntry>, AppError> {
    let parsed_url = Url::parse(feed_url)
        .map_err(|e| AppError::BadRequest(format!("Invalid feed URL: {}", e)))?;

    if let Err(e) = crate::services::scraper::check_ssrf(&parsed_url).await {
        tracing::warn!(url = feed_url, error = %e, "Feed URL failed SSRF check");
        return Ok(Vec::new());
    }

    let response = http_client
        .get(feed_url)
        .send()
        .await
        .map_err(|e| {
            tracing::warn!(url = feed_url, error = %e, "Failed to fetch feed");
            AppError::Internal(anyhow::anyhow!("Failed to fetch feed"))
        })?;

    if !response.status().is_success() {
        tracing::warn!(url = feed_url, status = %response.status(), "Feed returned non-200");
        return Ok(Vec::new());
    }

    let body = response.bytes().await.map_err(|e| {
        AppError::Internal(anyhow::anyhow!("Failed to read feed body: {}", e))
    })?;

    let feed = feed_rs::parser::parse(&body[..]).map_err(|e| {
        tracing::warn!(url = feed_url, error = %e, "Failed to parse feed");
        AppError::Internal(anyhow::anyhow!("Failed to parse feed: {}", e))
    })?;

    let mut entries: Vec<FeedEntry> = feed
        .entries
        .into_iter()
        .filter_map(|entry| {
            // Get the article URL: prefer links, fall back to id if it looks like a URL
            let url = entry
                .links
                .first()
                .map(|l| l.href.clone())
                .or_else(|| {
                    if entry.id.starts_with("http://") || entry.id.starts_with("https://") {
                        Some(entry.id.clone())
                    } else {
                        None
                    }
                })?;

            let title = entry
                .title
                .map(|t| t.content)
                .unwrap_or_default();

            let published_date = entry
                .published
                .or(entry.updated);

            Some(FeedEntry {
                url,
                title,
                published_date,
            })
        })
        .collect();

    // Sort by published_date descending (newest first), entries without dates last
    entries.sort_by(|a, b| {
        match (&b.published_date, &a.published_date) {
            (Some(db), Some(da)) => db.cmp(da),
            (Some(_), None) => std::cmp::Ordering::Less,
            (None, Some(_)) => std::cmp::Ordering::Greater,
            (None, None) => std::cmp::Ordering::Equal,
        }
    });

    entries.truncate(max_links);

    Ok(entries)
}

#[cfg(test)]
mod tests {
    use super::*;
    use wiremock::{Mock, MockServer, ResponseTemplate};
    use wiremock::matchers::method;

    #[tokio::test]
    async fn parse_feed_rss2() {
        let server = MockServer::start().await;
        let rss_body = r#"<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
  <channel>
    <title>Test Blog</title>
    <item>
      <title>Article 1</title>
      <link>https://example.com/article-1</link>
      <pubDate>Thu, 03 Apr 2026 10:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Article 2</title>
      <link>https://example.com/article-2</link>
      <pubDate>Wed, 02 Apr 2026 10:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Article 3</title>
      <link>https://example.com/article-3</link>
      <pubDate>Tue, 01 Apr 2026 10:00:00 GMT</pubDate>
    </item>
  </channel>
</rss>"#;

        Mock::given(method("GET"))
            .respond_with(ResponseTemplate::new(200).set_body_raw(rss_body, "application/rss+xml"))
            .mount(&server)
            .await;

        let client = reqwest::Client::new();
        let entries = parse_feed(&client, &server.uri(), 10).await.unwrap();

        assert_eq!(entries.len(), 3);
        assert_eq!(entries[0].title, "Article 1");
        assert_eq!(entries[0].url, "https://example.com/article-1");
        assert!(entries[0].published_date > entries[1].published_date);
        assert!(entries[1].published_date > entries[2].published_date);
    }

    #[tokio::test]
    async fn parse_feed_atom() {
        let server = MockServer::start().await;
        let atom_body = r#"<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>Test Feed</title>
  <entry>
    <title>Atom Article</title>
    <link href="https://example.com/atom-1"/>
    <updated>2026-04-03T12:00:00Z</updated>
  </entry>
</feed>"#;

        Mock::given(method("GET"))
            .respond_with(ResponseTemplate::new(200).set_body_raw(atom_body, "application/atom+xml"))
            .mount(&server)
            .await;

        let client = reqwest::Client::new();
        let entries = parse_feed(&client, &server.uri(), 10).await.unwrap();

        assert_eq!(entries.len(), 1);
        assert_eq!(entries[0].title, "Atom Article");
        assert_eq!(entries[0].url, "https://example.com/atom-1");
        assert!(entries[0].published_date.is_some());
    }

    #[tokio::test]
    async fn parse_feed_respects_max_links() {
        let server = MockServer::start().await;
        let rss_body = r#"<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
  <channel>
    <title>Test</title>
    <item><title>A1</title><link>https://example.com/1</link><pubDate>Thu, 03 Apr 2026 10:00:00 GMT</pubDate></item>
    <item><title>A2</title><link>https://example.com/2</link><pubDate>Wed, 02 Apr 2026 10:00:00 GMT</pubDate></item>
    <item><title>A3</title><link>https://example.com/3</link><pubDate>Tue, 01 Apr 2026 10:00:00 GMT</pubDate></item>
  </channel>
</rss>"#;

        Mock::given(method("GET"))
            .respond_with(ResponseTemplate::new(200).set_body_raw(rss_body, "application/rss+xml"))
            .mount(&server)
            .await;

        let client = reqwest::Client::new();
        let entries = parse_feed(&client, &server.uri(), 2).await.unwrap();

        assert_eq!(entries.len(), 2);
        assert_eq!(entries[0].url, "https://example.com/1"); // newest first
    }

    #[tokio::test]
    async fn parse_feed_entries_without_dates_come_last() {
        let server = MockServer::start().await;
        let rss_body = r#"<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
  <channel>
    <title>Test</title>
    <item><title>No date</title><link>https://example.com/no-date</link></item>
    <item><title>Has date</title><link>https://example.com/has-date</link><pubDate>Thu, 03 Apr 2026 10:00:00 GMT</pubDate></item>
  </channel>
</rss>"#;

        Mock::given(method("GET"))
            .respond_with(ResponseTemplate::new(200).set_body_raw(rss_body, "application/rss+xml"))
            .mount(&server)
            .await;

        let client = reqwest::Client::new();
        let entries = parse_feed(&client, &server.uri(), 10).await.unwrap();

        assert_eq!(entries.len(), 2);
        assert_eq!(entries[0].url, "https://example.com/has-date");
        assert_eq!(entries[1].url, "https://example.com/no-date");
    }

    #[tokio::test]
    async fn parse_feed_404_returns_empty() {
        let server = MockServer::start().await;

        Mock::given(method("GET"))
            .respond_with(ResponseTemplate::new(404))
            .mount(&server)
            .await;

        let client = reqwest::Client::new();
        let entries = parse_feed(&client, &server.uri(), 10).await.unwrap();
        assert!(entries.is_empty());
    }

    #[tokio::test]
    async fn parse_feed_invalid_xml_returns_error() {
        let server = MockServer::start().await;

        Mock::given(method("GET"))
            .respond_with(ResponseTemplate::new(200).set_body_string("not xml at all"))
            .mount(&server)
            .await;

        let client = reqwest::Client::new();
        let result = parse_feed(&client, &server.uri(), 10).await;
        assert!(result.is_err());
    }
}
  • Step 2: Register the module in services/mod.rs

In backend/src/services/mod.rs, add after the export line:

pub mod feed_parser;
  • Step 3: Run tests to verify they pass

Run: cd backend && cargo test --lib feed_parser -- --nocapture Expected: all 6 tests pass

  • Step 4: Commit
git add backend/src/services/feed_parser.rs backend/src/services/mod.rs
git commit -m "feat: add feed_parser service with parse_feed function and tests"

Task 4: Add discover_feed function

Files:

  • Modify: backend/src/services/feed_parser.rs

  • Step 1: Write failing tests for discover_feed

Add these tests to the mod tests block in backend/src/services/feed_parser.rs:

    #[tokio::test]
    async fn discover_feed_from_link_rss() {
        let server = MockServer::start().await;
        let html = format!(
            r#"<html><head>
            <link rel="alternate" type="application/rss+xml" href="{}/feed.xml">
            </head><body></body></html>"#,
            server.uri()
        );

        Mock::given(method("GET"))
            .respond_with(ResponseTemplate::new(200).set_body_string(html))
            .mount(&server)
            .await;

        let client = reqwest::Client::new();
        let result = discover_feed(&client, &server.uri()).await;

        assert!(result.is_some());
        assert!(result.unwrap().contains("/feed.xml"));
    }

    #[tokio::test]
    async fn discover_feed_from_link_atom() {
        let server = MockServer::start().await;
        let html = format!(
            r#"<html><head>
            <link rel="alternate" type="application/atom+xml" href="{}/atom.xml">
            </head><body></body></html>"#,
            server.uri()
        );

        Mock::given(method("GET"))
            .respond_with(ResponseTemplate::new(200).set_body_string(html))
            .mount(&server)
            .await;

        let client = reqwest::Client::new();
        let result = discover_feed(&client, &server.uri()).await;

        assert!(result.is_some());
        assert!(result.unwrap().contains("/atom.xml"));
    }

    #[tokio::test]
    async fn discover_feed_direct_rss_url() {
        let server = MockServer::start().await;
        let rss_body = r#"<?xml version="1.0"?><rss version="2.0"><channel><title>T</title></channel></rss>"#;

        Mock::given(method("GET"))
            .respond_with(
                ResponseTemplate::new(200)
                    .set_body_raw(rss_body, "application/rss+xml")
            )
            .mount(&server)
            .await;

        let client = reqwest::Client::new();
        let result = discover_feed(&client, &server.uri()).await;

        assert!(result.is_some());
        assert_eq!(result.unwrap(), server.uri());
    }

    #[tokio::test]
    async fn discover_feed_no_feed_found() {
        let server = MockServer::start().await;
        let html = "<html><head><title>No feed</title></head><body></body></html>";

        Mock::given(method("GET"))
            .respond_with(ResponseTemplate::new(200).set_body_string(html))
            .mount(&server)
            .await;

        let client = reqwest::Client::new();
        let result = discover_feed(&client, &server.uri()).await;

        assert!(result.is_none());
    }

    #[tokio::test]
    async fn discover_feed_resolves_relative_href() {
        let server = MockServer::start().await;
        let html = r#"<html><head>
        <link rel="alternate" type="application/rss+xml" href="/feed.xml">
        </head><body></body></html>"#;

        Mock::given(method("GET"))
            .respond_with(ResponseTemplate::new(200).set_body_string(html))
            .mount(&server)
            .await;

        let client = reqwest::Client::new();
        let result = discover_feed(&client, &server.uri()).await;

        assert!(result.is_some());
        let feed_url = result.unwrap();
        assert!(feed_url.starts_with(&server.uri()));
        assert!(feed_url.ends_with("/feed.xml"));
    }
  • Step 2: Run tests to verify they fail

Run: cd backend && cargo test --lib feed_parser::tests::discover_feed -- --nocapture Expected: compilation error — discover_feed not defined

  • Step 3: Implement discover_feed

Add this function to backend/src/services/feed_parser.rs, before the #[cfg(test)] block:

/// RSS/Atom content types that indicate a direct feed URL.
const FEED_CONTENT_TYPES: &[&str] = &[
    "application/rss+xml",
    "application/atom+xml",
    "application/xml",
    "text/xml",
];

/// Discover an RSS/Atom feed URL from a source URL.
///
/// Two detection strategies:
/// 1. If the URL itself returns an RSS/Atom Content-Type, it is a feed directly.
/// 2. If the URL returns HTML, look for `<link rel="alternate" type="application/rss+xml">`
///    or `type="application/atom+xml"` in the `<head>`.
///
/// Returns `Some(feed_url)` if a feed is found, `None` otherwise.
pub async fn discover_feed(
    http_client: &reqwest::Client,
    source_url: &str,
) -> Option<String> {
    let parsed_url = Url::parse(source_url).ok()?;

    if let Err(e) = crate::services::scraper::check_ssrf(&parsed_url).await {
        tracing::warn!(url = source_url, error = %e, "Source URL failed SSRF check during feed discovery");
        return None;
    }

    let response = http_client
        .get(source_url)
        .send()
        .await
        .ok()?;

    if !response.status().is_success() {
        return None;
    }

    // Check Content-Type for direct feed
    let content_type = response
        .headers()
        .get(reqwest::header::CONTENT_TYPE)
        .and_then(|v| v.to_str().ok())
        .unwrap_or("")
        .to_lowercase();

    if FEED_CONTENT_TYPES.iter().any(|ct| content_type.contains(ct)) {
        return Some(source_url.to_string());
    }

    // If HTML, look for <link rel="alternate"> with feed type
    if !content_type.contains("text/html") {
        return None;
    }

    let body = response.text().await.ok()?;
    let document = scraper::Html::parse_document(&body);

    let selector = scraper::Selector::parse(r#"link[rel="alternate"]"#).ok()?;

    for element in document.select(&selector) {
        let link_type = element.value().attr("type").unwrap_or("");
        if link_type == "application/rss+xml" || link_type == "application/atom+xml" {
            if let Some(href) = element.value().attr("href") {
                // Resolve relative URLs against the source URL
                let resolved = parsed_url.join(href).ok()?;
                return Some(resolved.to_string());
            }
        }
    }

    None
}
  • Step 4: Run tests to verify they pass

Run: cd backend && cargo test --lib feed_parser -- --nocapture Expected: all 11 tests pass (6 from Task 3 + 5 new)

  • Step 5: Commit
git add backend/src/services/feed_parser.rs
git commit -m "feat: add discover_feed function for RSS/Atom auto-discovery"

Task 5: Add detect_and_parse_feed orchestration function

Files:

  • Modify: backend/src/services/feed_parser.rs

  • Step 1: Write failing tests for detect_and_parse_feed

Add these tests to the mod tests block in backend/src/services/feed_parser.rs:

    #[tokio::test]
    async fn detect_and_parse_cached_fresh_feed() {
        let server = MockServer::start().await;
        let rss_body = r#"<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"><channel><title>T</title>
  <item><title>A1</title><link>https://example.com/1</link><pubDate>Thu, 03 Apr 2026 10:00:00 GMT</pubDate></item>
  <item><title>A2</title><link>https://example.com/2</link><pubDate>Wed, 02 Apr 2026 10:00:00 GMT</pubDate></item>
  <item><title>A3</title><link>https://example.com/3</link><pubDate>Tue, 01 Apr 2026 10:00:00 GMT</pubDate></item>
</channel></rss>"#;

        Mock::given(method("GET"))
            .respond_with(ResponseTemplate::new(200).set_body_raw(rss_body, "application/rss+xml"))
            .mount(&server)
            .await;

        let client = reqwest::Client::new();
        let result = detect_and_parse_feed(
            &client,
            "https://example.com",
            Some(&server.uri()),
            Some(Utc::now()), // fresh
            10,
        ).await;

        match result {
            FeedResult::Found { entries, .. } => assert_eq!(entries.len(), 3),
            FeedResult::NotFound => panic!("Expected Found"),
        }
    }

    #[tokio::test]
    async fn detect_and_parse_no_cache_discovers_feed() {
        let server = MockServer::start().await;

        // First request: HTML page with feed link
        let feed_path = format!("{}/feed.xml", server.uri());
        let html = format!(
            r#"<html><head>
            <link rel="alternate" type="application/rss+xml" href="{}">
            </head><body></body></html>"#,
            feed_path
        );

        let rss_body = r#"<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"><channel><title>T</title>
  <item><title>A1</title><link>https://example.com/1</link><pubDate>Thu, 03 Apr 2026 10:00:00 GMT</pubDate></item>
  <item><title>A2</title><link>https://example.com/2</link><pubDate>Wed, 02 Apr 2026 10:00:00 GMT</pubDate></item>
  <item><title>A3</title><link>https://example.com/3</link><pubDate>Tue, 01 Apr 2026 10:00:00 GMT</pubDate></item>
</channel></rss>"#;

        // Mock: source page returns HTML
        Mock::given(method("GET"))
            .and(wiremock::matchers::path("/"))
            .respond_with(ResponseTemplate::new(200).set_body_string(html))
            .mount(&server)
            .await;

        // Mock: feed URL returns RSS
        Mock::given(method("GET"))
            .and(wiremock::matchers::path("/feed.xml"))
            .respond_with(ResponseTemplate::new(200).set_body_raw(rss_body, "application/rss+xml"))
            .mount(&server)
            .await;

        let client = reqwest::Client::new();
        let result = detect_and_parse_feed(
            &client,
            &server.uri(),
            None,  // no cache
            None,
            10,
        ).await;

        match result {
            FeedResult::Found { feed_url, entries } => {
                assert!(feed_url.contains("/feed.xml"));
                assert_eq!(entries.len(), 3);
            }
            FeedResult::NotFound => panic!("Expected Found"),
        }
    }

    #[tokio::test]
    async fn detect_and_parse_no_feed_returns_not_found() {
        let server = MockServer::start().await;
        let html = "<html><head><title>No feed</title></head><body></body></html>";

        Mock::given(method("GET"))
            .respond_with(ResponseTemplate::new(200).set_body_string(html))
            .mount(&server)
            .await;

        let client = reqwest::Client::new();
        let result = detect_and_parse_feed(
            &client,
            &server.uri(),
            None,
            None,
            10,
        ).await;

        assert!(matches!(result, FeedResult::NotFound));
    }

    #[tokio::test]
    async fn detect_and_parse_stale_cache_rediscovers() {
        let server = MockServer::start().await;

        let feed_path = format!("{}/feed.xml", server.uri());
        let html = format!(
            r#"<html><head>
            <link rel="alternate" type="application/rss+xml" href="{}">
            </head><body></body></html>"#,
            feed_path
        );

        let rss_body = r#"<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"><channel><title>T</title>
  <item><title>A1</title><link>https://example.com/1</link><pubDate>Thu, 03 Apr 2026 10:00:00 GMT</pubDate></item>
  <item><title>A2</title><link>https://example.com/2</link><pubDate>Wed, 02 Apr 2026 10:00:00 GMT</pubDate></item>
  <item><title>A3</title><link>https://example.com/3</link><pubDate>Tue, 01 Apr 2026 10:00:00 GMT</pubDate></item>
</channel></rss>"#;

        Mock::given(method("GET"))
            .and(wiremock::matchers::path("/"))
            .respond_with(ResponseTemplate::new(200).set_body_string(html))
            .mount(&server)
            .await;

        Mock::given(method("GET"))
            .and(wiremock::matchers::path("/feed.xml"))
            .respond_with(ResponseTemplate::new(200).set_body_raw(rss_body, "application/rss+xml"))
            .mount(&server)
            .await;

        let client = reqwest::Client::new();
        let stale_date = Utc::now() - chrono::Duration::days(31);
        let result = detect_and_parse_feed(
            &client,
            &server.uri(),
            Some("https://old-feed.example.com/rss"), // stale cached URL
            Some(stale_date),
            10,
        ).await;

        match result {
            FeedResult::Found { feed_url, entries } => {
                assert!(feed_url.contains("/feed.xml"), "Should discover new feed URL");
                assert_eq!(entries.len(), 3);
            }
            FeedResult::NotFound => panic!("Expected Found after re-discovery"),
        }
    }
  • Step 2: Run tests to verify they fail

Run: cd backend && cargo test --lib feed_parser::tests::detect_and_parse -- --nocapture Expected: compilation error — detect_and_parse_feed not defined

  • Step 3: Implement detect_and_parse_feed

Add this function to backend/src/services/feed_parser.rs, before the #[cfg(test)] block:

/// Detect and parse an RSS/Atom feed for a source URL.
///
/// Orchestrates the discovery and parsing logic:
/// - If `rss_url` is cached and fresh (< 30 days), parse it directly.
/// - If `rss_url` is cached but stale (>= 30 days), re-discover from `source_url`.
/// - If no `rss_url` cached, attempt discovery from `source_url`.
///
/// Returns `FeedResult::Found` with the feed URL and sorted entries,
/// or `FeedResult::NotFound` if no feed could be found/parsed.
pub async fn detect_and_parse_feed(
    http_client: &reqwest::Client,
    source_url: &str,
    rss_url: Option<&str>,
    rss_discovered_at: Option<DateTime<Utc>>,
    max_links: usize,
) -> FeedResult {
    // Case 1: Cached and fresh — use directly
    if let Some(cached_url) = rss_url {
        let is_fresh = rss_discovered_at
            .map(|d| Utc::now().signed_duration_since(d).num_days() < REDISCOVERY_DAYS)
            .unwrap_or(false);

        if is_fresh {
            match parse_feed(http_client, cached_url, max_links).await {
                Ok(entries) if !entries.is_empty() => {
                    return FeedResult::Found {
                        feed_url: cached_url.to_string(),
                        entries,
                    };
                }
                _ => {
                    tracing::warn!(url = cached_url, "Cached feed failed to parse, attempting re-discovery");
                }
            }
        }
    }

    // Case 2: No cache or stale — discover
    let discovered = discover_feed(http_client, source_url).await;

    if let Some(feed_url) = discovered {
        match parse_feed(http_client, &feed_url, max_links).await {
            Ok(entries) if !entries.is_empty() => {
                return FeedResult::Found {
                    feed_url,
                    entries,
                };
            }
            Ok(_) => {
                tracing::info!(url = feed_url, "Discovered feed is empty");
            }
            Err(e) => {
                tracing::warn!(url = feed_url, error = %e, "Discovered feed failed to parse");
            }
        }
    }

    FeedResult::NotFound
}
  • Step 4: Run tests to verify they pass

Run: cd backend && cargo test --lib feed_parser -- --nocapture Expected: all 15 tests pass

  • Step 5: Commit
git add backend/src/services/feed_parser.rs
git commit -m "feat: add detect_and_parse_feed orchestration function"

Task 6: Integrate feed_parser into the Phase 1 pipeline

Files:

  • Modify: backend/src/services/synthesis/mod.rs

  • Step 1: Add feed_parser import

In backend/src/services/synthesis/mod.rs, add after line 29 (use crate::services::source_scraper;):

use crate::services::feed_parser;
  • Step 2: Replace the link extraction in Phase 1 wave processing

In backend/src/services/synthesis/mod.rs, locate the Phase 1 link extraction block (around line 193-224). This is inside the 'wave_loop where join_set spawns tasks calling source_scraper::extract_article_links.

Replace the entire block from let mut wave_urls: Vec<(String, String)> = Vec::new(); through the closing } of while let Some(join_result) = join_set.join_next().await { ... } (lines 193-224) with:

            let mut wave_urls: Vec<(String, String)> = Vec::new();
            let mut rss_updates: Vec<(Uuid, Option<String>, Option<DateTime<Utc>>)> = Vec::new();
            {
                let mut join_set = tokio::task::JoinSet::new();
                for source in wave_sources {
                    let client = state.http_client.clone();
                    let source_id = source.id;
                    let source_url = source.url.clone();
                    let source_title = source.title.clone();
                    let rss_url = source.rss_url.clone();
                    let rss_discovered_at = source.rss_discovered_at;
                    let max_l = max_links;
                    join_set.spawn(async move {
                        // Try RSS feed first
                        let feed_result = feed_parser::detect_and_parse_feed(
                            &client,
                            &source_url,
                            rss_url.as_deref(),
                            rss_discovered_at,
                            max_l,
                        ).await;

                        match feed_result {
                            feed_parser::FeedResult::Found { feed_url, entries }
                                if entries.len() >= feed_parser::MIN_FEED_ENTRIES =>
                            {
                                let links: Vec<String> = entries.into_iter().map(|e| e.url).collect();
                                tracing::info!(
                                    source = %source_title,
                                    feed = %feed_url,
                                    links = links.len(),
                                    "Extracted links from RSS feed"
                                );
                                // Signal RSS URL update if it changed
                                let rss_changed = rss_url.as_deref() != Some(&feed_url);
                                let rss_stale = rss_discovered_at
                                    .map(|d| Utc::now().signed_duration_since(d).num_days() >= feed_parser::REDISCOVERY_DAYS)
                                    .unwrap_or(true);
                                let update = if rss_changed || rss_stale {
                                    Some((source_id, Some(feed_url), Some(Utc::now())))
                                } else {
                                    None
                                };
                                (source_url, source_title, Ok(links), update)
                            }
                            _ => {
                                // Fallback to HTML extraction
                                let links = source_scraper::extract_article_links(&client, &source_url, max_l).await;
                                // If we had a cached RSS URL but feed failed, clear it
                                let update = if rss_url.is_some() {
                                    Some((source_id, None, None))
                                } else {
                                    None
                                };
                                (source_url, source_title, links, update)
                            }
                        }
                    });
                }

                while let Some(join_result) = join_set.join_next().await {
                    if let Ok((source_url, source_title, links_result, rss_update)) = join_result {
                        if let Some(update) = rss_update {
                            rss_updates.push(update);
                        }
                        match links_result {
                            Ok(links) => {
                                tracing::info!(source = %source_title, links = links.len(), "Extracted links from source");
                                for link in links {
                                    if seen_urls.insert(link.to_lowercase()) {
                                        wave_urls.push((link, source_url.clone()));
                                    }
                                }
                            }
                            Err(e) => {
                                tracing::warn!(source = %source_title, error = %e, "Failed to extract links");
                            }
                        }
                    }
                }
            }

            // Persist RSS URL updates (fire-and-forget)
            for (source_id, new_rss_url, new_discovered_at) in rss_updates {
                db::sources::update_source_rss(
                    &state.pool,
                    source_id,
                    new_rss_url.as_deref(),
                    new_discovered_at,
                ).await.ok();
            }
  • Step 3: Verify it compiles

Run: cd backend && cargo check Expected: compiles with no errors

  • Step 4: Run existing unit tests

Run: cd backend && cargo test --lib Expected: all tests pass (no regressions)

  • Step 5: Commit
git add backend/src/services/synthesis/mod.rs
git commit -m "feat: integrate feed_parser into Phase 1 pipeline with HTML fallback"

Task 7: Add integration test for RSS feed in pipeline

Files:

  • Modify: the existing integration test structure (if a synthesis integration test exists), OR create a focused unit test

  • Step 1: Write a test that verifies RSS-first behavior end-to-end

Add this test to the mod tests block at the end of backend/src/services/feed_parser.rs:

    #[tokio::test]
    async fn full_flow_rss_first_with_html_fallback() {
        // Source 1: has an RSS feed with 5 articles
        let server1 = MockServer::start().await;
        let rss_body = r#"<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"><channel><title>Blog</title>
  <item><title>A1</title><link>https://blog.example.com/1</link><pubDate>Thu, 03 Apr 2026 10:00:00 GMT</pubDate></item>
  <item><title>A2</title><link>https://blog.example.com/2</link><pubDate>Wed, 02 Apr 2026 10:00:00 GMT</pubDate></item>
  <item><title>A3</title><link>https://blog.example.com/3</link><pubDate>Tue, 01 Apr 2026 10:00:00 GMT</pubDate></item>
  <item><title>A4</title><link>https://blog.example.com/4</link><pubDate>Mon, 31 Mar 2026 10:00:00 GMT</pubDate></item>
  <item><title>A5</title><link>https://blog.example.com/5</link><pubDate>Sun, 30 Mar 2026 10:00:00 GMT</pubDate></item>
</channel></rss>"#;

        Mock::given(method("GET"))
            .respond_with(ResponseTemplate::new(200).set_body_raw(rss_body, "application/rss+xml"))
            .mount(&server1)
            .await;

        let client = reqwest::Client::new();

        // With cached RSS URL (fresh) — should use RSS directly
        let result = detect_and_parse_feed(
            &client,
            "https://blog.example.com",
            Some(&server1.uri()),
            Some(Utc::now()),
            10,
        ).await;

        match result {
            FeedResult::Found { entries, .. } => {
                assert_eq!(entries.len(), 5);
                // Verify sorted newest first
                for i in 0..entries.len() - 1 {
                    if let (Some(a), Some(b)) = (&entries[i].published_date, &entries[i + 1].published_date) {
                        assert!(a >= b, "Entries should be sorted newest first");
                    }
                }
            }
            FeedResult::NotFound => panic!("Expected Found"),
        }

        // Source 2: no RSS feed, only HTML — should return NotFound
        let server2 = MockServer::start().await;
        let html = r#"<html><head><title>No feed</title></head><body>
        <a href="/article-1">Article 1</a>
        </body></html>"#;

        Mock::given(method("GET"))
            .respond_with(ResponseTemplate::new(200).set_body_string(html))
            .mount(&server2)
            .await;

        let result = detect_and_parse_feed(
            &client,
            &server2.uri(),
            None,
            None,
            10,
        ).await;

        // No feed found — pipeline would fall back to source_scraper
        assert!(matches!(result, FeedResult::NotFound));
    }
  • Step 2: Run all feed_parser tests

Run: cd backend && cargo test --lib feed_parser -- --nocapture Expected: all 16 tests pass

  • Step 3: Run full unit test suite

Run: cd backend && cargo test --lib Expected: all tests pass

  • Step 4: Commit
git add backend/src/services/feed_parser.rs
git commit -m "test: add end-to-end RSS flow test for feed_parser"