38 KiB
RSS Feed Integration Implementation Plan
For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (
- [ ]) syntax for tracking.
Goal: Add RSS/Atom feed support to personalized sources so the synthesis pipeline discovers articles via feeds first (sorted by recency), falling back to HTML extraction when no feed is found or it yields fewer than 3 links.
Architecture: New feed_parser service handles feed discovery, parsing, and caching. The Phase 1 pipeline calls it before the existing source_scraper. Two new nullable columns on sources persist discovered feed URLs with a 30-day re-discovery cycle.
Tech Stack: Rust, feed-rs crate (RSS/Atom/JSON Feed parsing), scraper crate (HTML <link> discovery), reqwest (HTTP), sqlx (Postgres), wiremock (test mocks)
Task 1: Add feed-rs dependency
Files:
-
Modify:
backend/Cargo.toml -
Step 1: Add feed-rs to dependencies
In backend/Cargo.toml, add after the scraper line in [dependencies]:
# RSS/Atom feed parsing
feed-rs = "2"
- Step 2: Verify it compiles
Run: cd backend && cargo check
Expected: compiles with no errors
- Step 3: Commit
git add backend/Cargo.toml
git commit -m "deps: add feed-rs crate for RSS/Atom feed parsing"
Task 2: Database migration — add RSS columns to sources
Files:
-
Create:
backend/migrations/20260403000031_add_source_rss_fields.sql -
Modify:
backend/src/models/source.rs -
Modify:
backend/src/db/sources.rs -
Step 1: Create the migration file
Create backend/migrations/20260403000031_add_source_rss_fields.sql:
ALTER TABLE sources ADD COLUMN rss_url TEXT;
ALTER TABLE sources ADD COLUMN rss_discovered_at TIMESTAMPTZ;
- Step 2: Add fields to the Source struct
In backend/src/models/source.rs, add two fields to the Source struct after is_preferred:
pub struct Source {
pub id: Uuid,
pub user_id: Uuid,
pub title: String,
pub url: String,
pub theme_id: Option<Uuid>,
pub is_preferred: bool,
pub rss_url: Option<String>,
pub rss_discovered_at: Option<DateTime<Utc>>,
pub created_at: DateTime<Utc>,
}
- Step 3: Update all SQL SELECT queries in
db/sources.rs
Every SELECT in backend/src/db/sources.rs that uses query_as::<_, Source> needs the two new columns. Update the column lists from:
SELECT id, user_id, title, url, theme_id, is_preferred, created_at
to:
SELECT id, user_id, title, url, theme_id, is_preferred, rss_url, rss_discovered_at, created_at
This applies to:
-
list_for_user(2 queries: with and without theme_id filter) -
create(the RETURNING clause) -
bulk_create(the RETURNING clause) -
Step 4: Add
update_source_rssfunction todb/sources.rs
Append to backend/src/db/sources.rs:
/// Update the cached RSS feed URL and discovery timestamp for a source.
///
/// Called during synthesis generation when a feed is discovered or re-verified.
/// Pass `rss_url = None` to clear a previously cached feed (e.g., feed no longer exists).
pub async fn update_source_rss(
pool: &PgPool,
source_id: Uuid,
rss_url: Option<&str>,
rss_discovered_at: Option<DateTime<Utc>>,
) -> Result<(), AppError> {
sqlx::query(
"UPDATE sources SET rss_url = $1, rss_discovered_at = $2 WHERE id = $3",
)
.bind(rss_url)
.bind(rss_discovered_at)
.bind(source_id)
.execute(pool)
.await?;
Ok(())
}
Add use chrono::{DateTime, Utc}; to the imports at the top of db/sources.rs.
- Step 5: Verify it compiles
Run: cd backend && cargo check
Expected: compiles (migration will run at startup)
- Step 6: Commit
git add backend/migrations/20260403000031_add_source_rss_fields.sql backend/src/models/source.rs backend/src/db/sources.rs
git commit -m "feat: add rss_url and rss_discovered_at columns to sources"
Task 3: Create feed_parser service — parse_feed function
Files:
-
Create:
backend/src/services/feed_parser.rs -
Modify:
backend/src/services/mod.rs -
Step 1: Write the failing test for
parse_feed
Create backend/src/services/feed_parser.rs with the test module and types:
//! RSS/Atom feed parser service.
//!
//! Discovers and parses RSS/Atom feeds from source URLs.
//! Used in Phase 1 of the generation pipeline to extract article links
//! sorted by publication date (newest first), before falling back
//! to the HTML-based source_scraper.
use chrono::{DateTime, Utc};
use url::Url;
use crate::errors::AppError;
/// A single entry extracted from an RSS/Atom feed.
#[derive(Debug, Clone)]
pub struct FeedEntry {
pub url: String,
pub title: String,
pub published_date: Option<DateTime<Utc>>,
}
/// Result of attempting to detect and parse a feed for a source.
pub enum FeedResult {
/// Feed found and parsed successfully.
Found {
feed_url: String,
entries: Vec<FeedEntry>,
},
/// No feed discovered or feed invalid.
NotFound,
}
/// Minimum number of feed entries to consider the feed useful.
/// Below this threshold, the pipeline falls back to HTML extraction.
pub const MIN_FEED_ENTRIES: usize = 3;
/// Number of days before a cached feed URL is re-verified.
pub const REDISCOVERY_DAYS: i64 = 30;
/// Parse an RSS/Atom feed URL and return entries sorted by date (newest first).
///
/// Uses the `feed-rs` crate which handles RSS 1.0, RSS 2.0, Atom, and JSON Feed.
/// Entries without a published date are placed last.
pub async fn parse_feed(
http_client: &reqwest::Client,
feed_url: &str,
max_links: usize,
) -> Result<Vec<FeedEntry>, AppError> {
let parsed_url = Url::parse(feed_url)
.map_err(|e| AppError::BadRequest(format!("Invalid feed URL: {}", e)))?;
if let Err(e) = crate::services::scraper::check_ssrf(&parsed_url).await {
tracing::warn!(url = feed_url, error = %e, "Feed URL failed SSRF check");
return Ok(Vec::new());
}
let response = http_client
.get(feed_url)
.send()
.await
.map_err(|e| {
tracing::warn!(url = feed_url, error = %e, "Failed to fetch feed");
AppError::Internal(anyhow::anyhow!("Failed to fetch feed"))
})?;
if !response.status().is_success() {
tracing::warn!(url = feed_url, status = %response.status(), "Feed returned non-200");
return Ok(Vec::new());
}
let body = response.bytes().await.map_err(|e| {
AppError::Internal(anyhow::anyhow!("Failed to read feed body: {}", e))
})?;
let feed = feed_rs::parser::parse(&body[..]).map_err(|e| {
tracing::warn!(url = feed_url, error = %e, "Failed to parse feed");
AppError::Internal(anyhow::anyhow!("Failed to parse feed: {}", e))
})?;
let mut entries: Vec<FeedEntry> = feed
.entries
.into_iter()
.filter_map(|entry| {
// Get the article URL: prefer links, fall back to id if it looks like a URL
let url = entry
.links
.first()
.map(|l| l.href.clone())
.or_else(|| {
if entry.id.starts_with("http://") || entry.id.starts_with("https://") {
Some(entry.id.clone())
} else {
None
}
})?;
let title = entry
.title
.map(|t| t.content)
.unwrap_or_default();
let published_date = entry
.published
.or(entry.updated);
Some(FeedEntry {
url,
title,
published_date,
})
})
.collect();
// Sort by published_date descending (newest first), entries without dates last
entries.sort_by(|a, b| {
match (&b.published_date, &a.published_date) {
(Some(db), Some(da)) => db.cmp(da),
(Some(_), None) => std::cmp::Ordering::Less,
(None, Some(_)) => std::cmp::Ordering::Greater,
(None, None) => std::cmp::Ordering::Equal,
}
});
entries.truncate(max_links);
Ok(entries)
}
#[cfg(test)]
mod tests {
use super::*;
use wiremock::{Mock, MockServer, ResponseTemplate};
use wiremock::matchers::method;
#[tokio::test]
async fn parse_feed_rss2() {
let server = MockServer::start().await;
let rss_body = r#"<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
<channel>
<title>Test Blog</title>
<item>
<title>Article 1</title>
<link>https://example.com/article-1</link>
<pubDate>Thu, 03 Apr 2026 10:00:00 GMT</pubDate>
</item>
<item>
<title>Article 2</title>
<link>https://example.com/article-2</link>
<pubDate>Wed, 02 Apr 2026 10:00:00 GMT</pubDate>
</item>
<item>
<title>Article 3</title>
<link>https://example.com/article-3</link>
<pubDate>Tue, 01 Apr 2026 10:00:00 GMT</pubDate>
</item>
</channel>
</rss>"#;
Mock::given(method("GET"))
.respond_with(ResponseTemplate::new(200).set_body_raw(rss_body, "application/rss+xml"))
.mount(&server)
.await;
let client = reqwest::Client::new();
let entries = parse_feed(&client, &server.uri(), 10).await.unwrap();
assert_eq!(entries.len(), 3);
assert_eq!(entries[0].title, "Article 1");
assert_eq!(entries[0].url, "https://example.com/article-1");
assert!(entries[0].published_date > entries[1].published_date);
assert!(entries[1].published_date > entries[2].published_date);
}
#[tokio::test]
async fn parse_feed_atom() {
let server = MockServer::start().await;
let atom_body = r#"<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<title>Test Feed</title>
<entry>
<title>Atom Article</title>
<link href="https://example.com/atom-1"/>
<updated>2026-04-03T12:00:00Z</updated>
</entry>
</feed>"#;
Mock::given(method("GET"))
.respond_with(ResponseTemplate::new(200).set_body_raw(atom_body, "application/atom+xml"))
.mount(&server)
.await;
let client = reqwest::Client::new();
let entries = parse_feed(&client, &server.uri(), 10).await.unwrap();
assert_eq!(entries.len(), 1);
assert_eq!(entries[0].title, "Atom Article");
assert_eq!(entries[0].url, "https://example.com/atom-1");
assert!(entries[0].published_date.is_some());
}
#[tokio::test]
async fn parse_feed_respects_max_links() {
let server = MockServer::start().await;
let rss_body = r#"<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
<channel>
<title>Test</title>
<item><title>A1</title><link>https://example.com/1</link><pubDate>Thu, 03 Apr 2026 10:00:00 GMT</pubDate></item>
<item><title>A2</title><link>https://example.com/2</link><pubDate>Wed, 02 Apr 2026 10:00:00 GMT</pubDate></item>
<item><title>A3</title><link>https://example.com/3</link><pubDate>Tue, 01 Apr 2026 10:00:00 GMT</pubDate></item>
</channel>
</rss>"#;
Mock::given(method("GET"))
.respond_with(ResponseTemplate::new(200).set_body_raw(rss_body, "application/rss+xml"))
.mount(&server)
.await;
let client = reqwest::Client::new();
let entries = parse_feed(&client, &server.uri(), 2).await.unwrap();
assert_eq!(entries.len(), 2);
assert_eq!(entries[0].url, "https://example.com/1"); // newest first
}
#[tokio::test]
async fn parse_feed_entries_without_dates_come_last() {
let server = MockServer::start().await;
let rss_body = r#"<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
<channel>
<title>Test</title>
<item><title>No date</title><link>https://example.com/no-date</link></item>
<item><title>Has date</title><link>https://example.com/has-date</link><pubDate>Thu, 03 Apr 2026 10:00:00 GMT</pubDate></item>
</channel>
</rss>"#;
Mock::given(method("GET"))
.respond_with(ResponseTemplate::new(200).set_body_raw(rss_body, "application/rss+xml"))
.mount(&server)
.await;
let client = reqwest::Client::new();
let entries = parse_feed(&client, &server.uri(), 10).await.unwrap();
assert_eq!(entries.len(), 2);
assert_eq!(entries[0].url, "https://example.com/has-date");
assert_eq!(entries[1].url, "https://example.com/no-date");
}
#[tokio::test]
async fn parse_feed_404_returns_empty() {
let server = MockServer::start().await;
Mock::given(method("GET"))
.respond_with(ResponseTemplate::new(404))
.mount(&server)
.await;
let client = reqwest::Client::new();
let entries = parse_feed(&client, &server.uri(), 10).await.unwrap();
assert!(entries.is_empty());
}
#[tokio::test]
async fn parse_feed_invalid_xml_returns_error() {
let server = MockServer::start().await;
Mock::given(method("GET"))
.respond_with(ResponseTemplate::new(200).set_body_string("not xml at all"))
.mount(&server)
.await;
let client = reqwest::Client::new();
let result = parse_feed(&client, &server.uri(), 10).await;
assert!(result.is_err());
}
}
- Step 2: Register the module in
services/mod.rs
In backend/src/services/mod.rs, add after the export line:
pub mod feed_parser;
- Step 3: Run tests to verify they pass
Run: cd backend && cargo test --lib feed_parser -- --nocapture
Expected: all 6 tests pass
- Step 4: Commit
git add backend/src/services/feed_parser.rs backend/src/services/mod.rs
git commit -m "feat: add feed_parser service with parse_feed function and tests"
Task 4: Add discover_feed function
Files:
-
Modify:
backend/src/services/feed_parser.rs -
Step 1: Write failing tests for
discover_feed
Add these tests to the mod tests block in backend/src/services/feed_parser.rs:
#[tokio::test]
async fn discover_feed_from_link_rss() {
let server = MockServer::start().await;
let html = format!(
r#"<html><head>
<link rel="alternate" type="application/rss+xml" href="{}/feed.xml">
</head><body></body></html>"#,
server.uri()
);
Mock::given(method("GET"))
.respond_with(ResponseTemplate::new(200).set_body_string(html))
.mount(&server)
.await;
let client = reqwest::Client::new();
let result = discover_feed(&client, &server.uri()).await;
assert!(result.is_some());
assert!(result.unwrap().contains("/feed.xml"));
}
#[tokio::test]
async fn discover_feed_from_link_atom() {
let server = MockServer::start().await;
let html = format!(
r#"<html><head>
<link rel="alternate" type="application/atom+xml" href="{}/atom.xml">
</head><body></body></html>"#,
server.uri()
);
Mock::given(method("GET"))
.respond_with(ResponseTemplate::new(200).set_body_string(html))
.mount(&server)
.await;
let client = reqwest::Client::new();
let result = discover_feed(&client, &server.uri()).await;
assert!(result.is_some());
assert!(result.unwrap().contains("/atom.xml"));
}
#[tokio::test]
async fn discover_feed_direct_rss_url() {
let server = MockServer::start().await;
let rss_body = r#"<?xml version="1.0"?><rss version="2.0"><channel><title>T</title></channel></rss>"#;
Mock::given(method("GET"))
.respond_with(
ResponseTemplate::new(200)
.set_body_raw(rss_body, "application/rss+xml")
)
.mount(&server)
.await;
let client = reqwest::Client::new();
let result = discover_feed(&client, &server.uri()).await;
assert!(result.is_some());
assert_eq!(result.unwrap(), server.uri());
}
#[tokio::test]
async fn discover_feed_no_feed_found() {
let server = MockServer::start().await;
let html = "<html><head><title>No feed</title></head><body></body></html>";
Mock::given(method("GET"))
.respond_with(ResponseTemplate::new(200).set_body_string(html))
.mount(&server)
.await;
let client = reqwest::Client::new();
let result = discover_feed(&client, &server.uri()).await;
assert!(result.is_none());
}
#[tokio::test]
async fn discover_feed_resolves_relative_href() {
let server = MockServer::start().await;
let html = r#"<html><head>
<link rel="alternate" type="application/rss+xml" href="/feed.xml">
</head><body></body></html>"#;
Mock::given(method("GET"))
.respond_with(ResponseTemplate::new(200).set_body_string(html))
.mount(&server)
.await;
let client = reqwest::Client::new();
let result = discover_feed(&client, &server.uri()).await;
assert!(result.is_some());
let feed_url = result.unwrap();
assert!(feed_url.starts_with(&server.uri()));
assert!(feed_url.ends_with("/feed.xml"));
}
- Step 2: Run tests to verify they fail
Run: cd backend && cargo test --lib feed_parser::tests::discover_feed -- --nocapture
Expected: compilation error — discover_feed not defined
- Step 3: Implement
discover_feed
Add this function to backend/src/services/feed_parser.rs, before the #[cfg(test)] block:
/// RSS/Atom content types that indicate a direct feed URL.
const FEED_CONTENT_TYPES: &[&str] = &[
"application/rss+xml",
"application/atom+xml",
"application/xml",
"text/xml",
];
/// Discover an RSS/Atom feed URL from a source URL.
///
/// Two detection strategies:
/// 1. If the URL itself returns an RSS/Atom Content-Type, it is a feed directly.
/// 2. If the URL returns HTML, look for `<link rel="alternate" type="application/rss+xml">`
/// or `type="application/atom+xml"` in the `<head>`.
///
/// Returns `Some(feed_url)` if a feed is found, `None` otherwise.
pub async fn discover_feed(
http_client: &reqwest::Client,
source_url: &str,
) -> Option<String> {
let parsed_url = Url::parse(source_url).ok()?;
if let Err(e) = crate::services::scraper::check_ssrf(&parsed_url).await {
tracing::warn!(url = source_url, error = %e, "Source URL failed SSRF check during feed discovery");
return None;
}
let response = http_client
.get(source_url)
.send()
.await
.ok()?;
if !response.status().is_success() {
return None;
}
// Check Content-Type for direct feed
let content_type = response
.headers()
.get(reqwest::header::CONTENT_TYPE)
.and_then(|v| v.to_str().ok())
.unwrap_or("")
.to_lowercase();
if FEED_CONTENT_TYPES.iter().any(|ct| content_type.contains(ct)) {
return Some(source_url.to_string());
}
// If HTML, look for <link rel="alternate"> with feed type
if !content_type.contains("text/html") {
return None;
}
let body = response.text().await.ok()?;
let document = scraper::Html::parse_document(&body);
let selector = scraper::Selector::parse(r#"link[rel="alternate"]"#).ok()?;
for element in document.select(&selector) {
let link_type = element.value().attr("type").unwrap_or("");
if link_type == "application/rss+xml" || link_type == "application/atom+xml" {
if let Some(href) = element.value().attr("href") {
// Resolve relative URLs against the source URL
let resolved = parsed_url.join(href).ok()?;
return Some(resolved.to_string());
}
}
}
None
}
- Step 4: Run tests to verify they pass
Run: cd backend && cargo test --lib feed_parser -- --nocapture
Expected: all 11 tests pass (6 from Task 3 + 5 new)
- Step 5: Commit
git add backend/src/services/feed_parser.rs
git commit -m "feat: add discover_feed function for RSS/Atom auto-discovery"
Task 5: Add detect_and_parse_feed orchestration function
Files:
-
Modify:
backend/src/services/feed_parser.rs -
Step 1: Write failing tests for
detect_and_parse_feed
Add these tests to the mod tests block in backend/src/services/feed_parser.rs:
#[tokio::test]
async fn detect_and_parse_cached_fresh_feed() {
let server = MockServer::start().await;
let rss_body = r#"<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"><channel><title>T</title>
<item><title>A1</title><link>https://example.com/1</link><pubDate>Thu, 03 Apr 2026 10:00:00 GMT</pubDate></item>
<item><title>A2</title><link>https://example.com/2</link><pubDate>Wed, 02 Apr 2026 10:00:00 GMT</pubDate></item>
<item><title>A3</title><link>https://example.com/3</link><pubDate>Tue, 01 Apr 2026 10:00:00 GMT</pubDate></item>
</channel></rss>"#;
Mock::given(method("GET"))
.respond_with(ResponseTemplate::new(200).set_body_raw(rss_body, "application/rss+xml"))
.mount(&server)
.await;
let client = reqwest::Client::new();
let result = detect_and_parse_feed(
&client,
"https://example.com",
Some(&server.uri()),
Some(Utc::now()), // fresh
10,
).await;
match result {
FeedResult::Found { entries, .. } => assert_eq!(entries.len(), 3),
FeedResult::NotFound => panic!("Expected Found"),
}
}
#[tokio::test]
async fn detect_and_parse_no_cache_discovers_feed() {
let server = MockServer::start().await;
// First request: HTML page with feed link
let feed_path = format!("{}/feed.xml", server.uri());
let html = format!(
r#"<html><head>
<link rel="alternate" type="application/rss+xml" href="{}">
</head><body></body></html>"#,
feed_path
);
let rss_body = r#"<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"><channel><title>T</title>
<item><title>A1</title><link>https://example.com/1</link><pubDate>Thu, 03 Apr 2026 10:00:00 GMT</pubDate></item>
<item><title>A2</title><link>https://example.com/2</link><pubDate>Wed, 02 Apr 2026 10:00:00 GMT</pubDate></item>
<item><title>A3</title><link>https://example.com/3</link><pubDate>Tue, 01 Apr 2026 10:00:00 GMT</pubDate></item>
</channel></rss>"#;
// Mock: source page returns HTML
Mock::given(method("GET"))
.and(wiremock::matchers::path("/"))
.respond_with(ResponseTemplate::new(200).set_body_string(html))
.mount(&server)
.await;
// Mock: feed URL returns RSS
Mock::given(method("GET"))
.and(wiremock::matchers::path("/feed.xml"))
.respond_with(ResponseTemplate::new(200).set_body_raw(rss_body, "application/rss+xml"))
.mount(&server)
.await;
let client = reqwest::Client::new();
let result = detect_and_parse_feed(
&client,
&server.uri(),
None, // no cache
None,
10,
).await;
match result {
FeedResult::Found { feed_url, entries } => {
assert!(feed_url.contains("/feed.xml"));
assert_eq!(entries.len(), 3);
}
FeedResult::NotFound => panic!("Expected Found"),
}
}
#[tokio::test]
async fn detect_and_parse_no_feed_returns_not_found() {
let server = MockServer::start().await;
let html = "<html><head><title>No feed</title></head><body></body></html>";
Mock::given(method("GET"))
.respond_with(ResponseTemplate::new(200).set_body_string(html))
.mount(&server)
.await;
let client = reqwest::Client::new();
let result = detect_and_parse_feed(
&client,
&server.uri(),
None,
None,
10,
).await;
assert!(matches!(result, FeedResult::NotFound));
}
#[tokio::test]
async fn detect_and_parse_stale_cache_rediscovers() {
let server = MockServer::start().await;
let feed_path = format!("{}/feed.xml", server.uri());
let html = format!(
r#"<html><head>
<link rel="alternate" type="application/rss+xml" href="{}">
</head><body></body></html>"#,
feed_path
);
let rss_body = r#"<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"><channel><title>T</title>
<item><title>A1</title><link>https://example.com/1</link><pubDate>Thu, 03 Apr 2026 10:00:00 GMT</pubDate></item>
<item><title>A2</title><link>https://example.com/2</link><pubDate>Wed, 02 Apr 2026 10:00:00 GMT</pubDate></item>
<item><title>A3</title><link>https://example.com/3</link><pubDate>Tue, 01 Apr 2026 10:00:00 GMT</pubDate></item>
</channel></rss>"#;
Mock::given(method("GET"))
.and(wiremock::matchers::path("/"))
.respond_with(ResponseTemplate::new(200).set_body_string(html))
.mount(&server)
.await;
Mock::given(method("GET"))
.and(wiremock::matchers::path("/feed.xml"))
.respond_with(ResponseTemplate::new(200).set_body_raw(rss_body, "application/rss+xml"))
.mount(&server)
.await;
let client = reqwest::Client::new();
let stale_date = Utc::now() - chrono::Duration::days(31);
let result = detect_and_parse_feed(
&client,
&server.uri(),
Some("https://old-feed.example.com/rss"), // stale cached URL
Some(stale_date),
10,
).await;
match result {
FeedResult::Found { feed_url, entries } => {
assert!(feed_url.contains("/feed.xml"), "Should discover new feed URL");
assert_eq!(entries.len(), 3);
}
FeedResult::NotFound => panic!("Expected Found after re-discovery"),
}
}
- Step 2: Run tests to verify they fail
Run: cd backend && cargo test --lib feed_parser::tests::detect_and_parse -- --nocapture
Expected: compilation error — detect_and_parse_feed not defined
- Step 3: Implement
detect_and_parse_feed
Add this function to backend/src/services/feed_parser.rs, before the #[cfg(test)] block:
/// Detect and parse an RSS/Atom feed for a source URL.
///
/// Orchestrates the discovery and parsing logic:
/// - If `rss_url` is cached and fresh (< 30 days), parse it directly.
/// - If `rss_url` is cached but stale (>= 30 days), re-discover from `source_url`.
/// - If no `rss_url` cached, attempt discovery from `source_url`.
///
/// Returns `FeedResult::Found` with the feed URL and sorted entries,
/// or `FeedResult::NotFound` if no feed could be found/parsed.
pub async fn detect_and_parse_feed(
http_client: &reqwest::Client,
source_url: &str,
rss_url: Option<&str>,
rss_discovered_at: Option<DateTime<Utc>>,
max_links: usize,
) -> FeedResult {
// Case 1: Cached and fresh — use directly
if let Some(cached_url) = rss_url {
let is_fresh = rss_discovered_at
.map(|d| Utc::now().signed_duration_since(d).num_days() < REDISCOVERY_DAYS)
.unwrap_or(false);
if is_fresh {
match parse_feed(http_client, cached_url, max_links).await {
Ok(entries) if !entries.is_empty() => {
return FeedResult::Found {
feed_url: cached_url.to_string(),
entries,
};
}
_ => {
tracing::warn!(url = cached_url, "Cached feed failed to parse, attempting re-discovery");
}
}
}
}
// Case 2: No cache or stale — discover
let discovered = discover_feed(http_client, source_url).await;
if let Some(feed_url) = discovered {
match parse_feed(http_client, &feed_url, max_links).await {
Ok(entries) if !entries.is_empty() => {
return FeedResult::Found {
feed_url,
entries,
};
}
Ok(_) => {
tracing::info!(url = feed_url, "Discovered feed is empty");
}
Err(e) => {
tracing::warn!(url = feed_url, error = %e, "Discovered feed failed to parse");
}
}
}
FeedResult::NotFound
}
- Step 4: Run tests to verify they pass
Run: cd backend && cargo test --lib feed_parser -- --nocapture
Expected: all 15 tests pass
- Step 5: Commit
git add backend/src/services/feed_parser.rs
git commit -m "feat: add detect_and_parse_feed orchestration function"
Task 6: Integrate feed_parser into the Phase 1 pipeline
Files:
-
Modify:
backend/src/services/synthesis/mod.rs -
Step 1: Add
feed_parserimport
In backend/src/services/synthesis/mod.rs, add after line 29 (use crate::services::source_scraper;):
use crate::services::feed_parser;
- Step 2: Replace the link extraction in Phase 1 wave processing
In backend/src/services/synthesis/mod.rs, locate the Phase 1 link extraction block (around line 193-224). This is inside the 'wave_loop where join_set spawns tasks calling source_scraper::extract_article_links.
Replace the entire block from let mut wave_urls: Vec<(String, String)> = Vec::new(); through the closing } of while let Some(join_result) = join_set.join_next().await { ... } (lines 193-224) with:
let mut wave_urls: Vec<(String, String)> = Vec::new();
let mut rss_updates: Vec<(Uuid, Option<String>, Option<DateTime<Utc>>)> = Vec::new();
{
let mut join_set = tokio::task::JoinSet::new();
for source in wave_sources {
let client = state.http_client.clone();
let source_id = source.id;
let source_url = source.url.clone();
let source_title = source.title.clone();
let rss_url = source.rss_url.clone();
let rss_discovered_at = source.rss_discovered_at;
let max_l = max_links;
join_set.spawn(async move {
// Try RSS feed first
let feed_result = feed_parser::detect_and_parse_feed(
&client,
&source_url,
rss_url.as_deref(),
rss_discovered_at,
max_l,
).await;
match feed_result {
feed_parser::FeedResult::Found { feed_url, entries }
if entries.len() >= feed_parser::MIN_FEED_ENTRIES =>
{
let links: Vec<String> = entries.into_iter().map(|e| e.url).collect();
tracing::info!(
source = %source_title,
feed = %feed_url,
links = links.len(),
"Extracted links from RSS feed"
);
// Signal RSS URL update if it changed
let rss_changed = rss_url.as_deref() != Some(&feed_url);
let rss_stale = rss_discovered_at
.map(|d| Utc::now().signed_duration_since(d).num_days() >= feed_parser::REDISCOVERY_DAYS)
.unwrap_or(true);
let update = if rss_changed || rss_stale {
Some((source_id, Some(feed_url), Some(Utc::now())))
} else {
None
};
(source_url, source_title, Ok(links), update)
}
_ => {
// Fallback to HTML extraction
let links = source_scraper::extract_article_links(&client, &source_url, max_l).await;
// If we had a cached RSS URL but feed failed, clear it
let update = if rss_url.is_some() {
Some((source_id, None, None))
} else {
None
};
(source_url, source_title, links, update)
}
}
});
}
while let Some(join_result) = join_set.join_next().await {
if let Ok((source_url, source_title, links_result, rss_update)) = join_result {
if let Some(update) = rss_update {
rss_updates.push(update);
}
match links_result {
Ok(links) => {
tracing::info!(source = %source_title, links = links.len(), "Extracted links from source");
for link in links {
if seen_urls.insert(link.to_lowercase()) {
wave_urls.push((link, source_url.clone()));
}
}
}
Err(e) => {
tracing::warn!(source = %source_title, error = %e, "Failed to extract links");
}
}
}
}
}
// Persist RSS URL updates (fire-and-forget)
for (source_id, new_rss_url, new_discovered_at) in rss_updates {
db::sources::update_source_rss(
&state.pool,
source_id,
new_rss_url.as_deref(),
new_discovered_at,
).await.ok();
}
- Step 3: Verify it compiles
Run: cd backend && cargo check
Expected: compiles with no errors
- Step 4: Run existing unit tests
Run: cd backend && cargo test --lib
Expected: all tests pass (no regressions)
- Step 5: Commit
git add backend/src/services/synthesis/mod.rs
git commit -m "feat: integrate feed_parser into Phase 1 pipeline with HTML fallback"
Task 7: Add integration test for RSS feed in pipeline
Files:
-
Modify: the existing integration test structure (if a synthesis integration test exists), OR create a focused unit test
-
Step 1: Write a test that verifies RSS-first behavior end-to-end
Add this test to the mod tests block at the end of backend/src/services/feed_parser.rs:
#[tokio::test]
async fn full_flow_rss_first_with_html_fallback() {
// Source 1: has an RSS feed with 5 articles
let server1 = MockServer::start().await;
let rss_body = r#"<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"><channel><title>Blog</title>
<item><title>A1</title><link>https://blog.example.com/1</link><pubDate>Thu, 03 Apr 2026 10:00:00 GMT</pubDate></item>
<item><title>A2</title><link>https://blog.example.com/2</link><pubDate>Wed, 02 Apr 2026 10:00:00 GMT</pubDate></item>
<item><title>A3</title><link>https://blog.example.com/3</link><pubDate>Tue, 01 Apr 2026 10:00:00 GMT</pubDate></item>
<item><title>A4</title><link>https://blog.example.com/4</link><pubDate>Mon, 31 Mar 2026 10:00:00 GMT</pubDate></item>
<item><title>A5</title><link>https://blog.example.com/5</link><pubDate>Sun, 30 Mar 2026 10:00:00 GMT</pubDate></item>
</channel></rss>"#;
Mock::given(method("GET"))
.respond_with(ResponseTemplate::new(200).set_body_raw(rss_body, "application/rss+xml"))
.mount(&server1)
.await;
let client = reqwest::Client::new();
// With cached RSS URL (fresh) — should use RSS directly
let result = detect_and_parse_feed(
&client,
"https://blog.example.com",
Some(&server1.uri()),
Some(Utc::now()),
10,
).await;
match result {
FeedResult::Found { entries, .. } => {
assert_eq!(entries.len(), 5);
// Verify sorted newest first
for i in 0..entries.len() - 1 {
if let (Some(a), Some(b)) = (&entries[i].published_date, &entries[i + 1].published_date) {
assert!(a >= b, "Entries should be sorted newest first");
}
}
}
FeedResult::NotFound => panic!("Expected Found"),
}
// Source 2: no RSS feed, only HTML — should return NotFound
let server2 = MockServer::start().await;
let html = r#"<html><head><title>No feed</title></head><body>
<a href="/article-1">Article 1</a>
</body></html>"#;
Mock::given(method("GET"))
.respond_with(ResponseTemplate::new(200).set_body_string(html))
.mount(&server2)
.await;
let result = detect_and_parse_feed(
&client,
&server2.uri(),
None,
None,
10,
).await;
// No feed found — pipeline would fall back to source_scraper
assert!(matches!(result, FeedResult::NotFound));
}
- Step 2: Run all feed_parser tests
Run: cd backend && cargo test --lib feed_parser -- --nocapture
Expected: all 16 tests pass
- Step 3: Run full unit test suite
Run: cd backend && cargo test --lib
Expected: all tests pass
- Step 4: Commit
git add backend/src/services/feed_parser.rs
git commit -m "test: add end-to-end RSS flow test for feed_parser"