You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

25 KiB

Raw Blame History

Rust Expert Audit Report v2 -- AI Weekly Synth Backend

Date: 2026-03-27 Scope: Full audit of backend Rust code focusing on idiomatic patterns, async correctness, type safety, memory/performance, safety, and newly added modules (scheduler, themes, schedules, preferred sources pipeline).

Executive Summary

The codebase demonstrates strong Rust fundamentals for a learning project. Error handling is consistent, unwrap() is absent from production code, SQL injection is prevented via parameterized queries, and the async architecture is sound. The main areas for improvement are: (1) the synthesis.rs monolith function is ~850 lines with heavy code duplication between Phase 1 and Phase 2; (2) the type system underuses newtypes and enums where stringly-typed data could be statically checked; (3) several concurrency patterns have subtle correctness or performance gaps; (4) the scheduler has a safety regression with two bare unwrap_or_default() calls on serde deserialization.

Severity Legend: [CRITICAL] = correctness/data bug, [HIGH] = production risk, [MEDIUM] = code quality / maintainability, [LOW] = polish / best practice.

1. Idiomatic Rust

1.1 Ownership and Borrowing

Positive: The code avoids gratuitous clone() in most places. Arc<dyn LlmProvider> is correctly shared across spawned tasks. The ArticleTrace<'a> struct borrows data from the pipeline scope, avoiding allocation until build_trace_entry converts to owned strings for DB insertion.

[MEDIUM] Redundant clone in category initialization (synthesis.rs:272-276)

let user_categories = if theme_categories.is_empty() {
    Vec::new()
} else {
    theme_categories.clone()
};

This clone is unnecessary since theme_categories is already a Vec<String> owned by this scope. The if branch constructing an empty vec vs cloning is equivalent to just using theme_categories directly (it is already a potentially empty vec). The code could simply be let user_categories = theme_categories;.

[MEDIUM] Excessive .clone() on source iteration (synthesis.rs:320-322)

let preferred: Vec<_> = rotated_sources.iter().filter(|s| s.is_preferred).cloned().collect();
let non_preferred: Vec<_> = rotated_sources.iter().filter(|s| !s.is_preferred).cloned().collect();

This makes two full passes and clones every Source. A single partition() or sort_by_key() would halve the clones and passes:

let mut ordered_sources = rotated_sources;
ordered_sources.sort_by_key(|s| !s.is_preferred); // true sorts before false (reversed)

1.2 Error Handling

Positive: AppError is a well-designed thiserror enum with IntoResponse for Axum. The From<sqlx::Error> impl correctly hides DB details. from conversion on anyhow::Error provides the catch-all. All LLM providers map errors through map_provider_http_error() for consistency.

[HIGH] Silent .ok() on DB writes with no fallback (synthesis.rs, multiple locations)

Throughout the pipeline, db::article_history::batch_insert_entries(...).await.ok() and db::llm_call_log::insert(...).await.ok() silently discard DB errors. While the comment says "Don't fail synthesis if logging fails", there is no logging when these calls fail. If the DB goes down mid-pipeline, the user's synthesis completes but all traceability data is lost without any indication. At minimum, these should use .map_err(|e| tracing::warn!(...)) or .unwrap_or_else(|e| ...) to log the failure.

[MEDIUM] Validation returns String instead of structured errors

CreateThemeRequest::validate(), UpdateSettingsRequest::validate(), and UpsertScheduleRequest::validate() all return Result<(), String>. Idiomatic Rust would use a dedicated ValidationError type (or at least AppError::Validation) to enable programmatic inspection and i18n. The current pattern requires callers to wrap the string:

req.validate().map_err(AppError::Validation)?;

This works but is fragile. Multiple validation failures cannot be reported at once.

1.3 Iterators and Functional Style

Positive: Good use of filter_map, and_then, chaining in brave_search.rs, source_scraper.rs, and the LLM response parsers. The sanitize_json_null_bytes recursive function is clean.

[LOW] Manual loop where Iterator::find suffices (synthesis.rs:1229-1232)

if !classification_categories.iter().any(|c| c.to_lowercase() == llm_category.to_lowercase()) {
    llm_category = "Divers".to_string();
}

The repeated to_lowercase() on classification_categories for every article is O(n*m) string allocations. Pre-lowering once into a HashSet<String> at pipeline start would reduce this to O(1) lookup per article.

1.4 Trait Usage

Positive: The LlmProvider trait is well-designed -- minimal surface area, Send + Sync bounds on the trait, async_trait for object safety. The factory pattern in factory.rs is clean.

[LOW] LlmProvider trait could benefit from a name() method alias

provider_id() returns &str, which is good. However, the API key is stored as a String field directly in each provider struct. If a provider were ever Cloned, the API key would be duplicated in memory. Consider wrapping it in a zeroize::Zeroizing<String> (the zeroize crate is already a dependency).

2. Async Patterns

2.1 JoinSet Usage

Positive: The pipeline uses JoinSet correctly for parallel scraping and classification. Tasks are spawned, then drained with join_next().await. This is the right pattern for bounded concurrency.

[HIGH] No concurrency limit on JoinSet spawns

In Phase 1, all URLs in a wave's batch are scraped in parallel via JoinSet (synthesis.rs:472-482). If max_links_per_source is 30 and source_extraction_window is 10, that could be 300 simultaneous HTTP requests in a single JoinSet batch. Similarly, the link extraction phase (line 350-378) spawns all sources in a wave concurrently. There is no semaphore or concurrency cap.

Impact: Under load with many sources, this could exhaust file descriptors, trigger target-site WAFs, or cause reqwest connection pool contention.

Recommendation: Add a tokio::sync::Semaphore with a configurable max (e.g., 20 concurrent scrapes) to JoinSet tasks.

[MEDIUM] JoinSet task panic handling

If a task inside scrape_set.spawn(...) panics, join_next().await returns Err(JoinError). The code handles this with if let Ok(...), silently dropping panicked tasks. This is correct behavior (no crash propagation), but panics should at minimum be logged:

match join_result {
    Ok(tuple) => { /* process */ }
    Err(e) => tracing::error!("Task panicked: {}", e),
}

2.2 Cancellation

Positive: The cancellation mechanism via Arc<AtomicBool> is well-designed. The pipeline checks cancelled.load(Ordering::Relaxed) at wave boundaries and batch boundaries. On cancellation, it saves collected articles rather than discarding them.

[MEDIUM] Cancellation during LLM calls is not preemptive

Once an LLM call is in progress (provider.call_llm(...) inside a JoinSet task), cancellation cannot interrupt it. The user must wait for the LLM API response (up to 120s per call) before the cancellation flag is checked. This is a limitation of the current design, not a bug, but worth documenting. A tokio::select! with a cancellation token could abort waiting.

[LOW] Ordering::Relaxed for cancellation flag

cancelled.store(true, Ordering::Relaxed) and cancelled.load(Ordering::Relaxed) are used. Relaxed is correct here because:

There is only one writer (the cancel endpoint) and one reader (the pipeline).
There are no dependent memory operations that need ordering guarantees.
The flag is checked at batch boundaries, so a few microseconds of delay in visibility is irrelevant.

This is fine. Relaxed is the right choice.

2.3 Spawned Tasks and Lifetimes

[MEDIUM] Fire-and-forget cleanup spawn (synthesis.rs:239-242)

tokio::spawn(async move {
    tokio::time::sleep(Duration::from_secs(300)).await;
    store.remove(&jid);
});

This spawned task lives for 5 minutes after every generation completes. If many generations finish, these accumulate. The task is untracked -- if the server shuts down, these are dropped mid-sleep (which is fine, but the job lingers in the store until process exit). Consider using cleanup_expired() on a timer instead of per-job spawns.

[MEDIUM] Double-spawn panic handler pattern (generation.rs:87-109)

The handler spawns a task, then spawns another task to monitor the first for panics. This works but is unusual. The inner spawn handles timeout, and the outer spawn catches panics. This could be simplified with tokio::spawn(...).catch_unwind() or by wrapping AssertUnwindSafe around the inner future. The current pattern is correct but adds an extra long-lived task per generation.

2.4 Scheduler

[HIGH] Scheduler runs sequentially (scheduler.rs:43)

for schedule in due {
    // ...runs generation synchronously...
}

If multiple users have schedules due at the same minute, they run sequentially. A single slow generation (up to 15 minutes) blocks all subsequent scheduled jobs. This could cause significant delays.

Recommendation: Use JoinSet or FuturesUnordered with a concurrency limit (e.g., 3 concurrent scheduled jobs) to process them in parallel.

[MEDIUM] Scheduler time comparison is string-based (scheduler.rs:29, schedules.rs:91)

let time = chrono::Utc::now().format("%H:%M").to_string();

The SQL query uses time_utc <= $2 as a string comparison. This works for HH:MM format because it is lexicographically ordinal. However, if the scheduler tick happens at 08:01 and the schedule was for 08:00, the schedule runs. If the tick happens at 08:59, the 08:00 schedule runs again -- but last_run_at date check prevents double execution. This is correct but relies on the CURRENT_DATE check. If the server restarts after midnight UTC but before the tick, a 23:30 schedule from the previous day would not run (missed execution). This is a known limitation of tick-based scheduling.

3. Type System

3.1 Newtype Gaps

[MEDIUM] Stringly-typed identifiers

The codebase passes provider_name: &str, source_type: &str, status: &str, call_type: &str as raw strings throughout the pipeline. These should be enums:

enum ProviderName { Gemini, OpenAI, Anthropic }
enum ArticleStatus { Used, FilteredHistory, FilteredDiversity, FilteredEmpty, FilteredTooOld, FilteredNotArticle, FilteredHomepage }
enum SourceType { PersonalizedSource, WebSearch, BraveSearch }

Currently, a typo in a status string (e.g., "filtered_diversty") would silently produce incorrect data. Enums catch this at compile time.

[MEDIUM] serde_json::Value as domain type

Theme.categories is serde_json::Value rather than Vec<String>. Every access requires serde_json::from_value(theme.categories).unwrap_or_default(). This pattern appears in:

synthesis.rs:265
models/theme.rs:104
models/schedule.rs:89-92
scheduler.rs:67,70

Deserializing to Vec<String> at the DB layer (in FromRow or TryFrom) would eliminate this repeated pattern and make invalid states unrepresentable.

[LOW] i32 used for semantic integers

max_items_per_category, max_age_days, summary_length, batch_size, source_extraction_window are all i32. These are always positive. Using a newtype like NonZeroU32 or a validated wrapper would prevent negative values at the type level rather than at validation time.

3.2 Enum Design

Positive: AppError and ProgressEvent are well-designed tagged enums. ProgressEvent uses #[serde(tag = "type")] for clean JSON serialization.

[LOW] ProgressEvent::Error should carry structured error info

Currently ProgressEvent::Error { message: String }. Adding an optional error code would help the frontend differentiate between user-fixable errors (bad API key) and transient errors (rate limit).

4. Memory and Performance

4.1 Cloning

[MEDIUM] AppState is cloned on every handler call

AppState is #[derive(Clone)] and contains a DashMap<Uuid, UserRateLimitEntry> directly (not behind Arc). The DashMap itself is Clone, but it wraps Arc internally, so this is actually cheap. This is fine -- just worth noting that the Clone is O(1) due to internal Arc.

[LOW] sources.clone() before rotation (synthesis.rs:318)

let rotated_sources = rotate_sources(sources.clone(), last_source_url.as_deref());

sources is not used after this point, so rotate_sources(sources, ...) would avoid the clone. The function already takes ownership.

4.2 Allocations

[MEDIUM] Repeated string allocations in hot path

In the article processing loop, to_lowercase() is called on category names, source URLs, and domain names for every article. Pre-computing lowered versions at pipeline start would reduce allocations:

// At pipeline start:
let classification_categories_lower: HashSet<String> = classification_categories.iter()
    .map(|c| c.to_lowercase()).collect();

[LOW] format!("category_{}", i) repeated many times

The pattern format!("category_{}", i) appears ~15 times across the pipeline. Pre-computing a Vec<String> of category keys at pipeline start would avoid redundant format allocations.

4.3 Arc Usage

Positive: Arc is used appropriately for shared data across spawned tasks:

Arc<watch::Sender<ProgressEvent>> for progress channel
Arc<AtomicBool> for cancellation flag
Arc<dyn LlmProvider> for the LLM provider
Arc<String> for model_research (shared across classify tasks)
Arc<Vec<String>> for classification_categories
Arc<Value> for classify_schema

These are all correct. The Arc is needed because the values cross tokio::spawn boundaries.

5. Safety

5.1 `unwrap()` in Production Code

Positive: No unwrap() calls exist in production (non-test) code paths. All unwrap() usage is confined to:

#[cfg(test)] blocks
LazyLock static initializers (Selector::parse("a[href]").unwrap()) -- these are infallible for valid CSS selectors and run once at startup.

[HIGH] unwrap_or_default() on serde deserialization in scheduler (scheduler.rs:67,70)

let emails: Vec<String> = serde_json::from_value(schedule.emails.clone()).unwrap_or_default();
// ...
let sections: Vec<NewsSection> = serde_json::from_value(synth.sections).unwrap_or_default();

While unwrap_or_default() is not unwrap(), these silently swallow malformed JSON. If schedule.emails is corrupted, no email is sent and no error is logged. The sections case is worse: a corrupted synthesis would send an empty email. These should at minimum log a warning:

let emails: Vec<String> = match serde_json::from_value(schedule.emails.clone()) {
    Ok(e) => e,
    Err(e) => {
        tracing::warn!(schedule_id = %schedule.id, error = %e, "Failed to parse schedule emails");
        continue; // or Vec::new()
    }
};

5.2 SQL Safety

Positive: All queries use parameterized bindings ($1, $2, etc.) via sqlx. No string interpolation into SQL. The find_due_schedules query uses to_jsonb($1::text) for JSONB containment, which is safe.

[LOW] Raw SQL in resolve_model (synthesis.rs:1419-1429)

let model = sqlx::query_scalar::<_, String>(
    r#"SELECT m->>'model_id' FROM admin_providers, jsonb_array_elements(models_scraping) AS m
    WHERE provider_name = $1 AND is_enabled = true AND (m->>'is_default')::boolean = true LIMIT 1"#,
)

This is parameterized and safe, but uses a cross-join with jsonb_array_elements which could be expensive if models_scraping is large. In practice it is always small (< 10 models per provider), so this is fine.

5.3 Input Validation

Positive: All request types have validate() methods that check bounds. URL validation requires http:// or https:// scheme. Email validation in UpsertScheduleRequest checks for @.

[MEDIUM] Email validation is too permissive (schedule.rs:62-64)

if !email.contains('@') {
    return Err(format!("Invalid email address: '{}'", email));
}

The email_address crate is already a dependency (used for auth registration). Scheduled email validation should use it for consistency:

if email_address::EmailAddress::parse(&email, None).is_err() { ... }

[LOW] No URL validation on source URLs beyond scheme check

validate_url() checks scheme and length but does not parse with url::Url::parse(). A string like https:// (no host) passes validation but will fail at scrape time. Adding a Url::parse() check would catch these early.

5.4 SSRF Prevention

Positive: Comprehensive SSRF protection:

DNS resolution before connecting (check_ssrf)
IPv4/IPv6 private range checking including mapped addresses
Redirect policy with per-hop DNS validation
Documented TOCTOU limitation

[LOW] SKIP_SSRF_CHECK env var bypass

if std::env::var("SKIP_SSRF_CHECK").is_ok() {
    return Ok(());
}

This is documented as being for integration tests, but the env check runs on every scrape call. In production, if this env var is accidentally set, all SSRF protection is disabled. Consider gating this with #[cfg(test)] or a compile-time feature flag.

5.5 Secret Handling

Positive:

API keys are encrypted at rest with AES-256-GCM.
sanitize_error_message() strips potential API key patterns from error messages sent to clients.
Gemini provider specifically avoids logging the full URL (which contains the API key in a query parameter).
AppError::Internal hides details from the client response.

[MEDIUM] Gemini API key in URL query parameter

fn api_url(&self, model: &str) -> String {
    format!("...?key={}", self.api_key)
}

The API key is in the URL. If reqwest ever logs the URL at debug level, the key leaks to logs. The provider already mitigates this by logging only the error kind, not the full error (which includes the URL). However, any middleware or proxy that logs URLs would expose the key. Consider using a header-based auth approach if Gemini supports it, or ensuring the HTTP client never logs request URLs.

6. New Modules Analysis

6.1 Scheduler (`scheduler.rs`)

Architecture: Simple tick-based scheduler. An external caller (likely a periodic tokio::spawn with tokio::time::interval) calls run_scheduled_jobs() each minute.

[HIGH] No jitter or distributed lock

If multiple instances of the application run (e.g., horizontal scaling), all would execute the same scheduled jobs simultaneously because find_due_schedules has no locking. The CLAUDE.md says "single-tenant self-hosted", so this is likely acceptable, but the design should be documented as single-instance only.

[MEDIUM] mark_run is called after generation completes

If the server crashes during generation, the schedule is never marked as run. On restart, it would re-run. This is probably acceptable behavior (retry on failure), but could lead to duplicate syntheses if the first run partially succeeded and saved to DB.

6.2 Themes (`db/themes.rs`, `models/theme.rs`)

Well-structured: Clean CRUD operations with user_id scoping. The COALESCE pattern for partial updates is idiomatic SQL. The TryFrom<Theme> for ThemeResponse conversion handles the serde_json::Value -> Vec<String> deserialization.

[LOW] UpdateThemeRequest lacks validation

Unlike CreateThemeRequest which has a thorough validate() method, UpdateThemeRequest has no validation at all. Optional fields bypass all checks. If max_items_per_category is Some(-1), it passes through to the DB unchecked.

6.3 Schedules (`db/schedules.rs`, `models/schedule.rs`)

Well-structured: The ON CONFLICT (theme_id) DO UPDATE upsert pattern is clean. find_due_schedules uses JSONB containment (@>) correctly.

[LOW] time_utc is stored as String rather than a SQL TIME type

Using a TEXT column for time means the DB cannot enforce format constraints. A TIME column would provide built-in validation and enable time arithmetic in queries.

6.4 Preferred Sources Pipeline Logic

Well-designed: The pipeline correctly:

Rotates sources to avoid always starting from the same one
Separates preferred sources (processed first) from non-preferred
Enforces per-source diversity limits (max_articles_per_source)
Checks article history for deduplication
Batches scraping and classification for efficiency
Tracks URL provenance for history

[MEDIUM] Source rotation uses last URL, not last source index

let last_source = db::article_history::get_last_source_url(&state.pool, user_id).await.unwrap_or(None);
let rotated_sources = rotate_sources(sources.clone(), last_source.as_deref());

If a source is deleted and re-added with a different URL, rotation breaks. If the last-used source URL changes (e.g., site redesign), the rotation starts from index 0. Using source ID instead of URL for rotation tracking would be more robust.

7. Code Structure and Maintainability

7.1 `synthesis.rs` Monolith

[HIGH] run_generation_inner is ~800 lines with significant duplication

The Phase 1 (personalized sources) and Phase 2 Brave Search path share nearly identical scrape+classify logic. The batch processing loop (scrape in parallel -> classify in parallel -> assign category -> track counts) is copy-pasted with minor variations (presence/absence of source_url).

Recommendation: Extract shared logic into helper functions:

scrape_and_classify_batch(urls, provider, model, schema, categories, ...)
process_classify_response(response, categories, filled_counts, max_per_category, ...)

This would reduce the function by ~300 lines and eliminate the risk of fixing a bug in one path but not the other.

7.2 `#[allow(clippy::too_many_arguments)]`

This attribute appears on log_llm_call and build_search_prompt. Functions with 8+ parameters are a code smell. Builder patterns or parameter structs would improve readability:

struct LlmCallLog { user_id: Uuid, job_id: Uuid, call_type: &str, ... }

8. Dependency Review

Crate	Version	Notes
`axum`	0.8	Current. Good.
`sqlx`	0.8	Current. Using runtime-checked queries (no compile-time verification).
`reqwest`	0.12	Current.
`tokio`	1	Full features. Fine for a backend server.
`dashmap`	6	Current. Used appropriately for concurrent maps.
`rand`	0.8	One major version behind (0.9 available). Not urgent.
`anyhow` + `thiserror`	1 + 2	Good combination: `thiserror` for structured errors, `anyhow` for internal catch-all.
`zeroize`	1	Present in dependencies but not used on API key strings. See recommendation above.
`async-trait`	0.1	Still needed for trait object dispatch. Rust 1.75+ has native async traits, but not for `dyn` dispatch.

9. Summary of Findings by Severity

CRITICAL (0)

None found.

HIGH (5)

No concurrency limit on JoinSet scrape/classify spawns -- risk of resource exhaustion under load
Silent .ok() on DB writes -- traceability data loss without any logging
Scheduler runs due jobs sequentially -- one slow job blocks all others
Scheduler unwrap_or_default() silently swallows malformed data -- corrupted schedule sends no email with no warning
run_generation_inner monolith -- 800+ lines with duplicated Phase 1/Phase 2 logic; high bug introduction risk

MEDIUM (12)

Redundant clone in category initialization
Excessive .clone() on source partitioning
Validation returns String instead of structured errors
serde_json::Value used as domain type for categories/days/emails
Stringly-typed identifiers (provider names, status codes, source types)
JoinSet task panic handling drops errors silently
Fire-and-forget cleanup spawn per job
Double-spawn panic handler pattern in generation handler
Scheduler time comparison and missed-execution edge case
Email validation too permissive in schedules
Gemini API key in URL -- logging risk
Source rotation based on URL rather than ID

LOW (9)

Repeated to_lowercase() allocations in hot path
Repeated format!("category_{}", i) allocations
sources.clone() before rotation is avoidable
ProgressEvent::Error lacks structured error code
UpdateThemeRequest lacks validation
time_utc stored as String instead of SQL TIME
No URL parsing in validate_url()
SKIP_SSRF_CHECK env var bypass not compile-gated
zeroize unused on API key strings despite being a dependency

10. Recommendations (Prioritized)

Extract scrape+classify batch logic from run_generation_inner into a shared helper. This is the single highest-ROI refactor.
Add a concurrency semaphore to JoinSet-based scraping (e.g., Semaphore::new(20)).
Log DB write failures instead of using bare .ok().
Parallelize the scheduler with bounded concurrency for due jobs.
Replace serde_json::Value with typed fields in Theme.categories, ThemeSchedule.days, ThemeSchedule.emails.
Introduce enums for stringly-typed constants (ProviderName, ArticleStatus, SourceType).
Add validation to UpdateThemeRequest to match CreateThemeRequest.
Use email_address crate for schedule email validation.
Wrap API key strings in Zeroizing<String> in LLM provider structs.
Consider #[cfg(test)] gating for SKIP_SSRF_CHECK instead of runtime env check.

25 KiB Raw Blame History