You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

22 KiB

Raw Blame History

Tech Lead Code Review Report

Date: 2026-03-26 Codebase: AI Weekly Synth (Rust/Axum + SolidJS) Total Backend LOC: ~14,263 (Rust) Total Frontend LOC: ~10,622 (TypeScript/TSX)

Executive Summary

AI Weekly Synth is a well-structured, well-documented codebase with clear module boundaries and strong security practices (SSRF prevention, AES-256-GCM encryption, anti-enumeration auth). The code quality is above average for a learning project.

The primary concern is synthesis.rs at 1,686 lines, which contains a monolithic pipeline function (run_generation_inner, lines 243-890) spanning ~650 lines with significant copy-pasted code blocks between Phase 1 (personalized sources), Phase 2 Brave Search, and Phase 2 LLM Search. This file alone accounts for 12% of all backend code and is the single biggest maintainability risk.

Secondary concerns include: triple-duplication of settings field lists across UserSettings, SettingsResponse, and UpdateSettingsRequest; nearly identical error-mapping functions across all three LLM providers; and a 1,025-line Settings.tsx that could benefit from component extraction.

The codebase is otherwise clean: naming is consistent, comments are useful without being excessive, the error handling is centralized, and the module structure follows idiomatic Rust patterns. The recommendations below are ordered by impact-to-effort ratio.

1. Complexity Hotspots

1.1 `backend/src/services/synthesis.rs` -- 1,686 lines (CRITICAL)

This is by far the most complex file. Key concerns:

run_generation_inner (lines 243-890, ~650 lines): This single async function orchestrates the entire generation pipeline. It manages 8+ mutable tracking structures (article_scraped, source_counts, url_source, filled_counts, seen_urls, etc.) and contains 3 nearly identical "scrape + classify + assign category" loops.
Deep nesting: The Brave Search path (lines 562-759) is nested inside if !category_gaps.is_empty() > if settings.use_brave_search > while !done > while let Some(join_result), reaching 5 levels of indentation.
Copy-pasted classify logic: The category assignment block (lines 505-541) is duplicated almost verbatim at lines 702-750 for Brave Search and implicitly again for the LLM search path. Each block does:
- Extract title/summary/category from LLM response
- Validate category against allowed list
- Compute cat_key string
- Check filled_counts for overflow to "Autre"
- Insert into article_scraped
- Update counters
trace_article function (line 926): Has 11 parameters and is called 15+ times throughout the pipeline with #[allow(clippy::too_many_arguments)]. This is a code smell that the function is doing too much parameter passing.

1.2 `backend/src/services/scraper.rs` -- 1,242 lines

Well-organized but large. 572 lines are production code, 670 lines are tests. The test-to-code ratio is excellent. The main scrape_url function is a reasonable 95 lines. No urgent action needed, but could benefit from splitting the test module into a separate file.

1.3 `frontend/src/pages/Settings.tsx` -- 1,025 lines

A single component handling theme, categories, max age, items per category, batch size, article history, LLM source links, Brave Search key management, provider selection, model selection, websearch model selection, rate limits, search behavior, export/import, and API key management. Contains:

14 signals (state variables)
10+ event handler functions
Inline JSX for every form section

1.4 `frontend/src/pages/admin/Providers.tsx` -- 854 lines

Similar to Settings: a single component managing CRUD for providers with inline model list editing. The model list editing JSX (for both models_scraping and models_websearch) is nearly identical, repeated twice per provider card and twice for the "new provider" form.

1.5 `backend/src/services/llm/anthropic.rs` -- 500 lines

Of these, ~270 lines are tests, which is healthy. The production code is clean.

2. Code Duplication

2.1 Settings Field Lists (HIGH -- 4 locations)

The same 16 fields are listed in 4 places:

Location	Type	Purpose
`models/settings.rs:9-27`	`UserSettings`	DB model
`models/settings.rs:31-47`	`SettingsResponse`	API response
`models/settings.rs:73-89`	`UpdateSettingsRequest`	API input
`db/settings.rs:14-31`	`SettingsRow`	sqlx row type

Every time a new settings field is added (which has happened several times based on commit history), all 4 structs plus the From impl, the TryFrom impl, the SQL INSERT/UPDATE queries, and the validation function must all be updated in lockstep. The SettingsResponse is particularly pointless -- it has exactly the same fields as UserSettings minus user_id and updated_at.

Files affected: backend/src/models/settings.rs, backend/src/db/settings.rs

2.2 LLM Error Mapping (MEDIUM -- 3 providers)

The map_*_error functions in all three providers are nearly identical:

// gemini.rs:151-178, openai.rs:158-189, anthropic.rs:194-228
match status {
    400 => AppError::BadRequest("Invalid request to LLM provider".into()),
    401 => AppError::BadRequest("Invalid or unauthorized API key".into()),
    // ...
    429 => AppError::RateLimited("LLM provider rate limit exceeded...".into()),
    _ => AppError::Internal(anyhow::anyhow!("LLM provider returned an error")),
}

Only Anthropic adds a 529 case; Gemini merges 401|403. The error extraction logic (reading body["error"]["message"]) differs slightly per provider but could be parameterized.

Files affected: backend/src/services/llm/gemini.rs, openai.rs, anthropic.rs

2.3 Scrape + Classify Pipeline (CRITICAL -- 3 code paths)

The pattern "scrape batch in parallel -> classify batch in parallel -> assign to categories" appears 3 times in synthesis.rs:

Phase 1 Personalized Sources (lines 429-551): scrape set -> classify set -> category assignment with source tracking
Phase 2 Brave Search (lines 639-758): near-identical with minor tuple shape differences
Phase 2 LLM Search (lines 822-843): simplified (scrape for validation only, no classify needed)

Paths 1 and 2 differ only in:

Tuple shape: (url, source_url, body_text, page_title) vs (url, body_text, page_title)
Source tracking: Phase 1 uses source_url from url_source, Phase 2 uses extract_domain
Otherwise the classify/assign logic is character-for-character identical

2.4 Article Filtering Logic (MEDIUM -- 3 locations)

The "homepage filter + cross-phase dedup + history dedup + source diversity" filter pipeline appears in 3 places:

Phase 2 LLM search (lines 780-820)
Phase 2 Brave search (lines 575-616)
Phase 1 history filter (lines 376-387)

Each applies the same checks in slightly different orders with identical logic.

2.5 Frontend Model List JSX (LOW -- Providers.tsx)

In Providers.tsx, the model list editing UI (add/remove/edit model rows) is repeated 4 times:

Existing provider models_scraping list
Existing provider models_websearch list
New provider models_scraping list
New provider models_websearch list

2.6 Frontend Default Settings (LOW -- 2 locations)

DEFAULT_SETTINGS is defined in frontend/src/types.ts:60-83 and UserSettings::default() is defined in backend/src/models/settings.rs:162-190. The category lists are slightly different between the two (the frontend has longer, more descriptive French names). This divergence is a latent bug source.

3. Readability Assessment

Strengths

Module-level documentation: Every .rs file starts with //! module docs explaining purpose and listing endpoints. This is excellent and should be maintained.
Function documentation: Important functions have /// doc comments with # Arguments sections. The scraper and synthesis modules are particularly well-documented.
Naming conventions: Consistent snake_case for Rust, camelCase for TypeScript. No single-letter variables except loop iterators.
Error handling: Centralized in errors.rs with clear variant-to-status-code mapping. Internal errors never leak to clients.
Security comments: The SSRF check has explicit notes about TOCTOU gaps (scraper.rs:63). The job store documents why _rx is kept alive. These are high-value comments.

Concerns

Short variable names in synthesis.rs: Variables like u, su, mad, jid, uid, mdl, sys, usr hurt readability in the already-complex pipeline code. These abbreviations exist because of async move closures requiring cloned values, but they make the code harder to follow.
Magic numbers in synthesis.rs: max_links = 15, max_concurrent = 5, progress percentages (5, 10, 12, 15, 25, 70, 75, 80, 90) are hardcoded inline. The progress percentages especially are fragile -- changing the pipeline structure requires manually recalculating all percent values.
French strings in Rust: Error messages in synthesis.rs are in French (e.g., "Aucun article valide trouve"), which is correct for this French-language app, but they are hardcoded rather than going through an i18n system. This is acceptable for a single-language self-hosted app.

4. Maintainability Risks

4.1 Settings Field Addition Ceremony

Adding one settings field requires changes to:

models/settings.rs -- 3 structs + Default + From + validate()
db/settings.rs -- SettingsRow + TryFrom + 2 SQL queries (INSERT + UPDATE with RETURNING)
SQL migration
frontend/src/types.ts -- UserSettings + DEFAULT_SETTINGS
frontend/src/pages/Settings.tsx -- UI input
frontend/src/i18n/fr.ts -- label

This 6-file ceremony is a maintenance burden and has likely already produced bugs (note the divergent default categories between frontend and backend).

4.2 Tight Coupling Between Pipeline Steps

The generation pipeline in synthesis.rs is a single function with shared mutable state (article_scraped, filled_counts, source_counts, seen_urls). This means:

Cannot test individual phases in isolation
Cannot add a new phase without modifying the 650-line function
Cannot run phases in different orders for different configurations
Any bug in counter management affects all downstream phases

4.3 Provider-Specific Behavior Spread Across Files

Adding a new LLM provider requires:

New file in services/llm/ implementing LlmProvider
Update services/llm/factory.rs match arm
Potentially update services/synthesis.rs if the provider has unique behavior
Update admin provider seed data

This is mostly well-contained by the trait abstraction, but the factory pattern could be more extensible.

4.4 Frontend types.ts as a God File

types.ts (303 lines) defines every interface for the entire app. While currently manageable, it will become harder to navigate as the app grows. Types are not co-located with their usage.

5. Simplification Opportunities

5.1 Eliminate `SettingsResponse`

SettingsResponse is a strict subset of UserSettings (same fields minus user_id and updated_at). The handler could simply serialize UserSettings with #[serde(skip)] on the excluded fields, or use a #[serde(rename)] attribute, eliminating one of the 4 duplicate structs.

5.2 Extract Category Assignment Logic

The 40-line category assignment block (validate LLM category -> compute cat_key -> check filled_counts -> overflow to "Autre" -> insert into article_scraped -> update counters) is a pure function that can be extracted:

fn assign_to_category(
    item: NewsItem,
    llm_response: &Value,
    categories: &[String],
    filled_counts: &mut HashMap<String, usize>,
    article_scraped: &mut HashMap<String, Vec<NewsItem>>,
    max_per_cat: usize,
) -> bool { ... }

5.3 Unify Scrape+Classify Pipeline

Create a helper function:

async fn scrape_and_classify_batch(
    urls: &[(String, Option<String>)],  // (url, optional_source_url)
    provider: &Arc<dyn LlmProvider>,
    model: &str,
    schema: &Value,
    categories: &[String],
    settings: &UserSettings,
    // tracking state refs
) -> Vec<ClassifiedArticle>

This would eliminate ~200 lines of duplication between Phase 1 and Phase 2 Brave paths.

5.4 Extract Article Filter Pipeline

fn should_include_article(
    url: &str,
    seen_urls: &HashSet<String>,
    source_counts: &HashMap<String, usize>,
    max_per_source: usize,
) -> Result<(), &'static str>

5.5 Introduce `PipelineContext` Struct

Replace the 8+ mutable tracking variables with a single struct:

struct PipelineContext {
    article_scraped: HashMap<String, Vec<NewsItem>>,
    source_counts: HashMap<String, usize>,
    url_source: HashMap<String, String>,
    filled_counts: HashMap<String, usize>,
    seen_urls: HashSet<String>,
    max_total: usize,
}

This reduces parameter passing and makes the tracking state explicit.

5.6 Shared LLM Error Mapper

Move the common error mapping logic to services/llm/mod.rs:

pub fn map_provider_error(status: u16, provider_id: &str, message: &str) -> AppError

5.7 Frontend: Extract Settings Section Components

Break Settings.tsx into:

SettingsThemeSection
SettingsNumericFields
SettingsCategoriesSection
SettingsProviderSection
SettingsBraveSection
SettingsAdvancedSection
SettingsExportImport

Each would be 50-100 lines, down from 1,025 total.

5.8 Frontend: Extract ModelListEditor Component

In Providers.tsx, the model list UI (with add/remove/edit/default toggle) appears 4 times. A ModelListEditor component would receive models, onChange, and label props.

5.9 Dead/Unused Code

SettingsRow in db/settings.rs could be generated by sqlx macros instead of manual struct + TryFrom, but this is a style preference rather than dead code.
The JobStore::len() and JobStore::is_empty() methods are only used in tests. They are correctly gated for testing purposes.

6. Detailed Refactoring Plan

Critical Priority

C1. Extract Scrape+Classify Pipeline from `synthesis.rs`

Description: Extract the duplicated scrape-batch + classify-batch + assign-category logic into a reusable async function. Create a PipelineContext struct to hold the shared tracking state. This would reduce run_generation_inner from ~650 to ~300 lines.
Files affected: backend/src/services/synthesis.rs
Estimated effort: Medium (1-2 days)
Rationale: This is the single highest-impact change. The current 3x duplication means any bug fix or feature addition must be applied in 3 places. The run_generation_inner function is too long to review effectively.

C2. Extract Article Filter Pipeline

Description: Create a filter_candidate_url() function that applies homepage check, cross-phase dedup, history dedup, and source diversity in one place.
Files affected: backend/src/services/synthesis.rs
Estimated effort: Small (2-4 hours)
Rationale: The same filtering logic is duplicated 3 times with minor variations. A single function with configuration parameters eliminates this.
Dependencies: Can be done independently or as part of C1.

High Priority

H1. Eliminate `SettingsResponse` Struct Duplication

Description: Use #[serde(skip_serializing)] on user_id and updated_at in UserSettings, and serialize it directly in the handler. Remove SettingsResponse and its From impl entirely.
Files affected: backend/src/models/settings.rs, backend/src/handlers/settings.rs
Estimated effort: Small (1-2 hours)
Rationale: Eliminates one of the 4 duplicate field lists with zero risk.

H2. Introduce `PipelineContext` Struct

Description: Group article_scraped, source_counts, url_source, filled_counts, seen_urls, and max_total into a PipelineContext struct with methods like is_full(), add_article(), track_source().
Files affected: backend/src/services/synthesis.rs
Estimated effort: Medium (4-6 hours)
Rationale: Reduces cognitive load when reading the pipeline. Makes the tracking state self-documenting. Enables methods like ctx.add_article() that encapsulate the category assignment logic.
Dependencies: Best done alongside C1.

H3. Replace `trace_article` 11-Parameter Function with a Struct

Description: Create TraceEntry struct and pass it to trace_article(). This is already partially done with ArticleHistoryEntry in db/article_history.rs but trace_article() constructs it from 11 individual parameters.
Files affected: backend/src/services/synthesis.rs
Estimated effort: Small (2-3 hours)
Rationale: 11-parameter functions violate Rust idioms. A builder pattern or struct makes each call site clearer and less error-prone.

Medium Priority

M1. Extract Shared LLM Error Mapping

Description: Create map_provider_error(status, provider_name, error_message, error_type) -> AppError in services/llm/mod.rs. Each provider calls this with its parsed error details.
Files affected: backend/src/services/llm/mod.rs, gemini.rs, openai.rs, anthropic.rs
Estimated effort: Small (2-3 hours)
Rationale: Eliminates 3x duplication of the same match-on-status-code pattern. Makes it easy to add consistent error handling for new providers.

M2. Break Up `Settings.tsx` into Section Components

Description: Extract 6-7 section components (Theme, NumericFields, Categories, Provider, Brave, Advanced, ExportImport). The parent Settings component becomes a form container with save logic.
Files affected: frontend/src/pages/Settings.tsx, new files in frontend/src/components/settings/
Estimated effort: Medium (4-6 hours)
Rationale: 1,025 lines is too large for a single component. Section extraction makes each section independently testable and easier to navigate.

M3. Extract `ModelListEditor` Component from `Providers.tsx`

Description: Create a reusable ModelListEditor component that accepts models, onChange, label, and onAddModel props. Use it 4 times in Providers.tsx.
Files affected: frontend/src/pages/admin/Providers.tsx, new frontend/src/components/admin/ModelListEditor.tsx
Estimated effort: Small (2-3 hours)
Rationale: Eliminates 4x JSX duplication of the model editing UI.

M4. Name Pipeline Progress Percentages as Constants

Description: Define named constants for progress stages: PROGRESS_INIT = 5, PROGRESS_SOURCES = 10, PROGRESS_PROVIDER = 12, etc. Better yet, compute them from a stage list.
Files affected: backend/src/services/synthesis.rs
Estimated effort: Small (1-2 hours)
Rationale: Progress percentages are currently magic numbers scattered through 650 lines. Named constants make them self-documenting and easier to adjust.

M5. Align Frontend/Backend Default Settings

Description: The DEFAULT_SETTINGS in types.ts and UserSettings::default() in Rust have different category lists. Choose one as the source of truth. Ideally the backend defaults should be authoritative, and the frontend should fetch them on first load (which it already does -- the frontend defaults are only used before the first API call).
Files affected: frontend/src/types.ts, backend/src/models/settings.rs
Estimated effort: Small (1 hour)
Rationale: Divergent defaults can cause subtle bugs when a user's first-load experience differs from what the backend creates.

Low Priority

L1. Split `scraper.rs` Tests into a Separate File

Description: Move the 670-line test module from scraper.rs to scraper_tests.rs or use #[path] to keep them co-located but in a separate file.
Files affected: backend/src/services/scraper.rs
Estimated effort: Small (30 minutes)
Rationale: The file is large but most of it is tests. Moving tests out keeps the main file focused on production code.

L2. Use Builder Pattern for `PdfWriter`

Description: The PdfWriter::new() takes 6 parameters (doc, page, layer, font_regular, font_bold, font_italic). A builder would be cleaner.
Files affected: backend/src/services/export.rs
Estimated effort: Small (1-2 hours)
Rationale: Minor improvement. The current code works fine.

L3. Co-locate Frontend Types with API Modules

Description: Move type definitions from types.ts to their respective API modules (e.g., Source to api/sources.ts, Synthesis to api/syntheses.ts). Keep shared types like ApiError in types.ts.
Files affected: frontend/src/types.ts, multiple API files
Estimated effort: Medium (2-3 hours)
Rationale: Better co-location makes it easier to find and maintain types. However, the current approach works and 303 lines is still manageable.

L4. Improve Variable Names in Async Closures

Description: In synthesis.rs, replace abbreviated variables in async move closures (u, su, mad, jid) with descriptive names (url_to_scrape, source_url_ref, max_age, job_id_ref). The abbreviations exist because async closures require cloning, but descriptive names are worth the extra characters.
Files affected: backend/src/services/synthesis.rs
Estimated effort: Small (1-2 hours)
Rationale: Minor readability improvement in the most complex file.

Appendix: File Size Summary (Top 15)

Backend (Rust)

File	Lines	Notes
`services/synthesis.rs`	1,686	~650 lines pipeline + ~400 helpers + ~600 tests
`services/scraper.rs`	1,242	~570 code + ~670 tests
`services/llm/anthropic.rs`	500	~230 code + ~270 tests
`services/rate_limiter.rs`	471	Clean, well-structured
`services/export.rs`	456	PDF generation
`handlers/admin.rs`	438	Standard CRUD, clean
`models/settings.rs`	433	4 structs with field duplication
`services/source_scraper.rs`	428	Link extraction from source pages
`services/llm/openai.rs`	399	~190 code + ~200 tests
`services/prompts.rs`	390	Prompt templates + tests

Frontend (TypeScript/TSX)

File	Lines	Notes
`pages/Settings.tsx`	1,025	Monolithic settings form
`pages/admin/Providers.tsx`	854	Provider CRUD with duplicated model UI
`pages/Sources.tsx`	488	Manageable
`pages/SynthesisDetail.tsx`	476	Clean component extraction
`i18n/fr.ts`	394	Translation strings
`pages/GenerateSynthesis.tsx`	404	Generation with SSE progress
`pages/ArticleHistory.tsx`	321	Table with filtering
`components/ApiKeyManager.tsx`	313	API key CRUD component
`types.ts`	303	All TypeScript interfaces

22 KiB Raw Blame History