ai_synth/docs/audit/tech-lead-report.md

# Tech Lead Code Review Report

**Date:** 2026-03-26
**Codebase:** AI Weekly Synth (Rust/Axum + SolidJS)
**Total Backend LOC:** ~14,263 (Rust)
**Total Frontend LOC:** ~10,622 (TypeScript/TSX)

---

## Executive Summary

AI Weekly Synth is a well-structured, well-documented codebase with clear module boundaries and strong security practices (SSRF prevention, AES-256-GCM encryption, anti-enumeration auth). The code quality is above average for a learning project.

The primary concern is **`synthesis.rs` at 1,686 lines**, which contains a monolithic pipeline function (`run_generation_inner`, lines 243-890) spanning ~650 lines with significant copy-pasted code blocks between Phase 1 (personalized sources), Phase 2 Brave Search, and Phase 2 LLM Search. This file alone accounts for 12% of all backend code and is the single biggest maintainability risk.

Secondary concerns include: triple-duplication of settings field lists across `UserSettings`, `SettingsResponse`, and `UpdateSettingsRequest`; nearly identical error-mapping functions across all three LLM providers; and a 1,025-line Settings.tsx that could benefit from component extraction.

The codebase is otherwise clean: naming is consistent, comments are useful without being excessive, the error handling is centralized, and the module structure follows idiomatic Rust patterns. The recommendations below are ordered by impact-to-effort ratio.

---

## 1. Complexity Hotspots

### 1.1 `backend/src/services/synthesis.rs` -- 1,686 lines (CRITICAL)

This is by far the most complex file. Key concerns:

- **`run_generation_inner`** (lines 243-890, ~650 lines): This single async function orchestrates the entire generation pipeline. It manages 8+ mutable tracking structures (`article_scraped`, `source_counts`, `url_source`, `filled_counts`, `seen_urls`, etc.) and contains 3 nearly identical "scrape + classify + assign category" loops.

- **Deep nesting**: The Brave Search path (lines 562-759) is nested inside `if !category_gaps.is_empty()` > `if settings.use_brave_search` > `while !done` > `while let Some(join_result)`, reaching 5 levels of indentation.

- **Copy-pasted classify logic**: The category assignment block (lines 505-541) is duplicated almost verbatim at lines 702-750 for Brave Search and implicitly again for the LLM search path. Each block does:
  - Extract title/summary/category from LLM response
  - Validate category against allowed list
  - Compute `cat_key` string
  - Check `filled_counts` for overflow to "Autre"
  - Insert into `article_scraped`
  - Update counters

- **`trace_article` function** (line 926): Has 11 parameters and is called 15+ times throughout the pipeline with `#[allow(clippy::too_many_arguments)]`. This is a code smell that the function is doing too much parameter passing.

### 1.2 `backend/src/services/scraper.rs` -- 1,242 lines

Well-organized but large. 572 lines are production code, 670 lines are tests. The test-to-code ratio is excellent. The main `scrape_url` function is a reasonable 95 lines. No urgent action needed, but could benefit from splitting the test module into a separate file.

### 1.3 `frontend/src/pages/Settings.tsx` -- 1,025 lines

A single component handling theme, categories, max age, items per category, batch size, article history, LLM source links, Brave Search key management, provider selection, model selection, websearch model selection, rate limits, search behavior, export/import, and API key management. Contains:
- 14 signals (state variables)
- 10+ event handler functions
- Inline JSX for every form section

### 1.4 `frontend/src/pages/admin/Providers.tsx` -- 854 lines

Similar to Settings: a single component managing CRUD for providers with inline model list editing. The model list editing JSX (for both `models_scraping` and `models_websearch`) is nearly identical, repeated twice per provider card and twice for the "new provider" form.

### 1.5 `backend/src/services/llm/anthropic.rs` -- 500 lines

Of these, ~270 lines are tests, which is healthy. The production code is clean.

---

## 2. Code Duplication

### 2.1 Settings Field Lists (HIGH -- 4 locations)

The same 16 fields are listed in 4 places:

| Location | Type | Purpose |
|----------|------|---------|
| `models/settings.rs:9-27` | `UserSettings` | DB model |
| `models/settings.rs:31-47` | `SettingsResponse` | API response |
| `models/settings.rs:73-89` | `UpdateSettingsRequest` | API input |
| `db/settings.rs:14-31` | `SettingsRow` | sqlx row type |

Every time a new settings field is added (which has happened several times based on commit history), all 4 structs plus the `From` impl, the `TryFrom` impl, the SQL INSERT/UPDATE queries, and the validation function must all be updated in lockstep. The `SettingsResponse` is particularly pointless -- it has exactly the same fields as `UserSettings` minus `user_id` and `updated_at`.

**Files affected:** `backend/src/models/settings.rs`, `backend/src/db/settings.rs`

### 2.2 LLM Error Mapping (MEDIUM -- 3 providers)

The `map_*_error` functions in all three providers are nearly identical:

```
// gemini.rs:151-178, openai.rs:158-189, anthropic.rs:194-228
match status {
    400 => AppError::BadRequest("Invalid request to LLM provider".into()),
    401 => AppError::BadRequest("Invalid or unauthorized API key".into()),
    // ...
    429 => AppError::RateLimited("LLM provider rate limit exceeded...".into()),
    _ => AppError::Internal(anyhow::anyhow!("LLM provider returned an error")),
}
```

Only Anthropic adds a `529` case; Gemini merges `401|403`. The error extraction logic (reading `body["error"]["message"]`) differs slightly per provider but could be parameterized.

**Files affected:** `backend/src/services/llm/gemini.rs`, `openai.rs`, `anthropic.rs`

### 2.3 Scrape + Classify Pipeline (CRITICAL -- 3 code paths)

The pattern "scrape batch in parallel -> classify batch in parallel -> assign to categories" appears 3 times in `synthesis.rs`:

1. **Phase 1 Personalized Sources** (lines 429-551): scrape set -> classify set -> category assignment with source tracking
2. **Phase 2 Brave Search** (lines 639-758): near-identical with minor tuple shape differences
3. **Phase 2 LLM Search** (lines 822-843): simplified (scrape for validation only, no classify needed)

Paths 1 and 2 differ only in:
- Tuple shape: `(url, source_url, body_text, page_title)` vs `(url, body_text, page_title)`
- Source tracking: Phase 1 uses `source_url` from `url_source`, Phase 2 uses `extract_domain`
- Otherwise the classify/assign logic is character-for-character identical

### 2.4 Article Filtering Logic (MEDIUM -- 3 locations)

The "homepage filter + cross-phase dedup + history dedup + source diversity" filter pipeline appears in 3 places:
- Phase 2 LLM search (lines 780-820)
- Phase 2 Brave search (lines 575-616)
- Phase 1 history filter (lines 376-387)

Each applies the same checks in slightly different orders with identical logic.

### 2.5 Frontend Model List JSX (LOW -- Providers.tsx)

In `Providers.tsx`, the model list editing UI (add/remove/edit model rows) is repeated 4 times:
- Existing provider `models_scraping` list
- Existing provider `models_websearch` list
- New provider `models_scraping` list
- New provider `models_websearch` list

### 2.6 Frontend Default Settings (LOW -- 2 locations)

`DEFAULT_SETTINGS` is defined in `frontend/src/types.ts:60-83` and `UserSettings::default()` is defined in `backend/src/models/settings.rs:162-190`. The category lists are slightly different between the two (the frontend has longer, more descriptive French names). This divergence is a latent bug source.

---

## 3. Readability Assessment

### Strengths

- **Module-level documentation**: Every `.rs` file starts with `//!` module docs explaining purpose and listing endpoints. This is excellent and should be maintained.
- **Function documentation**: Important functions have `///` doc comments with `# Arguments` sections. The scraper and synthesis modules are particularly well-documented.
- **Naming conventions**: Consistent `snake_case` for Rust, `camelCase` for TypeScript. No single-letter variables except loop iterators.
- **Error handling**: Centralized in `errors.rs` with clear variant-to-status-code mapping. Internal errors never leak to clients.
- **Security comments**: The SSRF check has explicit notes about TOCTOU gaps (`scraper.rs:63`). The job store documents why `_rx` is kept alive. These are high-value comments.

### Concerns

- **Short variable names in synthesis.rs**: Variables like `u`, `su`, `mad`, `jid`, `uid`, `mdl`, `sys`, `usr` hurt readability in the already-complex pipeline code. These abbreviations exist because of async move closures requiring cloned values, but they make the code harder to follow.
- **Magic numbers in synthesis.rs**: `max_links = 15`, `max_concurrent = 5`, progress percentages (5, 10, 12, 15, 25, 70, 75, 80, 90) are hardcoded inline. The progress percentages especially are fragile -- changing the pipeline structure requires manually recalculating all percent values.
- **French strings in Rust**: Error messages in `synthesis.rs` are in French (e.g., "Aucun article valide trouve"), which is correct for this French-language app, but they are hardcoded rather than going through an i18n system. This is acceptable for a single-language self-hosted app.

---

## 4. Maintainability Risks

### 4.1 Settings Field Addition Ceremony

Adding one settings field requires changes to:
1. `models/settings.rs` -- 3 structs + `Default` + `From` + `validate()`
2. `db/settings.rs` -- `SettingsRow` + `TryFrom` + 2 SQL queries (INSERT + UPDATE with RETURNING)
3. SQL migration
4. `frontend/src/types.ts` -- `UserSettings` + `DEFAULT_SETTINGS`
5. `frontend/src/pages/Settings.tsx` -- UI input
6. `frontend/src/i18n/fr.ts` -- label

This 6-file ceremony is a maintenance burden and has likely already produced bugs (note the divergent default categories between frontend and backend).

### 4.2 Tight Coupling Between Pipeline Steps

The generation pipeline in `synthesis.rs` is a single function with shared mutable state (`article_scraped`, `filled_counts`, `source_counts`, `seen_urls`). This means:
- Cannot test individual phases in isolation
- Cannot add a new phase without modifying the 650-line function
- Cannot run phases in different orders for different configurations
- Any bug in counter management affects all downstream phases

### 4.3 Provider-Specific Behavior Spread Across Files

Adding a new LLM provider requires:
1. New file in `services/llm/` implementing `LlmProvider`
2. Update `services/llm/factory.rs` match arm
3. Potentially update `services/synthesis.rs` if the provider has unique behavior
4. Update admin provider seed data

This is mostly well-contained by the trait abstraction, but the factory pattern could be more extensible.

### 4.4 Frontend types.ts as a God File

`types.ts` (303 lines) defines every interface for the entire app. While currently manageable, it will become harder to navigate as the app grows. Types are not co-located with their usage.

---

## 5. Simplification Opportunities

### 5.1 Eliminate `SettingsResponse`

`SettingsResponse` is a strict subset of `UserSettings` (same fields minus `user_id` and `updated_at`). The handler could simply serialize `UserSettings` with `#[serde(skip)]` on the excluded fields, or use a `#[serde(rename)]` attribute, eliminating one of the 4 duplicate structs.

### 5.2 Extract Category Assignment Logic

The 40-line category assignment block (validate LLM category -> compute cat_key -> check filled_counts -> overflow to "Autre" -> insert into article_scraped -> update counters) is a pure function that can be extracted:

```rust
fn assign_to_category(
    item: NewsItem,
    llm_response: &Value,
    categories: &[String],
    filled_counts: &mut HashMap<String, usize>,
    article_scraped: &mut HashMap<String, Vec<NewsItem>>,
    max_per_cat: usize,
) -> bool { ... }
```

### 5.3 Unify Scrape+Classify Pipeline

Create a helper function:
```rust
async fn scrape_and_classify_batch(
    urls: &[(String, Option<String>)],  // (url, optional_source_url)
    provider: &Arc<dyn LlmProvider>,
    model: &str,
    schema: &Value,
    categories: &[String],
    settings: &UserSettings,
    // tracking state refs
) -> Vec<ClassifiedArticle>
```

This would eliminate ~200 lines of duplication between Phase 1 and Phase 2 Brave paths.

### 5.4 Extract Article Filter Pipeline

```rust
fn should_include_article(
    url: &str,
    seen_urls: &HashSet<String>,
    source_counts: &HashMap<String, usize>,
    max_per_source: usize,
) -> Result<(), &'static str>
```

### 5.5 Introduce `PipelineContext` Struct

Replace the 8+ mutable tracking variables with a single struct:
```rust
struct PipelineContext {
    article_scraped: HashMap<String, Vec<NewsItem>>,
    source_counts: HashMap<String, usize>,
    url_source: HashMap<String, String>,
    filled_counts: HashMap<String, usize>,
    seen_urls: HashSet<String>,
    max_total: usize,
}
```

This reduces parameter passing and makes the tracking state explicit.

### 5.6 Shared LLM Error Mapper

Move the common error mapping logic to `services/llm/mod.rs`:
```rust
pub fn map_provider_error(status: u16, provider_id: &str, message: &str) -> AppError
```

### 5.7 Frontend: Extract Settings Section Components

Break `Settings.tsx` into:
- `SettingsThemeSection`
- `SettingsNumericFields`
- `SettingsCategoriesSection`
- `SettingsProviderSection`
- `SettingsBraveSection`
- `SettingsAdvancedSection`
- `SettingsExportImport`

Each would be 50-100 lines, down from 1,025 total.

### 5.8 Frontend: Extract ModelListEditor Component

In `Providers.tsx`, the model list UI (with add/remove/edit/default toggle) appears 4 times. A `ModelListEditor` component would receive `models`, `onChange`, and `label` props.

### 5.9 Dead/Unused Code

- `SettingsRow` in `db/settings.rs` could be generated by sqlx macros instead of manual struct + TryFrom, but this is a style preference rather than dead code.
- The `JobStore::len()` and `JobStore::is_empty()` methods are only used in tests. They are correctly gated for testing purposes.

---

## 6. Detailed Refactoring Plan

### Critical Priority

#### C1. Extract Scrape+Classify Pipeline from `synthesis.rs`
- **Description:** Extract the duplicated scrape-batch + classify-batch + assign-category logic into a reusable async function. Create a `PipelineContext` struct to hold the shared tracking state. This would reduce `run_generation_inner` from ~650 to ~300 lines.
- **Files affected:** `backend/src/services/synthesis.rs`
- **Estimated effort:** Medium (1-2 days)
- **Rationale:** This is the single highest-impact change. The current 3x duplication means any bug fix or feature addition must be applied in 3 places. The `run_generation_inner` function is too long to review effectively.

#### C2. Extract Article Filter Pipeline
- **Description:** Create a `filter_candidate_url()` function that applies homepage check, cross-phase dedup, history dedup, and source diversity in one place.
- **Files affected:** `backend/src/services/synthesis.rs`
- **Estimated effort:** Small (2-4 hours)
- **Rationale:** The same filtering logic is duplicated 3 times with minor variations. A single function with configuration parameters eliminates this.
- **Dependencies:** Can be done independently or as part of C1.

### High Priority

#### H1. Eliminate `SettingsResponse` Struct Duplication
- **Description:** Use `#[serde(skip_serializing)]` on `user_id` and `updated_at` in `UserSettings`, and serialize it directly in the handler. Remove `SettingsResponse` and its `From` impl entirely.
- **Files affected:** `backend/src/models/settings.rs`, `backend/src/handlers/settings.rs`
- **Estimated effort:** Small (1-2 hours)
- **Rationale:** Eliminates one of the 4 duplicate field lists with zero risk.

#### H2. Introduce `PipelineContext` Struct
- **Description:** Group `article_scraped`, `source_counts`, `url_source`, `filled_counts`, `seen_urls`, and `max_total` into a `PipelineContext` struct with methods like `is_full()`, `add_article()`, `track_source()`.
- **Files affected:** `backend/src/services/synthesis.rs`
- **Estimated effort:** Medium (4-6 hours)
- **Rationale:** Reduces cognitive load when reading the pipeline. Makes the tracking state self-documenting. Enables methods like `ctx.add_article()` that encapsulate the category assignment logic.
- **Dependencies:** Best done alongside C1.

#### H3. Replace `trace_article` 11-Parameter Function with a Struct
- **Description:** Create `TraceEntry` struct and pass it to `trace_article()`. This is already partially done with `ArticleHistoryEntry` in `db/article_history.rs` but `trace_article()` constructs it from 11 individual parameters.
- **Files affected:** `backend/src/services/synthesis.rs`
- **Estimated effort:** Small (2-3 hours)
- **Rationale:** 11-parameter functions violate Rust idioms. A builder pattern or struct makes each call site clearer and less error-prone.

### Medium Priority

#### M1. Extract Shared LLM Error Mapping
- **Description:** Create `map_provider_error(status, provider_name, error_message, error_type) -> AppError` in `services/llm/mod.rs`. Each provider calls this with its parsed error details.
- **Files affected:** `backend/src/services/llm/mod.rs`, `gemini.rs`, `openai.rs`, `anthropic.rs`
- **Estimated effort:** Small (2-3 hours)
- **Rationale:** Eliminates 3x duplication of the same match-on-status-code pattern. Makes it easy to add consistent error handling for new providers.

#### M2. Break Up `Settings.tsx` into Section Components
- **Description:** Extract 6-7 section components (Theme, NumericFields, Categories, Provider, Brave, Advanced, ExportImport). The parent `Settings` component becomes a form container with save logic.
- **Files affected:** `frontend/src/pages/Settings.tsx`, new files in `frontend/src/components/settings/`
- **Estimated effort:** Medium (4-6 hours)
- **Rationale:** 1,025 lines is too large for a single component. Section extraction makes each section independently testable and easier to navigate.

#### M3. Extract `ModelListEditor` Component from `Providers.tsx`
- **Description:** Create a reusable `ModelListEditor` component that accepts `models`, `onChange`, `label`, and `onAddModel` props. Use it 4 times in `Providers.tsx`.
- **Files affected:** `frontend/src/pages/admin/Providers.tsx`, new `frontend/src/components/admin/ModelListEditor.tsx`
- **Estimated effort:** Small (2-3 hours)
- **Rationale:** Eliminates 4x JSX duplication of the model editing UI.

#### M4. Name Pipeline Progress Percentages as Constants
- **Description:** Define named constants for progress stages: `PROGRESS_INIT = 5`, `PROGRESS_SOURCES = 10`, `PROGRESS_PROVIDER = 12`, etc. Better yet, compute them from a stage list.
- **Files affected:** `backend/src/services/synthesis.rs`
- **Estimated effort:** Small (1-2 hours)
- **Rationale:** Progress percentages are currently magic numbers scattered through 650 lines. Named constants make them self-documenting and easier to adjust.

#### M5. Align Frontend/Backend Default Settings
- **Description:** The `DEFAULT_SETTINGS` in `types.ts` and `UserSettings::default()` in Rust have different category lists. Choose one as the source of truth. Ideally the backend defaults should be authoritative, and the frontend should fetch them on first load (which it already does -- the frontend defaults are only used before the first API call).
- **Files affected:** `frontend/src/types.ts`, `backend/src/models/settings.rs`
- **Estimated effort:** Small (1 hour)
- **Rationale:** Divergent defaults can cause subtle bugs when a user's first-load experience differs from what the backend creates.

### Low Priority

#### L1. Split `scraper.rs` Tests into a Separate File
- **Description:** Move the 670-line test module from `scraper.rs` to `scraper_tests.rs` or use `#[path]` to keep them co-located but in a separate file.
- **Files affected:** `backend/src/services/scraper.rs`
- **Estimated effort:** Small (30 minutes)
- **Rationale:** The file is large but most of it is tests. Moving tests out keeps the main file focused on production code.

#### L2. Use Builder Pattern for `PdfWriter`
- **Description:** The `PdfWriter::new()` takes 6 parameters (doc, page, layer, font_regular, font_bold, font_italic). A builder would be cleaner.
- **Files affected:** `backend/src/services/export.rs`
- **Estimated effort:** Small (1-2 hours)
- **Rationale:** Minor improvement. The current code works fine.

#### L3. Co-locate Frontend Types with API Modules
- **Description:** Move type definitions from `types.ts` to their respective API modules (e.g., `Source` to `api/sources.ts`, `Synthesis` to `api/syntheses.ts`). Keep shared types like `ApiError` in `types.ts`.
- **Files affected:** `frontend/src/types.ts`, multiple API files
- **Estimated effort:** Medium (2-3 hours)
- **Rationale:** Better co-location makes it easier to find and maintain types. However, the current approach works and 303 lines is still manageable.

#### L4. Improve Variable Names in Async Closures
- **Description:** In `synthesis.rs`, replace abbreviated variables in async move closures (`u`, `su`, `mad`, `jid`) with descriptive names (`url_to_scrape`, `source_url_ref`, `max_age`, `job_id_ref`). The abbreviations exist because async closures require cloning, but descriptive names are worth the extra characters.
- **Files affected:** `backend/src/services/synthesis.rs`
- **Estimated effort:** Small (1-2 hours)
- **Rationale:** Minor readability improvement in the most complex file.

---

## Appendix: File Size Summary (Top 15)

### Backend (Rust)
| File | Lines | Notes |
|------|-------|-------|
| `services/synthesis.rs` | 1,686 | ~650 lines pipeline + ~400 helpers + ~600 tests |
| `services/scraper.rs` | 1,242 | ~570 code + ~670 tests |
| `services/llm/anthropic.rs` | 500 | ~230 code + ~270 tests |
| `services/rate_limiter.rs` | 471 | Clean, well-structured |
| `services/export.rs` | 456 | PDF generation |
| `handlers/admin.rs` | 438 | Standard CRUD, clean |
| `models/settings.rs` | 433 | 4 structs with field duplication |
| `services/source_scraper.rs` | 428 | Link extraction from source pages |
| `services/llm/openai.rs` | 399 | ~190 code + ~200 tests |
| `services/prompts.rs` | 390 | Prompt templates + tests |

### Frontend (TypeScript/TSX)
| File | Lines | Notes |
|------|-------|-------|
| `pages/Settings.tsx` | 1,025 | Monolithic settings form |
| `pages/admin/Providers.tsx` | 854 | Provider CRUD with duplicated model UI |
| `pages/Sources.tsx` | 488 | Manageable |
| `pages/SynthesisDetail.tsx` | 476 | Clean component extraction |
| `i18n/fr.ts` | 394 | Translation strings |
| `pages/GenerateSynthesis.tsx` | 404 | Generation with SSE progress |
| `pages/ArticleHistory.tsx` | 321 | Table with filtering |
| `components/ApiKeyManager.tsx` | 313 | API key CRUD component |
| `types.ts` | 303 | All TypeScript interfaces |