You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

319 lines
17 KiB
Markdown

# V2 Tech Lead Audit Report — AI Weekly Synth
**Date:** 2026-03-27
**Scope:** Full codebase (backend + frontend), complexity, duplication, readability, maintainability
---
## Executive Summary
The codebase is well-structured for a learning project and demonstrates solid engineering practices: clean error handling, SSRF protection, rate limiting, encryption at rest, and thorough test coverage for utility functions. However, organic growth has introduced one critical complexity hotspot (`synthesis.rs` at 2010 lines), significant frontend duplication between `Sources.tsx` and `ThemeManager.tsx`, and several patterns that will impede future development if not addressed.
**Priority:** 14 findings ranked P1 (do first) through P4 (nice to have).
---
## 1. Complexity Hotspots
### 1.1 [P1] `backend/src/services/synthesis.rs` — 2010 lines, God Function
**File:** `/Users/oabrivard/Projects/rust/ai_synth/backend/src/services/synthesis.rs`
`run_generation_inner()` spans approximately **800 lines** (lines 246-1038). It handles initialization, source rotation, link extraction, article history filtering, preferred-source shuffling, batch scraping, LLM classification, date filtering, category assignment, Brave Search fallback, LLM web search fallback, final assembly, and database persistence — all in a single function.
**Specific issues:**
- **Deep nesting**: The wave loop (`'wave_loop`) contains a batch loop (`while !done`), which contains a JoinSet collection loop, which contains match arms with multiple `continue` branches. This is 4-5 levels of nesting.
- **Duplicated scrape+classify logic**: The Phase 1 scrape+classify block (lines 471-632) and the Brave Search scrape+classify block (lines 704-888) are near-identical. Both build a JoinSet, spawn scrape tasks, collect results, build another JoinSet for LLM classification, parse responses, check `is_article`, filter by date, handle no-date articles, and assign categories.
- **12 calls to `build_trace_entry()`** with the same boilerplate `ArticleTrace` struct construction scattered throughout.
- **7 flush-pending-traces blocks** (check `!pending_traces.is_empty()`, call `batch_insert_entries`, call `pending_traces.clear()`).
**Recommendation:** Extract into a pipeline module with distinct phases:
```
services/pipeline/mod.rs — orchestrator (run_generation_inner)
services/pipeline/phase1.rs — personalized source processing
services/pipeline/phase2.rs — web search fallback (Brave + LLM)
services/pipeline/classify.rs — shared scrape+classify batch logic
services/pipeline/tracing.rs — ArticleTrace builder + flush helper
services/pipeline/progress.rs — ProgressEvent + emit_progress
```
### 1.2 [P2] `backend/src/services/scraper.rs` — 1280 lines
**File:** `/Users/oabrivard/Projects/rust/ai_synth/backend/src/services/scraper.rs`
This file is reasonably well-organized but large. The 600+ lines of tests (starting at line 678) constitute nearly half the file. The SSRF validation, HTML parsing, date extraction, and soft-404 detection are logically distinct concerns.
**Recommendation:** Move tests to `backend/src/services/scraper/tests.rs` using a `#[cfg(test)] mod tests;` pattern. Consider splitting the file into `scraper/ssrf.rs`, `scraper/html.rs`, `scraper/dates.rs` if it continues to grow.
### 1.3 [P3] `frontend/src/pages/ThemeManager.tsx` — 935 lines, monolithic component
**File:** `/Users/oabrivard/Projects/rust/ai_synth/frontend/src/pages/ThemeManager.tsx`
This single component manages 20+ signals, handles theme CRUD, source CRUD, bulk import, CSV import/export, preferred sources, category editing, and a schedule sub-component. The render function alone (lines 429-931) is 500 lines of JSX.
**Recommendation:** Extract sub-components:
- `ThemeContentForm` — name, topic, categories, max age/items, summary length
- `ThemeSourceList` — source list, add, delete, preferred toggle
- `ThemeImportExport` — CSV and bulk import sections
---
## 2. Code Duplication
### 2.1 [P1] Sources.tsx and ThemeManager.tsx — ~80% duplicated source management logic
**Files:**
- `/Users/oabrivard/Projects/rust/ai_synth/frontend/src/pages/Sources.tsx` (481 lines)
- `/Users/oabrivard/Projects/rust/ai_synth/frontend/src/pages/ThemeManager.tsx` (935 lines)
Nearly every source-management function in `ThemeManager.tsx` is a copy-paste of `Sources.tsx` with minor adaptations (adding `theme_id` parameter):
- `handleAddSource` — identical validation logic, same error handling pattern
- `handleDeleteClick` / `performDelete` — identical two-click confirmation with timer
- `handleExportCsv` / `handleImportCsv` — identical
- `handleBulkImport` — identical line parsing, same semicolon splitting
The JSX for source list rendering (star toggle, delete button, link display) is also duplicated.
**Recommendation:** Extract a `SourceManager` component that accepts an optional `themeId` prop. Both pages delegate to it. The `normalizeUrl` and `isValidUrl` functions are already exported from `Sources.tsx` and imported by `ThemeManager.tsx` — this pattern should extend to the full source management UI.
### 2.2 [P1] Synthesis pipeline: duplicated scrape+classify blocks
As noted in 1.1, the Phase 1 and Brave Search paths in `synthesis.rs` duplicate approximately 120 lines of scrape-then-classify logic. The only differences are:
- Phase 1 tracks `source_url` per article; Brave does not
- Phase 1 uses `(String, String, String, String)` tuples; Brave uses `(String, String, String)`
**Recommendation:** Create a `scrape_and_classify_batch()` function parameterized by source type and optional source URL. This eliminates the duplication and makes adding future search backends (e.g., Google Search, Bing) trivial.
### 2.3 [P2] Frontend error handling boilerplate — 40+ occurrences
The pattern `catch (err) { if (isApiError(err)) { setX(err.message) } else { setX(t('...')) } }` appears 40 times across 14 files. This is mechanical and could be simplified.
**Recommendation:** Create a `handleApiError(err, fallbackKey)` utility:
```typescript
function handleApiError(err: unknown, t: TFunction, fallbackKey: string): string {
return isApiError(err) ? err.message : t(fallbackKey);
}
```
### 2.4 [P3] Admin audit logging boilerplate
In `admin.rs`, every handler follows the same pattern: perform action, then call `db::audit::create_entry` with a `CreateAuditLog` struct. This is 5 occurrences, each ~15 lines.
**Recommendation:** Consider an audit middleware or macro that captures the action, target type, and details from the handler return value.
---
## 3. Readability
### 3.1 [P2] French/English mixing in backend code
User-facing strings in `synthesis.rs` and `prompts.rs` are hardcoded in French:
- Progress messages: `"Chargement des parametres..."`, `"Analyse des sources personnalisees..."`
- Error messages: `"Aucun article valide trouve. Verifiez vos sources et categories."`
- Prompt text: entire system/user prompts in French
Meanwhile, code comments, doc strings, log messages, and error variants are in English. This inconsistency makes it harder for non-French speakers to contribute and prevents future i18n.
**Recommendation:** Move all user-facing strings to constants or a backend i18n module. Keep code, comments, and logs in English.
### 3.2 [P3] `#[allow(clippy::too_many_arguments)]` — 3 occurrences
**Files:** `synthesis.rs`, `prompts.rs`, `llm_call_log.rs`
These suppressions indicate functions with parameter counts exceeding Clippy's threshold (typically 7+). They are code smells signaling that parameters should be grouped into structs.
- `build_search_prompt` takes 9 parameters
- `log_llm_call` takes 10 parameters
- `insert` in `llm_call_log.rs` takes 10 parameters
**Recommendation:** Introduce parameter structs:
```rust
struct SearchPromptParams<'a> {
theme: &'a str,
categories: &'a [String],
max_items_per_category: i32,
// ...
}
```
### 3.3 [P4] Magic strings for category keys
Category keys like `"category_0"`, `"category_autre"`, `"category_no_date"` are used as HashMap keys throughout `synthesis.rs` and in `schema.rs`. These appear as raw string literals in ~15 places.
**Recommendation:** Define constants or an enum:
```rust
const CATEGORY_OTHER: &str = "category_autre";
const CATEGORY_NO_DATE: &str = "category_no_date";
fn category_key(index: usize) -> String { format!("category_{}", index) }
```
---
## 4. Maintainability Risks
### 4.1 [P2] Tight coupling between synthesis pipeline and database
`run_generation_inner()` directly calls `db::settings::get_or_create_default`, `db::themes::get_by_id`, `db::sources::list_for_user`, `db::article_history::*`, `db::llm_call_log::insert`, `db::syntheses::create`, and a raw `sqlx::query_scalar` (line 1419-1429 for `resolve_model`). The function takes `AppState` which bundles the database pool, HTTP client, job store, and rate limiters.
**Impact:** Unit testing the pipeline logic requires either a real Postgres database or a complete mock of `AppState`. The existing E2E tests use a mock LLM provider (good) but still need Postgres (expensive).
**Recommendation:** Introduce a `PipelineContext` trait or struct that abstracts data access. This would allow testing the orchestration logic with in-memory implementations.
### 4.2 [P2] Raw SQL inline in `resolve_model()`
**File:** `/Users/oabrivard/Projects/rust/ai_synth/backend/src/services/synthesis.rs`, lines 1419-1429
```rust
let model = sqlx::query_scalar::<_, String>(
r#"SELECT m->>'model_id' FROM admin_providers, ..."#,
)
```
This is the only place in the service layer that contains raw SQL. All other queries go through the `db/` module, maintaining a clean separation. This breaks the pattern.
**Recommendation:** Move to `db::providers::get_default_scraping_model(pool, provider_name)`.
### 4.3 [P3] LLM provider implementations share identical HTTP error handling
Each of the three providers (`gemini.rs`, `openai.rs`, `anthropic.rs`) implements the same pattern:
1. Build request body (provider-specific)
2. Send HTTP request
3. Map network errors with `is_timeout()` / `is_connect()` classification
4. Parse response JSON
5. Check HTTP status and map errors
6. Extract content from provider-specific response structure
Steps 2-4 are identical across all three providers (~20 lines each). Only steps 1, 5, and 6 differ.
**Recommendation:** Extract a `send_llm_request()` helper in `llm/mod.rs`:
```rust
async fn send_llm_request(
client: &reqwest::Client,
url: &str,
body: &Value,
headers: &[(String, String)],
provider_name: &str,
) -> Result<(u16, Value), AppError>
```
### 4.4 [P3] `Providers.tsx` — 854 lines, complex inline state management
**File:** `/Users/oabrivard/Projects/rust/ai_synth/frontend/src/pages/admin/Providers.tsx`
The admin Providers page manages local editable copies of provider state in a `Record<string, ProviderFormState>` map, with functions for model array manipulation (add, remove, toggle default), scraping vs. websearch model lists, and inline validation. This is the most complex admin page and would benefit from splitting the model list editor into a reusable `ModelListEditor` component.
### 4.5 [P4] `Settings.tsx` — 694 lines, growing form complexity
**File:** `/Users/oabrivard/Projects/rust/ai_synth/frontend/src/pages/Settings.tsx`
The settings page has already been partially decomposed (`SettingsBraveSearch`, `SettingsRateLimit`, `ApiKeyManager`), which is good. The remaining monolithic JSX sections (provider selection, model dropdowns, import/export) could follow the same pattern for consistency.
---
## 5. Simplification Opportunities
### 5.1 [P3] `Sources.tsx` may be dead code
With the introduction of `ThemeManager.tsx`, which subsumes all source management under themes, the standalone `Sources.tsx` page may no longer be reachable by users. It is still registered in the router, but if all sources must now belong to a theme, the standalone page serves no purpose.
**Action:** Verify whether `Sources.tsx` is still linked in the navigation. If not, remove it and its route to eliminate 481 lines of duplicated code.
### 5.2 [P3] `list_for_user` query branch duplication
**File:** `/Users/oabrivard/Projects/rust/ai_synth/backend/src/db/sources.rs`, lines 15-44
The function has two nearly identical SQL queries — one with `AND theme_id = $2` and one without. The only difference is the optional WHERE clause.
**Recommendation:** Use a single query with a conditional clause:
```rust
sqlx::query_as::<_, Source>(
"SELECT ... FROM sources WHERE user_id = $1 AND ($2::uuid IS NULL OR theme_id = $2) ORDER BY ..."
)
.bind(user_id)
.bind(theme_id)
```
### 5.3 [P4] `bulk_create` uses sequential inserts instead of batch
**File:** `/Users/oabrivard/Projects/rust/ai_synth/backend/src/db/sources.rs`, lines 97-127
Sources are inserted one by one in a loop. For bulk imports of 50-100 sources, this generates 50-100 round-trips to the database.
**Recommendation:** Use `sqlx`'s batch insert or build a single `INSERT ... VALUES ($1, $2), ($3, $4), ...` query. This is a performance optimization, not a correctness issue.
### 5.4 [P4] Hardcoded snippet sizes
In `synthesis.rs`, the snippet size is computed from `summary_length` in two separate places (Phase 1 at line 437 and Brave at lines 766-770):
```rust
let snippet_size = match theme.summary_length { 1 => 500, 2 => 2000, _ => 4000 };
```
**Recommendation:** Extract to a function `fn snippet_size_for_length(summary_length: i32) -> usize`.
---
## 6. File Size Summary
### Backend (top 10 by line count)
| File | Lines | Assessment |
|------|-------|------------|
| `services/synthesis.rs` | 2010 | Needs decomposition (P1) |
| `services/scraper.rs` | 1280 | Acceptable, extract tests |
| `services/rate_limiter.rs` | 471 | Clean |
| `services/llm/anthropic.rs` | 471 | Minor shared-code opportunity |
| `services/export.rs` | 459 | Clean |
| `handlers/admin.rs` | 438 | Audit boilerplate |
| `models/synthesis.rs` | 416 | Clean |
| `services/email.rs` | 384 | Clean |
| `handlers/auth.rs` | 381 | Clean |
| `services/llm/openai.rs` | 373 | Minor shared-code opportunity |
### Frontend (top 10 by line count)
| File | Lines | Assessment |
|------|-------|------------|
| `pages/ThemeManager.tsx` | 935 | Needs decomposition (P1/P3) |
| `pages/admin/Providers.tsx` | 854 | Extract ModelListEditor (P3) |
| `pages/Settings.tsx` | 694 | Partially decomposed, continue (P4) |
| `pages/SynthesisDetail.tsx` | 548 | Acceptable |
| `pages/Sources.tsx` | 481 | Possibly dead code (P3) |
| `pages/GenerateSynthesis.tsx` | 471 | Clean |
| `i18n/fr.ts` | 462 | Expected size for translations |
| `pages/ArticleHistory.tsx` | 371 | Clean |
| `pages/Home.tsx` | 345 | Clean |
| `components/settings/SettingsSchedule.tsx` | 286 | Clean |
---
## 7. Positive Observations
These aspects of the codebase are well-executed and should be preserved:
1. **Error handling**: `AppError` enum with `IntoResponse` is clean, consistent, and hides internal details. Tests verify that secrets are never leaked.
2. **Security**: SSRF prevention with DNS resolution checks, AES-256-GCM encryption for API keys, CSRF via `X-Requested-With`, timing-attack mitigation in auth, and sensitive data scrubbing in error messages.
3. **LLM provider abstraction**: The `LlmProvider` trait + factory pattern makes adding new providers straightforward.
4. **Documentation**: Module-level `//!` doc comments on every file, function-level `///` doc comments with examples, and clear CLAUDE.md project instructions.
5. **Frontend component extraction**: `SettingsBraveSearch`, `SettingsRateLimit`, `SettingsSchedule`, and `ApiKeyManager` demonstrate good instincts for decomposition.
6. **Type safety**: Frontend `types.ts` is clean, well-organized, and provides `isApiError` type guard.
7. **Test coverage**: Unit tests for error handling, SSRF checks, URL normalization, job store, role validation, and CSV parsing.
---
## 8. Prioritized Action Plan
| Priority | Item | Effort | Impact |
|----------|------|--------|--------|
| **P1** | Decompose `synthesis.rs` into pipeline module (1.1) | Large | Reduces complexity, enables testing |
| **P1** | Extract shared `SourceManager` component (2.1) | Medium | Eliminates ~300 lines of duplication |
| **P1** | Extract shared scrape+classify function (2.2) | Medium | Eliminates ~120 lines of duplication |
| **P2** | Move hardcoded French strings to constants (3.1) | Medium | Enables future i18n, improves consistency |
| **P2** | Frontend error-handling helper (2.3) | Small | Reduces boilerplate in 14 files |
| **P2** | Abstract data access from pipeline (4.1) | Large | Enables unit testing without Postgres |
| **P2** | Move inline SQL from `resolve_model` to db module (4.2) | Small | Maintains architecture consistency |
| **P2** | Extract scraper tests to separate file (1.2) | Small | Improves file navigation |
| **P3** | Decompose `ThemeManager.tsx` into sub-components (1.3) | Medium | Improves readability |
| **P3** | Introduce parameter structs for long signatures (3.2) | Small | Removes clippy suppressions |
| **P3** | Define category key constants (3.3) | Small | Prevents typo bugs |
| **P3** | Audit whether `Sources.tsx` is dead code (5.1) | Small | Potential -481 lines |
| **P3** | Consolidate LLM HTTP request handling (4.3) | Medium | Reduces duplication across 3 files |
| **P4** | Batch insert for `bulk_create` (5.3) | Small | Performance improvement |