ai_synth/docs/technical_specs.md

# AI Weekly Synth -- Technical Specifications

## 1. Backend Tech Stack

| Dependency | Version | Purpose |
|---|---|---|
| axum | 0.8 | Web framework (macros, multipart) |
| tokio | 1 | Async runtime (full features) |
| tower | 0.5 | Middleware composition |
| tower-http | 0.6 | CORS, static files, tracing, headers |
| sqlx | 0.8 | Async Postgres driver (runtime-tokio, tls-rustls, uuid, chrono, json, migrate) |
| reqwest | 0.12 | HTTP client (JSON) |
| serde / serde_json | 1 | Serialization/deserialization |
| chrono | 0.4 | Date/time handling (serde feature) |
| aes-gcm | 0.10 | AES-256-GCM encryption |
| zeroize | 1 | Secure memory zeroing |
| sha2 | 0.10 | SHA-256 hashing |
| rand | 0.8 | Random number generation |
| base64 | 0.22 | Base64 encoding |
| hex | 0.4 | Hex encoding/decoding |
| async-trait | 0.1 | Async trait objects |
| tracing / tracing-subscriber | 0.1 / 0.3 | Structured logging (env-filter, json) |
| dotenvy | 0.15 | .env file loading |
| clap | 4 | CLI argument parsing |
| scraper | 0.22 | HTML parsing (CSS selectors) |
| ego-tree | 0.10 | Tree data structure (used by scraper) |
| url | 2 | URL parsing and validation |
| email_address | 0.2 | Email validation |
| anyhow | 1 | Error context |
| thiserror | 2 | Error type derivation |
| uuid | 1 | UUID v4 generation (serde feature) |
| dashmap | 6 | Concurrent hash maps |
| tokio-stream | 0.1 | Stream utilities for SSE |
| futures | 0.3 | Async stream combinators |
| printpdf | 0.7 | PDF generation |

**Dev dependencies**: tower (util), http-body-util, wiremock 0.6.

**Rust edition**: 2021.

---

## 2. Frontend Tech Stack

| Dependency | Version | Purpose |
|---|---|---|
| solid-js | ^1.9.0 | Reactive UI framework |
| @solidjs/router | ^0.15.0 | Client-side routing |
| lucide-solid | ^0.475.0 | Icon library |
| date-fns | ^4.1.0 | Date formatting |
| tailwindcss | ^4.1.0 | Utility-first CSS (v4) |
| @tailwindcss/vite | ^4.1.0 | Tailwind Vite plugin |
| vite | ^6.2.0 | Build tool and dev server |
| vite-plugin-solid | ^2.11.0 | SolidJS Vite integration |
| typescript | ~5.8.0 | Type checking |
| vitest | ^3.0.0 | Unit testing |
| @solidjs/testing-library | ^0.8.0 | Component testing |
| jsdom | ^25.0.0 | DOM environment for tests |

### Frontend Routes

| Path | Component | Auth | Description |
|---|---|---|---|
| /login | Login | Public | Login page |
| /register | Register | Public | Registration page |
| /auth/verify | AuthVerify | Public | Magic link verification |
| / | Home | Protected | Dashboard / synthesis list |
| /settings | Settings | Protected | User settings |
| /themes | ThemeManager | Protected | Theme CRUD + source management |
| /generate | GenerateSynthesis | Protected | Generation trigger + progress |
| /synthesis/:id | SynthesisDetail | Protected | Full synthesis view |
| /article-history | ArticleHistory | Protected | Article history browser |
| /llm-logs/:jobId | LlmLogs | Protected | LLM call log viewer |
| /admin/providers | AdminProviders | Admin | Provider configuration |
| /admin/rate-limits | AdminRateLimits | Admin | Rate limit configuration |
| /admin/users | AdminUsers | Admin | User management |

---

## 3. Database Schema

### 3.1 `users`

| Column | Type | Constraints |
|---|---|---|
| id | UUID | PK, DEFAULT gen_random_uuid() |
| email | TEXT | NOT NULL, UNIQUE |
| display_name | TEXT | nullable |
| role | TEXT | NOT NULL, DEFAULT 'user', CHECK (user/admin) |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |
| updated_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |

Indexes: `idx_users_email` on (email).

### 3.2 `sessions`

| Column | Type | Constraints |
|---|---|---|
| session_hash | TEXT | PK (SHA-256 of raw token) |
| user_id | UUID | NOT NULL, FK users(id) CASCADE |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |
| expires_at | TIMESTAMPTZ | NOT NULL |
| last_active_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |
| ip_address | TEXT | nullable |
| user_agent | TEXT | nullable |

Indexes: `idx_sessions_user_id`, `idx_sessions_expires_at`.

### 3.3 `magic_tokens`

| Column | Type | Constraints |
|---|---|---|
| id | UUID | PK, DEFAULT gen_random_uuid() |
| email | TEXT | NOT NULL |
| token_hash | TEXT | NOT NULL, UNIQUE |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |
| expires_at | TIMESTAMPTZ | NOT NULL |
| used | BOOLEAN | NOT NULL, DEFAULT false |

Indexes: `idx_magic_tokens_email`, `idx_magic_tokens_expires`.

### 3.4 `settings`

Per-user pipeline configuration. One row per user (user_id is the PK).

| Column | Type | Constraints |
|---|---|---|
| user_id | UUID | PK, FK users(id) CASCADE |
| max_articles_per_source | INTEGER | NOT NULL, DEFAULT 3 |
| max_links_per_source | INTEGER | NOT NULL, DEFAULT 8 |
| use_brave_search | BOOLEAN | NOT NULL, DEFAULT false |
| article_history_days | INTEGER | NOT NULL, DEFAULT 90 |
| batch_size | INTEGER | NOT NULL, DEFAULT 5 |
| source_extraction_window | INTEGER | NOT NULL, DEFAULT 3 |
| search_agent_behavior | TEXT | NOT NULL, DEFAULT '' |
| ai_provider | TEXT | NOT NULL, DEFAULT '' |
| ai_model | TEXT | NOT NULL, DEFAULT '' |
| ai_model_websearch | TEXT | NOT NULL, DEFAULT '' |
| rate_limit_max_requests | INTEGER | nullable |
| rate_limit_time_window_seconds | INTEGER | nullable |
| updated_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |

### 3.5 `themes`

Per-user topic configurations with content settings.

| Column | Type | Constraints |
|---|---|---|
| id | UUID | PK, DEFAULT gen_random_uuid() |
| user_id | UUID | NOT NULL, FK users(id) CASCADE |
| name | TEXT | NOT NULL |
| theme | TEXT | NOT NULL (search topic) |
| categories | JSONB | NOT NULL, DEFAULT '[]' |
| max_items_per_category | INTEGER | NOT NULL, DEFAULT 4 |
| max_age_days | INTEGER | NOT NULL, DEFAULT 7 |
| summary_length | INTEGER | NOT NULL, DEFAULT 3 |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |
| updated_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |

Indexes: `idx_themes_user_id`.

### 3.6 `sources`

User-curated news source URLs, optionally tied to a theme.

| Column | Type | Constraints |
|---|---|---|
| id | UUID | PK, DEFAULT gen_random_uuid() |
| user_id | UUID | NOT NULL, FK users(id) CASCADE |
| title | VARCHAR(200) | NOT NULL, CHECK length 1-200 |
| url | VARCHAR(1000) | NOT NULL, CHECK length <= 1000 |
| theme_id | UUID | nullable, FK themes(id) CASCADE |
| is_preferred | BOOLEAN | NOT NULL, DEFAULT false |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |

Indexes: `idx_sources_user_id`, UNIQUE `idx_sources_user_id_url` on (user_id, url).

### 3.7 `syntheses`

Generated synthesis results with JSONB section data.

| Column | Type | Constraints |
|---|---|---|
| id | UUID | PK, DEFAULT gen_random_uuid() |
| user_id | UUID | NOT NULL, FK users(id) CASCADE |
| week | VARCHAR(10) | NOT NULL (ISO week string) |
| sections | JSONB | NOT NULL, DEFAULT '[]' |
| status | VARCHAR(20) | NOT NULL, DEFAULT 'completed' |
| job_id | UUID | nullable |
| theme_id | UUID | nullable, FK themes(id) SET NULL |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |

Indexes: `idx_syntheses_user_id_created_at` on (user_id, created_at DESC).

JSONB structure for `sections`:
```json
[
  {
    "title": "Category Name",
    "items": [
      { "title": "Article Title", "url": "https://...", "summary": "...", "date": "2026-03-25" }
    ]
  }
]
```

### 3.8 `theme_schedules`

Automated generation schedules, one per theme.

| Column | Type | Constraints |
|---|---|---|
| id | UUID | PK, DEFAULT gen_random_uuid() |
| theme_id | UUID | NOT NULL, UNIQUE, FK themes(id) CASCADE |
| user_id | UUID | NOT NULL, FK users(id) CASCADE |
| enabled | BOOLEAN | NOT NULL, DEFAULT true |
| days | JSONB | NOT NULL, DEFAULT '[]' (e.g. ["mon","fri"]) |
| time_utc | TEXT | NOT NULL, DEFAULT '08:00' (HH:MM) |
| emails | JSONB | NOT NULL, DEFAULT '[]' (up to 3 addresses) |
| last_run_at | TIMESTAMPTZ | nullable |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |
| updated_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |

Indexes: `idx_theme_schedules_enabled` (partial, WHERE enabled = true).

### 3.9 `article_history`

Article URL deduplication and full provenance tracing.

| Column | Type | Constraints |
|---|---|---|
| id | UUID | PK, DEFAULT gen_random_uuid() |
| user_id | UUID | NOT NULL, FK users(id) CASCADE |
| url_hash | TEXT | NOT NULL (SHA-256 of normalized URL) |
| url | TEXT | NOT NULL |
| title | TEXT | NOT NULL, DEFAULT '' |
| source_type | TEXT | NOT NULL, DEFAULT 'unknown' |
| source_url | TEXT | nullable |
| category | TEXT | nullable |
| synthesis_id | UUID | nullable, FK syntheses(id) SET NULL |
| status | TEXT | NOT NULL, DEFAULT 'used' |
| scraped_ok | BOOLEAN | NOT NULL, DEFAULT true |
| job_id | UUID | NOT NULL |
| published_date | TEXT | nullable |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |

Indexes: `idx_article_history_user_url` on (user_id, url_hash), `idx_article_history_job_id`.

Status values: `used`, `filtered_history`, `filtered_diversity`, `filtered_not_article`, `filtered_too_old`, `filtered_empty`, `filtered_homepage`, `filtered_cross_phase_dedup`.

Source type values: `personalized_source`, `brave_search`, `web_search`.

### 3.10 `llm_call_log`

Full LLM interaction logging for debugging and analysis.

| Column | Type | Constraints |
|---|---|---|
| id | UUID | PK, DEFAULT gen_random_uuid() |
| user_id | UUID | NOT NULL, FK users(id) CASCADE |
| job_id | UUID | NOT NULL |
| call_type | TEXT | NOT NULL |
| model | TEXT | NOT NULL |
| system_prompt | TEXT | NOT NULL, DEFAULT '' |
| user_prompt | TEXT | NOT NULL, DEFAULT '' |
| response_body | TEXT | NOT NULL, DEFAULT '' |
| duration_ms | INTEGER | NOT NULL, DEFAULT 0 |
| article_url | TEXT | nullable |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |

Indexes: `idx_llm_call_log_job_id`, `idx_llm_call_log_user_id` on (user_id, created_at).

### 3.11 `admin_providers`

Admin-curated catalog of LLM providers and their models.

| Column | Type | Constraints |
|---|---|---|
| id | UUID | PK, DEFAULT gen_random_uuid() |
| provider_name | VARCHAR(50) | NOT NULL, UNIQUE |
| display_name | VARCHAR(100) | NOT NULL |
| models_scraping | JSONB | NOT NULL, DEFAULT '[]' |
| models_websearch | JSONB | NOT NULL, DEFAULT '[]' |
| is_enabled | BOOLEAN | NOT NULL, DEFAULT true |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |
| updated_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |

Indexes: `idx_admin_providers_enabled` (partial, WHERE is_enabled = true).

Seeded with: gemini, openai, anthropic.

JSONB model structure:
```json
[{"model_id": "gemini-2.5-pro", "display_name": "Gemini 2.5 Pro", "is_default": true}]
```

### 3.12 `admin_rate_limits`

Per-provider rate limit configuration.

| Column | Type | Constraints |
|---|---|---|
| id | UUID | PK, DEFAULT gen_random_uuid() |
| provider_name | VARCHAR(50) | NOT NULL, UNIQUE, FK admin_providers(provider_name) CASCADE |
| max_requests | INTEGER | NOT NULL, DEFAULT 30 |
| time_window_seconds | INTEGER | NOT NULL, DEFAULT 60 |
| updated_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |

Seeded defaults: gemini 29/60s, openai 50/60s, anthropic 40/60s.

### 3.13 `user_api_keys`

Encrypted user LLM API keys.

| Column | Type | Constraints |
|---|---|---|
| id | UUID | PK, DEFAULT gen_random_uuid() |
| user_id | UUID | NOT NULL, FK users(id) CASCADE |
| provider_name | VARCHAR(50) | NOT NULL |
| encrypted_key | BYTEA | NOT NULL |
| nonce | BYTEA | NOT NULL |
| key_prefix | VARCHAR(20) | NOT NULL |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |
| updated_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |

Constraint: UNIQUE(user_id, provider_name). Valid providers: gemini, openai, anthropic, brave_search.

### 3.14 `audit_log`

Admin mutation audit trail.

| Column | Type | Constraints |
|---|---|---|
| id | UUID | PK, DEFAULT gen_random_uuid() |
| admin_user_id | UUID | nullable, FK users(id) SET NULL |
| action | VARCHAR(100) | NOT NULL |
| target_type | VARCHAR(50) | nullable |
| target_id | VARCHAR(255) | nullable |
| details | JSONB | nullable |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |

Indexes: `idx_audit_log_created_at` (DESC), `idx_audit_log_admin_user`.

---

## 4. API Endpoints

All endpoints are prefixed with `/api/v1`. Responses are JSON. Errors follow the shape `{ "error": "message" }`.

### 4.1 Authentication

**POST /auth/register**
- Auth: Public
- Body: `{ email: string, display_name?: string, turnstile_token: string }`
- Response: `{ message: string }`
- Sends magic link email. Rate limited.

**POST /auth/login**
- Auth: Public
- Body: `{ email: string, turnstile_token: string }`
- Response: `{ message: string }`
- Sends magic link email. Rate limited.

**GET /auth/verify?token=...&email=...**
- Auth: Public
- Response: Redirect to frontend with session cookie set.

**POST /auth/verify**
- Auth: Public
- Body: `{ token: string, email: string }`
- Response: `{ message: string, user: User }`
- Sets `session` HttpOnly cookie (30-day expiry).

**POST /auth/logout**
- Auth: Authenticated
- Response: `{ message: string }`
- Clears session cookie and deletes DB session.

**GET /auth/me**
- Auth: Authenticated
- Response: `{ id, email, display_name, role, created_at }`

### 4.2 Settings

**GET /settings**
- Auth: Authenticated
- Response: `UserSettings` (creates defaults if not exists)

**PUT /settings**
- Auth: Authenticated
- Body: `UpdateSettingsRequest` (all fields required)
- Validation: max_articles_per_source 1-10, max_links_per_source 1-30, batch_size 1-20, source_extraction_window 1-10, article_history_days 0-365, search_agent_behavior max 2000 chars, ai_provider/ai_model/ai_model_websearch max 100 chars.
- Response: Updated `UserSettings`

### 4.3 Themes

**GET /themes**
- Auth: Authenticated
- Response: `ThemeResponse[]`

**POST /themes**
- Auth: Authenticated
- Body: `{ name, theme, categories: string[], max_items_per_category?, max_age_days?, summary_length? }`
- Validation: name non-empty max 200 chars, categories 1-20 non-empty entries, max_items 1-50, max_age 1-365, summary_length 1-3.
- Response: `ThemeResponse`

**PUT /themes/{id}**
- Auth: Authenticated (owner only)
- Body: `UpdateThemeRequest` (all fields optional)
- Response: `ThemeResponse`

**DELETE /themes/{id}**
- Auth: Authenticated (owner only)
- Response: 204 No Content

### 4.4 Schedules

**GET /themes/{id}/schedule**
- Auth: Authenticated (theme owner)
- Response: `ScheduleResponse` or 404

**PUT /themes/{id}/schedule**
- Auth: Authenticated (theme owner)
- Body: `{ enabled, days: string[], time_utc: "HH:MM", emails: string[] }`
- Validation: days from mon-sun, time HH:MM format, max 3 emails.
- Response: `ScheduleResponse`

**DELETE /themes/{id}/schedule**
- Auth: Authenticated (theme owner)
- Response: 204 No Content

### 4.5 Sources

**GET /sources?theme_id=...**
- Auth: Authenticated
- Response: `SourceResponse[]`

**POST /sources**
- Auth: Authenticated
- Body: `{ title, url, theme_id? }`
- Validation: title non-empty max 200, URL http(s) max 1000 chars.
- Response: `SourceResponse`

**PUT /sources/preferred**
- Auth: Authenticated
- Body: `{ source_ids: UUID[] }`
- Response: `{ updated: number }`

**DELETE /sources/{id}**
- Auth: Authenticated (owner only)
- Response: 204 No Content

**POST /sources/bulk**
- Auth: Authenticated
- Body: `{ sources: CreateSourceRequest[], theme_id? }`
- Response: `{ imported, skipped, errors }`

**POST /sources/import-csv**
- Auth: Authenticated
- Body: Multipart file upload (CSV: title,url)
- Response: `{ imported, skipped, errors }`

**GET /sources/export-csv**
- Auth: Authenticated
- Response: CSV file download

### 4.6 Generation

**POST /syntheses/generate**
- Auth: Authenticated
- Body: `{ theme_id: UUID }`
- Response: `{ job_id: UUID }`
- Creates job in JobStore, spawns background generation task. Returns 409 if user already has active job.

**GET /syntheses/generate/{job_id}/progress**
- Auth: Authenticated (job owner)
- Response: SSE stream of `ProgressEvent`
- Events: `progress` (step, message, percent), `complete` (synthesis_id), `error` (message).

**POST /syntheses/generate/{job_id}/stop**
- Auth: Authenticated (job owner)
- Response: `{ message: string }`
- Sets cooperative cancellation flag.

### 4.7 Syntheses

**GET /syntheses**
- Auth: Authenticated
- Response: `SynthesisListItem[]` (with section summaries, theme info)

**GET /syntheses/{id}**
- Auth: Authenticated (owner only)
- Response: `SynthesisResponse` (full sections data)

**DELETE /syntheses/{id}**
- Auth: Authenticated (owner only)
- Response: 204 No Content

**POST /syntheses/{id}/send-email**
- Auth: Authenticated
- Body: `{ email: string }`
- Response: `{ message: string }`

**GET /syntheses/{id}/export/markdown**
- Auth: Authenticated
- Response: Markdown file download

**GET /syntheses/{id}/export/pdf**
- Auth: Authenticated
- Response: PDF file download

### 4.8 Article History & Provenance

**GET /article-history?limit=&offset=&job_id=&status=**
- Auth: Authenticated
- Response: `{ items: ArticleHistoryEntry[], total: number }`

**DELETE /article-history**
- Auth: Authenticated
- Response: `{ deleted: number }`

**GET /syntheses/{id}/provenance**
- Auth: Authenticated
- Response: `ArticleHistoryEntry[]` (articles with status "used" for this synthesis's job_id)

### 4.9 LLM Call Logs

**GET /llm-logs/{job_id}**
- Auth: Authenticated
- Response: `LlmCallLogEntry[]`

### 4.10 User API Keys

**GET /user/api-keys**
- Auth: Authenticated
- Response: `ApiKeyResponse[]` (id, provider_name, key_prefix, timestamps; never the full key)

**POST /user/api-keys**
- Auth: Authenticated
- Body: `{ provider_name, api_key }`
- Validation: provider in (gemini, openai, anthropic, brave_search), key 8-500 chars.
- Response: `ApiKeyResponse`
- Encrypts key with AES-256-GCM before storage; upserts (one key per user per provider).

**DELETE /user/api-keys/{provider}**
- Auth: Authenticated
- Response: 204 No Content

**POST /user/api-keys/{provider}/test**
- Auth: Authenticated
- Response: `{ success: boolean, message: string }`
- Decrypts key, calls provider test endpoint.

**POST /user/api-keys/export**
- Auth: Authenticated
- Response: `{ keys: [{ provider_name, api_key }] }`
- Decrypts and returns all keys (used for backup/migration).

### 4.11 Public Configuration

**GET /config/providers**
- Auth: Authenticated
- Response: `ProviderConfigResponse[]` (enabled providers with model lists for scraping and websearch)

### 4.12 Admin Endpoints

All admin endpoints require `AdminUser` extractor (role = admin).

**GET /admin/providers**
- Response: `AdminProviderResponse[]`

**POST /admin/providers**
- Body: `CreateProviderRequest`
- Validation: provider_name in (gemini, openai, anthropic), at least one model per list, at most one default per list.
- Response: `AdminProviderResponse`

**PUT /admin/providers/{id}**
- Body: `UpdateProviderRequest` (all fields optional)
- Response: `AdminProviderResponse`

**DELETE /admin/providers/{id}**
- Response: 204 No Content

**GET /admin/rate-limits**
- Response: `RateLimitResponse[]`

**PUT /admin/rate-limits/{provider_name}**
- Body: `{ max_requests: 1-1000, time_window_seconds: 1-3600 }`
- Response: `RateLimitResponse`
- Hot-reloads the in-memory provider rate limiter.

**GET /admin/users**
- Response: `AdminUserResponse[]`

**PUT /admin/users/{id}/role**
- Body: `{ role: "user" | "admin" }`
- Response: `{ message: string }`

**GET /health**
- Auth: Public
- Response: `{ status: "ok" }`

---

## 5. Generation Pipeline — Full Algorithm

### Startup & Background Tasks

- **Session cleanup**: an hourly background task deletes expired DB sessions (`db::sessions::delete_expired`).
- **Job store TTL**: expired job entries (older than 1 hour) are cleaned up via `JobStore::cleanup_expired`.

### Generation Lifecycle

`POST /api/v1/syntheses/generate` creates a job in the `JobStore`, then spawns two nested tasks:
- Inner task: wraps `run_generation` in a **15-minute `tokio::time::timeout`**. If the timeout fires, sends an `Error` progress event and releases the user lock.
- Outer task: monitors the inner task's `JoinHandle` for panics. If the inner task panics, sends an `Error` progress event and releases the user lock.

Progress is streamed to clients via a `tokio::sync::watch` channel (SSE endpoint subscribes to it).

### Initialization

1. **Load user settings** from DB (provider, models, batch_size, rate limits, etc.)
2. **Cleanup** — delete old article history entries (>N days, dropped only) + truncate old LLM call logs
3. **Validate** — if no categories configured, the only available category will be "Divers".
4. **Load theme** — categories, max_items_per_category, max_age_days, summary_length
5. **Load user sources** (personalized URLs filtered by theme_id)
6. **Resolve LLM provider** — decrypt user's API key, create provider instance (`Arc<dyn LlmProvider>`)
7. **Resolve models** — research model + web-search model (user override or admin default)
8. **Setup rate limiter** — per-user or global provider limiter
9. **Initialize tracking structures** — `article_scraped` (category→articles), `source_counts` (per-domain article count), `url_source` (per-article source), `filled_counts` (per-category article count), `seen_urls` (cross-phase dedup), `classification_categories` (user categories + "Divers")
10. **Batch trace buffer** — `pending_traces: Vec<ArticleHistoryEntry>` accumulates all article history writes; flushed with `db::article_history::batch_insert_entries` at phase boundaries.

### Phase 1: Personalized Sources

**Skipped entirely if user has 0 sources.**

#### 1a. Windowed source extraction

- Query `article_history` for the last source used. Reorder sources so the first source follows the last one used (rolling window).
- Separate preferred sources (processed first) from non-preferred, preserving rotation order within each group.
- Process sources in waves of `source_extraction_window` size:
  - For each source in the wave: fetch page HTML, extract up to `max_links_per_source` article URLs via HTML parsing (same-domain, non-homepage, no static assets).
  - **SSRF check** performed on each source URL before fetching.
  - Deduplicate candidate URLs (case-insensitive, cross-source via `seen_urls`).
  - **Filter against article history** — hash each URL (normalized: lowercase, strip fragments/UTM params/trailing slashes), batch-query `article_history` → remove matches. Trace dropped articles as `status: filtered_history`.
  - **Preferred-first shuffle** — shuffle preferred URLs separately from non-preferred, then concatenate (preferred first).
  - Track url → source in `url_source`.

#### 1b. Scrape, classify, and summarize articles (batched)

Processing in batches of `settings.batch_size` (minimum 1). For each batch:

**Batch assembly**: Pull up to `batch_size` candidates, skipping any where `source_counts[domain] >= max_articles_per_source` (traced as `filtered_diversity`).

**Phase A — Scrape batch in parallel** (`JoinSet`):
- SSRF check (no private IPs), 15s timeout, 5MB body limit.
- HTML parsing for title (`<title>`, `og:title`), date (meta tags, JSON-LD, `<time>`), body (strip scripts/nav), soft-404 detection.
- If article body is empty, is a soft-404, or is too old: trace as `filtered_empty` / `filtered_too_old` and skip.

**Phase B — Classify/summarize batch in parallel** (`JoinSet`):
- Check rate limit before classifying (waits up to 60s, then errors).
- Send article (title + body snippet based on `summary_length`: 500/2000/4000 chars) + categories + "Divers" to LLM.
- LLM returns `{title, summary, category, date, is_article}`.
- **`is_article` check**: if false, trace as `filtered_not_article` and skip.
- **Date fallback**: if LLM returned a date and it exceeds `max_age_days`, trace as `filtered_too_old` and skip.
- **No-date routing**: if no date found (neither scraper nor LLM), route to "Articles sans date" category.
- **`assign_category()`** helper: validates category, falls back to "Divers" if unknown or full. If "Divers" is also full, drops the article.
- **LLM call logged** with full prompt/response/timing.
- Add article to `article_scraped`, increment `filled_counts` and `source_counts`.

**Early exit**: After each batch, if total articles ≥ `(num_categories + 1) × max_items_per_category`, stop.

**Wave check**: After each wave, if synthesis is full, skip remaining waves.

**Trace flush**: Pending traces batch-inserted into `article_history` between waves.

### Phase 2: Web Search Fallback

**Skipped if all user-defined categories are already filled.**

#### 2a. Compute category gaps

For each user category: `needed = max_items_per_category - already_filled`. Only proceed if any category needs more.

#### 2b. Choose path: Brave Search or LLM web search

Selected by `settings.use_brave_search`.

#### Path A: Brave Search (`use_brave_search = true`)

1. Resolve and decrypt the user's Brave Search API key (error if not configured).
2. Query: `"{theme} actualites"`, up to 20 results, freshness mapped from `max_age_days` (pd/pw/pm/py).
3. Filter results through **`filter_phase2_url()`**: homepage filter → cross-phase dedup → article history → source diversity.
4. Batch scrape + classify (same as Phase 1b, `source_type = "brave_search"`).

#### Path B: LLM Web Search (`use_brave_search = false`)

1. Build search prompt with theme, categories, gap counts.
2. Call LLM with `model_websearch`. Returns `{category_0: [{title, url, summary}], ...}`.
3. Filter URLs through **`filter_phase2_url()`**.
4. Scrape each result sequentially. Keep LLM-provided title/summary (no re-classification).
5. `source_type = "web_search"`.

### Save + Record

1. **Error if empty** — if all article lists are empty and generation wasn't cancelled, return error.
2. **Order sections** — user-defined categories first (in order), then "Divers" if non-empty, then "Articles sans date" if non-empty.
3. **Sanitize** — strip `\u0000` null bytes from JSON (PostgreSQL JSONB requirement).
4. **Save synthesis** — insert into `syntheses` table with `job_id`, `week` (ISO week), `sections` (JSONB), `status: completed`, `theme_id`.
5. **Record used articles** — for each article in the final synthesis, build trace with `status: "used"`, `synthesis_id`, and correct `source_type` (inferred from `url_source`). Batch-insert into `article_history`.

### Shared Helpers

- **`build_trace_entry()`** — constructs an `ArticleHistoryEntry` from an `ArticleTrace` struct. Never writes to DB directly; caller accumulates in `pending_traces`.
- **`scrape_and_classify_batch()`** — shared batch processing logic used by Phase 1 and Phase 2 Brave paths.
- **`assign_category()`** — validates LLM-returned category, falls back to "Divers", drops if all full.
- **`filter_phase2_url()`** — async helper applying homepage/dedup/history/diversity filters for Phase 2.
- **`scrape_single_article()`** — thin wrapper around `scraper::scrape_url` returning `(body_text, page_title, final_url, drop_reason)`.
- **`hash_article_url()`** — normalizes URL (strips fragments, UTM params, trailing slashes, lowercases) then SHA-256 hashes.

---

## 6. LLM Provider Abstraction

### Trait Definition

```rust
#[async_trait]
pub trait LlmProvider: Send + Sync {
    fn provider_id(&self) -> &str;
    async fn call_llm(&self, model: &str, system_prompt: &str,
                       user_prompt: &str, response_schema: &Value)
        -> Result<Value, AppError>;
}
```

All calls use structured JSON output (response_schema defines the expected shape).

### Implementations

| Provider | Module | API Endpoint | Auth Method |
|---|---|---|---|
| Google Gemini | `llm/gemini.rs` | `generativelanguage.googleapis.com` | Query param `?key=` |
| OpenAI | `llm/openai.rs` | `api.openai.com/v1/chat/completions` | Bearer token |
| Anthropic | `llm/anthropic.rs` | `api.anthropic.com/v1/messages` | `x-api-key` header |
| Mock | `llm/mock.rs` | N/A (in-memory) | N/A |

### Factory

`llm/factory.rs` provides `create_provider(provider_name, api_key, http_client) -> Arc<dyn LlmProvider>`. Matches on provider name string.

### Response Schema

`llm/schema.rs` builds JSON Schema definitions for:
- Classification/summarization: `{title, summary, category, is_article}`
- Web search: `{category_0: [{title, url, summary}], ...}` with per-category arrays
- Source link extraction: `{links: [{url}]}`

### Error Mapping

`map_provider_http_error()` translates HTTP status codes to `AppError` variants:
- 400 -> BadRequest
- 401/403 -> BadRequest (invalid key)
- 404 -> BadRequest (model not found)
- 429/529 -> RateLimited
- Other -> Internal

---

## 7. Background Tasks

### Session Cleanup

Runs hourly via `tokio::spawn`. Calls `db::sessions::delete_expired` to remove sessions past their `expires_at` timestamp.

### Job Store Cleanup

`JobStore::cleanup_expired` removes job entries older than 1 hour (the TTL constant). Called periodically. Releases user locks for expired jobs.

### Scheduler

Runs every minute via `tokio::spawn` with a 60-second interval. For each tick:

1. `current_day_code()` -> "mon" through "sun"
2. `find_due_schedules(pool, day, time)` -> queries enabled schedules matching current day and time (HH:MM)
3. For each due schedule:
   - Skip if `job_store.has_active_job(user_id)` returns Some (manual generation in progress)
   - Create a temporary `watch::channel` and `AtomicBool`
   - Call `synthesis::run_generation_inner` directly (bypasses job store)
   - On success: send emails to configured recipients (up to 3), mark schedule as run
   - On failure: log error, do not mark as run

---

## 8. Configuration

### Environment Variables

| Variable | Required | Default | Description |
|---|---|---|---|
| DATABASE_URL | Yes | - | PostgreSQL connection string |
| MASTER_ENCRYPTION_KEY | Yes | - | 64 hex chars (32 bytes) for AES-256-GCM |
| APP_URL | Yes | - | Public URL (CORS, magic links, cookies). No trailing slash. |
| PORT | No | 8080 | HTTP server port |
| RUST_LOG | No | - | Logging filter (e.g., "info,ai_synth_backend=debug") |
| STATIC_DIR | No | ../frontend/dist | Path to built SolidJS files |
| RESEND_API_KEY | Yes | - | Resend email service API key |
| EMAIL_FROM | Yes | - | Sender address for emails |
| TURNSTILE_SECRET_KEY | Yes | - | Cloudflare Turnstile server secret |
| TURNSTILE_SITE_KEY | Yes | - | Cloudflare Turnstile client key |
| POSTGRES_PASSWORD | Yes | - | Used by docker-compose for DB container |

### Startup Validation

`AppConfig::validate()` checks at startup:
- `MASTER_ENCRYPTION_KEY` is exactly 64 hex characters
- `APP_URL` starts with http:// or https:// and has no trailing slash

The application refuses to start with invalid configuration.

### User Settings Model

Default values applied when a user has no saved settings:

| Setting | Default | Range |
|---|---|---|
| max_articles_per_source | 3 | 1-10 |
| max_links_per_source | 8 | 1-30 |
| use_brave_search | false | boolean |
| article_history_days | 90 | 0-365 |
| batch_size | 5 | 1-20 |
| source_extraction_window | 3 | 1-10 |
| search_agent_behavior | "" | max 2000 chars |
| ai_provider | "" | max 100 chars |
| ai_model | "" | max 100 chars |
| ai_model_websearch | "" | max 100 chars |
| rate_limit_max_requests | null | >= 1 if set |
| rate_limit_time_window_seconds | null | >= 1 if set |