You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

839 lines
31 KiB
Markdown

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

# AI Weekly Synth -- Technical Specifications
## 1. Backend Tech Stack
| Dependency | Version | Purpose |
|---|---|---|
| axum | 0.8 | Web framework (macros, multipart) |
| tokio | 1 | Async runtime (full features) |
| tower | 0.5 | Middleware composition |
| tower-http | 0.6 | CORS, static files, tracing, headers |
| sqlx | 0.8 | Async Postgres driver (runtime-tokio, tls-rustls, uuid, chrono, json, migrate) |
| reqwest | 0.12 | HTTP client (JSON) |
| serde / serde_json | 1 | Serialization/deserialization |
| chrono | 0.4 | Date/time handling (serde feature) |
| aes-gcm | 0.10 | AES-256-GCM encryption |
| zeroize | 1 | Secure memory zeroing |
| sha2 | 0.10 | SHA-256 hashing |
| rand | 0.8 | Random number generation |
| base64 | 0.22 | Base64 encoding |
| hex | 0.4 | Hex encoding/decoding |
| async-trait | 0.1 | Async trait objects |
| tracing / tracing-subscriber | 0.1 / 0.3 | Structured logging (env-filter, json) |
| dotenvy | 0.15 | .env file loading |
| clap | 4 | CLI argument parsing |
| scraper | 0.22 | HTML parsing (CSS selectors) |
| ego-tree | 0.10 | Tree data structure (used by scraper) |
| url | 2 | URL parsing and validation |
| email_address | 0.2 | Email validation |
| anyhow | 1 | Error context |
| thiserror | 2 | Error type derivation |
| uuid | 1 | UUID v4 generation (serde feature) |
| dashmap | 6 | Concurrent hash maps |
| tokio-stream | 0.1 | Stream utilities for SSE |
| futures | 0.3 | Async stream combinators |
| printpdf | 0.7 | PDF generation |
**Dev dependencies**: tower (util), http-body-util, wiremock 0.6.
**Rust edition**: 2021.
---
## 2. Frontend Tech Stack
| Dependency | Version | Purpose |
|---|---|---|
| solid-js | ^1.9.0 | Reactive UI framework |
| @solidjs/router | ^0.15.0 | Client-side routing |
| lucide-solid | ^0.475.0 | Icon library |
| date-fns | ^4.1.0 | Date formatting |
| tailwindcss | ^4.1.0 | Utility-first CSS (v4) |
| @tailwindcss/vite | ^4.1.0 | Tailwind Vite plugin |
| vite | ^6.2.0 | Build tool and dev server |
| vite-plugin-solid | ^2.11.0 | SolidJS Vite integration |
| typescript | ~5.8.0 | Type checking |
| vitest | ^3.0.0 | Unit testing |
| @solidjs/testing-library | ^0.8.0 | Component testing |
| jsdom | ^25.0.0 | DOM environment for tests |
### Frontend Routes
| Path | Component | Auth | Description |
|---|---|---|---|
| /login | Login | Public | Login page |
| /register | Register | Public | Registration page |
| /auth/verify | AuthVerify | Public | Magic link verification |
| / | Home | Protected | Dashboard / synthesis list |
| /settings | Settings | Protected | User settings |
| /themes | ThemeManager | Protected | Theme CRUD + source management |
| /generate | GenerateSynthesis | Protected | Generation trigger + progress |
| /synthesis/:id | SynthesisDetail | Protected | Full synthesis view |
| /article-history | ArticleHistory | Protected | Article history browser |
| /llm-logs/:jobId | LlmLogs | Protected | LLM call log viewer |
| /admin/providers | AdminProviders | Admin | Provider configuration |
| /admin/rate-limits | AdminRateLimits | Admin | Rate limit configuration |
| /admin/users | AdminUsers | Admin | User management |
---
## 3. Database Schema
### 3.1 `users`
| Column | Type | Constraints |
|---|---|---|
| id | UUID | PK, DEFAULT gen_random_uuid() |
| email | TEXT | NOT NULL, UNIQUE |
| display_name | TEXT | nullable |
| role | TEXT | NOT NULL, DEFAULT 'user', CHECK (user/admin) |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |
| updated_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |
Indexes: `idx_users_email` on (email).
### 3.2 `sessions`
| Column | Type | Constraints |
|---|---|---|
| session_hash | TEXT | PK (SHA-256 of raw token) |
| user_id | UUID | NOT NULL, FK users(id) CASCADE |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |
| expires_at | TIMESTAMPTZ | NOT NULL |
| last_active_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |
| ip_address | TEXT | nullable |
| user_agent | TEXT | nullable |
Indexes: `idx_sessions_user_id`, `idx_sessions_expires_at`.
### 3.3 `magic_tokens`
| Column | Type | Constraints |
|---|---|---|
| id | UUID | PK, DEFAULT gen_random_uuid() |
| email | TEXT | NOT NULL |
| token_hash | TEXT | NOT NULL, UNIQUE |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |
| expires_at | TIMESTAMPTZ | NOT NULL |
| used | BOOLEAN | NOT NULL, DEFAULT false |
Indexes: `idx_magic_tokens_email`, `idx_magic_tokens_expires`.
### 3.4 `settings`
Per-user pipeline configuration. One row per user (user_id is the PK).
| Column | Type | Constraints |
|---|---|---|
| user_id | UUID | PK, FK users(id) CASCADE |
| max_articles_per_source | INTEGER | NOT NULL, DEFAULT 3 |
| max_links_per_source | INTEGER | NOT NULL, DEFAULT 8 |
| use_brave_search | BOOLEAN | NOT NULL, DEFAULT false |
| article_history_days | INTEGER | NOT NULL, DEFAULT 90 |
| batch_size | INTEGER | NOT NULL, DEFAULT 5 |
| source_extraction_window | INTEGER | NOT NULL, DEFAULT 3 |
| search_agent_behavior | TEXT | NOT NULL, DEFAULT '' |
| ai_provider | TEXT | NOT NULL, DEFAULT '' |
| ai_model | TEXT | NOT NULL, DEFAULT '' |
| ai_model_websearch | TEXT | NOT NULL, DEFAULT '' |
| rate_limit_max_requests | INTEGER | nullable |
| rate_limit_time_window_seconds | INTEGER | nullable |
| updated_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |
### 3.5 `themes`
Per-user topic configurations with content settings.
| Column | Type | Constraints |
|---|---|---|
| id | UUID | PK, DEFAULT gen_random_uuid() |
| user_id | UUID | NOT NULL, FK users(id) CASCADE |
| name | TEXT | NOT NULL |
| theme | TEXT | NOT NULL (search topic) |
| categories | JSONB | NOT NULL, DEFAULT '[]' |
| max_items_per_category | INTEGER | NOT NULL, DEFAULT 4 |
| max_age_days | INTEGER | NOT NULL, DEFAULT 7 |
| summary_length | INTEGER | NOT NULL, DEFAULT 3 |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |
| updated_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |
Indexes: `idx_themes_user_id`.
### 3.6 `sources`
User-curated news source URLs, optionally tied to a theme.
| Column | Type | Constraints |
|---|---|---|
| id | UUID | PK, DEFAULT gen_random_uuid() |
| user_id | UUID | NOT NULL, FK users(id) CASCADE |
| title | VARCHAR(200) | NOT NULL, CHECK length 1-200 |
| url | VARCHAR(1000) | NOT NULL, CHECK length <= 1000 |
| theme_id | UUID | nullable, FK themes(id) CASCADE |
| is_preferred | BOOLEAN | NOT NULL, DEFAULT false |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |
Indexes: `idx_sources_user_id`, UNIQUE `idx_sources_user_id_url` on (user_id, url).
### 3.7 `syntheses`
Generated synthesis results with JSONB section data.
| Column | Type | Constraints |
|---|---|---|
| id | UUID | PK, DEFAULT gen_random_uuid() |
| user_id | UUID | NOT NULL, FK users(id) CASCADE |
| week | VARCHAR(10) | NOT NULL (ISO week string) |
| sections | JSONB | NOT NULL, DEFAULT '[]' |
| status | VARCHAR(20) | NOT NULL, DEFAULT 'completed' |
| job_id | UUID | nullable |
| theme_id | UUID | nullable, FK themes(id) SET NULL |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |
Indexes: `idx_syntheses_user_id_created_at` on (user_id, created_at DESC).
JSONB structure for `sections`:
```json
[
{
"title": "Category Name",
"items": [
{ "title": "Article Title", "url": "https://...", "summary": "...", "date": "2026-03-25" }
]
}
]
```
### 3.8 `theme_schedules`
Automated generation schedules, one per theme.
| Column | Type | Constraints |
|---|---|---|
| id | UUID | PK, DEFAULT gen_random_uuid() |
| theme_id | UUID | NOT NULL, UNIQUE, FK themes(id) CASCADE |
| user_id | UUID | NOT NULL, FK users(id) CASCADE |
| enabled | BOOLEAN | NOT NULL, DEFAULT true |
| days | JSONB | NOT NULL, DEFAULT '[]' (e.g. ["mon","fri"]) |
| time_utc | TEXT | NOT NULL, DEFAULT '08:00' (HH:MM) |
| emails | JSONB | NOT NULL, DEFAULT '[]' (up to 3 addresses) |
| last_run_at | TIMESTAMPTZ | nullable |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |
| updated_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |
Indexes: `idx_theme_schedules_enabled` (partial, WHERE enabled = true).
### 3.9 `article_history`
Article URL deduplication and full provenance tracing.
| Column | Type | Constraints |
|---|---|---|
| id | UUID | PK, DEFAULT gen_random_uuid() |
| user_id | UUID | NOT NULL, FK users(id) CASCADE |
| url_hash | TEXT | NOT NULL (SHA-256 of normalized URL) |
| url | TEXT | NOT NULL |
| title | TEXT | NOT NULL, DEFAULT '' |
| source_type | TEXT | NOT NULL, DEFAULT 'unknown' |
| source_url | TEXT | nullable |
| category | TEXT | nullable |
| synthesis_id | UUID | nullable, FK syntheses(id) SET NULL |
| status | TEXT | NOT NULL, DEFAULT 'used' |
| scraped_ok | BOOLEAN | NOT NULL, DEFAULT true |
| job_id | UUID | NOT NULL |
| published_date | TEXT | nullable |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |
Indexes: `idx_article_history_user_url` on (user_id, url_hash), `idx_article_history_job_id`.
Status values: `used`, `filtered_history`, `filtered_diversity`, `filtered_not_article`, `filtered_too_old`, `filtered_empty`, `filtered_homepage`, `filtered_cross_phase_dedup`.
Source type values: `personalized_source`, `brave_search`, `web_search`.
### 3.10 `llm_call_log`
Full LLM interaction logging for debugging and analysis.
| Column | Type | Constraints |
|---|---|---|
| id | UUID | PK, DEFAULT gen_random_uuid() |
| user_id | UUID | NOT NULL, FK users(id) CASCADE |
| job_id | UUID | NOT NULL |
| call_type | TEXT | NOT NULL |
| model | TEXT | NOT NULL |
| system_prompt | TEXT | NOT NULL, DEFAULT '' |
| user_prompt | TEXT | NOT NULL, DEFAULT '' |
| response_body | TEXT | NOT NULL, DEFAULT '' |
| duration_ms | INTEGER | NOT NULL, DEFAULT 0 |
| article_url | TEXT | nullable |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |
Indexes: `idx_llm_call_log_job_id`, `idx_llm_call_log_user_id` on (user_id, created_at).
### 3.11 `admin_providers`
Admin-curated catalog of LLM providers and their models.
| Column | Type | Constraints |
|---|---|---|
| id | UUID | PK, DEFAULT gen_random_uuid() |
| provider_name | VARCHAR(50) | NOT NULL, UNIQUE |
| display_name | VARCHAR(100) | NOT NULL |
| models_scraping | JSONB | NOT NULL, DEFAULT '[]' |
| models_websearch | JSONB | NOT NULL, DEFAULT '[]' |
| is_enabled | BOOLEAN | NOT NULL, DEFAULT true |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |
| updated_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |
Indexes: `idx_admin_providers_enabled` (partial, WHERE is_enabled = true).
Seeded with: gemini, openai, anthropic.
JSONB model structure:
```json
[{"model_id": "gemini-2.5-pro", "display_name": "Gemini 2.5 Pro", "is_default": true}]
```
### 3.12 `admin_rate_limits`
Per-provider rate limit configuration.
| Column | Type | Constraints |
|---|---|---|
| id | UUID | PK, DEFAULT gen_random_uuid() |
| provider_name | VARCHAR(50) | NOT NULL, UNIQUE, FK admin_providers(provider_name) CASCADE |
| max_requests | INTEGER | NOT NULL, DEFAULT 30 |
| time_window_seconds | INTEGER | NOT NULL, DEFAULT 60 |
| updated_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |
Seeded defaults: gemini 29/60s, openai 50/60s, anthropic 40/60s.
### 3.13 `user_api_keys`
Encrypted user LLM API keys.
| Column | Type | Constraints |
|---|---|---|
| id | UUID | PK, DEFAULT gen_random_uuid() |
| user_id | UUID | NOT NULL, FK users(id) CASCADE |
| provider_name | VARCHAR(50) | NOT NULL |
| encrypted_key | BYTEA | NOT NULL |
| nonce | BYTEA | NOT NULL |
| key_prefix | VARCHAR(20) | NOT NULL |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |
| updated_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |
Constraint: UNIQUE(user_id, provider_name). Valid providers: gemini, openai, anthropic, brave_search.
### 3.14 `audit_log`
Admin mutation audit trail.
| Column | Type | Constraints |
|---|---|---|
| id | UUID | PK, DEFAULT gen_random_uuid() |
| admin_user_id | UUID | nullable, FK users(id) SET NULL |
| action | VARCHAR(100) | NOT NULL |
| target_type | VARCHAR(50) | nullable |
| target_id | VARCHAR(255) | nullable |
| details | JSONB | nullable |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |
Indexes: `idx_audit_log_created_at` (DESC), `idx_audit_log_admin_user`.
---
## 4. API Endpoints
All endpoints are prefixed with `/api/v1`. Responses are JSON. Errors follow the shape `{ "error": "message" }`.
### 4.1 Authentication
**POST /auth/register**
- Auth: Public
- Body: `{ email: string, display_name?: string, turnstile_token: string }`
- Response: `{ message: string }`
- Sends magic link email. Rate limited.
**POST /auth/login**
- Auth: Public
- Body: `{ email: string, turnstile_token: string }`
- Response: `{ message: string }`
- Sends magic link email. Rate limited.
**GET /auth/verify?token=...&email=...**
- Auth: Public
- Response: Redirect to frontend with session cookie set.
**POST /auth/verify**
- Auth: Public
- Body: `{ token: string, email: string }`
- Response: `{ message: string, user: User }`
- Sets `session` HttpOnly cookie (30-day expiry).
**POST /auth/logout**
- Auth: Authenticated
- Response: `{ message: string }`
- Clears session cookie and deletes DB session.
**GET /auth/me**
- Auth: Authenticated
- Response: `{ id, email, display_name, role, created_at }`
### 4.2 Settings
**GET /settings**
- Auth: Authenticated
- Response: `UserSettings` (creates defaults if not exists)
**PUT /settings**
- Auth: Authenticated
- Body: `UpdateSettingsRequest` (all fields required)
- Validation: max_articles_per_source 1-10, max_links_per_source 1-30, batch_size 1-20, source_extraction_window 1-10, article_history_days 0-365, search_agent_behavior max 2000 chars, ai_provider/ai_model/ai_model_websearch max 100 chars.
- Response: Updated `UserSettings`
### 4.3 Themes
**GET /themes**
- Auth: Authenticated
- Response: `ThemeResponse[]`
**POST /themes**
- Auth: Authenticated
- Body: `{ name, theme, categories: string[], max_items_per_category?, max_age_days?, summary_length? }`
- Validation: name non-empty max 200 chars, categories 1-20 non-empty entries, max_items 1-50, max_age 1-365, summary_length 1-3.
- Response: `ThemeResponse`
**PUT /themes/{id}**
- Auth: Authenticated (owner only)
- Body: `UpdateThemeRequest` (all fields optional)
- Response: `ThemeResponse`
**DELETE /themes/{id}**
- Auth: Authenticated (owner only)
- Response: 204 No Content
### 4.4 Schedules
**GET /themes/{id}/schedule**
- Auth: Authenticated (theme owner)
- Response: `ScheduleResponse` or 404
**PUT /themes/{id}/schedule**
- Auth: Authenticated (theme owner)
- Body: `{ enabled, days: string[], time_utc: "HH:MM", emails: string[] }`
- Validation: days from mon-sun, time HH:MM format, max 3 emails.
- Response: `ScheduleResponse`
**DELETE /themes/{id}/schedule**
- Auth: Authenticated (theme owner)
- Response: 204 No Content
### 4.5 Sources
**GET /sources?theme_id=...**
- Auth: Authenticated
- Response: `SourceResponse[]`
**POST /sources**
- Auth: Authenticated
- Body: `{ title, url, theme_id? }`
- Validation: title non-empty max 200, URL http(s) max 1000 chars.
- Response: `SourceResponse`
**PUT /sources/preferred**
- Auth: Authenticated
- Body: `{ source_ids: UUID[] }`
- Response: `{ updated: number }`
**DELETE /sources/{id}**
- Auth: Authenticated (owner only)
- Response: 204 No Content
**POST /sources/bulk**
- Auth: Authenticated
- Body: `{ sources: CreateSourceRequest[], theme_id? }`
- Response: `{ imported, skipped, errors }`
**POST /sources/import-csv**
- Auth: Authenticated
- Body: Multipart file upload (CSV: title,url)
- Response: `{ imported, skipped, errors }`
**GET /sources/export-csv**
- Auth: Authenticated
- Response: CSV file download
### 4.6 Generation
**POST /syntheses/generate**
- Auth: Authenticated
- Body: `{ theme_id: UUID }`
- Response: `{ job_id: UUID }`
- Creates job in JobStore, spawns background generation task. Returns 409 if user already has active job.
**GET /syntheses/generate/{job_id}/progress**
- Auth: Authenticated (job owner)
- Response: SSE stream of `ProgressEvent`
- Events: `progress` (step, message, percent), `complete` (synthesis_id), `error` (message).
**POST /syntheses/generate/{job_id}/stop**
- Auth: Authenticated (job owner)
- Response: `{ message: string }`
- Sets cooperative cancellation flag.
### 4.7 Syntheses
**GET /syntheses**
- Auth: Authenticated
- Response: `SynthesisListItem[]` (with section summaries, theme info)
**GET /syntheses/{id}**
- Auth: Authenticated (owner only)
- Response: `SynthesisResponse` (full sections data)
**DELETE /syntheses/{id}**
- Auth: Authenticated (owner only)
- Response: 204 No Content
**POST /syntheses/{id}/send-email**
- Auth: Authenticated
- Body: `{ email: string }`
- Response: `{ message: string }`
**GET /syntheses/{id}/export/markdown**
- Auth: Authenticated
- Response: Markdown file download
**GET /syntheses/{id}/export/pdf**
- Auth: Authenticated
- Response: PDF file download
### 4.8 Article History & Provenance
**GET /article-history?limit=&offset=&job_id=&status=**
- Auth: Authenticated
- Response: `{ items: ArticleHistoryEntry[], total: number }`
**DELETE /article-history**
- Auth: Authenticated
- Response: `{ deleted: number }`
**GET /syntheses/{id}/provenance**
- Auth: Authenticated
- Response: `ArticleHistoryEntry[]` (articles with status "used" for this synthesis's job_id)
### 4.9 LLM Call Logs
**GET /llm-logs/{job_id}**
- Auth: Authenticated
- Response: `LlmCallLogEntry[]`
### 4.10 User API Keys
**GET /user/api-keys**
- Auth: Authenticated
- Response: `ApiKeyResponse[]` (id, provider_name, key_prefix, timestamps; never the full key)
**POST /user/api-keys**
- Auth: Authenticated
- Body: `{ provider_name, api_key }`
- Validation: provider in (gemini, openai, anthropic, brave_search), key 8-500 chars.
- Response: `ApiKeyResponse`
- Encrypts key with AES-256-GCM before storage; upserts (one key per user per provider).
**DELETE /user/api-keys/{provider}**
- Auth: Authenticated
- Response: 204 No Content
**POST /user/api-keys/{provider}/test**
- Auth: Authenticated
- Response: `{ success: boolean, message: string }`
- Decrypts key, calls provider test endpoint.
**POST /user/api-keys/export**
- Auth: Authenticated
- Response: `{ keys: [{ provider_name, api_key }] }`
- Decrypts and returns all keys (used for backup/migration).
### 4.11 Public Configuration
**GET /config/providers**
- Auth: Authenticated
- Response: `ProviderConfigResponse[]` (enabled providers with model lists for scraping and websearch)
### 4.12 Admin Endpoints
All admin endpoints require `AdminUser` extractor (role = admin).
**GET /admin/providers**
- Response: `AdminProviderResponse[]`
**POST /admin/providers**
- Body: `CreateProviderRequest`
- Validation: provider_name in (gemini, openai, anthropic), at least one model per list, at most one default per list.
- Response: `AdminProviderResponse`
**PUT /admin/providers/{id}**
- Body: `UpdateProviderRequest` (all fields optional)
- Response: `AdminProviderResponse`
**DELETE /admin/providers/{id}**
- Response: 204 No Content
**GET /admin/rate-limits**
- Response: `RateLimitResponse[]`
**PUT /admin/rate-limits/{provider_name}**
- Body: `{ max_requests: 1-1000, time_window_seconds: 1-3600 }`
- Response: `RateLimitResponse`
- Hot-reloads the in-memory provider rate limiter.
**GET /admin/users**
- Response: `AdminUserResponse[]`
**PUT /admin/users/{id}/role**
- Body: `{ role: "user" | "admin" }`
- Response: `{ message: string }`
**GET /health**
- Auth: Public
- Response: `{ status: "ok" }`
---
## 5. Generation Pipeline — Full Algorithm
### Startup & Background Tasks
- **Session cleanup**: an hourly background task deletes expired DB sessions (`db::sessions::delete_expired`).
- **Job store TTL**: expired job entries (older than 1 hour) are cleaned up via `JobStore::cleanup_expired`.
### Generation Lifecycle
`POST /api/v1/syntheses/generate` creates a job in the `JobStore`, then spawns two nested tasks:
- Inner task: wraps `run_generation` in a **15-minute `tokio::time::timeout`**. If the timeout fires, sends an `Error` progress event and releases the user lock.
- Outer task: monitors the inner task's `JoinHandle` for panics. If the inner task panics, sends an `Error` progress event and releases the user lock.
Progress is streamed to clients via a `tokio::sync::watch` channel (SSE endpoint subscribes to it).
### Initialization
1. **Load user settings** from DB (provider, models, batch_size, rate limits, etc.)
2. **Cleanup** — delete old article history entries (>N days, dropped only) + truncate old LLM call logs
3. **Validate** — if no categories configured, the only available category will be "Divers".
4. **Load theme** — categories, max_items_per_category, max_age_days, summary_length
5. **Load user sources** (personalized URLs filtered by theme_id)
6. **Resolve LLM provider** — decrypt user's API key, create provider instance (`Arc<dyn LlmProvider>`)
7. **Resolve models** — research model + web-search model (user override or admin default)
8. **Setup rate limiter** — per-user or global provider limiter
9. **Initialize tracking structures**`article_scraped` (category→articles), `source_counts` (per-domain article count), `url_source` (per-article source), `filled_counts` (per-category article count), `seen_urls` (cross-phase dedup), `classification_categories` (user categories + "Divers")
10. **Batch trace buffer**`pending_traces: Vec<ArticleHistoryEntry>` accumulates all article history writes; flushed with `db::article_history::batch_insert_entries` at phase boundaries.
### Phase 1: Personalized Sources
**Skipped entirely if user has 0 sources.**
#### 1a. Windowed source extraction
- Query `article_history` for the last source used. Reorder sources so the first source follows the last one used (rolling window).
- Separate preferred sources (processed first) from non-preferred, preserving rotation order within each group.
- Process sources in waves of `source_extraction_window` size:
- For each source in the wave: fetch page HTML, extract up to `max_links_per_source` article URLs via HTML parsing (same-domain, non-homepage, no static assets).
- **SSRF check** performed on each source URL before fetching.
- Deduplicate candidate URLs (case-insensitive, cross-source via `seen_urls`).
- **Filter against article history** — hash each URL (normalized: lowercase, strip fragments/UTM params/trailing slashes), batch-query `article_history` → remove matches. Trace dropped articles as `status: filtered_history`.
- **Preferred-first shuffle** — shuffle preferred URLs separately from non-preferred, then concatenate (preferred first).
- Track url → source in `url_source`.
#### 1b. Scrape, classify, and summarize articles (batched)
Processing in batches of `settings.batch_size` (minimum 1). For each batch:
**Batch assembly**: Pull up to `batch_size` candidates, skipping any where `source_counts[domain] >= max_articles_per_source` (traced as `filtered_diversity`).
**Phase A — Scrape batch in parallel** (`JoinSet`):
- SSRF check (no private IPs), 15s timeout, 5MB body limit.
- HTML parsing for title (`<title>`, `og:title`), date (meta tags, JSON-LD, `<time>`), body (strip scripts/nav), soft-404 detection.
- If article body is empty, is a soft-404, or is too old: trace as `filtered_empty` / `filtered_too_old` and skip.
**Phase B — Classify/summarize batch in parallel** (`JoinSet`):
- Check rate limit before classifying (waits up to 60s, then errors).
- Send article (title + body snippet based on `summary_length`: 500/2000/4000 chars) + categories + "Divers" to LLM.
- LLM returns `{title, summary, category, date, is_article}`.
- **`is_article` check**: if false, trace as `filtered_not_article` and skip.
- **Date fallback**: if LLM returned a date and it exceeds `max_age_days`, trace as `filtered_too_old` and skip.
- **No-date routing**: if no date found (neither scraper nor LLM), route to "Articles sans date" category.
- **`assign_category()`** helper: validates category, falls back to "Divers" if unknown or full. If "Divers" is also full, drops the article.
- **LLM call logged** with full prompt/response/timing.
- Add article to `article_scraped`, increment `filled_counts` and `source_counts`.
**Early exit**: After each batch, if total articles ≥ `(num_categories + 1) × max_items_per_category`, stop.
**Wave check**: After each wave, if synthesis is full, skip remaining waves.
**Trace flush**: Pending traces batch-inserted into `article_history` between waves.
### Phase 2: Web Search Fallback
**Skipped if all user-defined categories are already filled.**
#### 2a. Compute category gaps
For each user category: `needed = max_items_per_category - already_filled`. Only proceed if any category needs more.
#### 2b. Choose path: Brave Search or LLM web search
Selected by `settings.use_brave_search`.
#### Path A: Brave Search (`use_brave_search = true`)
1. Resolve and decrypt the user's Brave Search API key (error if not configured).
2. Query: `"{theme} actualites"`, up to 20 results, freshness mapped from `max_age_days` (pd/pw/pm/py).
3. Filter results through **`filter_phase2_url()`**: homepage filter → cross-phase dedup → article history → source diversity.
4. Batch scrape + classify (same as Phase 1b, `source_type = "brave_search"`).
#### Path B: LLM Web Search (`use_brave_search = false`)
1. Build search prompt with theme, categories, gap counts.
2. Call LLM with `model_websearch`. Returns `{category_0: [{title, url, summary}], ...}`.
3. Filter URLs through **`filter_phase2_url()`**.
4. Scrape each result sequentially. Keep LLM-provided title/summary (no re-classification).
5. `source_type = "web_search"`.
### Save + Record
1. **Error if empty** — if all article lists are empty and generation wasn't cancelled, return error.
2. **Order sections** — user-defined categories first (in order), then "Divers" if non-empty, then "Articles sans date" if non-empty.
3. **Sanitize** — strip `\u0000` null bytes from JSON (PostgreSQL JSONB requirement).
4. **Save synthesis** — insert into `syntheses` table with `job_id`, `week` (ISO week), `sections` (JSONB), `status: completed`, `theme_id`.
5. **Record used articles** — for each article in the final synthesis, build trace with `status: "used"`, `synthesis_id`, and correct `source_type` (inferred from `url_source`). Batch-insert into `article_history`.
### Shared Helpers
- **`build_trace_entry()`** — constructs an `ArticleHistoryEntry` from an `ArticleTrace` struct. Never writes to DB directly; caller accumulates in `pending_traces`.
- **`scrape_and_classify_batch()`** — shared batch processing logic used by Phase 1 and Phase 2 Brave paths.
- **`assign_category()`** — validates LLM-returned category, falls back to "Divers", drops if all full.
- **`filter_phase2_url()`** — async helper applying homepage/dedup/history/diversity filters for Phase 2.
- **`scrape_single_article()`** — thin wrapper around `scraper::scrape_url` returning `(body_text, page_title, final_url, drop_reason)`.
- **`hash_article_url()`** — normalizes URL (strips fragments, UTM params, trailing slashes, lowercases) then SHA-256 hashes.
---
## 6. LLM Provider Abstraction
### Trait Definition
```rust
#[async_trait]
pub trait LlmProvider: Send + Sync {
fn provider_id(&self) -> &str;
async fn call_llm(&self, model: &str, system_prompt: &str,
user_prompt: &str, response_schema: &Value)
-> Result<Value, AppError>;
}
```
All calls use structured JSON output (response_schema defines the expected shape).
### Implementations
| Provider | Module | API Endpoint | Auth Method |
|---|---|---|---|
| Google Gemini | `llm/gemini.rs` | `generativelanguage.googleapis.com` | Query param `?key=` |
| OpenAI | `llm/openai.rs` | `api.openai.com/v1/chat/completions` | Bearer token |
| Anthropic | `llm/anthropic.rs` | `api.anthropic.com/v1/messages` | `x-api-key` header |
| Mock | `llm/mock.rs` | N/A (in-memory) | N/A |
### Factory
`llm/factory.rs` provides `create_provider(provider_name, api_key, http_client) -> Arc<dyn LlmProvider>`. Matches on provider name string.
### Response Schema
`llm/schema.rs` builds JSON Schema definitions for:
- Classification/summarization: `{title, summary, category, is_article}`
- Web search: `{category_0: [{title, url, summary}], ...}` with per-category arrays
- Source link extraction: `{links: [{url}]}`
### Error Mapping
`map_provider_http_error()` translates HTTP status codes to `AppError` variants:
- 400 -> BadRequest
- 401/403 -> BadRequest (invalid key)
- 404 -> BadRequest (model not found)
- 429/529 -> RateLimited
- Other -> Internal
---
## 7. Background Tasks
### Session Cleanup
Runs hourly via `tokio::spawn`. Calls `db::sessions::delete_expired` to remove sessions past their `expires_at` timestamp.
### Job Store Cleanup
`JobStore::cleanup_expired` removes job entries older than 1 hour (the TTL constant). Called periodically. Releases user locks for expired jobs.
### Scheduler
Runs every minute via `tokio::spawn` with a 60-second interval. For each tick:
1. `current_day_code()` -> "mon" through "sun"
2. `find_due_schedules(pool, day, time)` -> queries enabled schedules matching current day and time (HH:MM)
3. For each due schedule:
- Skip if `job_store.has_active_job(user_id)` returns Some (manual generation in progress)
- Create a temporary `watch::channel` and `AtomicBool`
- Call `synthesis::run_generation_inner` directly (bypasses job store)
- On success: send emails to configured recipients (up to 3), mark schedule as run
- On failure: log error, do not mark as run
---
## 8. Configuration
### Environment Variables
| Variable | Required | Default | Description |
|---|---|---|---|
| DATABASE_URL | Yes | - | PostgreSQL connection string |
| MASTER_ENCRYPTION_KEY | Yes | - | 64 hex chars (32 bytes) for AES-256-GCM |
| APP_URL | Yes | - | Public URL (CORS, magic links, cookies). No trailing slash. |
| PORT | No | 8080 | HTTP server port |
| RUST_LOG | No | - | Logging filter (e.g., "info,ai_synth_backend=debug") |
| STATIC_DIR | No | ../frontend/dist | Path to built SolidJS files |
| RESEND_API_KEY | Yes | - | Resend email service API key |
| EMAIL_FROM | Yes | - | Sender address for emails |
| TURNSTILE_SECRET_KEY | Yes | - | Cloudflare Turnstile server secret |
| TURNSTILE_SITE_KEY | Yes | - | Cloudflare Turnstile client key |
| POSTGRES_PASSWORD | Yes | - | Used by docker-compose for DB container |
### Startup Validation
`AppConfig::validate()` checks at startup:
- `MASTER_ENCRYPTION_KEY` is exactly 64 hex characters
- `APP_URL` starts with http:// or https:// and has no trailing slash
The application refuses to start with invalid configuration.
### User Settings Model
Default values applied when a user has no saved settings:
| Setting | Default | Range |
|---|---|---|
| max_articles_per_source | 3 | 1-10 |
| max_links_per_source | 8 | 1-30 |
| use_brave_search | false | boolean |
| article_history_days | 90 | 0-365 |
| batch_size | 5 | 1-20 |
| source_extraction_window | 3 | 1-10 |
| search_agent_behavior | "" | max 2000 chars |
| ai_provider | "" | max 100 chars |
| ai_model | "" | max 100 chars |
| ai_model_websearch | "" | max 100 chars |
| rate_limit_max_requests | null | >= 1 if set |
| rate_limit_time_window_seconds | null | >= 1 if set |