From f07e91ba116849b1f69b0950513c707fac3d0326 Mon Sep 17 00:00:00 2001 From: oabrivard Date: Fri, 27 Mar 2026 15:00:49 +0100 Subject: [PATCH] docs: add consolidated architecture.md and technical_specs.md Co-Authored-By: Claude Opus 4.6 (1M context) --- docs/architecture.md | 382 +++++++++++++++++++ docs/technical_specs.md | 793 ++++++++++++++++++++++++++++++++++++++++ 2 files changed, 1175 insertions(+) create mode 100644 docs/architecture.md create mode 100644 docs/technical_specs.md diff --git a/docs/architecture.md b/docs/architecture.md new file mode 100644 index 0000000..ae2a428 --- /dev/null +++ b/docs/architecture.md @@ -0,0 +1,382 @@ +# AI Weekly Synth -- Architecture Document + +## 1. System Overview + +AI Weekly Synth is a self-hosted web application that generates AI-powered weekly news syntheses. Users configure topics (themes), categories, and an LLM provider; the system then searches the web, scrapes and validates sources, classifies articles, and produces structured summaries. + +### Technology Stack + +| Layer | Technology | +|---|---| +| Backend | Rust (Axum 0.8) | +| Frontend | SolidJS 1.9 + Tailwind CSS v4 | +| Database | PostgreSQL 17 (via sqlx with compile-time query checking) | +| Deployment | Docker Compose (app + Postgres) | + +### Deployment Topology + +``` +docker-compose.yml + ├── app (ai-synth) port 8080 + │ ├── Axum HTTP server + │ ├── Static file serving (SPA fallback) + │ └── Background tasks (scheduler, session cleanup, job TTL) + └── db (postgres:17-alpine) port 5432 (localhost only) + └── postgres_data volume +``` + +The app container builds from a multi-stage Dockerfile, serves the SolidJS frontend as static files, and connects to Postgres over the `internal` bridge network. + +--- + +## 2. Layer Architecture + +The backend follows a three-layer architecture with shared model types: + +``` +handlers/ (HTTP layer) + │ + ├── extracts request data (Axum extractors, JSON, path params) + ├── validates input + ├── calls services/ or db/ directly + └── formats HTTP responses + │ +services/ (Business logic) + │ + ├── synthesis pipeline orchestration + ├── LLM provider abstraction + factory + ├── scraping (articles, source pages) + ├── encryption, email, CSV, PDF export + ├── rate limiting, job store, scheduler + └── Brave Search client + │ +db/ (Data access) + │ + ├── pure SQL queries via sqlx + ├── typed result mapping (FromRow) + └── no business logic + │ +models/ (Shared types -- used by all layers) + │ + ├── domain structs (User, Theme, Source, Synthesis, etc.) + ├── request/response DTOs + └── validation logic +``` + +### Module Inventory + +**Handlers** (`handlers/`): `admin`, `api_keys`, `article_history`, `auth`, `config`, `generation`, `health`, `llm_logs`, `schedules`, `settings`, `sources`, `syntheses`, `themes` + +**Services** (`services/`): `auth`, `brave_search`, `csv`, `email`, `encryption`, `export`, `job_store`, `llm` (with `gemini`, `openai`, `anthropic`, `mock`, `factory`, `schema`), `prompts`, `rate_limiter`, `scheduler`, `scraper`, `source_scraper`, `synthesis`, `turnstile` + +**DB** (`db/`): `api_keys`, `article_history`, `audit`, `llm_call_log`, `magic_links`, `providers`, `rate_limits`, `schedules`, `sessions`, `settings`, `sources`, `syntheses`, `themes`, `users` + +**Models** (`models/`): `api_key`, `audit`, `magic_link`, `provider`, `rate_limit`, `schedule`, `session`, `settings`, `source`, `synthesis`, `theme`, `user` + +--- + +## 3. Key Components + +### 3.1 LLM Provider Abstraction + +The `LlmProvider` trait defines a unified interface for all LLM backends: + +```rust +#[async_trait] +pub trait LlmProvider: Send + Sync { + fn provider_id(&self) -> &str; + async fn call_llm(&self, model: &str, system_prompt: &str, + user_prompt: &str, response_schema: &Value) + -> Result; +} +``` + +Implementations: `GeminiProvider`, `OpenAiProvider`, `AnthropicProvider`, `MockLlmProvider`. + +The factory (`llm/factory.rs`) creates provider instances by name. The mock provider enables end-to-end pipeline testing without real API calls. + +### 3.2 Synthesis Pipeline + +The pipeline is the core business logic, orchestrated in `services/synthesis.rs`. It runs as a background tokio task with a 15-minute timeout. + +**Three phases:** + +1. **Phase 1 -- Personalized Sources**: Extract article links from user-curated source pages (windowed, rolling), scrape articles, classify and summarize each via LLM. Batched processing with configurable `batch_size`. + +2. **Phase 2 -- Web Search Fallback**: For under-filled categories, either call the Brave Search API or use the LLM's web search capability to find additional articles. Scrape and validate results. + +3. **Save**: Assemble sections by category, sanitize JSON, persist to database, record article history traces. + +Progress is reported via `tokio::sync::watch` channels consumed by SSE endpoints. + +### 3.3 Job Store + +`JobStore` (`services/job_store.rs`) is an in-memory concurrent store for active generation jobs: + +- Backed by `DashMap` for lock-free access +- `DashSet` for per-user deduplication (one active job per user) +- Each job holds a `watch::Sender` for real-time SSE streaming +- `AtomicBool` for cooperative cancellation +- 1-hour TTL with automatic cleanup + +### 3.4 Scheduler + +`services/scheduler.rs` runs as a background task, checking every minute for due `theme_schedules`. When a schedule fires: + +1. Query `find_due_schedules` matching current day code + time +2. Skip if user already has a manual generation in progress +3. Run `synthesis::run_generation_inner` directly +4. Send email to configured recipients (up to 3) +5. Mark schedule as run + +### 3.5 Scraper + +Two scraping services: + +- **`scraper.rs`**: Article page scraper with SSRF prevention, HTML parsing, title/date/body extraction, soft-404 detection, 15s timeout, 5MB body limit. +- **`source_scraper.rs`**: Source index page scraper that extracts article links from user-configured source URLs (HTML `` parsing with filters, or LLM-assisted extraction). + +### 3.6 Rate Limiters + +- **Auth rate limiter**: 10 requests/60s per key (email or IP) for magic link endpoints. +- **Provider rate limiter**: Per-LLM-provider sliding window, admin-configured, hot-reloaded from DB. +- **User rate limiters**: Per-user generation rate limits cached in `DashMap`, recreated on settings change. + +--- + +## 4. Data Model + +### Tables and Relationships + +``` +users + ├── sessions (user_id FK, CASCADE) + ├── magic_tokens (email reference, no FK) + ├── settings (user_id PK/FK, CASCADE) + ├── themes (user_id FK, CASCADE) + │ ├── sources (theme_id FK, CASCADE) + │ ├── syntheses (theme_id FK, SET NULL) + │ └── theme_schedules (theme_id FK, CASCADE, UNIQUE) + ├── user_api_keys (user_id FK, CASCADE; UNIQUE per provider) + ├── article_history (user_id FK, CASCADE) + ├── llm_call_log (user_id FK, CASCADE) + └── audit_log (admin_user_id FK, SET NULL) + +admin_providers + └── admin_rate_limits (provider_name FK, CASCADE) +``` + +### Table Summary + +| Table | Purpose | Key Columns | +|---|---|---| +| `users` | User accounts | id, email, display_name, role (user/admin), created_at | +| `sessions` | Login sessions | session_hash (PK), user_id, expires_at, last_active_at, ip_address | +| `magic_tokens` | Passwordless auth tokens | id, email, token_hash, expires_at, used | +| `settings` | Per-user pipeline config | user_id (PK), ai_provider, ai_model, ai_model_websearch, batch_size, max_articles_per_source, max_links_per_source, use_brave_search, source_extraction_window, article_history_days, search_agent_behavior, rate_limit_max_requests, rate_limit_time_window_seconds | +| `themes` | Per-user topic configurations | id, user_id, name, theme, categories (JSONB), max_items_per_category, max_age_days, summary_length | +| `sources` | User-curated news source URLs | id, user_id, title, url, theme_id, is_preferred | +| `syntheses` | Generated synthesis results | id, user_id, week, sections (JSONB), status, job_id, theme_id | +| `theme_schedules` | Automated generation schedules | id, theme_id (UNIQUE), user_id, enabled, days (JSONB), time_utc, emails (JSONB), last_run_at | +| `article_history` | Article URL dedup + provenance trace | id, user_id, url, url_hash, title, source_type, source_url, category, synthesis_id, status, scraped_ok, job_id, published_date | +| `llm_call_log` | Full LLM interaction log | id, user_id, job_id, call_type, model, system_prompt, user_prompt, response_body, duration_ms, article_url | +| `admin_providers` | Admin-curated LLM provider catalog | id, provider_name (UNIQUE), display_name, models_scraping (JSONB), models_websearch (JSONB), is_enabled | +| `admin_rate_limits` | Per-provider rate limit config | id, provider_name (UNIQUE, FK), max_requests, time_window_seconds | +| `user_api_keys` | Encrypted user LLM API keys | id, user_id, provider_name, encrypted_key (BYTEA), nonce (BYTEA), key_prefix; UNIQUE(user_id, provider_name) | +| `audit_log` | Admin mutation audit trail | id, admin_user_id, action, target_type, target_id, details (JSONB) | + +--- + +## 5. API Overview + +All API routes are prefixed with `/api/v1`. CSRF protection (`X-Requested-With` header) is applied to all mutating endpoints. + +### Authentication + +| Method | Path | Auth | Description | +|---|---|---|---| +| POST | /auth/register | Public | Create account + send magic link | +| POST | /auth/login | Public | Request magic link | +| GET | /auth/verify | Public | Verify token (email click redirect) | +| POST | /auth/verify | Public | Verify token (frontend API call) | +| POST | /auth/logout | Authenticated | Destroy session | +| GET | /auth/me | Authenticated | Current user info | + +### Settings + +| Method | Path | Auth | Description | +|---|---|---|---| +| GET | /settings | Authenticated | Get user settings | +| PUT | /settings | Authenticated | Update user settings | + +### Themes + +| Method | Path | Auth | Description | +|---|---|---|---| +| GET | /themes | Authenticated | List user themes | +| POST | /themes | Authenticated | Create theme | +| PUT | /themes/{id} | Authenticated | Update theme | +| DELETE | /themes/{id} | Authenticated | Delete theme | + +### Schedules + +| Method | Path | Auth | Description | +|---|---|---|---| +| GET | /themes/{id}/schedule | Authenticated | Get theme schedule | +| PUT | /themes/{id}/schedule | Authenticated | Create or update schedule | +| DELETE | /themes/{id}/schedule | Authenticated | Delete schedule | + +### Sources + +| Method | Path | Auth | Description | +|---|---|---|---| +| GET | /sources | Authenticated | List sources | +| POST | /sources | Authenticated | Create source | +| PUT | /sources/preferred | Authenticated | Update preferred sources | +| DELETE | /sources/{id} | Authenticated | Delete source | +| POST | /sources/bulk | Authenticated | Bulk import (JSON) | +| POST | /sources/import-csv | Authenticated | Import from CSV | +| GET | /sources/export-csv | Authenticated | Export as CSV | + +### Syntheses & Generation + +| Method | Path | Auth | Description | +|---|---|---|---| +| GET | /syntheses | Authenticated | List syntheses | +| GET | /syntheses/{id} | Authenticated | Get full synthesis | +| DELETE | /syntheses/{id} | Authenticated | Delete synthesis | +| POST | /syntheses/generate | Authenticated | Trigger generation | +| GET | /syntheses/generate/{job_id}/progress | Authenticated | SSE progress stream | +| POST | /syntheses/generate/{job_id}/stop | Authenticated | Cancel generation | +| POST | /syntheses/{id}/send-email | Authenticated | Email synthesis | +| GET | /syntheses/{id}/export/markdown | Authenticated | Markdown download | +| GET | /syntheses/{id}/export/pdf | Authenticated | PDF download | + +### Article History & LLM Logs + +| Method | Path | Auth | Description | +|---|---|---|---| +| GET | /article-history | Authenticated | List article history | +| DELETE | /article-history | Authenticated | Clear article history | +| GET | /syntheses/{id}/provenance | Authenticated | Get synthesis provenance | +| GET | /llm-logs/{job_id} | Authenticated | Get LLM call logs for job | + +### User API Keys + +| Method | Path | Auth | Description | +|---|---|---|---| +| GET | /user/api-keys | Authenticated | List keys (prefix only) | +| POST | /user/api-keys | Authenticated | Store encrypted key | +| DELETE | /user/api-keys/{provider} | Authenticated | Delete key | +| POST | /user/api-keys/{provider}/test | Authenticated | Test key validity | +| POST | /user/api-keys/export | Authenticated | Export keys | + +### Configuration & Admin + +| Method | Path | Auth | Description | +|---|---|---|---| +| GET | /config/providers | Authenticated | Available providers/models | +| GET | /admin/providers | Admin | List all providers | +| POST | /admin/providers | Admin | Create provider | +| PUT | /admin/providers/{id} | Admin | Update provider | +| DELETE | /admin/providers/{id} | Admin | Delete provider | +| GET | /admin/rate-limits | Admin | List rate limits | +| PUT | /admin/rate-limits/{provider_name} | Admin | Update rate limit | +| GET | /admin/users | Admin | List users | +| PUT | /admin/users/{id}/role | Admin | Change user role | + +### Infrastructure + +| Method | Path | Auth | Description | +|---|---|---|---| +| GET | /health | Public | Health check | + +--- + +## 6. Security Architecture + +### Authentication & Session Management + +- **Passwordless**: Magic link tokens sent via email (Resend API), single-use, time-limited +- **Captcha**: Cloudflare Turnstile on registration and login +- **Sessions**: SHA-256 hashed tokens stored in DB, 30-day expiry, `HttpOnly` + `SameSite=Lax` cookies, optionally `Secure` +- **Anti-enumeration**: Same response for existent/non-existent emails, timing attack mitigation +- **Authorization**: `AuthUser` and `AdminUser` Axum extractors enforce auth levels per handler + +### CSRF Protection + +All mutating API endpoints require the `X-Requested-With` header (checked by `csrf::csrf_check` middleware layer). Non-mutating GET/HEAD/OPTIONS requests are exempt. + +### Encryption at Rest + +User LLM API keys are encrypted with AES-256-GCM before storage: +- 32-byte master key from `MASTER_ENCRYPTION_KEY` env var (64 hex chars) +- Random 12-byte nonce per encryption (stored alongside ciphertext) +- Key bytes are zeroized on drop (`zeroize` crate) +- Only a key prefix (first 8 chars + "...") is ever returned via the API + +### SSRF Prevention + +Both `scraper.rs` and `source_scraper.rs` validate URLs before fetching: +- DNS resolution check against private/loopback IP ranges +- Redirect chain validation (no redirects to private IPs) +- Only HTTP/HTTPS schemes allowed + +### Security Headers + +Applied as global middleware layers: +- `Content-Security-Policy` (self + Cloudflare Turnstile) +- `X-Content-Type-Options: nosniff` +- `X-Frame-Options: DENY` +- `Referrer-Policy: strict-origin-when-cross-origin` +- `X-XSS-Protection: 1; mode=block` +- `Strict-Transport-Security` (HTTPS only) + +### Error Sanitization + +The `sanitize_error_message` function strips API keys and internal details from error messages before they reach SSE clients. Internal errors log full details server-side but return generic messages to users. + +### CORS + +Configured to allow only the `APP_URL` origin, with credentials (cookies), limited to GET/POST/PUT/DELETE methods. + +--- + +## 7. Concurrency Model + +### Async Runtime + +Tokio with full features. The Axum server runs as a multi-threaded async runtime. + +### Background Tasks + +Spawned at startup via `tokio::spawn`: +- **Session cleanup**: Hourly deletion of expired DB sessions +- **Job store cleanup**: Periodic removal of expired job entries (1-hour TTL) +- **Scheduler**: Minute-by-minute check for due theme schedules + +### Generation Pipeline Concurrency + +- **`tokio::task::JoinSet`**: Used for parallel scraping (bounded concurrency of 5 for source extraction) and parallel LLM classification calls within each batch +- **`tokio::sync::watch`**: Fan-out progress notifications to SSE clients; late subscribers immediately receive the latest state +- **`AtomicBool`**: Cooperative cancellation flag checked between pipeline stages; avoids mutex overhead +- **`DashMap` / `DashSet`**: Lock-free concurrent access for the job store (job entries), generating-users set, per-user rate limiter cache, and provider rate limiter state + +### Task Lifecycle + +``` +POST /generate + └── handler creates job in JobStore + └── spawns outer task (panic monitor) + └── spawns inner task (15-min timeout) + └── run_generation_inner() + ├── Phase 1 (JoinSet scrape, JoinSet classify) + ├── Phase 2 (JoinSet scrape, JoinSet classify) + └── Save to DB + └── on complete/error: send final ProgressEvent + └── delayed cleanup (5 min) then remove from JobStore +``` + +### Graceful Shutdown + +The server supports graceful shutdown via signal handling, allowing in-flight requests to complete. diff --git a/docs/technical_specs.md b/docs/technical_specs.md new file mode 100644 index 0000000..c4ee605 --- /dev/null +++ b/docs/technical_specs.md @@ -0,0 +1,793 @@ +# AI Weekly Synth -- Technical Specifications + +## 1. Backend Tech Stack + +| Dependency | Version | Purpose | +|---|---|---| +| axum | 0.8 | Web framework (macros, multipart) | +| tokio | 1 | Async runtime (full features) | +| tower | 0.5 | Middleware composition | +| tower-http | 0.6 | CORS, static files, tracing, headers | +| sqlx | 0.8 | Async Postgres driver (runtime-tokio, tls-rustls, uuid, chrono, json, migrate) | +| reqwest | 0.12 | HTTP client (JSON) | +| serde / serde_json | 1 | Serialization/deserialization | +| chrono | 0.4 | Date/time handling (serde feature) | +| aes-gcm | 0.10 | AES-256-GCM encryption | +| zeroize | 1 | Secure memory zeroing | +| sha2 | 0.10 | SHA-256 hashing | +| rand | 0.8 | Random number generation | +| base64 | 0.22 | Base64 encoding | +| hex | 0.4 | Hex encoding/decoding | +| async-trait | 0.1 | Async trait objects | +| tracing / tracing-subscriber | 0.1 / 0.3 | Structured logging (env-filter, json) | +| dotenvy | 0.15 | .env file loading | +| clap | 4 | CLI argument parsing | +| scraper | 0.22 | HTML parsing (CSS selectors) | +| ego-tree | 0.10 | Tree data structure (used by scraper) | +| url | 2 | URL parsing and validation | +| email_address | 0.2 | Email validation | +| anyhow | 1 | Error context | +| thiserror | 2 | Error type derivation | +| uuid | 1 | UUID v4 generation (serde feature) | +| dashmap | 6 | Concurrent hash maps | +| tokio-stream | 0.1 | Stream utilities for SSE | +| futures | 0.3 | Async stream combinators | +| printpdf | 0.7 | PDF generation | + +**Dev dependencies**: tower (util), http-body-util, wiremock 0.6. + +**Rust edition**: 2021. + +--- + +## 2. Frontend Tech Stack + +| Dependency | Version | Purpose | +|---|---|---| +| solid-js | ^1.9.0 | Reactive UI framework | +| @solidjs/router | ^0.15.0 | Client-side routing | +| lucide-solid | ^0.475.0 | Icon library | +| date-fns | ^4.1.0 | Date formatting | +| tailwindcss | ^4.1.0 | Utility-first CSS (v4) | +| @tailwindcss/vite | ^4.1.0 | Tailwind Vite plugin | +| vite | ^6.2.0 | Build tool and dev server | +| vite-plugin-solid | ^2.11.0 | SolidJS Vite integration | +| typescript | ~5.8.0 | Type checking | +| vitest | ^3.0.0 | Unit testing | +| @solidjs/testing-library | ^0.8.0 | Component testing | +| jsdom | ^25.0.0 | DOM environment for tests | + +### Frontend Routes + +| Path | Component | Auth | Description | +|---|---|---|---| +| /login | Login | Public | Login page | +| /register | Register | Public | Registration page | +| /auth/verify | AuthVerify | Public | Magic link verification | +| / | Home | Protected | Dashboard / synthesis list | +| /settings | Settings | Protected | User settings | +| /themes | ThemeManager | Protected | Theme CRUD + source management | +| /generate | GenerateSynthesis | Protected | Generation trigger + progress | +| /synthesis/:id | SynthesisDetail | Protected | Full synthesis view | +| /article-history | ArticleHistory | Protected | Article history browser | +| /llm-logs/:jobId | LlmLogs | Protected | LLM call log viewer | +| /admin/providers | AdminProviders | Admin | Provider configuration | +| /admin/rate-limits | AdminRateLimits | Admin | Rate limit configuration | +| /admin/users | AdminUsers | Admin | User management | + +--- + +## 3. Database Schema + +### 3.1 `users` + +| Column | Type | Constraints | +|---|---|---| +| id | UUID | PK, DEFAULT gen_random_uuid() | +| email | TEXT | NOT NULL, UNIQUE | +| display_name | TEXT | nullable | +| role | TEXT | NOT NULL, DEFAULT 'user', CHECK (user/admin) | +| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() | +| updated_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() | + +Indexes: `idx_users_email` on (email). + +### 3.2 `sessions` + +| Column | Type | Constraints | +|---|---|---| +| session_hash | TEXT | PK (SHA-256 of raw token) | +| user_id | UUID | NOT NULL, FK users(id) CASCADE | +| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() | +| expires_at | TIMESTAMPTZ | NOT NULL | +| last_active_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() | +| ip_address | TEXT | nullable | +| user_agent | TEXT | nullable | + +Indexes: `idx_sessions_user_id`, `idx_sessions_expires_at`. + +### 3.3 `magic_tokens` + +| Column | Type | Constraints | +|---|---|---| +| id | UUID | PK, DEFAULT gen_random_uuid() | +| email | TEXT | NOT NULL | +| token_hash | TEXT | NOT NULL, UNIQUE | +| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() | +| expires_at | TIMESTAMPTZ | NOT NULL | +| used | BOOLEAN | NOT NULL, DEFAULT false | + +Indexes: `idx_magic_tokens_email`, `idx_magic_tokens_expires`. + +### 3.4 `settings` + +Per-user pipeline configuration. One row per user (user_id is the PK). + +| Column | Type | Constraints | +|---|---|---| +| user_id | UUID | PK, FK users(id) CASCADE | +| max_articles_per_source | INTEGER | NOT NULL, DEFAULT 3 | +| max_links_per_source | INTEGER | NOT NULL, DEFAULT 8 | +| use_brave_search | BOOLEAN | NOT NULL, DEFAULT false | +| article_history_days | INTEGER | NOT NULL, DEFAULT 90 | +| batch_size | INTEGER | NOT NULL, DEFAULT 5 | +| source_extraction_window | INTEGER | NOT NULL, DEFAULT 3 | +| search_agent_behavior | TEXT | NOT NULL, DEFAULT '' | +| ai_provider | TEXT | NOT NULL, DEFAULT '' | +| ai_model | TEXT | NOT NULL, DEFAULT '' | +| ai_model_websearch | TEXT | NOT NULL, DEFAULT '' | +| rate_limit_max_requests | INTEGER | nullable | +| rate_limit_time_window_seconds | INTEGER | nullable | +| updated_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() | + +### 3.5 `themes` + +Per-user topic configurations with content settings. + +| Column | Type | Constraints | +|---|---|---| +| id | UUID | PK, DEFAULT gen_random_uuid() | +| user_id | UUID | NOT NULL, FK users(id) CASCADE | +| name | TEXT | NOT NULL | +| theme | TEXT | NOT NULL (search topic) | +| categories | JSONB | NOT NULL, DEFAULT '[]' | +| max_items_per_category | INTEGER | NOT NULL, DEFAULT 4 | +| max_age_days | INTEGER | NOT NULL, DEFAULT 7 | +| summary_length | INTEGER | NOT NULL, DEFAULT 3 | +| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() | +| updated_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() | + +Indexes: `idx_themes_user_id`. + +### 3.6 `sources` + +User-curated news source URLs, optionally tied to a theme. + +| Column | Type | Constraints | +|---|---|---| +| id | UUID | PK, DEFAULT gen_random_uuid() | +| user_id | UUID | NOT NULL, FK users(id) CASCADE | +| title | VARCHAR(200) | NOT NULL, CHECK length 1-200 | +| url | VARCHAR(1000) | NOT NULL, CHECK length <= 1000 | +| theme_id | UUID | nullable, FK themes(id) CASCADE | +| is_preferred | BOOLEAN | NOT NULL, DEFAULT false | +| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() | + +Indexes: `idx_sources_user_id`, UNIQUE `idx_sources_user_id_url` on (user_id, url). + +### 3.7 `syntheses` + +Generated synthesis results with JSONB section data. + +| Column | Type | Constraints | +|---|---|---| +| id | UUID | PK, DEFAULT gen_random_uuid() | +| user_id | UUID | NOT NULL, FK users(id) CASCADE | +| week | VARCHAR(10) | NOT NULL (ISO week string) | +| sections | JSONB | NOT NULL, DEFAULT '[]' | +| status | VARCHAR(20) | NOT NULL, DEFAULT 'completed' | +| job_id | UUID | nullable | +| theme_id | UUID | nullable, FK themes(id) SET NULL | +| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() | + +Indexes: `idx_syntheses_user_id_created_at` on (user_id, created_at DESC). + +JSONB structure for `sections`: +```json +[ + { + "title": "Category Name", + "items": [ + { "title": "Article Title", "url": "https://...", "summary": "...", "date": "2026-03-25" } + ] + } +] +``` + +### 3.8 `theme_schedules` + +Automated generation schedules, one per theme. + +| Column | Type | Constraints | +|---|---|---| +| id | UUID | PK, DEFAULT gen_random_uuid() | +| theme_id | UUID | NOT NULL, UNIQUE, FK themes(id) CASCADE | +| user_id | UUID | NOT NULL, FK users(id) CASCADE | +| enabled | BOOLEAN | NOT NULL, DEFAULT true | +| days | JSONB | NOT NULL, DEFAULT '[]' (e.g. ["mon","fri"]) | +| time_utc | TEXT | NOT NULL, DEFAULT '08:00' (HH:MM) | +| emails | JSONB | NOT NULL, DEFAULT '[]' (up to 3 addresses) | +| last_run_at | TIMESTAMPTZ | nullable | +| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() | +| updated_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() | + +Indexes: `idx_theme_schedules_enabled` (partial, WHERE enabled = true). + +### 3.9 `article_history` + +Article URL deduplication and full provenance tracing. + +| Column | Type | Constraints | +|---|---|---| +| id | UUID | PK, DEFAULT gen_random_uuid() | +| user_id | UUID | NOT NULL, FK users(id) CASCADE | +| url_hash | TEXT | NOT NULL (SHA-256 of normalized URL) | +| url | TEXT | NOT NULL | +| title | TEXT | NOT NULL, DEFAULT '' | +| source_type | TEXT | NOT NULL, DEFAULT 'unknown' | +| source_url | TEXT | nullable | +| category | TEXT | nullable | +| synthesis_id | UUID | nullable, FK syntheses(id) SET NULL | +| status | TEXT | NOT NULL, DEFAULT 'used' | +| scraped_ok | BOOLEAN | NOT NULL, DEFAULT true | +| job_id | UUID | NOT NULL | +| published_date | TEXT | nullable | +| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() | + +Indexes: `idx_article_history_user_url` on (user_id, url_hash), `idx_article_history_job_id`. + +Status values: `used`, `filtered_history`, `filtered_diversity`, `filtered_not_article`, `filtered_too_old`, `filtered_empty`, `filtered_homepage`, `filtered_cross_phase_dedup`. + +Source type values: `personalized_source`, `brave_search`, `web_search`. + +### 3.10 `llm_call_log` + +Full LLM interaction logging for debugging and analysis. + +| Column | Type | Constraints | +|---|---|---| +| id | UUID | PK, DEFAULT gen_random_uuid() | +| user_id | UUID | NOT NULL, FK users(id) CASCADE | +| job_id | UUID | NOT NULL | +| call_type | TEXT | NOT NULL | +| model | TEXT | NOT NULL | +| system_prompt | TEXT | NOT NULL, DEFAULT '' | +| user_prompt | TEXT | NOT NULL, DEFAULT '' | +| response_body | TEXT | NOT NULL, DEFAULT '' | +| duration_ms | INTEGER | NOT NULL, DEFAULT 0 | +| article_url | TEXT | nullable | +| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() | + +Indexes: `idx_llm_call_log_job_id`, `idx_llm_call_log_user_id` on (user_id, created_at). + +### 3.11 `admin_providers` + +Admin-curated catalog of LLM providers and their models. + +| Column | Type | Constraints | +|---|---|---| +| id | UUID | PK, DEFAULT gen_random_uuid() | +| provider_name | VARCHAR(50) | NOT NULL, UNIQUE | +| display_name | VARCHAR(100) | NOT NULL | +| models_scraping | JSONB | NOT NULL, DEFAULT '[]' | +| models_websearch | JSONB | NOT NULL, DEFAULT '[]' | +| is_enabled | BOOLEAN | NOT NULL, DEFAULT true | +| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() | +| updated_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() | + +Indexes: `idx_admin_providers_enabled` (partial, WHERE is_enabled = true). + +Seeded with: gemini, openai, anthropic. + +JSONB model structure: +```json +[{"model_id": "gemini-2.5-pro", "display_name": "Gemini 2.5 Pro", "is_default": true}] +``` + +### 3.12 `admin_rate_limits` + +Per-provider rate limit configuration. + +| Column | Type | Constraints | +|---|---|---| +| id | UUID | PK, DEFAULT gen_random_uuid() | +| provider_name | VARCHAR(50) | NOT NULL, UNIQUE, FK admin_providers(provider_name) CASCADE | +| max_requests | INTEGER | NOT NULL, DEFAULT 30 | +| time_window_seconds | INTEGER | NOT NULL, DEFAULT 60 | +| updated_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() | + +Seeded defaults: gemini 29/60s, openai 50/60s, anthropic 40/60s. + +### 3.13 `user_api_keys` + +Encrypted user LLM API keys. + +| Column | Type | Constraints | +|---|---|---| +| id | UUID | PK, DEFAULT gen_random_uuid() | +| user_id | UUID | NOT NULL, FK users(id) CASCADE | +| provider_name | VARCHAR(50) | NOT NULL | +| encrypted_key | BYTEA | NOT NULL | +| nonce | BYTEA | NOT NULL | +| key_prefix | VARCHAR(20) | NOT NULL | +| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() | +| updated_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() | + +Constraint: UNIQUE(user_id, provider_name). Valid providers: gemini, openai, anthropic, brave_search. + +### 3.14 `audit_log` + +Admin mutation audit trail. + +| Column | Type | Constraints | +|---|---|---| +| id | UUID | PK, DEFAULT gen_random_uuid() | +| admin_user_id | UUID | nullable, FK users(id) SET NULL | +| action | VARCHAR(100) | NOT NULL | +| target_type | VARCHAR(50) | nullable | +| target_id | VARCHAR(255) | nullable | +| details | JSONB | nullable | +| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() | + +Indexes: `idx_audit_log_created_at` (DESC), `idx_audit_log_admin_user`. + +--- + +## 4. API Endpoints + +All endpoints are prefixed with `/api/v1`. Responses are JSON. Errors follow the shape `{ "error": "message" }`. + +### 4.1 Authentication + +**POST /auth/register** +- Auth: Public +- Body: `{ email: string, display_name?: string, turnstile_token: string }` +- Response: `{ message: string }` +- Sends magic link email. Rate limited. + +**POST /auth/login** +- Auth: Public +- Body: `{ email: string, turnstile_token: string }` +- Response: `{ message: string }` +- Sends magic link email. Rate limited. + +**GET /auth/verify?token=...&email=...** +- Auth: Public +- Response: Redirect to frontend with session cookie set. + +**POST /auth/verify** +- Auth: Public +- Body: `{ token: string, email: string }` +- Response: `{ message: string, user: User }` +- Sets `session` HttpOnly cookie (30-day expiry). + +**POST /auth/logout** +- Auth: Authenticated +- Response: `{ message: string }` +- Clears session cookie and deletes DB session. + +**GET /auth/me** +- Auth: Authenticated +- Response: `{ id, email, display_name, role, created_at }` + +### 4.2 Settings + +**GET /settings** +- Auth: Authenticated +- Response: `UserSettings` (creates defaults if not exists) + +**PUT /settings** +- Auth: Authenticated +- Body: `UpdateSettingsRequest` (all fields required) +- Validation: max_articles_per_source 1-10, max_links_per_source 1-30, batch_size 1-20, source_extraction_window 1-10, article_history_days 0-365, search_agent_behavior max 2000 chars, ai_provider/ai_model/ai_model_websearch max 100 chars. +- Response: Updated `UserSettings` + +### 4.3 Themes + +**GET /themes** +- Auth: Authenticated +- Response: `ThemeResponse[]` + +**POST /themes** +- Auth: Authenticated +- Body: `{ name, theme, categories: string[], max_items_per_category?, max_age_days?, summary_length? }` +- Validation: name non-empty max 200 chars, categories 1-20 non-empty entries, max_items 1-50, max_age 1-365, summary_length 1-3. +- Response: `ThemeResponse` + +**PUT /themes/{id}** +- Auth: Authenticated (owner only) +- Body: `UpdateThemeRequest` (all fields optional) +- Response: `ThemeResponse` + +**DELETE /themes/{id}** +- Auth: Authenticated (owner only) +- Response: 204 No Content + +### 4.4 Schedules + +**GET /themes/{id}/schedule** +- Auth: Authenticated (theme owner) +- Response: `ScheduleResponse` or 404 + +**PUT /themes/{id}/schedule** +- Auth: Authenticated (theme owner) +- Body: `{ enabled, days: string[], time_utc: "HH:MM", emails: string[] }` +- Validation: days from mon-sun, time HH:MM format, max 3 emails. +- Response: `ScheduleResponse` + +**DELETE /themes/{id}/schedule** +- Auth: Authenticated (theme owner) +- Response: 204 No Content + +### 4.5 Sources + +**GET /sources?theme_id=...** +- Auth: Authenticated +- Response: `SourceResponse[]` + +**POST /sources** +- Auth: Authenticated +- Body: `{ title, url, theme_id? }` +- Validation: title non-empty max 200, URL http(s) max 1000 chars. +- Response: `SourceResponse` + +**PUT /sources/preferred** +- Auth: Authenticated +- Body: `{ source_ids: UUID[] }` +- Response: `{ updated: number }` + +**DELETE /sources/{id}** +- Auth: Authenticated (owner only) +- Response: 204 No Content + +**POST /sources/bulk** +- Auth: Authenticated +- Body: `{ sources: CreateSourceRequest[], theme_id? }` +- Response: `{ imported, skipped, errors }` + +**POST /sources/import-csv** +- Auth: Authenticated +- Body: Multipart file upload (CSV: title,url) +- Response: `{ imported, skipped, errors }` + +**GET /sources/export-csv** +- Auth: Authenticated +- Response: CSV file download + +### 4.6 Generation + +**POST /syntheses/generate** +- Auth: Authenticated +- Body: `{ theme_id: UUID }` +- Response: `{ job_id: UUID }` +- Creates job in JobStore, spawns background generation task. Returns 409 if user already has active job. + +**GET /syntheses/generate/{job_id}/progress** +- Auth: Authenticated (job owner) +- Response: SSE stream of `ProgressEvent` +- Events: `progress` (step, message, percent), `complete` (synthesis_id), `error` (message). + +**POST /syntheses/generate/{job_id}/stop** +- Auth: Authenticated (job owner) +- Response: `{ message: string }` +- Sets cooperative cancellation flag. + +### 4.7 Syntheses + +**GET /syntheses** +- Auth: Authenticated +- Response: `SynthesisListItem[]` (with section summaries, theme info) + +**GET /syntheses/{id}** +- Auth: Authenticated (owner only) +- Response: `SynthesisResponse` (full sections data) + +**DELETE /syntheses/{id}** +- Auth: Authenticated (owner only) +- Response: 204 No Content + +**POST /syntheses/{id}/send-email** +- Auth: Authenticated +- Body: `{ email: string }` +- Response: `{ message: string }` + +**GET /syntheses/{id}/export/markdown** +- Auth: Authenticated +- Response: Markdown file download + +**GET /syntheses/{id}/export/pdf** +- Auth: Authenticated +- Response: PDF file download + +### 4.8 Article History & Provenance + +**GET /article-history?limit=&offset=&job_id=&status=** +- Auth: Authenticated +- Response: `{ items: ArticleHistoryEntry[], total: number }` + +**DELETE /article-history** +- Auth: Authenticated +- Response: `{ deleted: number }` + +**GET /syntheses/{id}/provenance** +- Auth: Authenticated +- Response: `ArticleHistoryEntry[]` (articles with status "used" for this synthesis's job_id) + +### 4.9 LLM Call Logs + +**GET /llm-logs/{job_id}** +- Auth: Authenticated +- Response: `LlmCallLogEntry[]` + +### 4.10 User API Keys + +**GET /user/api-keys** +- Auth: Authenticated +- Response: `ApiKeyResponse[]` (id, provider_name, key_prefix, timestamps; never the full key) + +**POST /user/api-keys** +- Auth: Authenticated +- Body: `{ provider_name, api_key }` +- Validation: provider in (gemini, openai, anthropic, brave_search), key 8-500 chars. +- Response: `ApiKeyResponse` +- Encrypts key with AES-256-GCM before storage; upserts (one key per user per provider). + +**DELETE /user/api-keys/{provider}** +- Auth: Authenticated +- Response: 204 No Content + +**POST /user/api-keys/{provider}/test** +- Auth: Authenticated +- Response: `{ success: boolean, message: string }` +- Decrypts key, calls provider test endpoint. + +**POST /user/api-keys/export** +- Auth: Authenticated +- Response: `{ keys: [{ provider_name, api_key }] }` +- Decrypts and returns all keys (used for backup/migration). + +### 4.11 Public Configuration + +**GET /config/providers** +- Auth: Authenticated +- Response: `ProviderConfigResponse[]` (enabled providers with model lists for scraping and websearch) + +### 4.12 Admin Endpoints + +All admin endpoints require `AdminUser` extractor (role = admin). + +**GET /admin/providers** +- Response: `AdminProviderResponse[]` + +**POST /admin/providers** +- Body: `CreateProviderRequest` +- Validation: provider_name in (gemini, openai, anthropic), at least one model per list, at most one default per list. +- Response: `AdminProviderResponse` + +**PUT /admin/providers/{id}** +- Body: `UpdateProviderRequest` (all fields optional) +- Response: `AdminProviderResponse` + +**DELETE /admin/providers/{id}** +- Response: 204 No Content + +**GET /admin/rate-limits** +- Response: `RateLimitResponse[]` + +**PUT /admin/rate-limits/{provider_name}** +- Body: `{ max_requests: 1-1000, time_window_seconds: 1-3600 }` +- Response: `RateLimitResponse` +- Hot-reloads the in-memory provider rate limiter. + +**GET /admin/users** +- Response: `AdminUserResponse[]` + +**PUT /admin/users/{id}/role** +- Body: `{ role: "user" | "admin" }` +- Response: `{ message: string }` + +**GET /health** +- Auth: Public +- Response: `{ status: "ok" }` + +--- + +## 5. Generation Pipeline Technical Flow + +### Overview + +The pipeline runs as a background tokio task spawned by `POST /syntheses/generate`. It has a 15-minute global timeout and supports cooperative cancellation via `AtomicBool`. + +### Initialization + +1. Load `UserSettings` from DB (or create defaults) +2. Cleanup old article history (entries older than `article_history_days` with dropped status) and truncate old LLM call logs +3. Load the target `Theme` (categories, max_items, max_age_days, summary_length) +4. Load user `Sources` for the theme +5. Decrypt user's LLM API key, create `Arc` via factory +6. Resolve models: `ai_model` (for scraping/classification) and `ai_model_websearch` (for web search); user override or admin default fallback +7. Initialize per-user rate limiter (from settings or admin defaults) +8. Initialize tracking structures: `article_scraped` (category -> Vec), `source_counts`, `url_source`, `filled_counts`, `seen_urls`, `pending_traces` + +### Phase 1: Personalized Sources + +Skipped if user has 0 sources for the theme. + +**1a. Windowed source extraction** + +- Query article_history for the last source used; reorder sources in a rolling window starting after that source +- Select up to `source_extraction_window` sources per generation +- For each source (bounded concurrency of 5): fetch page HTML, extract up to `max_links_per_source` article URLs via HTML parsing (same-domain, non-homepage, no static assets) +- Deduplicate URLs cross-source via `seen_urls` +- Batch-check `article_history` for already-seen URL hashes; filter matches (traced as `filtered_history`) +- Shuffle remaining candidates to interleave sources +- Track url -> source in `url_source` + +**1b. Batch scrape + classify** + +Processing in batches of `settings.batch_size`: + +- **Batch assembly**: Pull up to batch_size candidates, skip if `source_counts[domain] >= max_articles_per_source` (traced as `filtered_diversity`) +- **Scrape** (JoinSet, parallel): SSRF check, 15s timeout, 5MB limit, HTML parsing, title/date/body extraction, soft-404 detection. Skip empty/too-old articles. +- **Classify** (JoinSet, parallel): Rate limit check (60s wait), send title + first 500 chars to LLM with categories list. LLM returns `{title, summary, category}`. Validate category via `assign_category()` (fallback to "Autre", drop if full). +- **LLM call logging**: Every LLM call is logged with full prompt, response, timing, and article URL. +- **Early exit**: Stop when total articles >= `(num_categories + 1) * max_items_per_category`. +- Batch-flush pending traces to `article_history`. + +### Phase 2: Web Search Fallback + +Skipped if all categories are filled to `max_items_per_category`. + +**2a. Compute gaps**: For each category, `needed = max_items - filled`. + +**2b. Path selection** based on `settings.use_brave_search`: + +**Path A -- Brave Search** (`use_brave_search = true`): +- Decrypt user's Brave Search API key +- Query: `"{theme} actualites"`, up to 20 results, freshness mapped from `max_age_days` (pd/pw/pm/py) +- Filter results through `filter_phase2_url()`: homepage filter, cross-phase dedup, article history check, source diversity check +- Batch scrape + classify (same logic as Phase 1b, source_type = "brave_search") + +**Path B -- LLM Web Search** (`use_brave_search = false`): +- Build search prompt with theme, categories, and gap counts +- Call LLM with `ai_model_websearch` model; returns structured JSON: `{category_0: [{title, url, summary}], ...}` +- Filter URLs through `filter_phase2_url()` +- Scrape each result sequentially to validate; keep LLM-provided title/summary (no re-classification) +- source_type = "web_search" + +### Save & Record + +1. Error if all article lists are empty +2. Order sections: user-defined categories first (in order), then "Autre" if non-empty +3. Sanitize: strip `\u0000` null bytes from JSON (PostgreSQL JSONB requirement) +4. Insert synthesis row: job_id, week (ISO week string), sections (JSONB), status "completed", theme_id +5. Record used articles: batch-insert `article_history` entries with status "used", synthesis_id, and correct source_type + +--- + +## 6. LLM Provider Abstraction + +### Trait Definition + +```rust +#[async_trait] +pub trait LlmProvider: Send + Sync { + fn provider_id(&self) -> &str; + async fn call_llm(&self, model: &str, system_prompt: &str, + user_prompt: &str, response_schema: &Value) + -> Result; +} +``` + +All calls use structured JSON output (response_schema defines the expected shape). + +### Implementations + +| Provider | Module | API Endpoint | Auth Method | +|---|---|---|---| +| Google Gemini | `llm/gemini.rs` | `generativelanguage.googleapis.com` | Query param `?key=` | +| OpenAI | `llm/openai.rs` | `api.openai.com/v1/chat/completions` | Bearer token | +| Anthropic | `llm/anthropic.rs` | `api.anthropic.com/v1/messages` | `x-api-key` header | +| Mock | `llm/mock.rs` | N/A (in-memory) | N/A | + +### Factory + +`llm/factory.rs` provides `create_provider(provider_name, api_key, http_client) -> Arc`. Matches on provider name string. + +### Response Schema + +`llm/schema.rs` builds JSON Schema definitions for: +- Classification/summarization: `{title, summary, category, is_article}` +- Web search: `{category_0: [{title, url, summary}], ...}` with per-category arrays +- Source link extraction: `{links: [{url}]}` + +### Error Mapping + +`map_provider_http_error()` translates HTTP status codes to `AppError` variants: +- 400 -> BadRequest +- 401/403 -> BadRequest (invalid key) +- 404 -> BadRequest (model not found) +- 429/529 -> RateLimited +- Other -> Internal + +--- + +## 7. Background Tasks + +### Session Cleanup + +Runs hourly via `tokio::spawn`. Calls `db::sessions::delete_expired` to remove sessions past their `expires_at` timestamp. + +### Job Store Cleanup + +`JobStore::cleanup_expired` removes job entries older than 1 hour (the TTL constant). Called periodically. Releases user locks for expired jobs. + +### Scheduler + +Runs every minute via `tokio::spawn` with a 60-second interval. For each tick: + +1. `current_day_code()` -> "mon" through "sun" +2. `find_due_schedules(pool, day, time)` -> queries enabled schedules matching current day and time (HH:MM) +3. For each due schedule: + - Skip if `job_store.has_active_job(user_id)` returns Some (manual generation in progress) + - Create a temporary `watch::channel` and `AtomicBool` + - Call `synthesis::run_generation_inner` directly (bypasses job store) + - On success: send emails to configured recipients (up to 3), mark schedule as run + - On failure: log error, do not mark as run + +--- + +## 8. Configuration + +### Environment Variables + +| Variable | Required | Default | Description | +|---|---|---|---| +| DATABASE_URL | Yes | - | PostgreSQL connection string | +| MASTER_ENCRYPTION_KEY | Yes | - | 64 hex chars (32 bytes) for AES-256-GCM | +| APP_URL | Yes | - | Public URL (CORS, magic links, cookies). No trailing slash. | +| PORT | No | 8080 | HTTP server port | +| RUST_LOG | No | - | Logging filter (e.g., "info,ai_synth_backend=debug") | +| STATIC_DIR | No | ../frontend/dist | Path to built SolidJS files | +| RESEND_API_KEY | Yes | - | Resend email service API key | +| EMAIL_FROM | Yes | - | Sender address for emails | +| TURNSTILE_SECRET_KEY | Yes | - | Cloudflare Turnstile server secret | +| TURNSTILE_SITE_KEY | Yes | - | Cloudflare Turnstile client key | +| POSTGRES_PASSWORD | Yes | - | Used by docker-compose for DB container | + +### Startup Validation + +`AppConfig::validate()` checks at startup: +- `MASTER_ENCRYPTION_KEY` is exactly 64 hex characters +- `APP_URL` starts with http:// or https:// and has no trailing slash + +The application refuses to start with invalid configuration. + +### User Settings Model + +Default values applied when a user has no saved settings: + +| Setting | Default | Range | +|---|---|---| +| max_articles_per_source | 3 | 1-10 | +| max_links_per_source | 8 | 1-30 | +| use_brave_search | false | boolean | +| article_history_days | 90 | 0-365 | +| batch_size | 5 | 1-20 | +| source_extraction_window | 3 | 1-10 | +| search_agent_behavior | "" | max 2000 chars | +| ai_provider | "" | max 100 chars | +| ai_model | "" | max 100 chars | +| ai_model_websearch | "" | max 100 chars | +| rate_limit_max_requests | null | >= 1 if set | +| rate_limit_time_window_seconds | null | >= 1 if set |