# AI Weekly Synth -- Architecture Document ## 1. System Overview AI Weekly Synth is a self-hosted web application that generates AI-powered weekly news syntheses. Users configure topics (themes), categories, and an LLM provider; the system then searches the web, scrapes and validates sources, classifies articles, and produces structured summaries. ### Technology Stack | Layer | Technology | |---|---| | Backend | Rust (Axum 0.8) | | Frontend | SolidJS 1.9 + Tailwind CSS v4 | | Database | PostgreSQL 17 (via sqlx with compile-time query checking) | | Deployment | Docker Compose (app + Postgres) | ### Deployment Topology ``` docker-compose.yml ├── app (ai-synth) port 8080 │ ├── Axum HTTP server │ ├── Static file serving (SPA fallback) │ └── Background tasks (scheduler, session cleanup, job TTL) └── db (postgres:17-alpine) port 5432 (localhost only) └── postgres_data volume ``` The app container builds from a multi-stage Dockerfile, serves the SolidJS frontend as static files, and connects to Postgres over the `internal` bridge network. --- ## 2. Layer Architecture The backend follows a three-layer architecture with shared model types: ``` handlers/ (HTTP layer) │ ├── extracts request data (Axum extractors, JSON, path params) ├── validates input ├── calls services/ or db/ directly └── formats HTTP responses │ services/ (Business logic) │ ├── synthesis pipeline orchestration ├── LLM provider abstraction + factory ├── scraping (articles, source pages) ├── encryption, email, CSV, PDF export ├── rate limiting, job store, scheduler └── Brave Search client │ db/ (Data access) │ ├── pure SQL queries via sqlx ├── typed result mapping (FromRow) └── no business logic │ models/ (Shared types -- used by all layers) │ ├── domain structs (User, Theme, Source, Synthesis, etc.) ├── request/response DTOs └── validation logic ``` ### Module Inventory **Handlers** (`handlers/`): `admin`, `api_keys`, `article_history`, `auth`, `config`, `generation`, `health`, `llm_logs`, `schedules`, `settings`, `sources`, `syntheses`, `themes` **Services** (`services/`): `auth`, `brave_search`, `csv`, `email`, `encryption`, `export`, `job_store`, `llm` (with `gemini`, `openai`, `anthropic`, `mock`, `factory`, `schema`), `prompts`, `rate_limiter`, `scheduler`, `scraper`, `source_scraper`, `synthesis`, `turnstile` **DB** (`db/`): `api_keys`, `article_history`, `audit`, `llm_call_log`, `magic_links`, `providers`, `rate_limits`, `schedules`, `sessions`, `settings`, `sources`, `syntheses`, `themes`, `users` **Models** (`models/`): `api_key`, `audit`, `magic_link`, `provider`, `rate_limit`, `schedule`, `session`, `settings`, `source`, `synthesis`, `theme`, `user` --- ## 3. Key Components ### 3.1 LLM Provider Abstraction The `LlmProvider` trait defines a unified interface for all LLM backends: ```rust #[async_trait] pub trait LlmProvider: Send + Sync { fn provider_id(&self) -> &str; async fn call_llm(&self, model: &str, system_prompt: &str, user_prompt: &str, response_schema: &Value) -> Result; } ``` Implementations: `GeminiProvider`, `OpenAiProvider`, `AnthropicProvider`, `MockLlmProvider`. The factory (`llm/factory.rs`) creates provider instances by name. The mock provider enables end-to-end pipeline testing without real API calls. ### 3.2 Synthesis Pipeline The pipeline is the core business logic, orchestrated in `services/synthesis.rs`. It runs as a background tokio task with a 15-minute timeout. **Three phases:** 1. **Phase 1 -- Personalized Sources**: Extract article links from user-curated source pages (windowed, rolling), scrape articles, classify and summarize each via LLM. Batched processing with configurable `batch_size`. 2. **Phase 2 -- Web Search Fallback**: For under-filled categories, either call the Brave Search API or use the LLM's web search capability to find additional articles. Scrape and validate results. 3. **Save**: Assemble sections by category, sanitize JSON, persist to database, record article history traces. Progress is reported via `tokio::sync::watch` channels consumed by SSE endpoints. ### 3.3 Job Store `JobStore` (`services/job_store.rs`) is an in-memory concurrent store for active generation jobs: - Backed by `DashMap` for lock-free access - `DashSet` for per-user deduplication (one active job per user) - Each job holds a `watch::Sender` for real-time SSE streaming - `AtomicBool` for cooperative cancellation - 1-hour TTL with automatic cleanup ### 3.4 Scheduler `services/scheduler.rs` runs as a background task, checking every minute for due `theme_schedules`. When a schedule fires: 1. Query `find_due_schedules` matching current day code + time 2. Skip if user already has a manual generation in progress 3. Run `synthesis::run_generation_inner` directly 4. Send email to configured recipients (up to 3) 5. Mark schedule as run ### 3.5 Scraper Two scraping services: - **`scraper.rs`**: Article page scraper with SSRF prevention, HTML parsing, title/date/body extraction, soft-404 detection, 15s timeout, 5MB body limit. - **`source_scraper.rs`**: Source index page scraper that extracts article links from user-configured source URLs (HTML `` parsing with filters, or LLM-assisted extraction). ### 3.6 Rate Limiters - **Auth rate limiter**: 10 requests/60s per key (email or IP) for magic link endpoints. - **Provider rate limiter**: Per-LLM-provider sliding window, admin-configured, hot-reloaded from DB. - **User rate limiters**: Per-user generation rate limits cached in `DashMap`, recreated on settings change. --- ## 4. Data Model ### Tables and Relationships ``` users ├── sessions (user_id FK, CASCADE) ├── magic_tokens (email reference, no FK) ├── settings (user_id PK/FK, CASCADE) ├── themes (user_id FK, CASCADE) │ ├── sources (theme_id FK, CASCADE) │ ├── syntheses (theme_id FK, SET NULL) │ └── theme_schedules (theme_id FK, CASCADE, UNIQUE) ├── user_api_keys (user_id FK, CASCADE; UNIQUE per provider) ├── article_history (user_id FK, CASCADE) ├── llm_call_log (user_id FK, CASCADE) └── audit_log (admin_user_id FK, SET NULL) admin_providers └── admin_rate_limits (provider_name FK, CASCADE) ``` ### Table Summary | Table | Purpose | Key Columns | |---|---|---| | `users` | User accounts | id, email, display_name, role (user/admin), created_at | | `sessions` | Login sessions | session_hash (PK), user_id, expires_at, last_active_at, ip_address | | `magic_tokens` | Passwordless auth tokens | id, email, token_hash, expires_at, used | | `settings` | Per-user pipeline config | user_id (PK), ai_provider, ai_model, ai_model_websearch, batch_size, max_articles_per_source, max_links_per_source, use_brave_search, source_extraction_window, article_history_days, search_agent_behavior, rate_limit_max_requests, rate_limit_time_window_seconds | | `themes` | Per-user topic configurations | id, user_id, name, theme, categories (JSONB), max_items_per_category, max_age_days, summary_length | | `sources` | User-curated news source URLs | id, user_id, title, url, theme_id, is_preferred | | `syntheses` | Generated synthesis results | id, user_id, week, sections (JSONB), status, job_id, theme_id | | `theme_schedules` | Automated generation schedules | id, theme_id (UNIQUE), user_id, enabled, days (JSONB), time_utc, emails (JSONB), last_run_at | | `article_history` | Article URL dedup + provenance trace | id, user_id, url, url_hash, title, source_type, source_url, category, synthesis_id, status, scraped_ok, job_id, published_date | | `llm_call_log` | Full LLM interaction log | id, user_id, job_id, call_type, model, system_prompt, user_prompt, response_body, duration_ms, article_url | | `admin_providers` | Admin-curated LLM provider catalog | id, provider_name (UNIQUE), display_name, models_scraping (JSONB), models_websearch (JSONB), is_enabled | | `admin_rate_limits` | Per-provider rate limit config | id, provider_name (UNIQUE, FK), max_requests, time_window_seconds | | `user_api_keys` | Encrypted user LLM API keys | id, user_id, provider_name, encrypted_key (BYTEA), nonce (BYTEA), key_prefix; UNIQUE(user_id, provider_name) | | `audit_log` | Admin mutation audit trail | id, admin_user_id, action, target_type, target_id, details (JSONB) | --- ## 5. API Overview All API routes are prefixed with `/api/v1`. CSRF protection (`X-Requested-With` header) is applied to all mutating endpoints. ### Authentication | Method | Path | Auth | Description | |---|---|---|---| | POST | /auth/register | Public | Create account + send magic link | | POST | /auth/login | Public | Request magic link | | GET | /auth/verify | Public | Verify token (email click redirect) | | POST | /auth/verify | Public | Verify token (frontend API call) | | POST | /auth/logout | Authenticated | Destroy session | | GET | /auth/me | Authenticated | Current user info | ### Settings | Method | Path | Auth | Description | |---|---|---|---| | GET | /settings | Authenticated | Get user settings | | PUT | /settings | Authenticated | Update user settings | ### Themes | Method | Path | Auth | Description | |---|---|---|---| | GET | /themes | Authenticated | List user themes | | POST | /themes | Authenticated | Create theme | | PUT | /themes/{id} | Authenticated | Update theme | | DELETE | /themes/{id} | Authenticated | Delete theme | ### Schedules | Method | Path | Auth | Description | |---|---|---|---| | GET | /themes/{id}/schedule | Authenticated | Get theme schedule | | PUT | /themes/{id}/schedule | Authenticated | Create or update schedule | | DELETE | /themes/{id}/schedule | Authenticated | Delete schedule | ### Sources | Method | Path | Auth | Description | |---|---|---|---| | GET | /sources | Authenticated | List sources | | POST | /sources | Authenticated | Create source | | PUT | /sources/preferred | Authenticated | Update preferred sources | | DELETE | /sources/{id} | Authenticated | Delete source | | POST | /sources/bulk | Authenticated | Bulk import (JSON) | | POST | /sources/import-csv | Authenticated | Import from CSV | | GET | /sources/export-csv | Authenticated | Export as CSV | ### Syntheses & Generation | Method | Path | Auth | Description | |---|---|---|---| | GET | /syntheses | Authenticated | List syntheses | | GET | /syntheses/{id} | Authenticated | Get full synthesis | | DELETE | /syntheses/{id} | Authenticated | Delete synthesis | | POST | /syntheses/generate | Authenticated | Trigger generation | | GET | /syntheses/generate/{job_id}/progress | Authenticated | SSE progress stream | | POST | /syntheses/generate/{job_id}/stop | Authenticated | Cancel generation | | POST | /syntheses/{id}/send-email | Authenticated | Email synthesis | | GET | /syntheses/{id}/export/markdown | Authenticated | Markdown download | | GET | /syntheses/{id}/export/pdf | Authenticated | PDF download | ### Article History & LLM Logs | Method | Path | Auth | Description | |---|---|---|---| | GET | /article-history | Authenticated | List article history | | DELETE | /article-history | Authenticated | Clear article history | | GET | /syntheses/{id}/provenance | Authenticated | Get synthesis provenance | | GET | /llm-logs/{job_id} | Authenticated | Get LLM call logs for job | ### User API Keys | Method | Path | Auth | Description | |---|---|---|---| | GET | /user/api-keys | Authenticated | List keys (prefix only) | | POST | /user/api-keys | Authenticated | Store encrypted key | | DELETE | /user/api-keys/{provider} | Authenticated | Delete key | | POST | /user/api-keys/{provider}/test | Authenticated | Test key validity | | POST | /user/api-keys/export | Authenticated | Export keys | ### Configuration & Admin | Method | Path | Auth | Description | |---|---|---|---| | GET | /config/providers | Authenticated | Available providers/models | | GET | /admin/providers | Admin | List all providers | | POST | /admin/providers | Admin | Create provider | | PUT | /admin/providers/{id} | Admin | Update provider | | DELETE | /admin/providers/{id} | Admin | Delete provider | | GET | /admin/rate-limits | Admin | List rate limits | | PUT | /admin/rate-limits/{provider_name} | Admin | Update rate limit | | GET | /admin/users | Admin | List users | | PUT | /admin/users/{id}/role | Admin | Change user role | ### Infrastructure | Method | Path | Auth | Description | |---|---|---|---| | GET | /health | Public | Health check | --- ## 6. Security Architecture ### Authentication & Session Management - **Passwordless**: Magic link tokens sent via email (Resend API), single-use, time-limited - **Captcha**: Cloudflare Turnstile on registration and login - **Sessions**: SHA-256 hashed tokens stored in DB, 30-day expiry, `HttpOnly` + `SameSite=Lax` cookies, optionally `Secure` - **Anti-enumeration**: Same response for existent/non-existent emails, timing attack mitigation - **Authorization**: `AuthUser` and `AdminUser` Axum extractors enforce auth levels per handler ### CSRF Protection All mutating API endpoints require the `X-Requested-With` header (checked by `csrf::csrf_check` middleware layer). Non-mutating GET/HEAD/OPTIONS requests are exempt. ### Encryption at Rest User LLM API keys are encrypted with AES-256-GCM before storage: - 32-byte master key from `MASTER_ENCRYPTION_KEY` env var (64 hex chars) - Random 12-byte nonce per encryption (stored alongside ciphertext) - Key bytes are zeroized on drop (`zeroize` crate) - Only a key prefix (first 8 chars + "...") is ever returned via the API ### SSRF Prevention Both `scraper.rs` and `source_scraper.rs` validate URLs before fetching: - DNS resolution check against private/loopback IP ranges - Redirect chain validation (no redirects to private IPs) - Only HTTP/HTTPS schemes allowed ### Security Headers Applied as global middleware layers: - `Content-Security-Policy` (self + Cloudflare Turnstile) - `X-Content-Type-Options: nosniff` - `X-Frame-Options: DENY` - `Referrer-Policy: strict-origin-when-cross-origin` - `X-XSS-Protection: 1; mode=block` - `Strict-Transport-Security` (HTTPS only) ### Error Sanitization The `sanitize_error_message` function strips API keys and internal details from error messages before they reach SSE clients. Internal errors log full details server-side but return generic messages to users. ### CORS Configured to allow only the `APP_URL` origin, with credentials (cookies), limited to GET/POST/PUT/DELETE methods. --- ## 7. Concurrency Model ### Async Runtime Tokio with full features. The Axum server runs as a multi-threaded async runtime. ### Background Tasks Spawned at startup via `tokio::spawn`: - **Session cleanup**: Hourly deletion of expired DB sessions - **Job store cleanup**: Periodic removal of expired job entries (1-hour TTL) - **Scheduler**: Minute-by-minute check for due theme schedules ### Generation Pipeline Concurrency - **`tokio::task::JoinSet`**: Used for parallel scraping (bounded concurrency of 5 for source extraction) and parallel LLM classification calls within each batch - **`tokio::sync::watch`**: Fan-out progress notifications to SSE clients; late subscribers immediately receive the latest state - **`AtomicBool`**: Cooperative cancellation flag checked between pipeline stages; avoids mutex overhead - **`DashMap` / `DashSet`**: Lock-free concurrent access for the job store (job entries), generating-users set, per-user rate limiter cache, and provider rate limiter state ### Task Lifecycle ``` POST /generate └── handler creates job in JobStore └── spawns outer task (panic monitor) └── spawns inner task (15-min timeout) └── run_generation_inner() ├── Phase 1 (JoinSet scrape, JoinSet classify) ├── Phase 2 (JoinSet scrape, JoinSet classify) └── Save to DB └── on complete/error: send final ProgressEvent └── delayed cleanup (5 min) then remove from JobStore ``` ### Graceful Shutdown The server supports graceful shutdown via signal handling, allowing in-flight requests to complete.