You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

383 lines
16 KiB
Markdown

# AI Weekly Synth -- Architecture Document
## 1. System Overview
AI Weekly Synth is a self-hosted web application that generates AI-powered weekly news syntheses. Users configure topics (themes), categories, and an LLM provider; the system then searches the web, scrapes and validates sources, classifies articles, and produces structured summaries.
### Technology Stack
| Layer | Technology |
|---|---|
| Backend | Rust (Axum 0.8) |
| Frontend | SolidJS 1.9 + Tailwind CSS v4 |
| Database | PostgreSQL 17 (via sqlx with compile-time query checking) |
| Deployment | Docker Compose (app + Postgres) |
### Deployment Topology
```
docker-compose.yml
├── app (ai-synth) port 8080
│ ├── Axum HTTP server
│ ├── Static file serving (SPA fallback)
│ └── Background tasks (scheduler, session cleanup, job TTL)
└── db (postgres:17-alpine) port 5432 (localhost only)
└── postgres_data volume
```
The app container builds from a multi-stage Dockerfile, serves the SolidJS frontend as static files, and connects to Postgres over the `internal` bridge network.
---
## 2. Layer Architecture
The backend follows a three-layer architecture with shared model types:
```
handlers/ (HTTP layer)
├── extracts request data (Axum extractors, JSON, path params)
├── validates input
├── calls services/ or db/ directly
└── formats HTTP responses
services/ (Business logic)
├── synthesis pipeline orchestration
├── LLM provider abstraction + factory
├── scraping (articles, source pages)
├── encryption, email, CSV, PDF export
├── rate limiting, job store, scheduler
└── Brave Search client
db/ (Data access)
├── pure SQL queries via sqlx
├── typed result mapping (FromRow)
└── no business logic
models/ (Shared types -- used by all layers)
├── domain structs (User, Theme, Source, Synthesis, etc.)
├── request/response DTOs
└── validation logic
```
### Module Inventory
**Handlers** (`handlers/`): `admin`, `api_keys`, `article_history`, `auth`, `config`, `generation`, `health`, `llm_logs`, `schedules`, `settings`, `sources`, `syntheses`, `themes`
**Services** (`services/`): `auth`, `brave_search`, `csv`, `email`, `encryption`, `export`, `job_store`, `llm` (with `gemini`, `openai`, `anthropic`, `mock`, `factory`, `schema`), `prompts`, `rate_limiter`, `scheduler`, `scraper`, `source_scraper`, `synthesis`, `turnstile`
**DB** (`db/`): `api_keys`, `article_history`, `audit`, `llm_call_log`, `magic_links`, `providers`, `rate_limits`, `schedules`, `sessions`, `settings`, `sources`, `syntheses`, `themes`, `users`
**Models** (`models/`): `api_key`, `audit`, `magic_link`, `provider`, `rate_limit`, `schedule`, `session`, `settings`, `source`, `synthesis`, `theme`, `user`
---
## 3. Key Components
### 3.1 LLM Provider Abstraction
The `LlmProvider` trait defines a unified interface for all LLM backends:
```rust
#[async_trait]
pub trait LlmProvider: Send + Sync {
fn provider_id(&self) -> &str;
async fn call_llm(&self, model: &str, system_prompt: &str,
user_prompt: &str, response_schema: &Value)
-> Result<Value, AppError>;
}
```
Implementations: `GeminiProvider`, `OpenAiProvider`, `AnthropicProvider`, `MockLlmProvider`.
The factory (`llm/factory.rs`) creates provider instances by name. The mock provider enables end-to-end pipeline testing without real API calls.
### 3.2 Synthesis Pipeline
The pipeline is the core business logic, orchestrated in `services/synthesis.rs`. It runs as a background tokio task with a 15-minute timeout.
**Three phases:**
1. **Phase 1 -- Personalized Sources**: Extract article links from user-curated source pages (windowed, rolling), scrape articles, classify and summarize each via LLM. Batched processing with configurable `batch_size`.
2. **Phase 2 -- Web Search Fallback**: For under-filled categories, either call the Brave Search API or use the LLM's web search capability to find additional articles. Scrape and validate results.
3. **Save**: Assemble sections by category, sanitize JSON, persist to database, record article history traces.
Progress is reported via `tokio::sync::watch` channels consumed by SSE endpoints.
### 3.3 Job Store
`JobStore` (`services/job_store.rs`) is an in-memory concurrent store for active generation jobs:
- Backed by `DashMap<Uuid, JobEntry>` for lock-free access
- `DashSet<Uuid>` for per-user deduplication (one active job per user)
- Each job holds a `watch::Sender<ProgressEvent>` for real-time SSE streaming
- `AtomicBool` for cooperative cancellation
- 1-hour TTL with automatic cleanup
### 3.4 Scheduler
`services/scheduler.rs` runs as a background task, checking every minute for due `theme_schedules`. When a schedule fires:
1. Query `find_due_schedules` matching current day code + time
2. Skip if user already has a manual generation in progress
3. Run `synthesis::run_generation_inner` directly
4. Send email to configured recipients (up to 3)
5. Mark schedule as run
### 3.5 Scraper
Two scraping services:
- **`scraper.rs`**: Article page scraper with SSRF prevention, HTML parsing, title/date/body extraction, soft-404 detection, 15s timeout, 5MB body limit.
- **`source_scraper.rs`**: Source index page scraper that extracts article links from user-configured source URLs (HTML `<a>` parsing with filters, or LLM-assisted extraction).
### 3.6 Rate Limiters
- **Auth rate limiter**: 10 requests/60s per key (email or IP) for magic link endpoints.
- **Provider rate limiter**: Per-LLM-provider sliding window, admin-configured, hot-reloaded from DB.
- **User rate limiters**: Per-user generation rate limits cached in `DashMap`, recreated on settings change.
---
## 4. Data Model
### Tables and Relationships
```
users
├── sessions (user_id FK, CASCADE)
├── magic_tokens (email reference, no FK)
├── settings (user_id PK/FK, CASCADE)
├── themes (user_id FK, CASCADE)
│ ├── sources (theme_id FK, CASCADE)
│ ├── syntheses (theme_id FK, SET NULL)
│ └── theme_schedules (theme_id FK, CASCADE, UNIQUE)
├── user_api_keys (user_id FK, CASCADE; UNIQUE per provider)
├── article_history (user_id FK, CASCADE)
├── llm_call_log (user_id FK, CASCADE)
└── audit_log (admin_user_id FK, SET NULL)
admin_providers
└── admin_rate_limits (provider_name FK, CASCADE)
```
### Table Summary
| Table | Purpose | Key Columns |
|---|---|---|
| `users` | User accounts | id, email, display_name, role (user/admin), created_at |
| `sessions` | Login sessions | session_hash (PK), user_id, expires_at, last_active_at, ip_address |
| `magic_tokens` | Passwordless auth tokens | id, email, token_hash, expires_at, used |
| `settings` | Per-user pipeline config | user_id (PK), ai_provider, ai_model, ai_model_websearch, batch_size, max_articles_per_source, max_links_per_source, use_brave_search, source_extraction_window, article_history_days, search_agent_behavior, rate_limit_max_requests, rate_limit_time_window_seconds |
| `themes` | Per-user topic configurations | id, user_id, name, theme, categories (JSONB), max_items_per_category, max_age_days, summary_length |
| `sources` | User-curated news source URLs | id, user_id, title, url, theme_id, is_preferred |
| `syntheses` | Generated synthesis results | id, user_id, week, sections (JSONB), status, job_id, theme_id |
| `theme_schedules` | Automated generation schedules | id, theme_id (UNIQUE), user_id, enabled, days (JSONB), time_utc, emails (JSONB), last_run_at |
| `article_history` | Article URL dedup + provenance trace | id, user_id, url, url_hash, title, source_type, source_url, category, synthesis_id, status, scraped_ok, job_id, published_date |
| `llm_call_log` | Full LLM interaction log | id, user_id, job_id, call_type, model, system_prompt, user_prompt, response_body, duration_ms, article_url |
| `admin_providers` | Admin-curated LLM provider catalog | id, provider_name (UNIQUE), display_name, models_scraping (JSONB), models_websearch (JSONB), is_enabled |
| `admin_rate_limits` | Per-provider rate limit config | id, provider_name (UNIQUE, FK), max_requests, time_window_seconds |
| `user_api_keys` | Encrypted user LLM API keys | id, user_id, provider_name, encrypted_key (BYTEA), nonce (BYTEA), key_prefix; UNIQUE(user_id, provider_name) |
| `audit_log` | Admin mutation audit trail | id, admin_user_id, action, target_type, target_id, details (JSONB) |
---
## 5. API Overview
All API routes are prefixed with `/api/v1`. CSRF protection (`X-Requested-With` header) is applied to all mutating endpoints.
### Authentication
| Method | Path | Auth | Description |
|---|---|---|---|
| POST | /auth/register | Public | Create account + send magic link |
| POST | /auth/login | Public | Request magic link |
| GET | /auth/verify | Public | Verify token (email click redirect) |
| POST | /auth/verify | Public | Verify token (frontend API call) |
| POST | /auth/logout | Authenticated | Destroy session |
| GET | /auth/me | Authenticated | Current user info |
### Settings
| Method | Path | Auth | Description |
|---|---|---|---|
| GET | /settings | Authenticated | Get user settings |
| PUT | /settings | Authenticated | Update user settings |
### Themes
| Method | Path | Auth | Description |
|---|---|---|---|
| GET | /themes | Authenticated | List user themes |
| POST | /themes | Authenticated | Create theme |
| PUT | /themes/{id} | Authenticated | Update theme |
| DELETE | /themes/{id} | Authenticated | Delete theme |
### Schedules
| Method | Path | Auth | Description |
|---|---|---|---|
| GET | /themes/{id}/schedule | Authenticated | Get theme schedule |
| PUT | /themes/{id}/schedule | Authenticated | Create or update schedule |
| DELETE | /themes/{id}/schedule | Authenticated | Delete schedule |
### Sources
| Method | Path | Auth | Description |
|---|---|---|---|
| GET | /sources | Authenticated | List sources |
| POST | /sources | Authenticated | Create source |
| PUT | /sources/preferred | Authenticated | Update preferred sources |
| DELETE | /sources/{id} | Authenticated | Delete source |
| POST | /sources/bulk | Authenticated | Bulk import (JSON) |
| POST | /sources/import-csv | Authenticated | Import from CSV |
| GET | /sources/export-csv | Authenticated | Export as CSV |
### Syntheses & Generation
| Method | Path | Auth | Description |
|---|---|---|---|
| GET | /syntheses | Authenticated | List syntheses |
| GET | /syntheses/{id} | Authenticated | Get full synthesis |
| DELETE | /syntheses/{id} | Authenticated | Delete synthesis |
| POST | /syntheses/generate | Authenticated | Trigger generation |
| GET | /syntheses/generate/{job_id}/progress | Authenticated | SSE progress stream |
| POST | /syntheses/generate/{job_id}/stop | Authenticated | Cancel generation |
| POST | /syntheses/{id}/send-email | Authenticated | Email synthesis |
| GET | /syntheses/{id}/export/markdown | Authenticated | Markdown download |
| GET | /syntheses/{id}/export/pdf | Authenticated | PDF download |
### Article History & LLM Logs
| Method | Path | Auth | Description |
|---|---|---|---|
| GET | /article-history | Authenticated | List article history |
| DELETE | /article-history | Authenticated | Clear article history |
| GET | /syntheses/{id}/provenance | Authenticated | Get synthesis provenance |
| GET | /llm-logs/{job_id} | Authenticated | Get LLM call logs for job |
### User API Keys
| Method | Path | Auth | Description |
|---|---|---|---|
| GET | /user/api-keys | Authenticated | List keys (prefix only) |
| POST | /user/api-keys | Authenticated | Store encrypted key |
| DELETE | /user/api-keys/{provider} | Authenticated | Delete key |
| POST | /user/api-keys/{provider}/test | Authenticated | Test key validity |
| POST | /user/api-keys/export | Authenticated | Export keys |
### Configuration & Admin
| Method | Path | Auth | Description |
|---|---|---|---|
| GET | /config/providers | Authenticated | Available providers/models |
| GET | /admin/providers | Admin | List all providers |
| POST | /admin/providers | Admin | Create provider |
| PUT | /admin/providers/{id} | Admin | Update provider |
| DELETE | /admin/providers/{id} | Admin | Delete provider |
| GET | /admin/rate-limits | Admin | List rate limits |
| PUT | /admin/rate-limits/{provider_name} | Admin | Update rate limit |
| GET | /admin/users | Admin | List users |
| PUT | /admin/users/{id}/role | Admin | Change user role |
### Infrastructure
| Method | Path | Auth | Description |
|---|---|---|---|
| GET | /health | Public | Health check |
---
## 6. Security Architecture
### Authentication & Session Management
- **Passwordless**: Magic link tokens sent via email (Resend API), single-use, time-limited
- **Captcha**: Cloudflare Turnstile on registration and login
- **Sessions**: SHA-256 hashed tokens stored in DB, 30-day expiry, `HttpOnly` + `SameSite=Lax` cookies, optionally `Secure`
- **Anti-enumeration**: Same response for existent/non-existent emails, timing attack mitigation
- **Authorization**: `AuthUser` and `AdminUser` Axum extractors enforce auth levels per handler
### CSRF Protection
All mutating API endpoints require the `X-Requested-With` header (checked by `csrf::csrf_check` middleware layer). Non-mutating GET/HEAD/OPTIONS requests are exempt.
### Encryption at Rest
User LLM API keys are encrypted with AES-256-GCM before storage:
- 32-byte master key from `MASTER_ENCRYPTION_KEY` env var (64 hex chars)
- Random 12-byte nonce per encryption (stored alongside ciphertext)
- Key bytes are zeroized on drop (`zeroize` crate)
- Only a key prefix (first 8 chars + "...") is ever returned via the API
### SSRF Prevention
Both `scraper.rs` and `source_scraper.rs` validate URLs before fetching:
- DNS resolution check against private/loopback IP ranges
- Redirect chain validation (no redirects to private IPs)
- Only HTTP/HTTPS schemes allowed
### Security Headers
Applied as global middleware layers:
- `Content-Security-Policy` (self + Cloudflare Turnstile)
- `X-Content-Type-Options: nosniff`
- `X-Frame-Options: DENY`
- `Referrer-Policy: strict-origin-when-cross-origin`
- `X-XSS-Protection: 1; mode=block`
- `Strict-Transport-Security` (HTTPS only)
### Error Sanitization
The `sanitize_error_message` function strips API keys and internal details from error messages before they reach SSE clients. Internal errors log full details server-side but return generic messages to users.
### CORS
Configured to allow only the `APP_URL` origin, with credentials (cookies), limited to GET/POST/PUT/DELETE methods.
---
## 7. Concurrency Model
### Async Runtime
Tokio with full features. The Axum server runs as a multi-threaded async runtime.
### Background Tasks
Spawned at startup via `tokio::spawn`:
- **Session cleanup**: Hourly deletion of expired DB sessions
- **Job store cleanup**: Periodic removal of expired job entries (1-hour TTL)
- **Scheduler**: Minute-by-minute check for due theme schedules
### Generation Pipeline Concurrency
- **`tokio::task::JoinSet`**: Used for parallel scraping (bounded concurrency of 5 for source extraction) and parallel LLM classification calls within each batch
- **`tokio::sync::watch`**: Fan-out progress notifications to SSE clients; late subscribers immediately receive the latest state
- **`AtomicBool`**: Cooperative cancellation flag checked between pipeline stages; avoids mutex overhead
- **`DashMap` / `DashSet`**: Lock-free concurrent access for the job store (job entries), generating-users set, per-user rate limiter cache, and provider rate limiter state
### Task Lifecycle
```
POST /generate
└── handler creates job in JobStore
└── spawns outer task (panic monitor)
└── spawns inner task (15-min timeout)
└── run_generation_inner()
├── Phase 1 (JoinSet scrape, JoinSet classify)
├── Phase 2 (JoinSet scrape, JoinSet classify)
└── Save to DB
└── on complete/error: send final ProgressEvent
└── delayed cleanup (5 min) then remove from JobStore
```
### Graceful Shutdown
The server supports graceful shutdown via signal handling, allowing in-flight requests to complete.