ai_synth/docs/architecture.md

# AI Weekly Synth -- Architecture Document

## 1. System Overview

AI Weekly Synth is a self-hosted web application that generates AI-powered weekly news syntheses. Users configure topics (themes), categories, and an LLM provider; the system then searches the web, scrapes and validates sources, classifies articles, and produces structured summaries.

### Technology Stack

| Layer | Technology |
|---|---|
| Backend | Rust (Axum 0.8) |
| Frontend | SolidJS 1.9 + Tailwind CSS v4 |
| Database | PostgreSQL 17 (via sqlx with compile-time query checking) |
| Deployment | Docker Compose (app + Postgres) |

### Deployment Topology

```
docker-compose.yml
  ├── app  (ai-synth)       port 8080
  │     ├── Axum HTTP server
  │     ├── Static file serving (SPA fallback)
  │     └── Background tasks (scheduler, session cleanup, job TTL)
  └── db   (postgres:17-alpine)  port 5432 (localhost only)
        └── postgres_data volume
```

The app container builds from a multi-stage Dockerfile, serves the SolidJS frontend as static files, and connects to Postgres over the `internal` bridge network.

---

## 2. Layer Architecture

The backend follows a three-layer architecture with shared model types:

```
handlers/  (HTTP layer)
    │
    ├── extracts request data (Axum extractors, JSON, path params)
    ├── validates input
    ├── calls services/ or db/ directly
    └── formats HTTP responses
    │
services/  (Business logic)
    │
    ├── synthesis pipeline orchestration
    ├── LLM provider abstraction + factory
    ├── scraping (articles, source pages)
    ├── encryption, email, CSV, PDF export
    ├── rate limiting, job store, scheduler
    └── Brave Search client
    │
db/  (Data access)
    │
    ├── pure SQL queries via sqlx
    ├── typed result mapping (FromRow)
    └── no business logic
    │
models/  (Shared types -- used by all layers)
    │
    ├── domain structs (User, Theme, Source, Synthesis, etc.)
    ├── request/response DTOs
    └── validation logic
```

### Module Inventory

**Handlers** (`handlers/`): `admin`, `api_keys`, `article_history`, `auth`, `config`, `generation`, `health`, `llm_logs`, `schedules`, `settings`, `sources`, `syntheses`, `themes`

**Services** (`services/`): `auth`, `brave_search`, `csv`, `email`, `encryption`, `export`, `job_store`, `llm` (with `gemini`, `openai`, `anthropic`, `mock`, `factory`, `schema`), `prompts`, `rate_limiter`, `scheduler`, `scraper`, `source_scraper`, `synthesis`, `turnstile`

**DB** (`db/`): `api_keys`, `article_history`, `audit`, `llm_call_log`, `magic_links`, `providers`, `rate_limits`, `schedules`, `sessions`, `settings`, `sources`, `syntheses`, `themes`, `users`

**Models** (`models/`): `api_key`, `audit`, `magic_link`, `provider`, `rate_limit`, `schedule`, `session`, `settings`, `source`, `synthesis`, `theme`, `user`

---

## 3. Key Components

### 3.1 LLM Provider Abstraction

The `LlmProvider` trait defines a unified interface for all LLM backends:

```rust
#[async_trait]
pub trait LlmProvider: Send + Sync {
    fn provider_id(&self) -> &str;
    async fn call_llm(&self, model: &str, system_prompt: &str,
                       user_prompt: &str, response_schema: &Value)
        -> Result<Value, AppError>;
}
```

Implementations: `GeminiProvider`, `OpenAiProvider`, `AnthropicProvider`, `MockLlmProvider`.

The factory (`llm/factory.rs`) creates provider instances by name. The mock provider enables end-to-end pipeline testing without real API calls.

### 3.2 Synthesis Pipeline

The pipeline is the core business logic, orchestrated in `services/synthesis.rs`. It runs as a background tokio task with a 15-minute timeout.

**Three phases:**

1. **Phase 1 -- Personalized Sources**: Extract article links from user-curated source pages (windowed, rolling), scrape articles, classify and summarize each via LLM. Batched processing with configurable `batch_size`.

2. **Phase 2 -- Web Search Fallback**: For under-filled categories, either call the Brave Search API or use the LLM's web search capability to find additional articles. Scrape and validate results.

3. **Save**: Assemble sections by category, sanitize JSON, persist to database, record article history traces.

Progress is reported via `tokio::sync::watch` channels consumed by SSE endpoints.

### 3.3 Job Store

`JobStore` (`services/job_store.rs`) is an in-memory concurrent store for active generation jobs:

- Backed by `DashMap<Uuid, JobEntry>` for lock-free access
- `DashSet<Uuid>` for per-user deduplication (one active job per user)
- Each job holds a `watch::Sender<ProgressEvent>` for real-time SSE streaming
- `AtomicBool` for cooperative cancellation
- 1-hour TTL with automatic cleanup

### 3.4 Scheduler

`services/scheduler.rs` runs as a background task, checking every minute for due `theme_schedules`. When a schedule fires:

1. Query `find_due_schedules` matching current day code + time
2. Skip if user already has a manual generation in progress
3. Run `synthesis::run_generation_inner` directly
4. Send email to configured recipients (up to 3)
5. Mark schedule as run

### 3.5 Scraper

Two scraping services:

- **`scraper.rs`**: Article page scraper with SSRF prevention, HTML parsing, title/date/body extraction, soft-404 detection, 15s timeout, 5MB body limit.
- **`source_scraper.rs`**: Source index page scraper that extracts article links from user-configured source URLs (HTML `<a>` parsing with filters, or LLM-assisted extraction).

### 3.6 Rate Limiters

- **Auth rate limiter**: 10 requests/60s per key (email or IP) for magic link endpoints.
- **Provider rate limiter**: Per-LLM-provider sliding window, admin-configured, hot-reloaded from DB.
- **User rate limiters**: Per-user generation rate limits cached in `DashMap`, recreated on settings change.

---

## 4. Data Model

### Tables and Relationships

```
users
  ├── sessions          (user_id FK, CASCADE)
  ├── magic_tokens      (email reference, no FK)
  ├── settings          (user_id PK/FK, CASCADE)
  ├── themes            (user_id FK, CASCADE)
  │     ├── sources           (theme_id FK, CASCADE)
  │     ├── syntheses         (theme_id FK, SET NULL)
  │     └── theme_schedules   (theme_id FK, CASCADE, UNIQUE)
  ├── user_api_keys     (user_id FK, CASCADE; UNIQUE per provider)
  ├── article_history   (user_id FK, CASCADE)
  ├── llm_call_log      (user_id FK, CASCADE)
  └── audit_log         (admin_user_id FK, SET NULL)

admin_providers
  └── admin_rate_limits (provider_name FK, CASCADE)
```

### Table Summary

| Table | Purpose | Key Columns |
|---|---|---|
| `users` | User accounts | id, email, display_name, role (user/admin), created_at |
| `sessions` | Login sessions | session_hash (PK), user_id, expires_at, last_active_at, ip_address |
| `magic_tokens` | Passwordless auth tokens | id, email, token_hash, expires_at, used |
| `settings` | Per-user pipeline config | user_id (PK), ai_provider, ai_model, ai_model_websearch, batch_size, max_articles_per_source, max_links_per_source, use_brave_search, source_extraction_window, article_history_days, search_agent_behavior, rate_limit_max_requests, rate_limit_time_window_seconds |
| `themes` | Per-user topic configurations | id, user_id, name, theme, categories (JSONB), max_items_per_category, max_age_days, summary_length |
| `sources` | User-curated news source URLs | id, user_id, title, url, theme_id, is_preferred |
| `syntheses` | Generated synthesis results | id, user_id, week, sections (JSONB), status, job_id, theme_id |
| `theme_schedules` | Automated generation schedules | id, theme_id (UNIQUE), user_id, enabled, days (JSONB), time_utc, emails (JSONB), last_run_at |
| `article_history` | Article URL dedup + provenance trace | id, user_id, url, url_hash, title, source_type, source_url, category, synthesis_id, status, scraped_ok, job_id, published_date |
| `llm_call_log` | Full LLM interaction log | id, user_id, job_id, call_type, model, system_prompt, user_prompt, response_body, duration_ms, article_url |
| `admin_providers` | Admin-curated LLM provider catalog | id, provider_name (UNIQUE), display_name, models_scraping (JSONB), models_websearch (JSONB), is_enabled |
| `admin_rate_limits` | Per-provider rate limit config | id, provider_name (UNIQUE, FK), max_requests, time_window_seconds |
| `user_api_keys` | Encrypted user LLM API keys | id, user_id, provider_name, encrypted_key (BYTEA), nonce (BYTEA), key_prefix; UNIQUE(user_id, provider_name) |
| `audit_log` | Admin mutation audit trail | id, admin_user_id, action, target_type, target_id, details (JSONB) |

---

## 5. API Overview

All API routes are prefixed with `/api/v1`. CSRF protection (`X-Requested-With` header) is applied to all mutating endpoints.

### Authentication

| Method | Path | Auth | Description |
|---|---|---|---|
| POST | /auth/register | Public | Create account + send magic link |
| POST | /auth/login | Public | Request magic link |
| GET | /auth/verify | Public | Verify token (email click redirect) |
| POST | /auth/verify | Public | Verify token (frontend API call) |
| POST | /auth/logout | Authenticated | Destroy session |
| GET | /auth/me | Authenticated | Current user info |

### Settings

| Method | Path | Auth | Description |
|---|---|---|---|
| GET | /settings | Authenticated | Get user settings |
| PUT | /settings | Authenticated | Update user settings |

### Themes

| Method | Path | Auth | Description |
|---|---|---|---|
| GET | /themes | Authenticated | List user themes |
| POST | /themes | Authenticated | Create theme |
| PUT | /themes/{id} | Authenticated | Update theme |
| DELETE | /themes/{id} | Authenticated | Delete theme |

### Schedules

| Method | Path | Auth | Description |
|---|---|---|---|
| GET | /themes/{id}/schedule | Authenticated | Get theme schedule |
| PUT | /themes/{id}/schedule | Authenticated | Create or update schedule |
| DELETE | /themes/{id}/schedule | Authenticated | Delete schedule |

### Sources

| Method | Path | Auth | Description |
|---|---|---|---|
| GET | /sources | Authenticated | List sources |
| POST | /sources | Authenticated | Create source |
| PUT | /sources/preferred | Authenticated | Update preferred sources |
| DELETE | /sources/{id} | Authenticated | Delete source |
| POST | /sources/bulk | Authenticated | Bulk import (JSON) |
| POST | /sources/import-csv | Authenticated | Import from CSV |
| GET | /sources/export-csv | Authenticated | Export as CSV |

### Syntheses & Generation

| Method | Path | Auth | Description |
|---|---|---|---|
| GET | /syntheses | Authenticated | List syntheses |
| GET | /syntheses/{id} | Authenticated | Get full synthesis |
| DELETE | /syntheses/{id} | Authenticated | Delete synthesis |
| POST | /syntheses/generate | Authenticated | Trigger generation |
| GET | /syntheses/generate/{job_id}/progress | Authenticated | SSE progress stream |
| POST | /syntheses/generate/{job_id}/stop | Authenticated | Cancel generation |
| POST | /syntheses/{id}/send-email | Authenticated | Email synthesis |
| GET | /syntheses/{id}/export/markdown | Authenticated | Markdown download |
| GET | /syntheses/{id}/export/pdf | Authenticated | PDF download |

### Article History & LLM Logs

| Method | Path | Auth | Description |
|---|---|---|---|
| GET | /article-history | Authenticated | List article history |
| DELETE | /article-history | Authenticated | Clear article history |
| GET | /syntheses/{id}/provenance | Authenticated | Get synthesis provenance |
| GET | /llm-logs/{job_id} | Authenticated | Get LLM call logs for job |

### User API Keys

| Method | Path | Auth | Description |
|---|---|---|---|
| GET | /user/api-keys | Authenticated | List keys (prefix only) |
| POST | /user/api-keys | Authenticated | Store encrypted key |
| DELETE | /user/api-keys/{provider} | Authenticated | Delete key |
| POST | /user/api-keys/{provider}/test | Authenticated | Test key validity |
| POST | /user/api-keys/export | Authenticated | Export keys |

### Configuration & Admin

| Method | Path | Auth | Description |
|---|---|---|---|
| GET | /config/providers | Authenticated | Available providers/models |
| GET | /admin/providers | Admin | List all providers |
| POST | /admin/providers | Admin | Create provider |
| PUT | /admin/providers/{id} | Admin | Update provider |
| DELETE | /admin/providers/{id} | Admin | Delete provider |
| GET | /admin/rate-limits | Admin | List rate limits |
| PUT | /admin/rate-limits/{provider_name} | Admin | Update rate limit |
| GET | /admin/users | Admin | List users |
| PUT | /admin/users/{id}/role | Admin | Change user role |

### Infrastructure

| Method | Path | Auth | Description |
|---|---|---|---|
| GET | /health | Public | Health check |

---

## 6. Security Architecture

### Authentication & Session Management

- **Passwordless**: Magic link tokens sent via email (Resend API), single-use, time-limited
- **Captcha**: Cloudflare Turnstile on registration and login
- **Sessions**: SHA-256 hashed tokens stored in DB, 30-day expiry, `HttpOnly` + `SameSite=Lax` cookies, optionally `Secure`
- **Anti-enumeration**: Same response for existent/non-existent emails, timing attack mitigation
- **Authorization**: `AuthUser` and `AdminUser` Axum extractors enforce auth levels per handler

### CSRF Protection

All mutating API endpoints require the `X-Requested-With` header (checked by `csrf::csrf_check` middleware layer). Non-mutating GET/HEAD/OPTIONS requests are exempt.

### Encryption at Rest

User LLM API keys are encrypted with AES-256-GCM before storage:
- 32-byte master key from `MASTER_ENCRYPTION_KEY` env var (64 hex chars)
- Random 12-byte nonce per encryption (stored alongside ciphertext)
- Key bytes are zeroized on drop (`zeroize` crate)
- Only a key prefix (first 8 chars + "...") is ever returned via the API

### SSRF Prevention

Both `scraper.rs` and `source_scraper.rs` validate URLs before fetching:
- DNS resolution check against private/loopback IP ranges
- Redirect chain validation (no redirects to private IPs)
- Only HTTP/HTTPS schemes allowed

### Security Headers

Applied as global middleware layers:
- `Content-Security-Policy` (self + Cloudflare Turnstile)
- `X-Content-Type-Options: nosniff`
- `X-Frame-Options: DENY`
- `Referrer-Policy: strict-origin-when-cross-origin`
- `X-XSS-Protection: 1; mode=block`
- `Strict-Transport-Security` (HTTPS only)

### Error Sanitization

The `sanitize_error_message` function strips API keys and internal details from error messages before they reach SSE clients. Internal errors log full details server-side but return generic messages to users.

### CORS

Configured to allow only the `APP_URL` origin, with credentials (cookies), limited to GET/POST/PUT/DELETE methods.

---

## 7. Concurrency Model

### Async Runtime

Tokio with full features. The Axum server runs as a multi-threaded async runtime.

### Background Tasks

Spawned at startup via `tokio::spawn`:
- **Session cleanup**: Hourly deletion of expired DB sessions
- **Job store cleanup**: Periodic removal of expired job entries (1-hour TTL)
- **Scheduler**: Minute-by-minute check for due theme schedules

### Generation Pipeline Concurrency

- **`tokio::task::JoinSet`**: Used for parallel scraping (bounded concurrency of 5 for source extraction) and parallel LLM classification calls within each batch
- **`tokio::sync::watch`**: Fan-out progress notifications to SSE clients; late subscribers immediately receive the latest state
- **`AtomicBool`**: Cooperative cancellation flag checked between pipeline stages; avoids mutex overhead
- **`DashMap` / `DashSet`**: Lock-free concurrent access for the job store (job entries), generating-users set, per-user rate limiter cache, and provider rate limiter state

### Task Lifecycle

```
POST /generate
  └── handler creates job in JobStore
        └── spawns outer task (panic monitor)
              └── spawns inner task (15-min timeout)
                    └── run_generation_inner()
                          ├── Phase 1 (JoinSet scrape, JoinSet classify)
                          ├── Phase 2 (JoinSet scrape, JoinSet classify)
                          └── Save to DB
              └── on complete/error: send final ProgressEvent
                    └── delayed cleanup (5 min) then remove from JobStore
```

### Graceful Shutdown

The server supports graceful shutdown via signal handling, allowing in-flight requests to complete.