From ad613aa001fa63f33ea0ed9c1da29569c1b94c30 Mon Sep 17 00:00:00 2001 From: oabrivard Date: Thu, 2 Apr 2026 10:01:46 +0200 Subject: [PATCH] fix: resolve all markdownlint errors (blank lines, table spacing, bare URLs) Co-Authored-By: Claude Sonnet 4.6 --- .markdownlint.json | 7 +++ CLAUDE.md | 13 ++++++ docs/architecture.md | 3 ++ docs/deployment.md | 15 ++++-- docs/dev_guidelines.md | 4 ++ docs/functional_specs.md | 17 ++++++- docs/qa_guidelines.md | 15 ++++-- docs/technical_specs.md | 98 ++++++++++++++++++++++++++++++++-------- 8 files changed, 140 insertions(+), 32 deletions(-) create mode 100644 .markdownlint.json diff --git a/.markdownlint.json b/.markdownlint.json new file mode 100644 index 0000000..667ed0b --- /dev/null +++ b/.markdownlint.json @@ -0,0 +1,7 @@ +{ + "MD013": false, + "MD024": { "siblings_only": true }, + "MD033": false, + "MD036": false, + "MD040": false +} diff --git a/CLAUDE.md b/CLAUDE.md index a46cee5..f5a0d94 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -1,15 +1,18 @@ # AI Weekly Synth ## Overview + AI Weekly Synth is a self-hosted web application that generates AI-powered weekly news syntheses. Users create themes (topics), configure categories and sources, then the app scrapes sources, classifies articles via LLM, and produces structured summaries. Supports scheduled generation with email delivery. ## Architecture + - **Backend**: Rust (Axum) — `backend/` - **Frontend**: SolidJS + Tailwind CSS v4 — `frontend/` - **Database**: PostgreSQL (via sqlx with runtime-checked queries) - **Deployment**: Docker (`docker-compose.yml`, `restart: unless-stopped`) ## Project Structure + ``` ai_synth/ ├── backend/ Rust/Axum backend @@ -51,6 +54,7 @@ ai_synth/ ``` ## Documentation + - [`docs/requirements.md`](docs/requirements.md) — Product vision, features, user roles, non-functional requirements - [`docs/functional_specs.md`](docs/functional_specs.md) — User journeys, feature details, pipeline description - [`docs/architecture.md`](docs/architecture.md) — System design, layers, data model, security, concurrency @@ -60,6 +64,7 @@ ai_synth/ - [`docs/deployment.md`](docs/deployment.md) — Docker setup, env vars, monitoring, security ## Key Features + - **Multi-Theme**: Users create multiple themes, each with its own categories, sources, and schedule - **LLM Providers**: Google Gemini, OpenAI, Anthropic — users bring their own API keys - **Generation Pipeline**: Two-phase (personalized sources → web search fallback), windowed extraction, batched scrape+classify @@ -76,12 +81,14 @@ ai_synth/ ## Running Locally ### Docker (production) + ```bash cp .env.example .env # Fill in values docker compose up -d ``` ### Development + ```bash # Backend (requires Postgres running) cd backend && cargo run -- serve @@ -91,11 +98,13 @@ cd frontend && npm install && npm run dev ``` ### CLI + ```bash cd backend && cargo run -- create-admin admin@example.com ``` ## Testing + ```bash # Backend unit tests (no Postgres needed) cd backend && cargo test --lib @@ -114,10 +123,13 @@ cd frontend && npx tsc --noEmit ``` ## Database (30 migrations) + Tables: `users`, `sessions`, `magic_link_tokens`, `settings`, `themes`, `theme_schedules`, `sources`, `syntheses`, `article_history`, `llm_call_log`, `admin_providers`, `admin_rate_limits`, `user_api_keys`, `audit_log` ## Environment Variables + See `.env.example` for the complete list. Key ones: + - `DATABASE_URL` — Postgres connection string - `MASTER_ENCRYPTION_KEY` — 64 hex chars for AES-256-GCM - `RESEND_API_KEY` — for email sending @@ -125,6 +137,7 @@ See `.env.example` for the complete list. Key ones: - `APP_URL` — public URL (for CORS, magic links, cookies) ## Design Decisions + - Idiomatic Rust (learning project) — no unwrap() in production code - Users bring their own LLM API keys (encrypted at rest) - Admin curates available providers/models, users select from the list diff --git a/docs/architecture.md b/docs/architecture.md index 255ad1c..c2139c6 100644 --- a/docs/architecture.md +++ b/docs/architecture.md @@ -122,6 +122,7 @@ All mutating API endpoints require the `X-Requested-With` header (checked by `cs ### Encryption at Rest User LLM API keys are encrypted with AES-256-GCM before storage: + - 32-byte master key from `MASTER_ENCRYPTION_KEY` env var (64 hex chars) - Random 12-byte nonce per encryption (stored alongside ciphertext) - Key bytes are zeroized on drop (`zeroize` crate) @@ -130,6 +131,7 @@ User LLM API keys are encrypted with AES-256-GCM before storage: ### SSRF Prevention Both `scraper.rs` and `source_scraper.rs` validate URLs before fetching: + - DNS resolution check against private/loopback IP ranges - Redirect chain validation (no redirects to private IPs) - Only HTTP/HTTPS schemes allowed @@ -137,6 +139,7 @@ Both `scraper.rs` and `source_scraper.rs` validate URLs before fetching: ### Security Headers Applied as global middleware layers: + - `Content-Security-Policy` (self + Cloudflare Turnstile) - `X-Content-Type-Options: nosniff` - `X-Frame-Options: DENY` diff --git a/docs/deployment.md b/docs/deployment.md index 3775463..1515886 100644 --- a/docs/deployment.md +++ b/docs/deployment.md @@ -29,6 +29,7 @@ The application will be available at `http://localhost:8080` (or the port config The `docker-compose.yml` defines two services: **app** (AI Weekly Synth backend + frontend): + - Multi-stage Docker image: Node.js builds the frontend, Rust builds the backend, then both are combined into a minimal Debian runtime - Runs as a non-root user (`appuser`) - Depends on `db` with a health check condition (waits for Postgres to be ready) @@ -36,6 +37,7 @@ The `docker-compose.yml` defines two services: - Restart policy: `unless-stopped` **db** (PostgreSQL 17 Alpine): + - Data persisted to a named Docker volume (`postgres_data`) - Exposed on `127.0.0.1:5432` (localhost only, not accessible from external networks) - Health check: `pg_isready` every 10 seconds @@ -59,20 +61,20 @@ All environment variables are documented in `.env.example`. The `.env` file is l ### Required | Variable | Description | Example | -|----------|-------------|---------| +| ---------- | ------------- | --------- | | `DATABASE_URL` | PostgreSQL connection string. In docker-compose, the hostname is `db`. | `postgres://ai_synth:secret@db:5432/ai_synth` | | `POSTGRES_PASSWORD` | Password for the PostgreSQL user. Used by both the `db` service and in `DATABASE_URL`. | `a-strong-random-password` | | `MASTER_ENCRYPTION_KEY` | 256-bit key for AES-256-GCM encryption of user API keys at rest. Must be exactly 64 hex characters. Generate with `openssl rand -hex 32`. **Back this up securely -- losing it means all stored API keys become unreadable.** | `ab12cd34...` (64 hex chars) | | `APP_URL` | Public URL where the app is accessible (no trailing slash). Used for magic link URLs, CORS origin, and cookie domain. | `https://synth.example.com` | -| `RESEND_API_KEY` | API key for Resend (email service). Required for magic link emails and synthesis email export. Sign up at https://resend.com. | `re_xxxxx` | +| `RESEND_API_KEY` | API key for Resend (email service). Required for magic link emails and synthesis email export. Sign up at . | `re_xxxxx` | | `EMAIL_FROM` | Sender address for emails. Must be a verified domain in Resend. | `AI Weekly Synth ` | -| `TURNSTILE_SECRET_KEY` | Server-side secret key for Cloudflare Turnstile captcha. Sign up at https://dash.cloudflare.com/turnstile. | `0x4AAAAAAA...` | +| `TURNSTILE_SECRET_KEY` | Server-side secret key for Cloudflare Turnstile captcha. Sign up at . | `0x4AAAAAAA...` | | `TURNSTILE_SITE_KEY` | Client-side site key for Cloudflare Turnstile. | `0x4BBBBBB...` | ### Optional | Variable | Description | Default | -|----------|-------------|---------| +| ---------- | ------------- | --------- | | `PORT` | Port for the backend HTTP server (inside the container). The docker-compose maps this to the host. | `8080` | | `RUST_LOG` | Logging level. Format: `level` or `level,crate=level`. | `info,ai_synth_backend=debug` | | `STATIC_DIR` | Path to the built frontend files. In Docker, this is `./static` (set by docker-compose). For local dev, use `../frontend/dist`. | `./static` (Docker) | @@ -87,6 +89,7 @@ All environment variables are documented in `.env.example`. The `.env` file is l The application uses PostgreSQL 17. The `docker-compose.yml` runs it as the `db` service with a named volume for data persistence. Key configuration: + - User: `ai_synth` (configurable via `POSTGRES_PASSWORD`) - Database: `ai_synth` - Shared memory: 128 MB (for complex queries) @@ -103,7 +106,7 @@ No manual migration step is needed. The application will not start serving reque The database contains the following tables: | Table | Purpose | -|-------|---------| +| ------- | --------- | | `users` | User accounts (email, display name, role) | | `sessions` | Active sessions (hashed tokens, expiry) | | `magic_link_tokens` | Passwordless login tokens | @@ -165,6 +168,7 @@ RUST_LOG=info,ai_synth_backend=debug ``` This provides: + - `info` level for all crates (HTTP requests, startup/shutdown, background tasks) - `debug` level for the application code (detailed pipeline progress, LLM call timing) @@ -217,6 +221,7 @@ docker compose up -d --build ``` This will: + 1. Rebuild the Docker image (frontend build + Rust compilation) 2. Restart the `app` container with the new image 3. Automatically run any new migrations on startup diff --git a/docs/dev_guidelines.md b/docs/dev_guidelines.md index 8a11361..d224e42 100644 --- a/docs/dev_guidelines.md +++ b/docs/dev_guidelines.md @@ -127,6 +127,7 @@ pub enum AppError { ``` Key rules: + - **Never use `unwrap()` in production code.** Use `?`, `ok_or_else`, `map_err`, or `unwrap_or_default` with appropriate logging. `unwrap()` is only acceptable in `#[cfg(test)]` blocks and `LazyLock` static initializers. - **`AppError::Internal` hides details** from the client. The full error is logged via `tracing::error!` but the response body only contains `"An internal error occurred"`. - **`From` and `From`** conversions are implemented, so you can use `?` with both types. @@ -135,6 +136,7 @@ Key rules: #### Arc Usage `Arc` is used to share data across `tokio::spawn` boundaries. Common patterns: + - `Arc` for the LLM provider (shared across classify tasks) - `Arc` for cancellation flags - `Arc>` for SSE progress channels @@ -298,6 +300,7 @@ Longer explanation if needed. Types: `feat`, `fix`, `docs`, `refactor`, `test`, `chore`. Examples from the repo: + - `fix: rewrite pass schema uses actual scraped item counts, not max setting` - `fix: filter empty scraped articles + restore URLs after rewrite + E2E assertions` - `docs: add spec and plan for source priority pipeline redesign` @@ -329,6 +332,7 @@ The `PUT /settings` endpoint requires the **complete** settings object, not a pa ### Pipeline Test Requirements Pipeline integration tests require: + - A running Postgres instance (via `TEST_DATABASE_URL`) - `SKIP_SSRF_CHECK=1` (to allow wiremock on localhost) - Wiremock for mocking HTTP responses from source websites diff --git a/docs/functional_specs.md b/docs/functional_specs.md index 72c4af1..e1bb3cd 100644 --- a/docs/functional_specs.md +++ b/docs/functional_specs.md @@ -66,6 +66,7 @@ ### 1.7 Export a Synthesis From the synthesis detail page: + - **Email**: enter a recipient address or click "S'envoyer a soi-meme". The synthesis is sent as a formatted email via Resend. - **Markdown**: download as a `.md` file. - **PDF**: download as a `.pdf` file. @@ -75,6 +76,7 @@ From the synthesis detail page: ### 2.1 Multi-Theme Each user can create multiple themes. A theme groups together: + - Content settings (search topic, categories, max items, max age, summary length) - Personalized sources - Generated syntheses @@ -88,6 +90,7 @@ The generate page requires selecting a theme before launching. The home page sho Categories are user-defined per theme. Users add and remove category names in the theme editor after creating a theme. The system always includes two default categories: + - `Divers`: overflow category for unmatched or full categories. - `Sans date`: category for articles without a usable publication date. @@ -100,6 +103,7 @@ Sources can be marked as preferred. Preference is stored per theme. During gener ### 2.4 Scheduled Generation Each theme can have an optional schedule with: + - **Enabled/disabled toggle** - **Days**: selection of days of the week (Mon-Sun) - **Time**: execution time in UTC (HH:MM) @@ -112,6 +116,7 @@ Changes to the schedule are saved immediately (auto-save). ### 2.5 Brave Search An optional alternative to LLM-powered web search in Phase 2. When enabled: + - The user provides a Brave Search API key (stored encrypted alongside LLM keys). - Phase 2 queries the Brave Search API with the theme topic, filtered by article freshness. - Results are scraped and classified/summarized by the LLM, following the same pipeline as Phase 1. @@ -127,6 +132,7 @@ Generation follows a two-phase pipeline. Phase 1 processes the user's personaliz ### 3.2 Initialization Before generation starts: + 1. Load theme settings (user-defined categories plus defaults `Divers` and `Sans date`, search topic, max items, max age, summary length) and global user settings (provider, models, batch size, rate limits, etc.). 2. Decrypt the user's LLM API key and create the provider instance. 3. Clean up old article history and LLM call logs. @@ -141,6 +147,7 @@ Skipped if the user has no sources for the theme. Sources are split into waves of `source_extraction_window` size (default 3). Sources are rotated so extraction starts after the last source used in a previous generation (rolling window). Preferred sources are placed before non-preferred sources within the rotation order. For each wave: + 1. Extract article links from all sources in the wave in parallel (bounded concurrency of 5). Link extraction uses HTML `` tag parsing. 2. Deduplicate candidate URLs and filter against article history (previously seen articles are skipped). 3. Shuffle remaining candidates, with URLs from preferred sources placed first. @@ -158,11 +165,13 @@ Skipped if all user-defined categories are already filled. The system computes category gaps (how many articles each category still needs), then follows one of two paths: **Path A -- Brave Search** (when `use_brave_search` is enabled): + 1. Query the Brave Search API with the theme topic and freshness filter. 2. Filter results: reject homepage URLs, deduplicate against Phase 1, check article history, apply source diversity cap. 3. Scrape and classify/summarize results using the same batched pipeline as Phase 1. **Path B -- LLM Web Search** (default): + 1. Send a search prompt to the LLM with the theme, categories, and gap counts. The LLM uses web grounding to find articles and returns structured results. 2. Filter results using the same filters as Path A. 3. Scrape each result to validate it. Keep the LLM-provided title and summary (no re-classification). @@ -183,7 +192,7 @@ For the complete technical algorithm, see `technical_specs.md` Section 5. Managed on the theme management page. Each theme has its own values. | Setting | Description | Default | -|---------|-------------|---------| +| --------- | ------------- | --------- | | Name | Display label for the theme | -- | | Search topic | Subject for AI search queries | -- | | Categories | Ordered list of user-defined category names (`Divers` and `Sans date` are always included by the system) | [] | @@ -196,7 +205,7 @@ Managed on the theme management page. Each theme has its own values. Managed on the settings page. Apply across all themes. | Setting | Description | Default | -|---------|-------------|---------| +| --------- | ------------- | --------- | | Provider | LLM provider (Gemini, OpenAI, Anthropic) | -- | | Research model | Model for scraping/classification | Admin default | | Web search model | Model for web search | Admin default | @@ -219,6 +228,7 @@ Users can export their global settings as a JSON file and import settings from a ### 5.1 Provider Management Admins configure which LLM providers and models are available to users: + - Add providers with a unique identifier and display name. - For each provider, configure two model lists: scraping/extraction models and web search models. - Set a default model for each category. @@ -234,6 +244,7 @@ Admins set default rate limits per provider (max requests / time window in secon ### 5.3 User Management Admins can: + - View all registered users (email, name, role, registration date). - Promote a user to admin or demote an admin to user. - Admins cannot modify their own role. @@ -259,6 +270,7 @@ A Markdown export is available from the synthesis detail page. The file can be s ### 7.1 Article History Every article encountered during generation is recorded in the article history with its status: + - **used**: included in the final synthesis. - **filtered_history**: skipped because it was seen in a previous generation. - **filtered_diversity**: skipped due to per-domain cap. @@ -272,6 +284,7 @@ Users can view the article history per synthesis (provenance view) or globally. ### 7.2 LLM Call Logs Every LLM call during generation is logged with: + - Call type (link extraction, classify/summarize, web search) - Model used - System prompt and user prompt diff --git a/docs/qa_guidelines.md b/docs/qa_guidelines.md index 19fe942..af85b6e 100644 --- a/docs/qa_guidelines.md +++ b/docs/qa_guidelines.md @@ -3,7 +3,7 @@ ## Test Inventory | Type | Count | Status | Location | -|------|-------|--------|----------| +| ------ | ------- | -------- | ---------- | | Backend unit tests | 358 | All passing | `backend/src/**/*.rs` (inline `#[cfg(test)]`) | | Backend integration tests | 183 | All passing | `backend/tests/*.rs` | | Frontend unit tests | 141 | 131 passing, 10 failing | `frontend/src/**/*.test.{ts,tsx}` | @@ -21,7 +21,7 @@ ### Backend Unit Test Breakdown | Source file | Tests | Coverage area | -|---|---|---| +| --- | --- | --- | | `services/scraper.rs` | 74 | SSRF IP checks, soft-404, redirect, HTML parsing | | `services/synthesis.rs` | 36 | Pipeline logic, schema building, category overflow | | `services/llm/anthropic.rs` | 20 | Response parsing, error handling | @@ -51,7 +51,7 @@ ### Backend Integration Test Breakdown | File | Tests | Coverage area | -|---|---|---| +| --- | --- | --- | | `api_sources_test.rs` | 36 | Sources CRUD, validation, CSV, bulk import, max limit | | `api_admin_test.rs` | 30 | Provider CRUD, rate limits, user management, audit log | | `api_keys_test.rs` | 18 | API key CRUD, encryption, ownership, test endpoint | @@ -73,7 +73,7 @@ ### E2E Test Breakdown | File | Coverage area | -|---|---| +| --- | --- | | `registration.spec.ts` | Full magic link registration flow | | `settings.spec.ts` | Settings persistence across reloads | | `settings-export.spec.ts` | Settings export/import roundtrip | @@ -107,6 +107,7 @@ Requires a running Postgres instance. Use the helper script: ``` The script automatically: + - Starts the test Postgres container on port 5433 (via `e2e/docker-compose.test.yml`) - Sets `TEST_DATABASE_URL` and `SKIP_SSRF_CHECK=1` - Runs `cargo test` with the specified arguments @@ -144,6 +145,7 @@ Use the helper script, which builds the Docker image, starts the full stack, see ``` The script: + 1. Builds the test Docker image (`docker compose -f docker-compose.test.yml build`) 2. Starts the full stack (app + Postgres) 3. Waits for the app health check to pass @@ -163,6 +165,7 @@ The `generation-live.spec.ts` test requires `OPENAI_TEST_API_KEY` to be set (in `backend/tests/common/mod.rs` provides the `TestApp` struct, which is the foundation for all integration tests. **What it does:** + - Creates a unique temporary Postgres database per test (named `ai_synth_test_{uuid}`) - Runs all migrations - Builds the full Axum router with test configuration (bypassed Turnstile and Resend) @@ -172,6 +175,7 @@ The `generation-live.spec.ts` test requires `OPENAI_TEST_API_KEY` to be set (in - Handles cleanup via `Drop` (fire-and-forget) or explicit `cleanup().await` **Request helpers** automatically: + - Set `Content-Type: application/json` for requests with a body - Set `X-Requested-With: XMLHttpRequest` (CSRF header) for mutating methods (POST, PUT, DELETE, PATCH) - Set the session cookie when `session_cookie` is provided @@ -347,6 +351,7 @@ The `TestApp::Drop` implementation spawns a background thread to drop the test d ### Flaky generation-live Test The `generation-live.spec.ts` test depends on a real OpenAI API call. It may fail due to: + - API rate limits - Slow responses exceeding the 30-second timeout - Changes in model behavior affecting output format @@ -374,7 +379,7 @@ As of the last audit, 10 of 141 frontend unit tests are failing. Investigate wit The following gaps must be addressed to satisfy the release gate policy. | Gap | Priority | Description | -|-----|----------|-------------| +| ----- | ---------- | ------------- | | Scheduled execution | Critical | `scheduler.rs` has zero tests. Autonomous process that generates syntheses and sends emails. | | Brave Search pipeline | High | Only 1 unit test. The Brave Search code path in the pipeline is untested in integration. | | Date filtering | High | No tests verify that `max_age_days` actually filters old articles. | diff --git a/docs/technical_specs.md b/docs/technical_specs.md index d25465f..d4626c8 100644 --- a/docs/technical_specs.md +++ b/docs/technical_specs.md @@ -3,7 +3,7 @@ ## 1. Backend Tech Stack | Dependency | Version | Purpose | -|---|---|---| +| --- | --- | --- | | axum | 0.8 | Web framework (macros, multipart) | | tokio | 1 | Async runtime (full features) | | tower | 0.5 | Middleware composition | @@ -43,7 +43,7 @@ ## 2. Frontend Tech Stack | Dependency | Version | Purpose | -|---|---|---| +| --- | --- | --- | | solid-js | ^1.9.0 | Reactive UI framework | | @solidjs/router | ^0.15.0 | Client-side routing | | lucide-solid | ^0.475.0 | Icon library | @@ -60,7 +60,7 @@ ### Frontend Routes | Path | Component | Auth | Description | -|---|---|---|---| +| --- | --- | --- | --- | | /login | Login | Public | Login page | | /register | Register | Public | Registration page | | /auth/verify | AuthVerify | Public | Magic link verification | @@ -82,7 +82,7 @@ ### 3.1 `users` | Column | Type | Constraints | -|---|---|---| +| --- | --- | --- | | id | UUID | PK, DEFAULT gen_random_uuid() | | email | TEXT | NOT NULL, UNIQUE | | display_name | TEXT | nullable | @@ -95,7 +95,7 @@ Indexes: `idx_users_email` on (email). ### 3.2 `sessions` | Column | Type | Constraints | -|---|---|---| +| --- | --- | --- | | session_hash | TEXT | PK (SHA-256 of raw token) | | user_id | UUID | NOT NULL, FK users(id) CASCADE | | created_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() | @@ -109,7 +109,7 @@ Indexes: `idx_sessions_user_id`, `idx_sessions_expires_at`. ### 3.3 `magic_tokens` | Column | Type | Constraints | -|---|---|---| +| --- | --- | --- | | id | UUID | PK, DEFAULT gen_random_uuid() | | email | TEXT | NOT NULL | | token_hash | TEXT | NOT NULL, UNIQUE | @@ -124,7 +124,7 @@ Indexes: `idx_magic_tokens_email`, `idx_magic_tokens_expires`. Per-user pipeline configuration. One row per user (user_id is the PK). | Column | Type | Constraints | -|---|---|---| +| --- | --- | --- | | user_id | UUID | PK, FK users(id) CASCADE | | max_articles_per_source | INTEGER | NOT NULL, DEFAULT 3 | | max_links_per_source | INTEGER | NOT NULL, DEFAULT 8 | @@ -145,7 +145,7 @@ Per-user pipeline configuration. One row per user (user_id is the PK). Per-user topic configurations with content settings. | Column | Type | Constraints | -|---|---|---| +| --- | --- | --- | | id | UUID | PK, DEFAULT gen_random_uuid() | | user_id | UUID | NOT NULL, FK users(id) CASCADE | | name | TEXT | NOT NULL | @@ -166,7 +166,7 @@ Indexes: `idx_themes_user_id`. User-curated news source URLs, always tied to a theme. | Column | Type | Constraints | -|---|---|---| +| --- | --- | --- | | id | UUID | PK, DEFAULT gen_random_uuid() | | user_id | UUID | NOT NULL, FK users(id) CASCADE | | title | VARCHAR(200) | NOT NULL, CHECK length 1-200 | @@ -182,7 +182,7 @@ Indexes: `idx_sources_user_id`, UNIQUE `idx_sources_user_id_url` on (user_id, ur Generated synthesis results with JSONB section data. | Column | Type | Constraints | -|---|---|---| +| --- | --- | --- | | id | UUID | PK, DEFAULT gen_random_uuid() | | user_id | UUID | NOT NULL, FK users(id) CASCADE | | week | VARCHAR(10) | NOT NULL (ISO week string) | @@ -195,6 +195,7 @@ Generated synthesis results with JSONB section data. Indexes: `idx_syntheses_user_id_created_at` on (user_id, created_at DESC). JSONB structure for `sections`: + ```json [ { @@ -211,7 +212,7 @@ JSONB structure for `sections`: Automated generation schedules, one per theme. | Column | Type | Constraints | -|---|---|---| +| --- | --- | --- | | id | UUID | PK, DEFAULT gen_random_uuid() | | theme_id | UUID | NOT NULL, UNIQUE, FK themes(id) CASCADE | | user_id | UUID | NOT NULL, FK users(id) CASCADE | @@ -230,7 +231,7 @@ Indexes: `idx_theme_schedules_enabled` (partial, WHERE enabled = true). Article URL deduplication and full provenance tracing. | Column | Type | Constraints | -|---|---|---| +| --- | --- | --- | | id | UUID | PK, DEFAULT gen_random_uuid() | | user_id | UUID | NOT NULL, FK users(id) CASCADE | | url_hash | TEXT | NOT NULL (SHA-256 of normalized URL) | @@ -257,7 +258,7 @@ Source type values: `personalized_source`, `brave_search`, `web_search`. Full LLM interaction logging for debugging and analysis. | Column | Type | Constraints | -|---|---|---| +| --- | --- | --- | | id | UUID | PK, DEFAULT gen_random_uuid() | | user_id | UUID | NOT NULL, FK users(id) CASCADE | | job_id | UUID | NOT NULL | @@ -277,7 +278,7 @@ Indexes: `idx_llm_call_log_job_id`, `idx_llm_call_log_user_id` on (user_id, crea Admin-curated catalog of LLM providers and their models. | Column | Type | Constraints | -|---|---|---| +| --- | --- | --- | | id | UUID | PK, DEFAULT gen_random_uuid() | | provider_name | VARCHAR(50) | NOT NULL, UNIQUE | | display_name | VARCHAR(100) | NOT NULL | @@ -292,6 +293,7 @@ Indexes: `idx_admin_providers_enabled` (partial, WHERE is_enabled = true). Seeded with: gemini, openai, anthropic. JSONB model structure: + ```json [{"model_id": "gemini-2.5-pro", "display_name": "Gemini 2.5 Pro", "is_default": true}] ``` @@ -301,7 +303,7 @@ JSONB model structure: Per-provider rate limit configuration. | Column | Type | Constraints | -|---|---|---| +| --- | --- | --- | | id | UUID | PK, DEFAULT gen_random_uuid() | | provider_name | VARCHAR(50) | NOT NULL, UNIQUE, FK admin_providers(provider_name) CASCADE | | max_requests | INTEGER | NOT NULL, DEFAULT 30 | @@ -315,7 +317,7 @@ Seeded defaults: gemini 29/60s, openai 50/60s, anthropic 40/60s. Encrypted user LLM API keys. | Column | Type | Constraints | -|---|---|---| +| --- | --- | --- | | id | UUID | PK, DEFAULT gen_random_uuid() | | user_id | UUID | NOT NULL, FK users(id) CASCADE | | provider_name | VARCHAR(50) | NOT NULL | @@ -332,7 +334,7 @@ Constraint: UNIQUE(user_id, provider_name). Valid providers: gemini, openai, ant Admin mutation audit trail. | Column | Type | Constraints | -|---|---|---| +| --- | --- | --- | | id | UUID | PK, DEFAULT gen_random_uuid() | | admin_user_id | UUID | nullable, FK users(id) SET NULL | | action | VARCHAR(100) | NOT NULL | @@ -352,43 +354,51 @@ All endpoints are prefixed with `/api/v1`. Responses are JSON. Errors follow the ### 4.1 Authentication **POST /auth/register** + - Auth: Public - Body: `{ email: string, display_name?: string, turnstile_token: string }` - Response: `{ message: string }` - Sends magic link email. Rate limited. **POST /auth/login** + - Auth: Public - Body: `{ email: string, turnstile_token: string }` - Response: `{ message: string }` - Sends magic link email. Rate limited. **GET /auth/verify?token=...&email=...** + - Auth: Public - Response: Redirect to frontend with session cookie set. **POST /auth/verify** + - Auth: Public - Body: `{ token: string, email: string }` - Response: `{ message: string, user: User }` - Sets `session` HttpOnly cookie (30-day expiry). **POST /auth/logout** + - Auth: Authenticated - Response: `{ message: string }` - Clears session cookie and deletes DB session. **GET /auth/me** + - Auth: Authenticated - Response: `{ id, email, display_name, role, created_at }` ### 4.2 Settings **GET /settings** + - Auth: Authenticated - Response: `UserSettings` (creates defaults if not exists) **PUT /settings** + - Auth: Authenticated - Body: `UpdateSettingsRequest` (all fields required) - Validation: max_articles_per_source 1-10, max_links_per_source 1-30, batch_size 1-20, source_extraction_window 1-10, article_history_days 0-365, search_agent_behavior max 2000 chars, ai_provider/ai_model/ai_model_websearch max 100 chars. @@ -397,10 +407,12 @@ All endpoints are prefixed with `/api/v1`. Responses are JSON. Errors follow the ### 4.3 Themes **GET /themes** + - Auth: Authenticated - Response: `ThemeResponse[]` **POST /themes** + - Auth: Authenticated - Body: `{ name, theme, categories: string[], max_items_per_category?, max_age_days?, summary_length? }` - Validation: name non-empty max 200 chars, categories 0-20 non-empty entries, max_items 1-50, max_age 1-365, summary_length 1-3. @@ -408,64 +420,76 @@ All endpoints are prefixed with `/api/v1`. Responses are JSON. Errors follow the - Response: `ThemeResponse` **PUT /themes/{id}** + - Auth: Authenticated (owner only) - Body: `UpdateThemeRequest` (all fields optional) - Response: `ThemeResponse` **DELETE /themes/{id}** + - Auth: Authenticated (owner only) - Response: 204 No Content ### 4.4 Schedules **GET /themes/{id}/schedule** + - Auth: Authenticated (theme owner) - Response: `ScheduleResponse | null` with HTTP 200 **PUT /themes/{id}/schedule** + - Auth: Authenticated (theme owner) - Body: `{ enabled, days: string[], time_utc: "HH:MM", emails: string[] }` - Validation: days from mon-sun, time HH:MM format, max 3 emails. - Response: `ScheduleResponse` **DELETE /themes/{id}/schedule** + - Auth: Authenticated (theme owner) - Response: 204 No Content ### 4.5 Sources **GET /sources?theme_id=...** + - Auth: Authenticated - Query: `theme_id` is required - Response: `SourceResponse[]` **POST /sources** + - Auth: Authenticated - Body: `{ title, url, theme_id }` - Validation: title non-empty max 200, URL http(s) max 1000 chars. - Response: `SourceResponse` **PUT /sources/preferred** + - Auth: Authenticated - Body: `{ theme_id: UUID, source_ids: UUID[] }` - Note: preferred state is scoped per theme. - Response: `{ updated: number }` **DELETE /sources/{id}** + - Auth: Authenticated (owner only) - Response: 204 No Content **POST /sources/bulk** + - Auth: Authenticated - Body: `{ sources: CreateSourceRequest[], theme_id: UUID }` - Response: `{ imported, skipped, errors }` **POST /sources/import-csv** + - Auth: Authenticated - Body: Multipart file upload (CSV: title,url) + required `theme_id` - Response: `{ imported, skipped, errors }` **GET /sources/export-csv** + - Auth: Authenticated - Query: `theme_id` is required - Scope: exports sources for the selected theme only @@ -474,17 +498,20 @@ All endpoints are prefixed with `/api/v1`. Responses are JSON. Errors follow the ### 4.6 Generation **POST /syntheses/generate** + - Auth: Authenticated - Body: `{ theme_id: UUID }` - Response: `{ job_id: UUID }` - Creates job in JobStore, spawns background generation task. Returns 409 if user already has active job. **GET /syntheses/generate/{job_id}/progress** + - Auth: Authenticated (job owner) - Response: SSE stream of `ProgressEvent` - Events: `progress` (step, message, percent), `complete` (synthesis_id), `error` (message). **POST /syntheses/generate/{job_id}/stop** + - Auth: Authenticated (job owner) - Response: `{ message: string }` - Sets cooperative cancellation flag. @@ -492,57 +519,69 @@ All endpoints are prefixed with `/api/v1`. Responses are JSON. Errors follow the ### 4.7 Syntheses **GET /syntheses** + - Auth: Authenticated - Response: `SynthesisListItem[]` (with section summaries, theme info) **GET /syntheses/{id}** + - Auth: Authenticated (owner only) - Response: `SynthesisResponse` (full sections data) **DELETE /syntheses/{id}** + - Auth: Authenticated (owner only) - Response: 204 No Content **POST /syntheses/{id}/send-email** + - Auth: Authenticated - Body: `{ email: string }` - Response: `{ message: string }` **GET /syntheses/{id}/export/markdown** + - Auth: Authenticated - Response: Markdown file download **GET /syntheses/{id}/export/pdf** + - Auth: Authenticated - Response: PDF file download ### 4.8 Article History & Provenance **GET /article-history?limit=&offset=&job_id=&status=** + - Auth: Authenticated - Response: `{ items: ArticleHistoryEntry[], total: number }` **DELETE /article-history** + - Auth: Authenticated - Response: `{ deleted: number }` **GET /syntheses/{id}/provenance** + - Auth: Authenticated - Response: `ArticleHistoryEntry[]` (articles with status "used" for this synthesis's job_id) ### 4.9 LLM Call Logs **GET /llm-logs/{job_id}** + - Auth: Authenticated - Response: `LlmCallLogEntry[]` ### 4.10 User API Keys **GET /user/api-keys** + - Auth: Authenticated - Response: `ApiKeyResponse[]` (id, provider_name, key_prefix, timestamps; never the full key) **POST /user/api-keys** + - Auth: Authenticated - Body: `{ provider_name, api_key }` - Validation: provider in (gemini, openai, anthropic, brave_search), key 8-500 chars. @@ -550,15 +589,18 @@ All endpoints are prefixed with `/api/v1`. Responses are JSON. Errors follow the - Encrypts key with AES-256-GCM before storage; upserts (one key per user per provider). **DELETE /user/api-keys/{provider}** + - Auth: Authenticated - Response: 204 No Content **POST /user/api-keys/{provider}/test** + - Auth: Authenticated - Response: `{ success: boolean, message: string }` - Decrypts key, calls provider test endpoint. **POST /user/api-keys/export** + - Auth: Authenticated - Response: `{ keys: [{ provider_name, api_key }] }` - Decrypts and returns all keys (used for backup/migration). @@ -566,6 +608,7 @@ All endpoints are prefixed with `/api/v1`. Responses are JSON. Errors follow the ### 4.11 Public Configuration **GET /config/providers** + - Auth: Authenticated - Response: `ProviderConfigResponse[]` (enabled providers with model lists for scraping and websearch) @@ -574,36 +617,45 @@ All endpoints are prefixed with `/api/v1`. Responses are JSON. Errors follow the All admin endpoints require `AdminUser` extractor (role = admin). **GET /admin/providers** + - Response: `AdminProviderResponse[]` **POST /admin/providers** + - Body: `CreateProviderRequest` - Validation: provider_name in (gemini, openai, anthropic), at least one model per list, at most one default per list. - Response: `AdminProviderResponse` **PUT /admin/providers/{id}** + - Body: `UpdateProviderRequest` (all fields optional) - Response: `AdminProviderResponse` **DELETE /admin/providers/{id}** + - Response: 204 No Content **GET /admin/rate-limits** + - Response: `RateLimitResponse[]` **PUT /admin/rate-limits/{provider_name}** + - Body: `{ max_requests: 1-1000, time_window_seconds: 1-3600 }` - Response: `RateLimitResponse` - Hot-reloads the in-memory provider rate limiter. **GET /admin/users** + - Response: `AdminUserResponse[]` **PUT /admin/users/{id}/role** + - Body: `{ role: "user" | "admin" }` - Response: `{ message: string }` **GET /health** + - Auth: Public - Response: `{ status: "ok" }` @@ -619,6 +671,7 @@ All admin endpoints require `AdminUser` extractor (role = admin). ### Generation Lifecycle `POST /api/v1/syntheses/generate` creates a job in the `JobStore`, then spawns two nested tasks: + - Inner task: wraps `run_generation` in a **15-minute `tokio::time::timeout`**. If the timeout fires, sends an `Error` progress event and releases the user lock. - Outer task: monitors the inner task's `JoinHandle` for panics. If the inner task panics, sends an `Error` progress event and releases the user lock. @@ -660,11 +713,13 @@ Processing in batches of `settings.batch_size` (minimum 1). For each batch: **Batch assembly**: Pull up to `batch_size` candidates, skipping any where `source_counts[domain] >= max_articles_per_source` (traced as `filtered_diversity`). **Phase A — Scrape batch in parallel** (`JoinSet`): + - SSRF check (no private IPs), 15s timeout, 5MB body limit. - HTML parsing for title (``, `og:title`), date (meta tags, JSON-LD, `<time>`), body (strip scripts/nav), soft-404 detection. - If article body is empty, is a soft-404, or is too old: trace as `filtered_empty` / `filtered_too_old` and skip. **Phase B — Classify/summarize batch in parallel** (`JoinSet`): + - Check rate limit before classifying (waits up to 60s, then errors). - Send article (title + body snippet based on `summary_length`: 500/2000/4000 chars) + categories + "Divers" to LLM. - LLM returns `{title, summary, category, date, is_article}`. @@ -746,7 +801,7 @@ All calls use structured JSON output (response_schema defines the expected shape ### Implementations | Provider | Module | API Endpoint | Auth Method | -|---|---|---|---| +| --- | --- | --- | --- | | Google Gemini | `llm/gemini.rs` | `generativelanguage.googleapis.com` | Query param `?key=` | | OpenAI | `llm/openai.rs` | `api.openai.com/v1/chat/completions` | Bearer token | | Anthropic | `llm/anthropic.rs` | `api.anthropic.com/v1/messages` | `x-api-key` header | @@ -759,6 +814,7 @@ All calls use structured JSON output (response_schema defines the expected shape ### Response Schema `llm/schema.rs` builds JSON Schema definitions for: + - Classification/summarization: `{title, summary, category, is_article}` - Web search: `{category_0: [{title, url, summary}], ...}` with per-category arrays - Source link extraction: handled via heuristic HTML parsing (no LLM schema). @@ -766,6 +822,7 @@ All calls use structured JSON output (response_schema defines the expected shape ### Error Mapping `map_provider_http_error()` translates HTTP status codes to `AppError` variants: + - 400 -> BadRequest - 401/403 -> BadRequest (invalid key) - 404 -> BadRequest (model not found) @@ -804,7 +861,7 @@ Runs every minute via `tokio::spawn` with a 60-second interval. For each tick: ### Environment Variables | Variable | Required | Default | Description | -|---|---|---|---| +| --- | --- | --- | --- | | DATABASE_URL | Yes | - | PostgreSQL connection string | | MASTER_ENCRYPTION_KEY | Yes | - | 64 hex chars (32 bytes) for AES-256-GCM | | APP_URL | Yes | - | Public URL (CORS, magic links, cookies). No trailing slash. | @@ -820,6 +877,7 @@ Runs every minute via `tokio::spawn` with a 60-second interval. For each tick: ### Startup Validation `AppConfig::validate()` checks at startup: + - `MASTER_ENCRYPTION_KEY` is exactly 64 hex characters - `APP_URL` starts with http:// or https:// and has no trailing slash @@ -830,7 +888,7 @@ The application refuses to start with invalid configuration. Default values applied when a user has no saved settings: | Setting | Default | Range | -|---|---|---| +| --- | --- | --- | | max_articles_per_source | 3 | 1-10 | | max_links_per_source | 8 | 1-30 | | use_brave_search | false | boolean |