From 58f42d0a87221101e31ffcd0a8fc3577b00723ee Mon Sep 17 00:00:00 2001 From: oabrivard Date: Sat, 28 Mar 2026 19:25:14 +0100 Subject: [PATCH] =?UTF-8?q?docs:=20remove=20redundancy=20across=20document?= =?UTF-8?q?ation=20=E2=80=94=20cross-references=20instead=20of=20duplicati?= =?UTF-8?q?on?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Trim architecture.md significantly (section 1 overview, technology stack, deployment topology, module inventory lists, LLM trait block, pipeline details, data model table, full API tables, background task list). Replace section 5 API tables with a one-liner. Requirements.md sections 3.1/3.5/3.6/3.7/3.8 and 4.2 condensed with cross-references. deployment.md security feature list replaced by cross-reference to architecture.md Section 6. functional_specs.md Section 3 gains a cross-reference to technical_specs.md Section 5. Co-Authored-By: Claude Sonnet 4.6 --- docs/architecture.md | 226 ++++----------------------------------- docs/deployment.md | 13 +-- docs/dev_guidelines.md | 1 + docs/functional_specs.md | 30 +++--- docs/qa_guidelines.md | 14 ++- docs/requirements.md | 53 +++------ docs/technical_specs.md | 33 +++--- 7 files changed, 85 insertions(+), 285 deletions(-) diff --git a/docs/architecture.md b/docs/architecture.md index ae2a428..255ad1c 100644 --- a/docs/architecture.md +++ b/docs/architecture.md @@ -2,30 +2,9 @@ ## 1. System Overview -AI Weekly Synth is a self-hosted web application that generates AI-powered weekly news syntheses. Users configure topics (themes), categories, and an LLM provider; the system then searches the web, scrapes and validates sources, classifies articles, and produces structured summaries. +AI Weekly Synth is a self-hosted Rust/Axum backend with a SolidJS frontend, backed by PostgreSQL, deployed as a Docker Compose stack. It generates AI-powered weekly news syntheses organized by user-configured themes and categories. -### Technology Stack - -| Layer | Technology | -|---|---| -| Backend | Rust (Axum 0.8) | -| Frontend | SolidJS 1.9 + Tailwind CSS v4 | -| Database | PostgreSQL 17 (via sqlx with compile-time query checking) | -| Deployment | Docker Compose (app + Postgres) | - -### Deployment Topology - -``` -docker-compose.yml - ├── app (ai-synth) port 8080 - │ ├── Axum HTTP server - │ ├── Static file serving (SPA fallback) - │ └── Background tasks (scheduler, session cleanup, job TTL) - └── db (postgres:17-alpine) port 5432 (localhost only) - └── postgres_data volume -``` - -The app container builds from a multi-stage Dockerfile, serves the SolidJS frontend as static files, and connects to Postgres over the `internal` bridge network. +See `requirements.md` for product vision and features. See `technical_specs.md` for the full technology stack. See `deployment.md` for the Docker topology and operational details. --- @@ -36,42 +15,16 @@ The backend follows a three-layer architecture with shared model types: ``` handlers/ (HTTP layer) │ - ├── extracts request data (Axum extractors, JSON, path params) - ├── validates input - ├── calls services/ or db/ directly - └── formats HTTP responses - │ services/ (Business logic) │ - ├── synthesis pipeline orchestration - ├── LLM provider abstraction + factory - ├── scraping (articles, source pages) - ├── encryption, email, CSV, PDF export - ├── rate limiting, job store, scheduler - └── Brave Search client - │ db/ (Data access) │ - ├── pure SQL queries via sqlx - ├── typed result mapping (FromRow) - └── no business logic - │ models/ (Shared types -- used by all layers) - │ - ├── domain structs (User, Theme, Source, Synthesis, etc.) - ├── request/response DTOs - └── validation logic ``` -### Module Inventory - -**Handlers** (`handlers/`): `admin`, `api_keys`, `article_history`, `auth`, `config`, `generation`, `health`, `llm_logs`, `schedules`, `settings`, `sources`, `syntheses`, `themes` - -**Services** (`services/`): `auth`, `brave_search`, `csv`, `email`, `encryption`, `export`, `job_store`, `llm` (with `gemini`, `openai`, `anthropic`, `mock`, `factory`, `schema`), `prompts`, `rate_limiter`, `scheduler`, `scraper`, `source_scraper`, `synthesis`, `turnstile` +Handlers extract and validate request data, delegate to services or db, and format responses. Services contain all business logic. The db layer executes pure SQL via sqlx with typed result mapping and no business logic. Models define domain structs, request/response DTOs, and validation logic. -**DB** (`db/`): `api_keys`, `article_history`, `audit`, `llm_call_log`, `magic_links`, `providers`, `rate_limits`, `schedules`, `sessions`, `settings`, `sources`, `syntheses`, `themes`, `users` - -**Models** (`models/`): `api_key`, `audit`, `magic_link`, `provider`, `rate_limit`, `schedule`, `session`, `settings`, `source`, `synthesis`, `theme`, `user` +See `dev_guidelines.md` Section 2 for complete project structure. --- @@ -79,35 +32,15 @@ models/ (Shared types -- used by all layers) ### 3.1 LLM Provider Abstraction -The `LlmProvider` trait defines a unified interface for all LLM backends: - -```rust -#[async_trait] -pub trait LlmProvider: Send + Sync { - fn provider_id(&self) -> &str; - async fn call_llm(&self, model: &str, system_prompt: &str, - user_prompt: &str, response_schema: &Value) - -> Result; -} -``` - -Implementations: `GeminiProvider`, `OpenAiProvider`, `AnthropicProvider`, `MockLlmProvider`. +The `LlmProvider` trait defines a unified interface for all LLM backends, with implementations for Gemini, OpenAI, Anthropic, and a mock provider for testing. A factory creates provider instances by name from the admin-curated provider list. -The factory (`llm/factory.rs`) creates provider instances by name. The mock provider enables end-to-end pipeline testing without real API calls. +See `technical_specs.md` Section 6 for provider interface details and supported models. ### 3.2 Synthesis Pipeline -The pipeline is the core business logic, orchestrated in `services/synthesis.rs`. It runs as a background tokio task with a 15-minute timeout. - -**Three phases:** - -1. **Phase 1 -- Personalized Sources**: Extract article links from user-curated source pages (windowed, rolling), scrape articles, classify and summarize each via LLM. Batched processing with configurable `batch_size`. +The pipeline is orchestrated in `services/synthesis.rs` and runs as a background tokio task with a 15-minute timeout. Phase 1 processes the user's personalized sources using a rolling windowed extraction with batched parallel scraping and LLM classification. Phase 2 fills remaining category gaps via Brave Search or LLM web search. The finalization step assembles sections, persists the synthesis, and records article history. Progress is reported via `tokio::sync::watch` channels consumed by SSE endpoints. -2. **Phase 2 -- Web Search Fallback**: For under-filled categories, either call the Brave Search API or use the LLM's web search capability to find additional articles. Scrape and validate results. - -3. **Save**: Assemble sections by category, sanitize JSON, persist to database, record article history traces. - -Progress is reported via `tokio::sync::watch` channels consumed by SSE endpoints. +See `technical_specs.md` Section 5 for the full algorithm. ### 3.3 Job Store @@ -121,20 +54,16 @@ Progress is reported via `tokio::sync::watch` channels consumed by SSE endpoints ### 3.4 Scheduler -`services/scheduler.rs` runs as a background task, checking every minute for due `theme_schedules`. When a schedule fires: +`services/scheduler.rs` runs as a background task checking every minute for due `theme_schedules`. When a schedule fires it runs the generation pipeline directly, emails results to configured recipients (up to 3), and marks the schedule as run to prevent double-execution on the same day. -1. Query `find_due_schedules` matching current day code + time -2. Skip if user already has a manual generation in progress -3. Run `synthesis::run_generation_inner` directly -4. Send email to configured recipients (up to 3) -5. Mark schedule as run +See `deployment.md` for operational details. ### 3.5 Scraper Two scraping services: - **`scraper.rs`**: Article page scraper with SSRF prevention, HTML parsing, title/date/body extraction, soft-404 detection, 15s timeout, 5MB body limit. -- **`source_scraper.rs`**: Source index page scraper that extracts article links from user-configured source URLs (HTML `` parsing with filters, or LLM-assisted extraction). +- **`source_scraper.rs`**: Source index page scraper that extracts article links from user-configured source URLs (HTML `` parsing with filters). ### 3.6 Rate Limiters @@ -166,130 +95,13 @@ admin_providers └── admin_rate_limits (provider_name FK, CASCADE) ``` -### Table Summary - -| Table | Purpose | Key Columns | -|---|---|---| -| `users` | User accounts | id, email, display_name, role (user/admin), created_at | -| `sessions` | Login sessions | session_hash (PK), user_id, expires_at, last_active_at, ip_address | -| `magic_tokens` | Passwordless auth tokens | id, email, token_hash, expires_at, used | -| `settings` | Per-user pipeline config | user_id (PK), ai_provider, ai_model, ai_model_websearch, batch_size, max_articles_per_source, max_links_per_source, use_brave_search, source_extraction_window, article_history_days, search_agent_behavior, rate_limit_max_requests, rate_limit_time_window_seconds | -| `themes` | Per-user topic configurations | id, user_id, name, theme, categories (JSONB), max_items_per_category, max_age_days, summary_length | -| `sources` | User-curated news source URLs | id, user_id, title, url, theme_id, is_preferred | -| `syntheses` | Generated synthesis results | id, user_id, week, sections (JSONB), status, job_id, theme_id | -| `theme_schedules` | Automated generation schedules | id, theme_id (UNIQUE), user_id, enabled, days (JSONB), time_utc, emails (JSONB), last_run_at | -| `article_history` | Article URL dedup + provenance trace | id, user_id, url, url_hash, title, source_type, source_url, category, synthesis_id, status, scraped_ok, job_id, published_date | -| `llm_call_log` | Full LLM interaction log | id, user_id, job_id, call_type, model, system_prompt, user_prompt, response_body, duration_ms, article_url | -| `admin_providers` | Admin-curated LLM provider catalog | id, provider_name (UNIQUE), display_name, models_scraping (JSONB), models_websearch (JSONB), is_enabled | -| `admin_rate_limits` | Per-provider rate limit config | id, provider_name (UNIQUE, FK), max_requests, time_window_seconds | -| `user_api_keys` | Encrypted user LLM API keys | id, user_id, provider_name, encrypted_key (BYTEA), nonce (BYTEA), key_prefix; UNIQUE(user_id, provider_name) | -| `audit_log` | Admin mutation audit trail | id, admin_user_id, action, target_type, target_id, details (JSONB) | +See `technical_specs.md` Section 3 for complete column definitions. --- ## 5. API Overview -All API routes are prefixed with `/api/v1`. CSRF protection (`X-Requested-With` header) is applied to all mutating endpoints. - -### Authentication - -| Method | Path | Auth | Description | -|---|---|---|---| -| POST | /auth/register | Public | Create account + send magic link | -| POST | /auth/login | Public | Request magic link | -| GET | /auth/verify | Public | Verify token (email click redirect) | -| POST | /auth/verify | Public | Verify token (frontend API call) | -| POST | /auth/logout | Authenticated | Destroy session | -| GET | /auth/me | Authenticated | Current user info | - -### Settings - -| Method | Path | Auth | Description | -|---|---|---|---| -| GET | /settings | Authenticated | Get user settings | -| PUT | /settings | Authenticated | Update user settings | - -### Themes - -| Method | Path | Auth | Description | -|---|---|---|---| -| GET | /themes | Authenticated | List user themes | -| POST | /themes | Authenticated | Create theme | -| PUT | /themes/{id} | Authenticated | Update theme | -| DELETE | /themes/{id} | Authenticated | Delete theme | - -### Schedules - -| Method | Path | Auth | Description | -|---|---|---|---| -| GET | /themes/{id}/schedule | Authenticated | Get theme schedule | -| PUT | /themes/{id}/schedule | Authenticated | Create or update schedule | -| DELETE | /themes/{id}/schedule | Authenticated | Delete schedule | - -### Sources - -| Method | Path | Auth | Description | -|---|---|---|---| -| GET | /sources | Authenticated | List sources | -| POST | /sources | Authenticated | Create source | -| PUT | /sources/preferred | Authenticated | Update preferred sources | -| DELETE | /sources/{id} | Authenticated | Delete source | -| POST | /sources/bulk | Authenticated | Bulk import (JSON) | -| POST | /sources/import-csv | Authenticated | Import from CSV | -| GET | /sources/export-csv | Authenticated | Export as CSV | - -### Syntheses & Generation - -| Method | Path | Auth | Description | -|---|---|---|---| -| GET | /syntheses | Authenticated | List syntheses | -| GET | /syntheses/{id} | Authenticated | Get full synthesis | -| DELETE | /syntheses/{id} | Authenticated | Delete synthesis | -| POST | /syntheses/generate | Authenticated | Trigger generation | -| GET | /syntheses/generate/{job_id}/progress | Authenticated | SSE progress stream | -| POST | /syntheses/generate/{job_id}/stop | Authenticated | Cancel generation | -| POST | /syntheses/{id}/send-email | Authenticated | Email synthesis | -| GET | /syntheses/{id}/export/markdown | Authenticated | Markdown download | -| GET | /syntheses/{id}/export/pdf | Authenticated | PDF download | - -### Article History & LLM Logs - -| Method | Path | Auth | Description | -|---|---|---|---| -| GET | /article-history | Authenticated | List article history | -| DELETE | /article-history | Authenticated | Clear article history | -| GET | /syntheses/{id}/provenance | Authenticated | Get synthesis provenance | -| GET | /llm-logs/{job_id} | Authenticated | Get LLM call logs for job | - -### User API Keys - -| Method | Path | Auth | Description | -|---|---|---|---| -| GET | /user/api-keys | Authenticated | List keys (prefix only) | -| POST | /user/api-keys | Authenticated | Store encrypted key | -| DELETE | /user/api-keys/{provider} | Authenticated | Delete key | -| POST | /user/api-keys/{provider}/test | Authenticated | Test key validity | -| POST | /user/api-keys/export | Authenticated | Export keys | - -### Configuration & Admin - -| Method | Path | Auth | Description | -|---|---|---|---| -| GET | /config/providers | Authenticated | Available providers/models | -| GET | /admin/providers | Admin | List all providers | -| POST | /admin/providers | Admin | Create provider | -| PUT | /admin/providers/{id} | Admin | Update provider | -| DELETE | /admin/providers/{id} | Admin | Delete provider | -| GET | /admin/rate-limits | Admin | List rate limits | -| PUT | /admin/rate-limits/{provider_name} | Admin | Update rate limit | -| GET | /admin/users | Admin | List users | -| PUT | /admin/users/{id}/role | Admin | Change user role | - -### Infrastructure - -| Method | Path | Auth | Description | -|---|---|---|---| -| GET | /health | Public | Health check | +See `technical_specs.md` Section 4 for complete API endpoint specifications. --- @@ -350,10 +162,7 @@ Tokio with full features. The Axum server runs as a multi-threaded async runtime ### Background Tasks -Spawned at startup via `tokio::spawn`: -- **Session cleanup**: Hourly deletion of expired DB sessions -- **Job store cleanup**: Periodic removal of expired job entries (1-hour TTL) -- **Scheduler**: Minute-by-minute check for due theme schedules +Three tasks are spawned at startup: hourly session cleanup, periodic job store TTL cleanup, and the minute-by-minute theme schedule checker. See `deployment.md` Section 2. ### Generation Pipeline Concurrency @@ -380,3 +189,10 @@ POST /generate ### Graceful Shutdown The server supports graceful shutdown via signal handling, allowing in-flight requests to complete. + +--- + +## 8. Quality Gates + +- Release candidates must include deterministic CI coverage for critical autonomous flows, especially scheduler execution and SSE progress behavior. +- External-provider tests (for example live LLM E2E checks) are supplemental and non-blocking; they do not replace deterministic CI coverage. diff --git a/docs/deployment.md b/docs/deployment.md index 3e33b05..3775463 100644 --- a/docs/deployment.md +++ b/docs/deployment.md @@ -243,15 +243,4 @@ Before deploying to production, verify: ### Security Features (Built-in) -The application includes the following security measures that require no additional configuration: - -- **AES-256-GCM encryption** for user LLM API keys at rest (per-key random nonces) -- **SSRF prevention** in the web scraper (DNS resolution checks, private IP blocking, redirect validation) -- **CSRF protection** via `X-Requested-With` header on all mutating API endpoints -- **Session cookies**: `HttpOnly`, `SameSite=Lax`, `Secure` (when HTTPS) -- **Security headers**: CSP, X-Frame-Options (DENY), X-Content-Type-Options (nosniff), Referrer-Policy, HSTS (when HTTPS) -- **Anti-enumeration**: Same response for existent/non-existent emails in auth flows -- **Error sanitization**: Internal errors and API key patterns are stripped from client-facing error messages -- **Rate limiting**: Configurable per-provider rate limits for LLM API calls -- **Non-root container**: The Docker image runs as `appuser` -- **Graceful shutdown**: SIGTERM/Ctrl+C triggers clean shutdown with database pool closure +See `architecture.md` Section 6 for detailed security architecture. diff --git a/docs/dev_guidelines.md b/docs/dev_guidelines.md index 7ff0934..8a11361 100644 --- a/docs/dev_guidelines.md +++ b/docs/dev_guidelines.md @@ -179,6 +179,7 @@ async fn admin_handler(admin: AdminUser, State(state): State) -> Resul #### Component Patterns - Use the `Button` component (`components/ui/Button.tsx`) with `variant`/`loading`/`icon` props instead of raw `