AI Weekly Synth is a self-hosted web application that generates AI-powered weekly news syntheses. Users create themes (topics), configure categories and sources, then the app scrapes sources, classifies articles via LLM, and produces structured summaries. Supports scheduled generation with email delivery.
AI Weekly Synth is a self-hosted web application that generates AI-powered weekly news syntheses. Users create themes (topics), configure categories and sources, then the app scrapes sources, classifies articles via LLM, and produces structured summaries. Supports scheduled generation with email delivery.
@ -29,6 +29,7 @@ The application will be available at `http://localhost:8080` (or the port config
The `docker-compose.yml` defines two services:
The `docker-compose.yml` defines two services:
**app** (AI Weekly Synth backend + frontend):
**app** (AI Weekly Synth backend + frontend):
- Multi-stage Docker image: Node.js builds the frontend, Rust builds the backend, then both are combined into a minimal Debian runtime
- Multi-stage Docker image: Node.js builds the frontend, Rust builds the backend, then both are combined into a minimal Debian runtime
- Runs as a non-root user (`appuser`)
- Runs as a non-root user (`appuser`)
- Depends on `db` with a health check condition (waits for Postgres to be ready)
- Depends on `db` with a health check condition (waits for Postgres to be ready)
@ -36,6 +37,7 @@ The `docker-compose.yml` defines two services:
- Restart policy: `unless-stopped`
- Restart policy: `unless-stopped`
**db** (PostgreSQL 17 Alpine):
**db** (PostgreSQL 17 Alpine):
- Data persisted to a named Docker volume (`postgres_data`)
- Data persisted to a named Docker volume (`postgres_data`)
- Exposed on `127.0.0.1:5432` (localhost only, not accessible from external networks)
- Exposed on `127.0.0.1:5432` (localhost only, not accessible from external networks)
- Health check: `pg_isready` every 10 seconds
- Health check: `pg_isready` every 10 seconds
@ -59,20 +61,20 @@ All environment variables are documented in `.env.example`. The `.env` file is l
### Required
### Required
| Variable | Description | Example |
| Variable | Description | Example |
|----------|-------------|---------|
|----------|-------------|---------|
| `DATABASE_URL` | PostgreSQL connection string. In docker-compose, the hostname is `db`. | `postgres://ai_synth:secret@db:5432/ai_synth` |
| `DATABASE_URL` | PostgreSQL connection string. In docker-compose, the hostname is `db`. | `postgres://ai_synth:secret@db:5432/ai_synth` |
| `POSTGRES_PASSWORD` | Password for the PostgreSQL user. Used by both the `db` service and in `DATABASE_URL`. | `a-strong-random-password` |
| `POSTGRES_PASSWORD` | Password for the PostgreSQL user. Used by both the `db` service and in `DATABASE_URL`. | `a-strong-random-password` |
| `MASTER_ENCRYPTION_KEY` | 256-bit key for AES-256-GCM encryption of user API keys at rest. Must be exactly 64 hex characters. Generate with `openssl rand -hex 32`. **Back this up securely -- losing it means all stored API keys become unreadable.** | `ab12cd34...` (64 hex chars) |
| `MASTER_ENCRYPTION_KEY` | 256-bit key for AES-256-GCM encryption of user API keys at rest. Must be exactly 64 hex characters. Generate with `openssl rand -hex 32`. **Back this up securely -- losing it means all stored API keys become unreadable.** | `ab12cd34...` (64 hex chars) |
| `APP_URL` | Public URL where the app is accessible (no trailing slash). Used for magic link URLs, CORS origin, and cookie domain. | `https://synth.example.com` |
| `APP_URL` | Public URL where the app is accessible (no trailing slash). Used for magic link URLs, CORS origin, and cookie domain. | `https://synth.example.com` |
| `RESEND_API_KEY` | API key for Resend (email service). Required for magic link emails and synthesis email export. Sign up at https://resend.com. | `re_xxxxx` |
| `RESEND_API_KEY` | API key for Resend (email service). Required for magic link emails and synthesis email export. Sign up at <https://resend.com>. | `re_xxxxx` |
| `EMAIL_FROM` | Sender address for emails. Must be a verified domain in Resend. | `AI Weekly Synth <noreply@synth.example.com>` |
| `EMAIL_FROM` | Sender address for emails. Must be a verified domain in Resend. | `AI Weekly Synth <noreply@synth.example.com>` |
| `TURNSTILE_SECRET_KEY` | Server-side secret key for Cloudflare Turnstile captcha. Sign up at https://dash.cloudflare.com/turnstile. | `0x4AAAAAAA...` |
| `TURNSTILE_SECRET_KEY` | Server-side secret key for Cloudflare Turnstile captcha. Sign up at <https://dash.cloudflare.com/turnstile>. | `0x4AAAAAAA...` |
| `TURNSTILE_SITE_KEY` | Client-side site key for Cloudflare Turnstile. | `0x4BBBBBB...` |
| `TURNSTILE_SITE_KEY` | Client-side site key for Cloudflare Turnstile. | `0x4BBBBBB...` |
### Optional
### Optional
| Variable | Description | Default |
| Variable | Description | Default |
|----------|-------------|---------|
|----------|-------------|---------|
| `PORT` | Port for the backend HTTP server (inside the container). The docker-compose maps this to the host. | `8080` |
| `PORT` | Port for the backend HTTP server (inside the container). The docker-compose maps this to the host. | `8080` |
| `STATIC_DIR` | Path to the built frontend files. In Docker, this is `./static` (set by docker-compose). For local dev, use `../frontend/dist`. | `./static` (Docker) |
| `STATIC_DIR` | Path to the built frontend files. In Docker, this is `./static` (set by docker-compose). For local dev, use `../frontend/dist`. | `./static` (Docker) |
@ -87,6 +89,7 @@ All environment variables are documented in `.env.example`. The `.env` file is l
The application uses PostgreSQL 17. The `docker-compose.yml` runs it as the `db` service with a named volume for data persistence.
The application uses PostgreSQL 17. The `docker-compose.yml` runs it as the `db` service with a named volume for data persistence.
Key configuration:
Key configuration:
- User: `ai_synth` (configurable via `POSTGRES_PASSWORD`)
- User: `ai_synth` (configurable via `POSTGRES_PASSWORD`)
- Database: `ai_synth`
- Database: `ai_synth`
- Shared memory: 128 MB (for complex queries)
- Shared memory: 128 MB (for complex queries)
@ -103,7 +106,7 @@ No manual migration step is needed. The application will not start serving reque
The database contains the following tables:
The database contains the following tables:
| Table | Purpose |
| Table | Purpose |
|-------|---------|
|-------|---------|
| `users` | User accounts (email, display name, role) |
| `users` | User accounts (email, display name, role) |
| `sessions` | Active sessions (hashed tokens, expiry) |
| `sessions` | Active sessions (hashed tokens, expiry) |
- **Never use `unwrap()` in production code.** Use `?`, `ok_or_else`, `map_err`, or `unwrap_or_default` with appropriate logging. `unwrap()` is only acceptable in `#[cfg(test)]` blocks and `LazyLock` static initializers.
- **Never use `unwrap()` in production code.** Use `?`, `ok_or_else`, `map_err`, or `unwrap_or_default` with appropriate logging. `unwrap()` is only acceptable in `#[cfg(test)]` blocks and `LazyLock` static initializers.
- **`AppError::Internal` hides details** from the client. The full error is logged via `tracing::error!` but the response body only contains `"An internal error occurred"`.
- **`AppError::Internal` hides details** from the client. The full error is logged via `tracing::error!` but the response body only contains `"An internal error occurred"`.
- **`From<sqlx::Error>` and `From<anyhow::Error>`** conversions are implemented, so you can use `?` with both types.
- **`From<sqlx::Error>` and `From<anyhow::Error>`** conversions are implemented, so you can use `?` with both types.
@ -135,6 +136,7 @@ Key rules:
#### Arc Usage
#### Arc Usage
`Arc` is used to share data across `tokio::spawn` boundaries. Common patterns:
`Arc` is used to share data across `tokio::spawn` boundaries. Common patterns:
- `Arc<dyn LlmProvider>` for the LLM provider (shared across classify tasks)
- `Arc<dyn LlmProvider>` for the LLM provider (shared across classify tasks)
- `Arc<AtomicBool>` for cancellation flags
- `Arc<AtomicBool>` for cancellation flags
- `Arc<watch::Sender<ProgressEvent>>` for SSE progress channels
- `Arc<watch::Sender<ProgressEvent>>` for SSE progress channels
- **Email**: enter a recipient address or click "S'envoyer a soi-meme". The synthesis is sent as a formatted email via Resend.
- **Email**: enter a recipient address or click "S'envoyer a soi-meme". The synthesis is sent as a formatted email via Resend.
- **Markdown**: download as a `.md` file.
- **Markdown**: download as a `.md` file.
- **PDF**: download as a `.pdf` file.
- **PDF**: download as a `.pdf` file.
@ -75,6 +76,7 @@ From the synthesis detail page:
### 2.1 Multi-Theme
### 2.1 Multi-Theme
Each user can create multiple themes. A theme groups together:
Each user can create multiple themes. A theme groups together:
- Content settings (search topic, categories, max items, max age, summary length)
- Content settings (search topic, categories, max items, max age, summary length)
- Personalized sources
- Personalized sources
- Generated syntheses
- Generated syntheses
@ -88,6 +90,7 @@ The generate page requires selecting a theme before launching. The home page sho
Categories are user-defined per theme. Users add and remove category names in the theme editor after creating a theme.
Categories are user-defined per theme. Users add and remove category names in the theme editor after creating a theme.
The system always includes two default categories:
The system always includes two default categories:
- `Divers`: overflow category for unmatched or full categories.
- `Divers`: overflow category for unmatched or full categories.
- `Sans date`: category for articles without a usable publication date.
- `Sans date`: category for articles without a usable publication date.
@ -100,6 +103,7 @@ Sources can be marked as preferred. Preference is stored per theme. During gener
### 2.4 Scheduled Generation
### 2.4 Scheduled Generation
Each theme can have an optional schedule with:
Each theme can have an optional schedule with:
- **Enabled/disabled toggle**
- **Enabled/disabled toggle**
- **Days**: selection of days of the week (Mon-Sun)
- **Days**: selection of days of the week (Mon-Sun)
- **Time**: execution time in UTC (HH:MM)
- **Time**: execution time in UTC (HH:MM)
@ -112,6 +116,7 @@ Changes to the schedule are saved immediately (auto-save).
### 2.5 Brave Search
### 2.5 Brave Search
An optional alternative to LLM-powered web search in Phase 2. When enabled:
An optional alternative to LLM-powered web search in Phase 2. When enabled:
- The user provides a Brave Search API key (stored encrypted alongside LLM keys).
- The user provides a Brave Search API key (stored encrypted alongside LLM keys).
- Phase 2 queries the Brave Search API with the theme topic, filtered by article freshness.
- Phase 2 queries the Brave Search API with the theme topic, filtered by article freshness.
- Results are scraped and classified/summarized by the LLM, following the same pipeline as Phase 1.
- Results are scraped and classified/summarized by the LLM, following the same pipeline as Phase 1.
@ -127,6 +132,7 @@ Generation follows a two-phase pipeline. Phase 1 processes the user's personaliz
### 3.2 Initialization
### 3.2 Initialization
Before generation starts:
Before generation starts:
1. Load theme settings (user-defined categories plus defaults `Divers` and `Sans date`, search topic, max items, max age, summary length) and global user settings (provider, models, batch size, rate limits, etc.).
1. Load theme settings (user-defined categories plus defaults `Divers` and `Sans date`, search topic, max items, max age, summary length) and global user settings (provider, models, batch size, rate limits, etc.).
2. Decrypt the user's LLM API key and create the provider instance.
2. Decrypt the user's LLM API key and create the provider instance.
3. Clean up old article history and LLM call logs.
3. Clean up old article history and LLM call logs.
@ -141,6 +147,7 @@ Skipped if the user has no sources for the theme.
Sources are split into waves of `source_extraction_window` size (default 3). Sources are rotated so extraction starts after the last source used in a previous generation (rolling window). Preferred sources are placed before non-preferred sources within the rotation order.
Sources are split into waves of `source_extraction_window` size (default 3). Sources are rotated so extraction starts after the last source used in a previous generation (rolling window). Preferred sources are placed before non-preferred sources within the rotation order.
For each wave:
For each wave:
1. Extract article links from all sources in the wave in parallel (bounded concurrency of 5). Link extraction uses HTML `<a>` tag parsing.
1. Extract article links from all sources in the wave in parallel (bounded concurrency of 5). Link extraction uses HTML `<a>` tag parsing.
2. Deduplicate candidate URLs and filter against article history (previously seen articles are skipped).
2. Deduplicate candidate URLs and filter against article history (previously seen articles are skipped).
3. Shuffle remaining candidates, with URLs from preferred sources placed first.
3. Shuffle remaining candidates, with URLs from preferred sources placed first.
@ -158,11 +165,13 @@ Skipped if all user-defined categories are already filled.
The system computes category gaps (how many articles each category still needs), then follows one of two paths:
The system computes category gaps (how many articles each category still needs), then follows one of two paths:
**Path A -- Brave Search** (when `use_brave_search` is enabled):
**Path A -- Brave Search** (when `use_brave_search` is enabled):
1. Query the Brave Search API with the theme topic and freshness filter.
1. Query the Brave Search API with the theme topic and freshness filter.
3. Scrape and classify/summarize results using the same batched pipeline as Phase 1.
3. Scrape and classify/summarize results using the same batched pipeline as Phase 1.
**Path B -- LLM Web Search** (default):
**Path B -- LLM Web Search** (default):
1. Send a search prompt to the LLM with the theme, categories, and gap counts. The LLM uses web grounding to find articles and returns structured results.
1. Send a search prompt to the LLM with the theme, categories, and gap counts. The LLM uses web grounding to find articles and returns structured results.
2. Filter results using the same filters as Path A.
2. Filter results using the same filters as Path A.
3. Scrape each result to validate it. Keep the LLM-provided title and summary (no re-classification).
3. Scrape each result to validate it. Keep the LLM-provided title and summary (no re-classification).
@ -183,7 +192,7 @@ For the complete technical algorithm, see `technical_specs.md` Section 5.
Managed on the theme management page. Each theme has its own values.
Managed on the theme management page. Each theme has its own values.
| Setting | Description | Default |
| Setting | Description | Default |
|---------|-------------|---------|
|---------|-------------|---------|
| Name | Display label for the theme | -- |
| Name | Display label for the theme | -- |
| Search topic | Subject for AI search queries | -- |
| Search topic | Subject for AI search queries | -- |
| Categories | Ordered list of user-defined category names (`Divers` and `Sans date` are always included by the system) | [] |
| Categories | Ordered list of user-defined category names (`Divers` and `Sans date` are always included by the system) | [] |
@ -196,7 +205,7 @@ Managed on the theme management page. Each theme has its own values.
Managed on the settings page. Apply across all themes.
Managed on the settings page. Apply across all themes.
`POST /api/v1/syntheses/generate` creates a job in the `JobStore`, then spawns two nested tasks:
`POST /api/v1/syntheses/generate` creates a job in the `JobStore`, then spawns two nested tasks:
- Inner task: wraps `run_generation` in a **15-minute `tokio::time::timeout`**. If the timeout fires, sends an `Error` progress event and releases the user lock.
- Inner task: wraps `run_generation` in a **15-minute `tokio::time::timeout`**. If the timeout fires, sends an `Error` progress event and releases the user lock.
- Outer task: monitors the inner task's `JoinHandle` for panics. If the inner task panics, sends an `Error` progress event and releases the user lock.
- Outer task: monitors the inner task's `JoinHandle` for panics. If the inner task panics, sends an `Error` progress event and releases the user lock.
@ -660,11 +713,13 @@ Processing in batches of `settings.batch_size` (minimum 1). For each batch:
**Batch assembly**: Pull up to `batch_size` candidates, skipping any where `source_counts[domain] >= max_articles_per_source` (traced as `filtered_diversity`).
**Batch assembly**: Pull up to `batch_size` candidates, skipping any where `source_counts[domain] >= max_articles_per_source` (traced as `filtered_diversity`).
**Phase A — Scrape batch in parallel** (`JoinSet`):
**Phase A — Scrape batch in parallel** (`JoinSet`):
- SSRF check (no private IPs), 15s timeout, 5MB body limit.
- SSRF check (no private IPs), 15s timeout, 5MB body limit.
- HTML parsing for title (`<title>`, `og:title`), date (meta tags, JSON-LD, `<time>`), body (strip scripts/nav), soft-404 detection.
- HTML parsing for title (`<title>`, `og:title`), date (meta tags, JSON-LD, `<time>`), body (strip scripts/nav), soft-404 detection.
- If article body is empty, is a soft-404, or is too old: trace as `filtered_empty` / `filtered_too_old` and skip.
- If article body is empty, is a soft-404, or is too old: trace as `filtered_empty` / `filtered_too_old` and skip.
**Phase B — Classify/summarize batch in parallel** (`JoinSet`):
**Phase B — Classify/summarize batch in parallel** (`JoinSet`):
- Check rate limit before classifying (waits up to 60s, then errors).
- Check rate limit before classifying (waits up to 60s, then errors).
- Send article (title + body snippet based on `summary_length`: 500/2000/4000 chars) + categories + "Divers" to LLM.
- Send article (title + body snippet based on `summary_length`: 500/2000/4000 chars) + categories + "Divers" to LLM.