AI Weekly Synth is a self-hosted web application that generates AI-powered weekly news syntheses. Users create themes (topics), configure categories and sources, then the app scrapes sources, classifies articles via LLM, and produces structured summaries. Supports scheduled generation with email delivery.
@ -29,6 +29,7 @@ The application will be available at `http://localhost:8080` (or the port config
The `docker-compose.yml` defines two services:
**app** (AI Weekly Synth backend + frontend):
- Multi-stage Docker image: Node.js builds the frontend, Rust builds the backend, then both are combined into a minimal Debian runtime
- Runs as a non-root user (`appuser`)
- Depends on `db` with a health check condition (waits for Postgres to be ready)
@ -36,6 +37,7 @@ The `docker-compose.yml` defines two services:
- Restart policy: `unless-stopped`
**db** (PostgreSQL 17 Alpine):
- Data persisted to a named Docker volume (`postgres_data`)
- Exposed on `127.0.0.1:5432` (localhost only, not accessible from external networks)
- Health check: `pg_isready` every 10 seconds
@ -64,9 +66,9 @@ All environment variables are documented in `.env.example`. The `.env` file is l
| `POSTGRES_PASSWORD` | Password for the PostgreSQL user. Used by both the `db` service and in `DATABASE_URL`. | `a-strong-random-password` |
| `MASTER_ENCRYPTION_KEY` | 256-bit key for AES-256-GCM encryption of user API keys at rest. Must be exactly 64 hex characters. Generate with `openssl rand -hex 32`. **Back this up securely -- losing it means all stored API keys become unreadable.** | `ab12cd34...` (64 hex chars) |
| `APP_URL` | Public URL where the app is accessible (no trailing slash). Used for magic link URLs, CORS origin, and cookie domain. | `https://synth.example.com` |
| `RESEND_API_KEY` | API key for Resend (email service). Required for magic link emails and synthesis email export. Sign up at https://resend.com. | `re_xxxxx` |
| `RESEND_API_KEY` | API key for Resend (email service). Required for magic link emails and synthesis email export. Sign up at <https://resend.com>. | `re_xxxxx` |
| `EMAIL_FROM` | Sender address for emails. Must be a verified domain in Resend. | `AI Weekly Synth <noreply@synth.example.com>` |
| `TURNSTILE_SECRET_KEY` | Server-side secret key for Cloudflare Turnstile captcha. Sign up at https://dash.cloudflare.com/turnstile. | `0x4AAAAAAA...` |
| `TURNSTILE_SECRET_KEY` | Server-side secret key for Cloudflare Turnstile captcha. Sign up at <https://dash.cloudflare.com/turnstile>. | `0x4AAAAAAA...` |
| `TURNSTILE_SITE_KEY` | Client-side site key for Cloudflare Turnstile. | `0x4BBBBBB...` |
### Optional
@ -87,6 +89,7 @@ All environment variables are documented in `.env.example`. The `.env` file is l
The application uses PostgreSQL 17. The `docker-compose.yml` runs it as the `db` service with a named volume for data persistence.
Key configuration:
- User: `ai_synth` (configurable via `POSTGRES_PASSWORD`)
- **Never use `unwrap()` in production code.** Use `?`, `ok_or_else`, `map_err`, or `unwrap_or_default` with appropriate logging. `unwrap()` is only acceptable in `#[cfg(test)]` blocks and `LazyLock` static initializers.
- **`AppError::Internal` hides details** from the client. The full error is logged via `tracing::error!` but the response body only contains `"An internal error occurred"`.
- **`From<sqlx::Error>` and `From<anyhow::Error>`** conversions are implemented, so you can use `?` with both types.
@ -135,6 +136,7 @@ Key rules:
#### Arc Usage
`Arc` is used to share data across `tokio::spawn` boundaries. Common patterns:
- `Arc<dyn LlmProvider>` for the LLM provider (shared across classify tasks)
- `Arc<AtomicBool>` for cancellation flags
- `Arc<watch::Sender<ProgressEvent>>` for SSE progress channels
- **Email**: enter a recipient address or click "S'envoyer a soi-meme". The synthesis is sent as a formatted email via Resend.
- **Markdown**: download as a `.md` file.
- **PDF**: download as a `.pdf` file.
@ -75,6 +76,7 @@ From the synthesis detail page:
### 2.1 Multi-Theme
Each user can create multiple themes. A theme groups together:
- Content settings (search topic, categories, max items, max age, summary length)
- Personalized sources
- Generated syntheses
@ -88,6 +90,7 @@ The generate page requires selecting a theme before launching. The home page sho
Categories are user-defined per theme. Users add and remove category names in the theme editor after creating a theme.
The system always includes two default categories:
- `Divers`: overflow category for unmatched or full categories.
- `Sans date`: category for articles without a usable publication date.
@ -100,6 +103,7 @@ Sources can be marked as preferred. Preference is stored per theme. During gener
### 2.4 Scheduled Generation
Each theme can have an optional schedule with:
- **Enabled/disabled toggle**
- **Days**: selection of days of the week (Mon-Sun)
- **Time**: execution time in UTC (HH:MM)
@ -112,6 +116,7 @@ Changes to the schedule are saved immediately (auto-save).
### 2.5 Brave Search
An optional alternative to LLM-powered web search in Phase 2. When enabled:
- The user provides a Brave Search API key (stored encrypted alongside LLM keys).
- Phase 2 queries the Brave Search API with the theme topic, filtered by article freshness.
- Results are scraped and classified/summarized by the LLM, following the same pipeline as Phase 1.
@ -127,6 +132,7 @@ Generation follows a two-phase pipeline. Phase 1 processes the user's personaliz
### 3.2 Initialization
Before generation starts:
1. Load theme settings (user-defined categories plus defaults `Divers` and `Sans date`, search topic, max items, max age, summary length) and global user settings (provider, models, batch size, rate limits, etc.).
2. Decrypt the user's LLM API key and create the provider instance.
3. Clean up old article history and LLM call logs.
@ -141,6 +147,7 @@ Skipped if the user has no sources for the theme.
Sources are split into waves of `source_extraction_window` size (default 3). Sources are rotated so extraction starts after the last source used in a previous generation (rolling window). Preferred sources are placed before non-preferred sources within the rotation order.
For each wave:
1. Extract article links from all sources in the wave in parallel (bounded concurrency of 5). Link extraction uses HTML `<a>` tag parsing.
2. Deduplicate candidate URLs and filter against article history (previously seen articles are skipped).
3. Shuffle remaining candidates, with URLs from preferred sources placed first.
@ -158,11 +165,13 @@ Skipped if all user-defined categories are already filled.
The system computes category gaps (how many articles each category still needs), then follows one of two paths:
**Path A -- Brave Search** (when `use_brave_search` is enabled):
1. Query the Brave Search API with the theme topic and freshness filter.
3. Scrape and classify/summarize results using the same batched pipeline as Phase 1.
**Path B -- LLM Web Search** (default):
1. Send a search prompt to the LLM with the theme, categories, and gap counts. The LLM uses web grounding to find articles and returns structured results.
2. Filter results using the same filters as Path A.
3. Scrape each result to validate it. Keep the LLM-provided title and summary (no re-classification).
@ -219,6 +228,7 @@ Users can export their global settings as a JSON file and import settings from a
### 5.1 Provider Management
Admins configure which LLM providers and models are available to users:
- Add providers with a unique identifier and display name.
- For each provider, configure two model lists: scraping/extraction models and web search models.
- Set a default model for each category.
@ -234,6 +244,7 @@ Admins set default rate limits per provider (max requests / time window in secon
### 5.3 User Management
Admins can:
- View all registered users (email, name, role, registration date).
- Promote a user to admin or demote an admin to user.
- Admins cannot modify their own role.
@ -259,6 +270,7 @@ A Markdown export is available from the synthesis detail page. The file can be s
### 7.1 Article History
Every article encountered during generation is recorded in the article history with its status:
- **used**: included in the final synthesis.
- **filtered_history**: skipped because it was seen in a previous generation.
- **filtered_diversity**: skipped due to per-domain cap.
@ -272,6 +284,7 @@ Users can view the article history per synthesis (provenance view) or globally.
### 7.2 LLM Call Logs
Every LLM call during generation is logged with:
- Call type (link extraction, classify/summarize, web search)
`POST /api/v1/syntheses/generate` creates a job in the `JobStore`, then spawns two nested tasks:
- Inner task: wraps `run_generation` in a **15-minute `tokio::time::timeout`**. If the timeout fires, sends an `Error` progress event and releases the user lock.
- Outer task: monitors the inner task's `JoinHandle` for panics. If the inner task panics, sends an `Error` progress event and releases the user lock.
@ -660,11 +713,13 @@ Processing in batches of `settings.batch_size` (minimum 1). For each batch:
**Batch assembly**: Pull up to `batch_size` candidates, skipping any where `source_counts[domain] >= max_articles_per_source` (traced as `filtered_diversity`).
**Phase A — Scrape batch in parallel** (`JoinSet`):
- SSRF check (no private IPs), 15s timeout, 5MB body limit.
- HTML parsing for title (`<title>`, `og:title`), date (meta tags, JSON-LD, `<time>`), body (strip scripts/nav), soft-404 detection.
- If article body is empty, is a soft-404, or is too old: trace as `filtered_empty` / `filtered_too_old` and skip.
**Phase B — Classify/summarize batch in parallel** (`JoinSet`):
- Check rate limit before classifying (waits up to 60s, then errors).
- Send article (title + body snippet based on `summary_length`: 500/2000/4000 chars) + categories + "Divers" to LLM.