Trim architecture.md significantly (section 1 overview, technology stack, deployment topology,
module inventory lists, LLM trait block, pipeline details, data model table, full API tables,
background task list). Replace section 5 API tables with a one-liner. Requirements.md sections
3.1/3.5/3.6/3.7/3.8 and 4.2 condensed with cross-references. deployment.md security feature
list replaced by cross-reference to architecture.md Section 6. functional_specs.md Section 3
gains a cross-reference to technical_specs.md Section 5.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
AI Weekly Synth is a self-hosted web application that generates AI-powered weekly news syntheses. Users configure topics (themes), categories, and an LLM provider; the system then searches the web, scrapes and validates sources, classifies articles, and produces structured summaries.
AI Weekly Synth is a self-hosted Rust/Axum backend with a SolidJS frontend, backed by PostgreSQL, deployed as a Docker Compose stack. It generates AI-powered weekly news syntheses organized by user-configured themes and categories.
### Technology Stack
See `requirements.md` for product vision and features. See `technical_specs.md` for the full technology stack. See `deployment.md` for the Docker topology and operational details.
└── db (postgres:17-alpine) port 5432 (localhost only)
└── postgres_data volume
```
The app container builds from a multi-stage Dockerfile, serves the SolidJS frontend as static files, and connects to Postgres over the `internal` bridge network.
---
---
@ -36,42 +15,16 @@ The backend follows a three-layer architecture with shared model types:
```
```
handlers/ (HTTP layer)
handlers/ (HTTP layer)
│
│
├── extracts request data (Axum extractors, JSON, path params)
Handlers extract and validate request data, delegate to services or db, and format responses. Services contain all business logic. The db layer executes pure SQL via sqlx with typed result mapping and no business logic. Models define domain structs, request/response DTOs, and validation logic.
@ -79,35 +32,15 @@ models/ (Shared types -- used by all layers)
### 3.1 LLM Provider Abstraction
### 3.1 LLM Provider Abstraction
The `LlmProvider` trait defines a unified interface for all LLM backends:
The `LlmProvider` trait defines a unified interface for all LLM backends, with implementations for Gemini, OpenAI, Anthropic, and a mock provider for testing. A factory creates provider instances by name from the admin-curated provider list.
The factory (`llm/factory.rs`) creates provider instances by name. The mock provider enables end-to-end pipeline testing without real API calls.
See `technical_specs.md` Section 6 for provider interface details and supported models.
### 3.2 Synthesis Pipeline
### 3.2 Synthesis Pipeline
The pipeline is the core business logic, orchestrated in `services/synthesis.rs`. It runs as a background tokio task with a 15-minute timeout.
The pipeline is orchestrated in `services/synthesis.rs` and runs as a background tokio task with a 15-minute timeout. Phase 1 processes the user's personalized sources using a rolling windowed extraction with batched parallel scraping and LLM classification. Phase 2 fills remaining category gaps via Brave Search or LLM web search. The finalization step assembles sections, persists the synthesis, and records article history. Progress is reported via `tokio::sync::watch` channels consumed by SSE endpoints.
**Three phases:**
1. **Phase 1 -- Personalized Sources**: Extract article links from user-curated source pages (windowed, rolling), scrape articles, classify and summarize each via LLM. Batched processing with configurable `batch_size`.
2. **Phase 2 -- Web Search Fallback**: For under-filled categories, either call the Brave Search API or use the LLM's web search capability to find additional articles. Scrape and validate results.
See `technical_specs.md` Section 5 for the full algorithm.
3. **Save**: Assemble sections by category, sanitize JSON, persist to database, record article history traces.
Progress is reported via `tokio::sync::watch` channels consumed by SSE endpoints.
### 3.3 Job Store
### 3.3 Job Store
@ -121,20 +54,16 @@ Progress is reported via `tokio::sync::watch` channels consumed by SSE endpoints
### 3.4 Scheduler
### 3.4 Scheduler
`services/scheduler.rs` runs as a background task, checking every minute for due `theme_schedules`. When a schedule fires:
`services/scheduler.rs` runs as a background task checking every minute for due `theme_schedules`. When a schedule fires it runs the generation pipeline directly, emails results to configured recipients (up to 3), and marks the schedule as run to prevent double-execution on the same day.
1. Query `find_due_schedules` matching current day code + time
See `deployment.md` for operational details.
2. Skip if user already has a manual generation in progress
3. Run `synthesis::run_generation_inner` directly
4. Send email to configured recipients (up to 3)
5. Mark schedule as run
### 3.5 Scraper
### 3.5 Scraper
Two scraping services:
Two scraping services:
- **`scraper.rs`**: Article page scraper with SSRF prevention, HTML parsing, title/date/body extraction, soft-404 detection, 15s timeout, 5MB body limit.
- **`scraper.rs`**: Article page scraper with SSRF prevention, HTML parsing, title/date/body extraction, soft-404 detection, 15s timeout, 5MB body limit.
- **`source_scraper.rs`**: Source index page scraper that extracts article links from user-configured source URLs (HTML `<a>` parsing with filters, or LLM-assisted extraction).
- **`source_scraper.rs`**: Source index page scraper that extracts article links from user-configured source URLs (HTML `<a>` parsing with filters).
### 3.6 Rate Limiters
### 3.6 Rate Limiters
@ -166,130 +95,13 @@ admin_providers
└── admin_rate_limits (provider_name FK, CASCADE)
└── admin_rate_limits (provider_name FK, CASCADE)
```
```
### Table Summary
See `technical_specs.md` Section 3 for complete column definitions.
| Table | Purpose | Key Columns |
|---|---|---|
| `users` | User accounts | id, email, display_name, role (user/admin), created_at |
| PUT | /admin/users/{id}/role | Admin | Change user role |
### Infrastructure
| Method | Path | Auth | Description |
|---|---|---|---|
| GET | /health | Public | Health check |
---
---
@ -350,10 +162,7 @@ Tokio with full features. The Axum server runs as a multi-threaded async runtime
### Background Tasks
### Background Tasks
Spawned at startup via `tokio::spawn`:
Three tasks are spawned at startup: hourly session cleanup, periodic job store TTL cleanup, and the minute-by-minute theme schedule checker. See `deployment.md` Section 2.
- **Session cleanup**: Hourly deletion of expired DB sessions
- **Job store cleanup**: Periodic removal of expired job entries (1-hour TTL)
- **Scheduler**: Minute-by-minute check for due theme schedules
### Generation Pipeline Concurrency
### Generation Pipeline Concurrency
@ -380,3 +189,10 @@ POST /generate
### Graceful Shutdown
### Graceful Shutdown
The server supports graceful shutdown via signal handling, allowing in-flight requests to complete.
The server supports graceful shutdown via signal handling, allowing in-flight requests to complete.
---
## 8. Quality Gates
- Release candidates must include deterministic CI coverage for critical autonomous flows, especially scheduler execution and SSE progress behavior.
- External-provider tests (for example live LLM E2E checks) are supplemental and non-blocking; they do not replace deterministic CI coverage.
- Use the `Button` component (`components/ui/Button.tsx`) with `variant`/`loading`/`icon` props instead of raw `<button>` elements with inline Tailwind classes.
- Use the `Button` component (`components/ui/Button.tsx`) with `variant`/`loading`/`icon` props instead of raw `<button>` elements with inline Tailwind classes.
- This rule is strict for all frontend UI code (no raw `<button>` in application components).
- Use `<Switch>/<Match>` for mutually exclusive conditional rendering instead of multiple adjacent `<Show>` blocks.
- Use `<Switch>/<Match>` for mutually exclusive conditional rendering instead of multiple adjacent `<Show>` blocks.
- Use `<For each={...}>` for list rendering.
- Use `<For each={...}>` for list rendering.
- Use the `useToast` context for user feedback (success/error notifications).
- Use the `useToast` context for user feedback (success/error notifications).
- **Search topic**: the subject the AI uses to search for news (e.g. "Intelligence Artificielle").
- **Search topic**: the subject the AI uses to search for news (e.g. "Intelligence Artificielle").
- **Categories**: an ordered list of user-defined category names. Categories can be added and removed. The system always adds an implicit "Autre" overflow category.
- **Categories**: an ordered list of user-defined category names. Themes can be created without user-defined categories. The system always includes `Divers` (overflow) and `Sans date` (undated articles).
- **Max age (days)**: how old articles can be.
- **Max age (days)**: how old articles can be.
- **Max items per category**: cap per category.
- **Max items per category**: cap per category.
- **Summary length**: slider with three positions -- Court (3-4 lines), Moyen (6-8 lines), Detaille (12-15 lines).
- **Summary length**: slider with three positions -- Court (3-4 lines), Moyen (6-8 lines), Detaille (12-15 lines).
@ -34,10 +34,10 @@
1. On the theme management page, below theme settings, the sources section shows sources scoped to the selected theme.
1. On the theme management page, below theme settings, the sources section shows sources scoped to the selected theme.
2. User adds sources individually (title + URL) or via:
2. User adds sources individually (title + URL) or via:
- **CSV import**: upload a `.csv` file with `Titre,URL` columns. Auto-detects comma/semicolon delimiters, skips header rows, prepends `https://` to bare URLs.
- **CSV import**: upload a `.csv` file with `Titre,URL` columns. Auto-detects comma/semicolon delimiters, skips header rows, prepends `https://` to bare URLs. Import is always applied to the selected theme.
- **Bulk text import**: paste multiple sources in `Nom;URL` format, one per line.
- **Bulk text import**: paste multiple sources in `Nom;URL` format, one per line. Import is always applied to the selected theme.
- **CSV export**: download all sources for the theme as a CSV file.
- **CSV export**: download sources for the selected theme only.
3. Sources can be marked as **preferred** (prioritaire) via checkboxes. Preferred sources are processed first during generation. A counter shows how many sources are preferred.
3. Sources can be marked as **preferred** (prioritaire) via checkboxes. Preferred sources are scoped per theme and do not affect other themes.
4. Sources can be deleted individually.
4. Sources can be deleted individually.
### 1.5 Generate a Synthesis
### 1.5 Generate a Synthesis
@ -85,13 +85,17 @@ The generate page requires selecting a theme before launching. The home page sho
### 2.2 Categories
### 2.2 Categories
Categories are user-defined per theme. Users add and remove category names in the theme editor. The system always appends an implicit "Autre" category to catch articles that do not match any user-defined category, or articles from categories that have reached their max items cap.
Categories are user-defined per theme. Users add and remove category names in the theme editor after creating a theme.
If no categories are configured, the only available category is "Autre".
The system always includes two default categories:
- `Divers`: overflow category for unmatched or full categories.
- `Sans date`: category for articles without a usable publication date.
If no user-defined categories are configured, the available categories are still `Divers` and `Sans date`.
### 2.3 Preferred Sources
### 2.3 Preferred Sources
Sources can be marked as preferred. During generation, preferred sources are extracted and processed before non-preferred sources. Within each extraction wave, URLs from preferred sources are also shuffled and placed before other URLs. This maximizes the chance that articles from preferred sources fill the synthesis.
Sources can be marked as preferred. Preference is stored per theme. During generation, preferred sources are extracted and processed before non-preferred sources. Within each extraction wave, URLs from preferred sources are also shuffled and placed before other URLs. This maximizes the chance that articles from preferred sources fill the synthesis.
### 2.4 Scheduled Generation
### 2.4 Scheduled Generation
@ -123,7 +127,7 @@ Generation follows a two-phase pipeline. Phase 1 processes the user's personaliz
### 3.2 Initialization
### 3.2 Initialization
Before generation starts:
Before generation starts:
1. Load theme settings (categories, search topic, max items, max age, summary length) and global user settings (provider, models, batch size, rate limits, etc.).
1. Load theme settings (user-defined categories plus defaults `Divers` and `Sans date`, search topic, max items, max age, summary length) and global user settings (provider, models, batch size, rate limits, etc.).
2. Decrypt the user's LLM API key and create the provider instance.
2. Decrypt the user's LLM API key and create the provider instance.
3. Clean up old article history and LLM call logs.
3. Clean up old article history and LLM call logs.
4. Load personalized sources for the selected theme.
4. Load personalized sources for the selected theme.
@ -137,7 +141,7 @@ Skipped if the user has no sources for the theme.
Sources are split into waves of `source_extraction_window` size (default 3). Sources are rotated so extraction starts after the last source used in a previous generation (rolling window). Preferred sources are placed before non-preferred sources within the rotation order.
Sources are split into waves of `source_extraction_window` size (default 3). Sources are rotated so extraction starts after the last source used in a previous generation (rolling window). Preferred sources are placed before non-preferred sources within the rotation order.
For each wave:
For each wave:
1. Extract article links from all sources in the wave in parallel (bounded concurrency of 5). Link extraction uses either LLM analysis of the page content or HTML `<a>` tag parsing (configurable).
1. Extract article links from all sources in the wave in parallel (bounded concurrency of 5). Link extraction uses HTML `<a>` tag parsing.
2. Deduplicate candidate URLs and filter against article history (previously seen articles are skipped).
2. Deduplicate candidate URLs and filter against article history (previously seen articles are skipped).
3. Shuffle remaining candidates, with URLs from preferred sources placed first.
3. Shuffle remaining candidates, with URLs from preferred sources placed first.
4. Process articles in batches of `batch_size`:
4. Process articles in batches of `batch_size`:
@ -166,10 +170,12 @@ The system computes category gaps (how many articles each category still needs),
### 3.5 Finalization
### 3.5 Finalization
1. If no articles were collected across both phases, return an error.
1. If no articles were collected across both phases, return an error.
2. Order sections: user-defined categories first (in their configured order), then "Autre" if non-empty.
2. Order sections: user-defined categories first (in their configured order), then `Divers` if non-empty, then `Sans date` if non-empty.
3. Save the synthesis to the database with status "completed".
3. Save the synthesis to the database with status "completed".
4. Record all used articles in article history for future deduplication.
4. Record all used articles in article history for future deduplication.
For the complete technical algorithm, see `technical_specs.md` Section 5.
## 4. Settings Overview
## 4. Settings Overview
### 4.1 Per-Theme Settings
### 4.1 Per-Theme Settings
@ -180,7 +186,7 @@ Managed on the theme management page. Each theme has its own values.
|---------|-------------|---------|
|---------|-------------|---------|
| Name | Display label for the theme | -- |
| Name | Display label for the theme | -- |
| Search topic | Subject for AI search queries | -- |
| Search topic | Subject for AI search queries | -- |
| Categories | Ordered list of category names | [] |
| Categories | Ordered list of user-defined category names (`Divers` and `Sans date` are always included by the system) | [] |
- Tests requiring external providers (for example `generation-live.spec.ts`) are non-blocking supplemental checks and must not be the only coverage for critical flows.
### Backend Unit Test Breakdown
### Backend Unit Test Breakdown
| Source file | Tests | Coverage area |
| Source file | Tests | Coverage area |
@ -144,7 +152,7 @@ The script:
6. Runs Playwright tests
6. Runs Playwright tests
7. Cleans up on exit (stops containers, removes volumes)
7. Cleans up on exit (stops containers, removes volumes)
The `generation-live.spec.ts` test requires `OPENAI_TEST_API_KEY` to be set (in `e2e/.env.test` or environment). It exercises the real pipeline with an actual LLM API call.
The `generation-live.spec.ts` test requires `OPENAI_TEST_API_KEY` to be set (in `e2e/.env.test` or environment). It is a supplemental non-blocking check and does not replace deterministic CI coverage.
---
---
@ -322,7 +330,7 @@ Pipeline integration tests in `pipeline_test.rs` use wiremock + MockLlmProvider:
4. **Use `createDbClient()`** from `e2e/helpers/auth.ts` when you need to verify database state directly.
4. **Use `createDbClient()`** from `e2e/helpers/auth.ts` when you need to verify database state directly.
5. **The `generation-live.spec.ts` test** is gated on `OPENAI_TEST_API_KEY`. It exercises the full pipeline including provenance and LLM log verification.
5. **The `generation-live.spec.ts` test** is gated on `OPENAI_TEST_API_KEY`. Treat it as supplemental coverage only.
---
---
@ -363,6 +371,8 @@ As of the last audit, 10 of 141 frontend unit tests are failing. Investigate wit
### Critical Gaps
### Critical Gaps
The following gaps must be addressed to satisfy the release gate policy.
| Gap | Priority | Description |
| Gap | Priority | Description |
|-----|----------|-------------|
|-----|----------|-------------|
| Scheduled execution | Critical | `scheduler.rs` has zero tests. Autonomous process that generates syntheses and sends emails. |
| Scheduled execution | Critical | `scheduler.rs` has zero tests. Autonomous process that generates syntheses and sends emails. |
@ -15,10 +15,9 @@ The application is designed for individuals or small teams who want an automated
### 3.1 Multi-Theme Support
### 3.1 Multi-Theme Support
- Users create multiple themes, each with its own search topic, categories, and content settings.
Users create multiple independent themes, each with its own search topic, categories, personalized sources, and content settings. Syntheses are generated and tagged per theme. Deleting a theme preserves its existing syntheses.
- Each theme has its own set of personalized sources.
- Syntheses are generated per theme and tagged accordingly.
See `functional_specs.md` Section 2 for detailed behavior.
- Themes can be created, edited, and deleted independently. Deleting a theme preserves its existing syntheses.
### 3.2 Synthesis Generation
### 3.2 Synthesis Generation
@ -38,48 +37,25 @@ The application is designed for individuals or small teams who want an automated
### 3.4 Personalized Sources
### 3.4 Personalized Sources
- Users add web sources (blogs, news sites) per theme.
- Users add web sources (blogs, news sites) per theme.
- Sources can be imported in bulk via text input, CSV upload, or added individually.
- Sources can be imported in bulk via text input, CSV upload, or added individually, always bound to the selected theme.
- Sources can be exported as CSV.
- Sources can be exported as CSV, always scoped to the selected theme.
- Sources can be marked as **preferred** (prioritized during generation -- processed before non-preferred sources).
- Sources can be marked as **preferred** (prioritized during generation -- processed before non-preferred sources), with preference state scoped per theme.
### 3.5 Brave Search Integration
### 3.5 Brave Search Integration
- Optional alternative to LLM web search for Phase 2.
Optional alternative to LLM web search for Phase 2. Users provide their own Brave Search API key; when enabled, Phase 2 queries Brave instead of using LLM web grounding. See `functional_specs.md` Section 2.5.
- Users provide their own Brave Search API key.
- When enabled, Phase 2 queries the Brave Search API instead of using LLM web grounding, then scrapes and classifies the results.
### 3.6 Export and Sharing
### 3.6 Export and Sharing
- **Email**: send a synthesis to any email address (or to self) via Resend.
Syntheses can be exported as email (via Resend), PDF, or Markdown. See `functional_specs.md` Section 6.
- **PDF**: download a synthesis as a PDF file.
- **Markdown**: download a synthesis as a Markdown file.
### 3.7 Settings
### 3.7 Settings
#### Per-theme settings (content)
Settings are split into two levels: per-theme content settings (search topic, categories, max age, max items, summary length) and global pipeline settings (LLM provider/model, Brave Search, batch size, rate limits, article history retention, import/export). See `functional_specs.md` Section 4 for the complete settings reference.
- Theme name and search topic
- Categories (user-defined list)
- Max age of articles (days)
- Max items per category
- Summary detail level (short / medium / detailed)
#### Global settings (pipeline and AI)
- LLM provider and model selection (research model + web search model)
- Search agent behavior (custom instructions for the AI research prompt)
- Brave Search toggle and API key
- Batch size (articles processed in parallel)
- Source extraction window (number of sources per extraction wave)
- Max articles per source (diversity cap)
- Max links extracted per source
- Rate limiting (max requests / time window)
- Article history retention (days)
- Settings import/export (JSON)
### 3.8 Authentication
### 3.8 Authentication
- Passwordless authentication via magic link emails.
Passwordless authentication via magic link emails with Cloudflare Turnstile captcha. Sessions use 30-day HttpOnly/SameSite cookies. See `architecture.md` Section 6 for the full security model.
- Cloudflare Turnstile captcha on login and registration.
- 30-day session cookies (HttpOnly, SameSite).
## 4. User Roles
## 4. User Roles
@ -95,13 +71,7 @@ The application is designed for individuals or small teams who want an automated
### 4.2 Admin
### 4.2 Admin
All user capabilities, plus:
All user capabilities, plus provider management (add/edit/enable/disable LLM providers and models), rate limit configuration (defaults per provider), and user management (view all users, promote/demote roles). The first admin is created via the `create-admin` CLI command. See `functional_specs.md` Section 5.
- **Provider management**: add, edit, enable/disable, and remove LLM providers and their available models. Users select from admin-curated providers.
- **Rate limit configuration**: set default rate limits per provider (max requests / time window). Users can override with their own values.
- **User management**: view all users, promote users to admin or demote admins to user.
The first admin is created via a CLI command (`create-admin`).
## 5. Non-Functional Requirements
## 5. Non-Functional Requirements
@ -139,3 +109,4 @@ The first admin is created via a CLI command (`create-admin`).
- Job store with TTL for expired generation jobs.
- Job store with TTL for expired generation jobs.
- Scheduled generation with double-run prevention (`last_run_at` tracking).
- Scheduled generation with double-run prevention (`last_run_at` tracking).
- Panic recovery and timeout handling for generation tasks.
- Panic recovery and timeout handling for generation tasks.
- Release gating in CI requires deterministic coverage for critical autonomous flows (notably scheduler execution and SSE progress behavior).
10. **Batch trace buffer** — `pending_traces: Vec<ArticleHistoryEntry>` accumulates all article history writes; flushed with `db::article_history::batch_insert_entries` at phase boundaries.
10. **Batch trace buffer** — `pending_traces: Vec<ArticleHistoryEntry>` accumulates all article history writes; flushed with `db::article_history::batch_insert_entries` at phase boundaries.
### Phase 1: Personalized Sources
### Phase 1: Personalized Sources
@ -663,7 +670,7 @@ Processing in batches of `settings.batch_size` (minimum 1). For each batch:
4. **Save synthesis** — insert into `syntheses` table with `job_id`, `week` (ISO week), `sections` (JSONB), `status: completed`, `theme_id`.
4. **Save synthesis** — insert into `syntheses` table with `job_id`, `week` (ISO week), `sections` (JSONB), `status: completed`, `theme_id`.
5. **Record used articles** — for each article in the final synthesis, build trace with `status: "used"`, `synthesis_id`, and correct `source_type` (inferred from `url_source`). Batch-insert into `article_history`.
5. **Record used articles** — for each article in the final synthesis, build trace with `status: "used"`, `synthesis_id`, and correct `source_type` (inferred from `url_source`). Batch-insert into `article_history`.
@ -754,7 +761,7 @@ All calls use structured JSON output (response_schema defines the expected shape