31 KiB
AI Weekly Synth -- Technical Specifications
1. Backend Tech Stack
| Dependency | Version | Purpose |
|---|---|---|
| axum | 0.8 | Web framework (macros, multipart) |
| tokio | 1 | Async runtime (full features) |
| tower | 0.5 | Middleware composition |
| tower-http | 0.6 | CORS, static files, tracing, headers |
| sqlx | 0.8 | Async Postgres driver (runtime-tokio, tls-rustls, uuid, chrono, json, migrate) |
| reqwest | 0.12 | HTTP client (JSON) |
| serde / serde_json | 1 | Serialization/deserialization |
| chrono | 0.4 | Date/time handling (serde feature) |
| aes-gcm | 0.10 | AES-256-GCM encryption |
| zeroize | 1 | Secure memory zeroing |
| sha2 | 0.10 | SHA-256 hashing |
| rand | 0.8 | Random number generation |
| base64 | 0.22 | Base64 encoding |
| hex | 0.4 | Hex encoding/decoding |
| async-trait | 0.1 | Async trait objects |
| tracing / tracing-subscriber | 0.1 / 0.3 | Structured logging (env-filter, json) |
| dotenvy | 0.15 | .env file loading |
| clap | 4 | CLI argument parsing |
| scraper | 0.22 | HTML parsing (CSS selectors) |
| ego-tree | 0.10 | Tree data structure (used by scraper) |
| url | 2 | URL parsing and validation |
| email_address | 0.2 | Email validation |
| anyhow | 1 | Error context |
| thiserror | 2 | Error type derivation |
| uuid | 1 | UUID v4 generation (serde feature) |
| dashmap | 6 | Concurrent hash maps |
| tokio-stream | 0.1 | Stream utilities for SSE |
| futures | 0.3 | Async stream combinators |
| printpdf | 0.7 | PDF generation |
Dev dependencies: tower (util), http-body-util, wiremock 0.6.
Rust edition: 2021.
2. Frontend Tech Stack
| Dependency | Version | Purpose |
|---|---|---|
| solid-js | ^1.9.0 | Reactive UI framework |
| @solidjs/router | ^0.15.0 | Client-side routing |
| lucide-solid | ^0.475.0 | Icon library |
| date-fns | ^4.1.0 | Date formatting |
| tailwindcss | ^4.1.0 | Utility-first CSS (v4) |
| @tailwindcss/vite | ^4.1.0 | Tailwind Vite plugin |
| vite | ^6.2.0 | Build tool and dev server |
| vite-plugin-solid | ^2.11.0 | SolidJS Vite integration |
| typescript | ~5.8.0 | Type checking |
| vitest | ^3.0.0 | Unit testing |
| @solidjs/testing-library | ^0.8.0 | Component testing |
| jsdom | ^25.0.0 | DOM environment for tests |
Frontend Routes
| Path | Component | Auth | Description |
|---|---|---|---|
| /login | Login | Public | Login page |
| /register | Register | Public | Registration page |
| /auth/verify | AuthVerify | Public | Magic link verification |
| / | Home | Protected | Dashboard / synthesis list |
| /settings | Settings | Protected | User settings |
| /themes | ThemeManager | Protected | Theme CRUD + source management |
| /generate | GenerateSynthesis | Protected | Generation trigger + progress |
| /synthesis/:id | SynthesisDetail | Protected | Full synthesis view |
| /article-history | ArticleHistory | Protected | Article history browser |
| /llm-logs/:jobId | LlmLogs | Protected | LLM call log viewer |
| /admin/providers | AdminProviders | Admin | Provider configuration |
| /admin/rate-limits | AdminRateLimits | Admin | Rate limit configuration |
| /admin/users | AdminUsers | Admin | User management |
3. Database Schema
3.1 users
| Column | Type | Constraints |
|---|---|---|
| id | UUID | PK, DEFAULT gen_random_uuid() |
| TEXT | NOT NULL, UNIQUE | |
| display_name | TEXT | nullable |
| role | TEXT | NOT NULL, DEFAULT 'user', CHECK (user/admin) |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |
| updated_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |
Indexes: idx_users_email on (email).
3.2 sessions
| Column | Type | Constraints |
|---|---|---|
| session_hash | TEXT | PK (SHA-256 of raw token) |
| user_id | UUID | NOT NULL, FK users(id) CASCADE |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |
| expires_at | TIMESTAMPTZ | NOT NULL |
| last_active_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |
| ip_address | TEXT | nullable |
| user_agent | TEXT | nullable |
Indexes: idx_sessions_user_id, idx_sessions_expires_at.
3.3 magic_tokens
| Column | Type | Constraints |
|---|---|---|
| id | UUID | PK, DEFAULT gen_random_uuid() |
| TEXT | NOT NULL | |
| token_hash | TEXT | NOT NULL, UNIQUE |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |
| expires_at | TIMESTAMPTZ | NOT NULL |
| used | BOOLEAN | NOT NULL, DEFAULT false |
Indexes: idx_magic_tokens_email, idx_magic_tokens_expires.
3.4 settings
Per-user pipeline configuration. One row per user (user_id is the PK).
| Column | Type | Constraints |
|---|---|---|
| user_id | UUID | PK, FK users(id) CASCADE |
| max_articles_per_source | INTEGER | NOT NULL, DEFAULT 3 |
| max_links_per_source | INTEGER | NOT NULL, DEFAULT 8 |
| use_brave_search | BOOLEAN | NOT NULL, DEFAULT false |
| article_history_days | INTEGER | NOT NULL, DEFAULT 90 |
| batch_size | INTEGER | NOT NULL, DEFAULT 5 |
| source_extraction_window | INTEGER | NOT NULL, DEFAULT 3 |
| search_agent_behavior | TEXT | NOT NULL, DEFAULT '' |
| ai_provider | TEXT | NOT NULL, DEFAULT '' |
| ai_model | TEXT | NOT NULL, DEFAULT '' |
| ai_model_websearch | TEXT | NOT NULL, DEFAULT '' |
| rate_limit_max_requests | INTEGER | nullable |
| rate_limit_time_window_seconds | INTEGER | nullable |
| updated_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |
3.5 themes
Per-user topic configurations with content settings.
| Column | Type | Constraints |
|---|---|---|
| id | UUID | PK, DEFAULT gen_random_uuid() |
| user_id | UUID | NOT NULL, FK users(id) CASCADE |
| name | TEXT | NOT NULL |
| theme | TEXT | NOT NULL (search topic) |
| categories | JSONB | NOT NULL, DEFAULT '[]' |
| max_items_per_category | INTEGER | NOT NULL, DEFAULT 4 |
| max_age_days | INTEGER | NOT NULL, DEFAULT 7 |
| summary_length | INTEGER | NOT NULL, DEFAULT 3 |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |
| updated_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |
Indexes: idx_themes_user_id.
categories stores user-defined categories only. Runtime/category assignment always includes Divers and Sans date.
3.6 sources
User-curated news source URLs, always tied to a theme.
| Column | Type | Constraints |
|---|---|---|
| id | UUID | PK, DEFAULT gen_random_uuid() |
| user_id | UUID | NOT NULL, FK users(id) CASCADE |
| title | VARCHAR(200) | NOT NULL, CHECK length 1-200 |
| url | VARCHAR(1000) | NOT NULL, CHECK length <= 1000 |
| theme_id | UUID | NOT NULL, FK themes(id) CASCADE |
| is_preferred | BOOLEAN | NOT NULL, DEFAULT false |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |
Indexes: idx_sources_user_id, UNIQUE idx_sources_user_id_url on (user_id, url).
3.7 syntheses
Generated synthesis results with JSONB section data.
| Column | Type | Constraints |
|---|---|---|
| id | UUID | PK, DEFAULT gen_random_uuid() |
| user_id | UUID | NOT NULL, FK users(id) CASCADE |
| week | VARCHAR(10) | NOT NULL (ISO week string) |
| sections | JSONB | NOT NULL, DEFAULT '[]' |
| status | VARCHAR(20) | NOT NULL, DEFAULT 'completed' |
| job_id | UUID | nullable |
| theme_id | UUID | nullable, FK themes(id) SET NULL |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |
Indexes: idx_syntheses_user_id_created_at on (user_id, created_at DESC).
JSONB structure for sections:
[
{
"title": "Category Name",
"items": [
{ "title": "Article Title", "url": "https://...", "summary": "...", "date": "2026-03-25" }
]
}
]
3.8 theme_schedules
Automated generation schedules, one per theme.
| Column | Type | Constraints |
|---|---|---|
| id | UUID | PK, DEFAULT gen_random_uuid() |
| theme_id | UUID | NOT NULL, UNIQUE, FK themes(id) CASCADE |
| user_id | UUID | NOT NULL, FK users(id) CASCADE |
| enabled | BOOLEAN | NOT NULL, DEFAULT true |
| days | JSONB | NOT NULL, DEFAULT '[]' (e.g. ["mon","fri"]) |
| time_utc | TEXT | NOT NULL, DEFAULT '08:00' (HH:MM) |
| emails | JSONB | NOT NULL, DEFAULT '[]' (up to 3 addresses) |
| last_run_at | TIMESTAMPTZ | nullable |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |
| updated_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |
Indexes: idx_theme_schedules_enabled (partial, WHERE enabled = true).
3.9 article_history
Article URL deduplication and full provenance tracing.
| Column | Type | Constraints |
|---|---|---|
| id | UUID | PK, DEFAULT gen_random_uuid() |
| user_id | UUID | NOT NULL, FK users(id) CASCADE |
| url_hash | TEXT | NOT NULL (SHA-256 of normalized URL) |
| url | TEXT | NOT NULL |
| title | TEXT | NOT NULL, DEFAULT '' |
| source_type | TEXT | NOT NULL, DEFAULT 'unknown' |
| source_url | TEXT | nullable |
| category | TEXT | nullable |
| synthesis_id | UUID | nullable, FK syntheses(id) SET NULL |
| status | TEXT | NOT NULL, DEFAULT 'used' |
| scraped_ok | BOOLEAN | NOT NULL, DEFAULT true |
| job_id | UUID | NOT NULL |
| published_date | TEXT | nullable |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |
Indexes: idx_article_history_user_url on (user_id, url_hash), idx_article_history_job_id.
Status values: used, filtered_history, filtered_diversity, filtered_not_article, filtered_too_old, filtered_empty, filtered_homepage, filtered_cross_phase_dedup.
Source type values: personalized_source, brave_search, web_search.
3.10 llm_call_log
Full LLM interaction logging for debugging and analysis.
| Column | Type | Constraints |
|---|---|---|
| id | UUID | PK, DEFAULT gen_random_uuid() |
| user_id | UUID | NOT NULL, FK users(id) CASCADE |
| job_id | UUID | NOT NULL |
| call_type | TEXT | NOT NULL |
| model | TEXT | NOT NULL |
| system_prompt | TEXT | NOT NULL, DEFAULT '' |
| user_prompt | TEXT | NOT NULL, DEFAULT '' |
| response_body | TEXT | NOT NULL, DEFAULT '' |
| duration_ms | INTEGER | NOT NULL, DEFAULT 0 |
| article_url | TEXT | nullable |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |
Indexes: idx_llm_call_log_job_id, idx_llm_call_log_user_id on (user_id, created_at).
3.11 admin_providers
Admin-curated catalog of LLM providers and their models.
| Column | Type | Constraints |
|---|---|---|
| id | UUID | PK, DEFAULT gen_random_uuid() |
| provider_name | VARCHAR(50) | NOT NULL, UNIQUE |
| display_name | VARCHAR(100) | NOT NULL |
| models_scraping | JSONB | NOT NULL, DEFAULT '[]' |
| models_websearch | JSONB | NOT NULL, DEFAULT '[]' |
| is_enabled | BOOLEAN | NOT NULL, DEFAULT true |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |
| updated_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |
Indexes: idx_admin_providers_enabled (partial, WHERE is_enabled = true).
Seeded with: gemini, openai, anthropic.
JSONB model structure:
[{"model_id": "gemini-2.5-pro", "display_name": "Gemini 2.5 Pro", "is_default": true}]
3.12 admin_rate_limits
Per-provider rate limit configuration.
| Column | Type | Constraints |
|---|---|---|
| id | UUID | PK, DEFAULT gen_random_uuid() |
| provider_name | VARCHAR(50) | NOT NULL, UNIQUE, FK admin_providers(provider_name) CASCADE |
| max_requests | INTEGER | NOT NULL, DEFAULT 30 |
| time_window_seconds | INTEGER | NOT NULL, DEFAULT 60 |
| updated_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |
Seeded defaults: gemini 29/60s, openai 50/60s, anthropic 40/60s.
3.13 user_api_keys
Encrypted user LLM API keys.
| Column | Type | Constraints |
|---|---|---|
| id | UUID | PK, DEFAULT gen_random_uuid() |
| user_id | UUID | NOT NULL, FK users(id) CASCADE |
| provider_name | VARCHAR(50) | NOT NULL |
| encrypted_key | BYTEA | NOT NULL |
| nonce | BYTEA | NOT NULL |
| key_prefix | VARCHAR(20) | NOT NULL |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |
| updated_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |
Constraint: UNIQUE(user_id, provider_name). Valid providers: gemini, openai, anthropic, brave_search.
3.14 audit_log
Admin mutation audit trail.
| Column | Type | Constraints |
|---|---|---|
| id | UUID | PK, DEFAULT gen_random_uuid() |
| admin_user_id | UUID | nullable, FK users(id) SET NULL |
| action | VARCHAR(100) | NOT NULL |
| target_type | VARCHAR(50) | nullable |
| target_id | VARCHAR(255) | nullable |
| details | JSONB | nullable |
| created_at | TIMESTAMPTZ | NOT NULL, DEFAULT now() |
Indexes: idx_audit_log_created_at (DESC), idx_audit_log_admin_user.
4. API Endpoints
All endpoints are prefixed with /api/v1. Responses are JSON. Errors follow the shape { "error": "message" }.
4.1 Authentication
POST /auth/register
- Auth: Public
- Body:
{ email: string, display_name?: string, turnstile_token: string } - Response:
{ message: string } - Sends magic link email. Rate limited.
POST /auth/login
- Auth: Public
- Body:
{ email: string, turnstile_token: string } - Response:
{ message: string } - Sends magic link email. Rate limited.
GET /auth/verify?token=...&email=...
- Auth: Public
- Response: Redirect to frontend with session cookie set.
POST /auth/verify
- Auth: Public
- Body:
{ token: string, email: string } - Response:
{ message: string, user: User } - Sets
sessionHttpOnly cookie (30-day expiry).
POST /auth/logout
- Auth: Authenticated
- Response:
{ message: string } - Clears session cookie and deletes DB session.
GET /auth/me
- Auth: Authenticated
- Response:
{ id, email, display_name, role, created_at }
4.2 Settings
GET /settings
- Auth: Authenticated
- Response:
UserSettings(creates defaults if not exists)
PUT /settings
- Auth: Authenticated
- Body:
UpdateSettingsRequest(all fields required) - Validation: max_articles_per_source 1-10, max_links_per_source 1-30, batch_size 1-20, source_extraction_window 1-10, article_history_days 0-365, search_agent_behavior max 2000 chars, ai_provider/ai_model/ai_model_websearch max 100 chars.
- Response: Updated
UserSettings
4.3 Themes
GET /themes
- Auth: Authenticated
- Response:
ThemeResponse[]
POST /themes
- Auth: Authenticated
- Body:
{ name, theme, categories: string[], max_items_per_category?, max_age_days?, summary_length? } - Validation: name non-empty max 200 chars, categories 0-20 non-empty entries, max_items 1-50, max_age 1-365, summary_length 1-3.
- Notes: theme creation is valid with an empty user-defined
categorieslist. The system always includesDiversandSans date. - Response:
ThemeResponse
PUT /themes/{id}
- Auth: Authenticated (owner only)
- Body:
UpdateThemeRequest(all fields optional) - Response:
ThemeResponse
DELETE /themes/{id}
- Auth: Authenticated (owner only)
- Response: 204 No Content
4.4 Schedules
GET /themes/{id}/schedule
- Auth: Authenticated (theme owner)
- Response:
ScheduleResponse | nullwith HTTP 200
PUT /themes/{id}/schedule
- Auth: Authenticated (theme owner)
- Body:
{ enabled, days: string[], time_utc: "HH:MM", emails: string[] } - Validation: days from mon-sun, time HH:MM format, max 3 emails.
- Response:
ScheduleResponse
DELETE /themes/{id}/schedule
- Auth: Authenticated (theme owner)
- Response: 204 No Content
4.5 Sources
GET /sources?theme_id=...
- Auth: Authenticated
- Query:
theme_idis required - Response:
SourceResponse[]
POST /sources
- Auth: Authenticated
- Body:
{ title, url, theme_id } - Validation: title non-empty max 200, URL http(s) max 1000 chars.
- Response:
SourceResponse
PUT /sources/preferred
- Auth: Authenticated
- Body:
{ theme_id: UUID, source_ids: UUID[] } - Note: preferred state is scoped per theme.
- Response:
{ updated: number }
DELETE /sources/{id}
- Auth: Authenticated (owner only)
- Response: 204 No Content
POST /sources/bulk
- Auth: Authenticated
- Body:
{ sources: CreateSourceRequest[], theme_id: UUID } - Response:
{ imported, skipped, errors }
POST /sources/import-csv
- Auth: Authenticated
- Body: Multipart file upload (CSV: title,url) + required
theme_id - Response:
{ imported, skipped, errors }
GET /sources/export-csv
- Auth: Authenticated
- Query:
theme_idis required - Scope: exports sources for the selected theme only
- Response: CSV file download
4.6 Generation
POST /syntheses/generate
- Auth: Authenticated
- Body:
{ theme_id: UUID } - Response:
{ job_id: UUID } - Creates job in JobStore, spawns background generation task. Returns 409 if user already has active job.
GET /syntheses/generate/{job_id}/progress
- Auth: Authenticated (job owner)
- Response: SSE stream of
ProgressEvent - Events:
progress(step, message, percent),complete(synthesis_id),error(message).
POST /syntheses/generate/{job_id}/stop
- Auth: Authenticated (job owner)
- Response:
{ message: string } - Sets cooperative cancellation flag.
4.7 Syntheses
GET /syntheses
- Auth: Authenticated
- Response:
SynthesisListItem[](with section summaries, theme info)
GET /syntheses/{id}
- Auth: Authenticated (owner only)
- Response:
SynthesisResponse(full sections data)
DELETE /syntheses/{id}
- Auth: Authenticated (owner only)
- Response: 204 No Content
POST /syntheses/{id}/send-email
- Auth: Authenticated
- Body:
{ email: string } - Response:
{ message: string }
GET /syntheses/{id}/export/markdown
- Auth: Authenticated
- Response: Markdown file download
GET /syntheses/{id}/export/pdf
- Auth: Authenticated
- Response: PDF file download
4.8 Article History & Provenance
GET /article-history?limit=&offset=&job_id=&status=
- Auth: Authenticated
- Response:
{ items: ArticleHistoryEntry[], total: number }
DELETE /article-history
- Auth: Authenticated
- Response:
{ deleted: number }
GET /syntheses/{id}/provenance
- Auth: Authenticated
- Response:
ArticleHistoryEntry[](articles with status "used" for this synthesis's job_id)
4.9 LLM Call Logs
GET /llm-logs/{job_id}
- Auth: Authenticated
- Response:
LlmCallLogEntry[]
4.10 User API Keys
GET /user/api-keys
- Auth: Authenticated
- Response:
ApiKeyResponse[](id, provider_name, key_prefix, timestamps; never the full key)
POST /user/api-keys
- Auth: Authenticated
- Body:
{ provider_name, api_key } - Validation: provider in (gemini, openai, anthropic, brave_search), key 8-500 chars.
- Response:
ApiKeyResponse - Encrypts key with AES-256-GCM before storage; upserts (one key per user per provider).
DELETE /user/api-keys/{provider}
- Auth: Authenticated
- Response: 204 No Content
POST /user/api-keys/{provider}/test
- Auth: Authenticated
- Response:
{ success: boolean, message: string } - Decrypts key, calls provider test endpoint.
POST /user/api-keys/export
- Auth: Authenticated
- Response:
{ keys: [{ provider_name, api_key }] } - Decrypts and returns all keys (used for backup/migration).
4.11 Public Configuration
GET /config/providers
- Auth: Authenticated
- Response:
ProviderConfigResponse[](enabled providers with model lists for scraping and websearch)
4.12 Admin Endpoints
All admin endpoints require AdminUser extractor (role = admin).
GET /admin/providers
- Response:
AdminProviderResponse[]
POST /admin/providers
- Body:
CreateProviderRequest - Validation: provider_name in (gemini, openai, anthropic), at least one model per list, at most one default per list.
- Response:
AdminProviderResponse
PUT /admin/providers/{id}
- Body:
UpdateProviderRequest(all fields optional) - Response:
AdminProviderResponse
DELETE /admin/providers/{id}
- Response: 204 No Content
GET /admin/rate-limits
- Response:
RateLimitResponse[]
PUT /admin/rate-limits/{provider_name}
- Body:
{ max_requests: 1-1000, time_window_seconds: 1-3600 } - Response:
RateLimitResponse - Hot-reloads the in-memory provider rate limiter.
GET /admin/users
- Response:
AdminUserResponse[]
PUT /admin/users/{id}/role
- Body:
{ role: "user" | "admin" } - Response:
{ message: string }
GET /health
- Auth: Public
- Response:
{ status: "ok" }
5. Generation Pipeline — Full Algorithm
Startup & Background Tasks
- Session cleanup: an hourly background task deletes expired DB sessions (
db::sessions::delete_expired). - Job store TTL: expired job entries (older than 1 hour) are cleaned up via
JobStore::cleanup_expired.
Generation Lifecycle
POST /api/v1/syntheses/generate creates a job in the JobStore, then spawns two nested tasks:
- Inner task: wraps
run_generationin a 15-minutetokio::time::timeout. If the timeout fires, sends anErrorprogress event and releases the user lock. - Outer task: monitors the inner task's
JoinHandlefor panics. If the inner task panics, sends anErrorprogress event and releases the user lock.
Progress is streamed to clients via a tokio::sync::watch channel (SSE endpoint subscribes to it).
Initialization
- Load user settings from DB (provider, models, batch_size, rate limits, etc.)
- Cleanup — delete old article history entries (>N days, dropped only) + truncate old LLM call logs
- Validate — runtime category set always includes
DiversandSans dateeven when no user-defined categories are configured. - Load theme — categories, max_items_per_category, max_age_days, summary_length
- Load user sources (personalized URLs filtered by theme_id)
- Resolve LLM provider — decrypt user's API key, create provider instance (
Arc<dyn LlmProvider>) - Resolve models — research model + web-search model (user override or admin default)
- Setup rate limiter — per-user or global provider limiter
- Initialize tracking structures —
article_scraped(category→articles),source_counts(per-domain article count),url_source(per-article source),filled_counts(per-category article count),seen_urls(cross-phase dedup),classification_categories(user categories +Divers;Sans dateis assigned by no-date routing) - Batch trace buffer —
pending_traces: Vec<ArticleHistoryEntry>accumulates all article history writes; flushed withdb::article_history::batch_insert_entriesat phase boundaries.
Phase 1: Personalized Sources
Skipped entirely if user has 0 sources.
1a. Windowed source extraction
- Query
article_historyfor the last source used. Reorder sources so the first source follows the last one used (rolling window). - Separate preferred sources (processed first) from non-preferred, preserving rotation order within each group.
- Process sources in waves of
source_extraction_windowsize:- For each source in the wave: fetch page HTML, extract up to
max_links_per_sourcearticle URLs via HTML parsing (same-domain, non-homepage, no static assets). - SSRF check performed on each source URL before fetching.
- Deduplicate candidate URLs (case-insensitive, cross-source via
seen_urls). - Filter against article history — hash each URL (normalized: lowercase, strip fragments/UTM params/trailing slashes), batch-query
article_history→ remove matches. Trace dropped articles asstatus: filtered_history. - Preferred-first shuffle — shuffle preferred URLs separately from non-preferred, then concatenate (preferred first).
- Track url → source in
url_source.
- For each source in the wave: fetch page HTML, extract up to
1b. Scrape, classify, and summarize articles (batched)
Processing in batches of settings.batch_size (minimum 1). For each batch:
Batch assembly: Pull up to batch_size candidates, skipping any where source_counts[domain] >= max_articles_per_source (traced as filtered_diversity).
Phase A — Scrape batch in parallel (JoinSet):
- SSRF check (no private IPs), 15s timeout, 5MB body limit.
- HTML parsing for title (
<title>,og:title), date (meta tags, JSON-LD,<time>), body (strip scripts/nav), soft-404 detection. - If article body is empty, is a soft-404, or is too old: trace as
filtered_empty/filtered_too_oldand skip.
Phase B — Classify/summarize batch in parallel (JoinSet):
- Check rate limit before classifying (waits up to 60s, then errors).
- Send article (title + body snippet based on
summary_length: 500/2000/4000 chars) + categories + "Divers" to LLM. - LLM returns
{title, summary, category, date, is_article}. is_articlecheck: if false, trace asfiltered_not_articleand skip.- Date fallback: if LLM returned a date and it exceeds
max_age_days, trace asfiltered_too_oldand skip. - No-date routing: if no date found (neither scraper nor LLM), route to
Sans datecategory. assign_category()helper: validates category, falls back to "Divers" if unknown or full. If "Divers" is also full, drops the article.- LLM call logged with full prompt/response/timing.
- Add article to
article_scraped, incrementfilled_countsandsource_counts.
Early exit: After each batch, if total articles ≥ (num_categories + 1) × max_items_per_category, stop.
Wave check: After each wave, if synthesis is full, skip remaining waves.
Trace flush: Pending traces batch-inserted into article_history between waves.
Phase 2: Web Search Fallback
Skipped if all user-defined categories are already filled.
2a. Compute category gaps
For each user category: needed = max_items_per_category - already_filled. Only proceed if any category needs more.
2b. Choose path: Brave Search or LLM web search
Selected by settings.use_brave_search.
Path A: Brave Search (use_brave_search = true)
- Resolve and decrypt the user's Brave Search API key (error if not configured).
- Query:
"{theme} actualites", up to 20 results, freshness mapped frommax_age_days(pd/pw/pm/py). - Filter results through
filter_phase2_url(): homepage filter → cross-phase dedup → article history → source diversity. - Batch scrape + classify (same as Phase 1b,
source_type = "brave_search").
Path B: LLM Web Search (use_brave_search = false)
- Build search prompt with theme, categories, gap counts.
- Call LLM with
model_websearch. Returns{category_0: [{title, url, summary}], ...}. - Filter URLs through
filter_phase2_url(). - Scrape each result sequentially. Keep LLM-provided title/summary (no re-classification).
source_type = "web_search".
Save + Record
- Error if empty — if all article lists are empty and generation wasn't cancelled, return error.
- Order sections — user-defined categories first (in order), then
Diversif non-empty, thenSans dateif non-empty. - Sanitize — strip
\u0000null bytes from JSON (PostgreSQL JSONB requirement). - Save synthesis — insert into
synthesestable withjob_id,week(ISO week),sections(JSONB),status: completed,theme_id. - Record used articles — for each article in the final synthesis, build trace with
status: "used",synthesis_id, and correctsource_type(inferred fromurl_source). Batch-insert intoarticle_history.
Shared Helpers
build_trace_entry()— constructs anArticleHistoryEntryfrom anArticleTracestruct. Never writes to DB directly; caller accumulates inpending_traces.scrape_and_classify_batch()— shared batch processing logic used by Phase 1 and Phase 2 Brave paths.assign_category()— validates LLM-returned category, falls back to "Divers", drops if all full.filter_phase2_url()— async helper applying homepage/dedup/history/diversity filters for Phase 2.scrape_single_article()— thin wrapper aroundscraper::scrape_urlreturning(body_text, page_title, final_url, drop_reason).hash_article_url()— normalizes URL (strips fragments, UTM params, trailing slashes, lowercases) then SHA-256 hashes.
6. LLM Provider Abstraction
Trait Definition
#[async_trait]
pub trait LlmProvider: Send + Sync {
fn provider_id(&self) -> &str;
async fn call_llm(&self, model: &str, system_prompt: &str,
user_prompt: &str, response_schema: &Value)
-> Result<Value, AppError>;
}
All calls use structured JSON output (response_schema defines the expected shape).
Implementations
| Provider | Module | API Endpoint | Auth Method |
|---|---|---|---|
| Google Gemini | llm/gemini.rs |
generativelanguage.googleapis.com |
Query param ?key= |
| OpenAI | llm/openai.rs |
api.openai.com/v1/chat/completions |
Bearer token |
| Anthropic | llm/anthropic.rs |
api.anthropic.com/v1/messages |
x-api-key header |
| Mock | llm/mock.rs |
N/A (in-memory) | N/A |
Factory
llm/factory.rs provides create_provider(provider_name, api_key, http_client) -> Arc<dyn LlmProvider>. Matches on provider name string.
Response Schema
llm/schema.rs builds JSON Schema definitions for:
- Classification/summarization:
{title, summary, category, is_article} - Web search:
{category_0: [{title, url, summary}], ...}with per-category arrays - Source link extraction: handled via heuristic HTML parsing (no LLM schema).
Error Mapping
map_provider_http_error() translates HTTP status codes to AppError variants:
- 400 -> BadRequest
- 401/403 -> BadRequest (invalid key)
- 404 -> BadRequest (model not found)
- 429/529 -> RateLimited
- Other -> Internal
7. Background Tasks
Session Cleanup
Runs hourly via tokio::spawn. Calls db::sessions::delete_expired to remove sessions past their expires_at timestamp.
Job Store Cleanup
JobStore::cleanup_expired removes job entries older than 1 hour (the TTL constant). Called periodically. Releases user locks for expired jobs.
Scheduler
Runs every minute via tokio::spawn with a 60-second interval. For each tick:
current_day_code()-> "mon" through "sun"find_due_schedules(pool, day, time)-> queries enabled schedules matching current day and time (HH:MM)- For each due schedule:
- Skip if
job_store.has_active_job(user_id)returns Some (manual generation in progress) - Create a temporary
watch::channelandAtomicBool - Call
synthesis::run_generation_innerdirectly (bypasses job store) - On success: send emails to configured recipients (up to 3), mark schedule as run
- On failure: log error, do not mark as run
- Skip if
8. Configuration
Environment Variables
| Variable | Required | Default | Description |
|---|---|---|---|
| DATABASE_URL | Yes | - | PostgreSQL connection string |
| MASTER_ENCRYPTION_KEY | Yes | - | 64 hex chars (32 bytes) for AES-256-GCM |
| APP_URL | Yes | - | Public URL (CORS, magic links, cookies). No trailing slash. |
| PORT | No | 8080 | HTTP server port |
| RUST_LOG | No | - | Logging filter (e.g., "info,ai_synth_backend=debug") |
| STATIC_DIR | No | ../frontend/dist | Path to built SolidJS files |
| RESEND_API_KEY | Yes | - | Resend email service API key |
| EMAIL_FROM | Yes | - | Sender address for emails |
| TURNSTILE_SECRET_KEY | Yes | - | Cloudflare Turnstile server secret |
| TURNSTILE_SITE_KEY | Yes | - | Cloudflare Turnstile client key |
| POSTGRES_PASSWORD | Yes | - | Used by docker-compose for DB container |
Startup Validation
AppConfig::validate() checks at startup:
MASTER_ENCRYPTION_KEYis exactly 64 hex charactersAPP_URLstarts with http:// or https:// and has no trailing slash
The application refuses to start with invalid configuration.
User Settings Model
Default values applied when a user has no saved settings:
| Setting | Default | Range |
|---|---|---|
| max_articles_per_source | 3 | 1-10 |
| max_links_per_source | 8 | 1-30 |
| use_brave_search | false | boolean |
| article_history_days | 90 | 0-365 |
| batch_size | 5 | 1-20 |
| source_extraction_window | 3 | 1-10 |
| search_agent_behavior | "" | max 2000 chars |
| ai_provider | "" | max 100 chars |
| ai_model | "" | max 100 chars |
| ai_model_websearch | "" | max 100 chars |
| rate_limit_max_requests | null | >= 1 if set |
| rate_limit_time_window_seconds | null | >= 1 if set |