You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

31 KiB

AI Weekly Synth -- Technical Specifications

1. Backend Tech Stack

Dependency Version Purpose
axum 0.8 Web framework (macros, multipart)
tokio 1 Async runtime (full features)
tower 0.5 Middleware composition
tower-http 0.6 CORS, static files, tracing, headers
sqlx 0.8 Async Postgres driver (runtime-tokio, tls-rustls, uuid, chrono, json, migrate)
reqwest 0.12 HTTP client (JSON)
serde / serde_json 1 Serialization/deserialization
chrono 0.4 Date/time handling (serde feature)
aes-gcm 0.10 AES-256-GCM encryption
zeroize 1 Secure memory zeroing
sha2 0.10 SHA-256 hashing
rand 0.8 Random number generation
base64 0.22 Base64 encoding
hex 0.4 Hex encoding/decoding
async-trait 0.1 Async trait objects
tracing / tracing-subscriber 0.1 / 0.3 Structured logging (env-filter, json)
dotenvy 0.15 .env file loading
clap 4 CLI argument parsing
scraper 0.22 HTML parsing (CSS selectors)
ego-tree 0.10 Tree data structure (used by scraper)
url 2 URL parsing and validation
email_address 0.2 Email validation
anyhow 1 Error context
thiserror 2 Error type derivation
uuid 1 UUID v4 generation (serde feature)
dashmap 6 Concurrent hash maps
tokio-stream 0.1 Stream utilities for SSE
futures 0.3 Async stream combinators
printpdf 0.7 PDF generation

Dev dependencies: tower (util), http-body-util, wiremock 0.6.

Rust edition: 2021.


2. Frontend Tech Stack

Dependency Version Purpose
solid-js ^1.9.0 Reactive UI framework
@solidjs/router ^0.15.0 Client-side routing
lucide-solid ^0.475.0 Icon library
date-fns ^4.1.0 Date formatting
tailwindcss ^4.1.0 Utility-first CSS (v4)
@tailwindcss/vite ^4.1.0 Tailwind Vite plugin
vite ^6.2.0 Build tool and dev server
vite-plugin-solid ^2.11.0 SolidJS Vite integration
typescript ~5.8.0 Type checking
vitest ^3.0.0 Unit testing
@solidjs/testing-library ^0.8.0 Component testing
jsdom ^25.0.0 DOM environment for tests

Frontend Routes

Path Component Auth Description
/login Login Public Login page
/register Register Public Registration page
/auth/verify AuthVerify Public Magic link verification
/ Home Protected Dashboard / synthesis list
/settings Settings Protected User settings
/themes ThemeManager Protected Theme CRUD + source management
/generate GenerateSynthesis Protected Generation trigger + progress
/synthesis/:id SynthesisDetail Protected Full synthesis view
/article-history ArticleHistory Protected Article history browser
/llm-logs/:jobId LlmLogs Protected LLM call log viewer
/admin/providers AdminProviders Admin Provider configuration
/admin/rate-limits AdminRateLimits Admin Rate limit configuration
/admin/users AdminUsers Admin User management

3. Database Schema

3.1 users

Column Type Constraints
id UUID PK, DEFAULT gen_random_uuid()
email TEXT NOT NULL, UNIQUE
display_name TEXT nullable
role TEXT NOT NULL, DEFAULT 'user', CHECK (user/admin)
created_at TIMESTAMPTZ NOT NULL, DEFAULT now()
updated_at TIMESTAMPTZ NOT NULL, DEFAULT now()

Indexes: idx_users_email on (email).

3.2 sessions

Column Type Constraints
session_hash TEXT PK (SHA-256 of raw token)
user_id UUID NOT NULL, FK users(id) CASCADE
created_at TIMESTAMPTZ NOT NULL, DEFAULT now()
expires_at TIMESTAMPTZ NOT NULL
last_active_at TIMESTAMPTZ NOT NULL, DEFAULT now()
ip_address TEXT nullable
user_agent TEXT nullable

Indexes: idx_sessions_user_id, idx_sessions_expires_at.

3.3 magic_tokens

Column Type Constraints
id UUID PK, DEFAULT gen_random_uuid()
email TEXT NOT NULL
token_hash TEXT NOT NULL, UNIQUE
created_at TIMESTAMPTZ NOT NULL, DEFAULT now()
expires_at TIMESTAMPTZ NOT NULL
used BOOLEAN NOT NULL, DEFAULT false

Indexes: idx_magic_tokens_email, idx_magic_tokens_expires.

3.4 settings

Per-user pipeline configuration. One row per user (user_id is the PK).

Column Type Constraints
user_id UUID PK, FK users(id) CASCADE
max_articles_per_source INTEGER NOT NULL, DEFAULT 3
max_links_per_source INTEGER NOT NULL, DEFAULT 8
use_brave_search BOOLEAN NOT NULL, DEFAULT false
article_history_days INTEGER NOT NULL, DEFAULT 90
batch_size INTEGER NOT NULL, DEFAULT 5
source_extraction_window INTEGER NOT NULL, DEFAULT 3
search_agent_behavior TEXT NOT NULL, DEFAULT ''
ai_provider TEXT NOT NULL, DEFAULT ''
ai_model TEXT NOT NULL, DEFAULT ''
ai_model_websearch TEXT NOT NULL, DEFAULT ''
rate_limit_max_requests INTEGER nullable
rate_limit_time_window_seconds INTEGER nullable
updated_at TIMESTAMPTZ NOT NULL, DEFAULT now()

3.5 themes

Per-user topic configurations with content settings.

Column Type Constraints
id UUID PK, DEFAULT gen_random_uuid()
user_id UUID NOT NULL, FK users(id) CASCADE
name TEXT NOT NULL
theme TEXT NOT NULL (search topic)
categories JSONB NOT NULL, DEFAULT '[]'
max_items_per_category INTEGER NOT NULL, DEFAULT 4
max_age_days INTEGER NOT NULL, DEFAULT 7
summary_length INTEGER NOT NULL, DEFAULT 3
created_at TIMESTAMPTZ NOT NULL, DEFAULT now()
updated_at TIMESTAMPTZ NOT NULL, DEFAULT now()

Indexes: idx_themes_user_id.

categories stores user-defined categories only. Runtime/category assignment always includes Divers and Sans date.

3.6 sources

User-curated news source URLs, always tied to a theme.

Column Type Constraints
id UUID PK, DEFAULT gen_random_uuid()
user_id UUID NOT NULL, FK users(id) CASCADE
title VARCHAR(200) NOT NULL, CHECK length 1-200
url VARCHAR(1000) NOT NULL, CHECK length <= 1000
theme_id UUID NOT NULL, FK themes(id) CASCADE
is_preferred BOOLEAN NOT NULL, DEFAULT false
created_at TIMESTAMPTZ NOT NULL, DEFAULT now()

Indexes: idx_sources_user_id, UNIQUE idx_sources_user_id_url on (user_id, url).

3.7 syntheses

Generated synthesis results with JSONB section data.

Column Type Constraints
id UUID PK, DEFAULT gen_random_uuid()
user_id UUID NOT NULL, FK users(id) CASCADE
week VARCHAR(10) NOT NULL (ISO week string)
sections JSONB NOT NULL, DEFAULT '[]'
status VARCHAR(20) NOT NULL, DEFAULT 'completed'
job_id UUID nullable
theme_id UUID nullable, FK themes(id) SET NULL
created_at TIMESTAMPTZ NOT NULL, DEFAULT now()

Indexes: idx_syntheses_user_id_created_at on (user_id, created_at DESC).

JSONB structure for sections:

[
  {
    "title": "Category Name",
    "items": [
      { "title": "Article Title", "url": "https://...", "summary": "...", "date": "2026-03-25" }
    ]
  }
]

3.8 theme_schedules

Automated generation schedules, one per theme.

Column Type Constraints
id UUID PK, DEFAULT gen_random_uuid()
theme_id UUID NOT NULL, UNIQUE, FK themes(id) CASCADE
user_id UUID NOT NULL, FK users(id) CASCADE
enabled BOOLEAN NOT NULL, DEFAULT true
days JSONB NOT NULL, DEFAULT '[]' (e.g. ["mon","fri"])
time_utc TEXT NOT NULL, DEFAULT '08:00' (HH:MM)
emails JSONB NOT NULL, DEFAULT '[]' (up to 3 addresses)
last_run_at TIMESTAMPTZ nullable
created_at TIMESTAMPTZ NOT NULL, DEFAULT now()
updated_at TIMESTAMPTZ NOT NULL, DEFAULT now()

Indexes: idx_theme_schedules_enabled (partial, WHERE enabled = true).

3.9 article_history

Article URL deduplication and full provenance tracing.

Column Type Constraints
id UUID PK, DEFAULT gen_random_uuid()
user_id UUID NOT NULL, FK users(id) CASCADE
url_hash TEXT NOT NULL (SHA-256 of normalized URL)
url TEXT NOT NULL
title TEXT NOT NULL, DEFAULT ''
source_type TEXT NOT NULL, DEFAULT 'unknown'
source_url TEXT nullable
category TEXT nullable
synthesis_id UUID nullable, FK syntheses(id) SET NULL
status TEXT NOT NULL, DEFAULT 'used'
scraped_ok BOOLEAN NOT NULL, DEFAULT true
job_id UUID NOT NULL
published_date TEXT nullable
created_at TIMESTAMPTZ NOT NULL, DEFAULT now()

Indexes: idx_article_history_user_url on (user_id, url_hash), idx_article_history_job_id.

Status values: used, filtered_history, filtered_diversity, filtered_not_article, filtered_too_old, filtered_empty, filtered_homepage, filtered_cross_phase_dedup.

Source type values: personalized_source, brave_search, web_search.

3.10 llm_call_log

Full LLM interaction logging for debugging and analysis.

Column Type Constraints
id UUID PK, DEFAULT gen_random_uuid()
user_id UUID NOT NULL, FK users(id) CASCADE
job_id UUID NOT NULL
call_type TEXT NOT NULL
model TEXT NOT NULL
system_prompt TEXT NOT NULL, DEFAULT ''
user_prompt TEXT NOT NULL, DEFAULT ''
response_body TEXT NOT NULL, DEFAULT ''
duration_ms INTEGER NOT NULL, DEFAULT 0
article_url TEXT nullable
created_at TIMESTAMPTZ NOT NULL, DEFAULT now()

Indexes: idx_llm_call_log_job_id, idx_llm_call_log_user_id on (user_id, created_at).

3.11 admin_providers

Admin-curated catalog of LLM providers and their models.

Column Type Constraints
id UUID PK, DEFAULT gen_random_uuid()
provider_name VARCHAR(50) NOT NULL, UNIQUE
display_name VARCHAR(100) NOT NULL
models_scraping JSONB NOT NULL, DEFAULT '[]'
models_websearch JSONB NOT NULL, DEFAULT '[]'
is_enabled BOOLEAN NOT NULL, DEFAULT true
created_at TIMESTAMPTZ NOT NULL, DEFAULT now()
updated_at TIMESTAMPTZ NOT NULL, DEFAULT now()

Indexes: idx_admin_providers_enabled (partial, WHERE is_enabled = true).

Seeded with: gemini, openai, anthropic.

JSONB model structure:

[{"model_id": "gemini-2.5-pro", "display_name": "Gemini 2.5 Pro", "is_default": true}]

3.12 admin_rate_limits

Per-provider rate limit configuration.

Column Type Constraints
id UUID PK, DEFAULT gen_random_uuid()
provider_name VARCHAR(50) NOT NULL, UNIQUE, FK admin_providers(provider_name) CASCADE
max_requests INTEGER NOT NULL, DEFAULT 30
time_window_seconds INTEGER NOT NULL, DEFAULT 60
updated_at TIMESTAMPTZ NOT NULL, DEFAULT now()

Seeded defaults: gemini 29/60s, openai 50/60s, anthropic 40/60s.

3.13 user_api_keys

Encrypted user LLM API keys.

Column Type Constraints
id UUID PK, DEFAULT gen_random_uuid()
user_id UUID NOT NULL, FK users(id) CASCADE
provider_name VARCHAR(50) NOT NULL
encrypted_key BYTEA NOT NULL
nonce BYTEA NOT NULL
key_prefix VARCHAR(20) NOT NULL
created_at TIMESTAMPTZ NOT NULL, DEFAULT now()
updated_at TIMESTAMPTZ NOT NULL, DEFAULT now()

Constraint: UNIQUE(user_id, provider_name). Valid providers: gemini, openai, anthropic, brave_search.

3.14 audit_log

Admin mutation audit trail.

Column Type Constraints
id UUID PK, DEFAULT gen_random_uuid()
admin_user_id UUID nullable, FK users(id) SET NULL
action VARCHAR(100) NOT NULL
target_type VARCHAR(50) nullable
target_id VARCHAR(255) nullable
details JSONB nullable
created_at TIMESTAMPTZ NOT NULL, DEFAULT now()

Indexes: idx_audit_log_created_at (DESC), idx_audit_log_admin_user.


4. API Endpoints

All endpoints are prefixed with /api/v1. Responses are JSON. Errors follow the shape { "error": "message" }.

4.1 Authentication

POST /auth/register

  • Auth: Public
  • Body: { email: string, display_name?: string, turnstile_token: string }
  • Response: { message: string }
  • Sends magic link email. Rate limited.

POST /auth/login

  • Auth: Public
  • Body: { email: string, turnstile_token: string }
  • Response: { message: string }
  • Sends magic link email. Rate limited.

GET /auth/verify?token=...&email=...

  • Auth: Public
  • Response: Redirect to frontend with session cookie set.

POST /auth/verify

  • Auth: Public
  • Body: { token: string, email: string }
  • Response: { message: string, user: User }
  • Sets session HttpOnly cookie (30-day expiry).

POST /auth/logout

  • Auth: Authenticated
  • Response: { message: string }
  • Clears session cookie and deletes DB session.

GET /auth/me

  • Auth: Authenticated
  • Response: { id, email, display_name, role, created_at }

4.2 Settings

GET /settings

  • Auth: Authenticated
  • Response: UserSettings (creates defaults if not exists)

PUT /settings

  • Auth: Authenticated
  • Body: UpdateSettingsRequest (all fields required)
  • Validation: max_articles_per_source 1-10, max_links_per_source 1-30, batch_size 1-20, source_extraction_window 1-10, article_history_days 0-365, search_agent_behavior max 2000 chars, ai_provider/ai_model/ai_model_websearch max 100 chars.
  • Response: Updated UserSettings

4.3 Themes

GET /themes

  • Auth: Authenticated
  • Response: ThemeResponse[]

POST /themes

  • Auth: Authenticated
  • Body: { name, theme, categories: string[], max_items_per_category?, max_age_days?, summary_length? }
  • Validation: name non-empty max 200 chars, categories 0-20 non-empty entries, max_items 1-50, max_age 1-365, summary_length 1-3.
  • Notes: theme creation is valid with an empty user-defined categories list. The system always includes Divers and Sans date.
  • Response: ThemeResponse

PUT /themes/{id}

  • Auth: Authenticated (owner only)
  • Body: UpdateThemeRequest (all fields optional)
  • Response: ThemeResponse

DELETE /themes/{id}

  • Auth: Authenticated (owner only)
  • Response: 204 No Content

4.4 Schedules

GET /themes/{id}/schedule

  • Auth: Authenticated (theme owner)
  • Response: ScheduleResponse | null with HTTP 200

PUT /themes/{id}/schedule

  • Auth: Authenticated (theme owner)
  • Body: { enabled, days: string[], time_utc: "HH:MM", emails: string[] }
  • Validation: days from mon-sun, time HH:MM format, max 3 emails.
  • Response: ScheduleResponse

DELETE /themes/{id}/schedule

  • Auth: Authenticated (theme owner)
  • Response: 204 No Content

4.5 Sources

GET /sources?theme_id=...

  • Auth: Authenticated
  • Query: theme_id is required
  • Response: SourceResponse[]

POST /sources

  • Auth: Authenticated
  • Body: { title, url, theme_id }
  • Validation: title non-empty max 200, URL http(s) max 1000 chars.
  • Response: SourceResponse

PUT /sources/preferred

  • Auth: Authenticated
  • Body: { theme_id: UUID, source_ids: UUID[] }
  • Note: preferred state is scoped per theme.
  • Response: { updated: number }

DELETE /sources/{id}

  • Auth: Authenticated (owner only)
  • Response: 204 No Content

POST /sources/bulk

  • Auth: Authenticated
  • Body: { sources: CreateSourceRequest[], theme_id: UUID }
  • Response: { imported, skipped, errors }

POST /sources/import-csv

  • Auth: Authenticated
  • Body: Multipart file upload (CSV: title,url) + required theme_id
  • Response: { imported, skipped, errors }

GET /sources/export-csv

  • Auth: Authenticated
  • Query: theme_id is required
  • Scope: exports sources for the selected theme only
  • Response: CSV file download

4.6 Generation

POST /syntheses/generate

  • Auth: Authenticated
  • Body: { theme_id: UUID }
  • Response: { job_id: UUID }
  • Creates job in JobStore, spawns background generation task. Returns 409 if user already has active job.

GET /syntheses/generate/{job_id}/progress

  • Auth: Authenticated (job owner)
  • Response: SSE stream of ProgressEvent
  • Events: progress (step, message, percent), complete (synthesis_id), error (message).

POST /syntheses/generate/{job_id}/stop

  • Auth: Authenticated (job owner)
  • Response: { message: string }
  • Sets cooperative cancellation flag.

4.7 Syntheses

GET /syntheses

  • Auth: Authenticated
  • Response: SynthesisListItem[] (with section summaries, theme info)

GET /syntheses/{id}

  • Auth: Authenticated (owner only)
  • Response: SynthesisResponse (full sections data)

DELETE /syntheses/{id}

  • Auth: Authenticated (owner only)
  • Response: 204 No Content

POST /syntheses/{id}/send-email

  • Auth: Authenticated
  • Body: { email: string }
  • Response: { message: string }

GET /syntheses/{id}/export/markdown

  • Auth: Authenticated
  • Response: Markdown file download

GET /syntheses/{id}/export/pdf

  • Auth: Authenticated
  • Response: PDF file download

4.8 Article History & Provenance

GET /article-history?limit=&offset=&job_id=&status=

  • Auth: Authenticated
  • Response: { items: ArticleHistoryEntry[], total: number }

DELETE /article-history

  • Auth: Authenticated
  • Response: { deleted: number }

GET /syntheses/{id}/provenance

  • Auth: Authenticated
  • Response: ArticleHistoryEntry[] (articles with status "used" for this synthesis's job_id)

4.9 LLM Call Logs

GET /llm-logs/{job_id}

  • Auth: Authenticated
  • Response: LlmCallLogEntry[]

4.10 User API Keys

GET /user/api-keys

  • Auth: Authenticated
  • Response: ApiKeyResponse[] (id, provider_name, key_prefix, timestamps; never the full key)

POST /user/api-keys

  • Auth: Authenticated
  • Body: { provider_name, api_key }
  • Validation: provider in (gemini, openai, anthropic, brave_search), key 8-500 chars.
  • Response: ApiKeyResponse
  • Encrypts key with AES-256-GCM before storage; upserts (one key per user per provider).

DELETE /user/api-keys/{provider}

  • Auth: Authenticated
  • Response: 204 No Content

POST /user/api-keys/{provider}/test

  • Auth: Authenticated
  • Response: { success: boolean, message: string }
  • Decrypts key, calls provider test endpoint.

POST /user/api-keys/export

  • Auth: Authenticated
  • Response: { keys: [{ provider_name, api_key }] }
  • Decrypts and returns all keys (used for backup/migration).

4.11 Public Configuration

GET /config/providers

  • Auth: Authenticated
  • Response: ProviderConfigResponse[] (enabled providers with model lists for scraping and websearch)

4.12 Admin Endpoints

All admin endpoints require AdminUser extractor (role = admin).

GET /admin/providers

  • Response: AdminProviderResponse[]

POST /admin/providers

  • Body: CreateProviderRequest
  • Validation: provider_name in (gemini, openai, anthropic), at least one model per list, at most one default per list.
  • Response: AdminProviderResponse

PUT /admin/providers/{id}

  • Body: UpdateProviderRequest (all fields optional)
  • Response: AdminProviderResponse

DELETE /admin/providers/{id}

  • Response: 204 No Content

GET /admin/rate-limits

  • Response: RateLimitResponse[]

PUT /admin/rate-limits/{provider_name}

  • Body: { max_requests: 1-1000, time_window_seconds: 1-3600 }
  • Response: RateLimitResponse
  • Hot-reloads the in-memory provider rate limiter.

GET /admin/users

  • Response: AdminUserResponse[]

PUT /admin/users/{id}/role

  • Body: { role: "user" | "admin" }
  • Response: { message: string }

GET /health

  • Auth: Public
  • Response: { status: "ok" }

5. Generation Pipeline — Full Algorithm

Startup & Background Tasks

  • Session cleanup: an hourly background task deletes expired DB sessions (db::sessions::delete_expired).
  • Job store TTL: expired job entries (older than 1 hour) are cleaned up via JobStore::cleanup_expired.

Generation Lifecycle

POST /api/v1/syntheses/generate creates a job in the JobStore, then spawns two nested tasks:

  • Inner task: wraps run_generation in a 15-minute tokio::time::timeout. If the timeout fires, sends an Error progress event and releases the user lock.
  • Outer task: monitors the inner task's JoinHandle for panics. If the inner task panics, sends an Error progress event and releases the user lock.

Progress is streamed to clients via a tokio::sync::watch channel (SSE endpoint subscribes to it).

Initialization

  1. Load user settings from DB (provider, models, batch_size, rate limits, etc.)
  2. Cleanup — delete old article history entries (>N days, dropped only) + truncate old LLM call logs
  3. Validate — runtime category set always includes Divers and Sans date even when no user-defined categories are configured.
  4. Load theme — categories, max_items_per_category, max_age_days, summary_length
  5. Load user sources (personalized URLs filtered by theme_id)
  6. Resolve LLM provider — decrypt user's API key, create provider instance (Arc<dyn LlmProvider>)
  7. Resolve models — research model + web-search model (user override or admin default)
  8. Setup rate limiter — per-user or global provider limiter
  9. Initialize tracking structuresarticle_scraped (category→articles), source_counts (per-domain article count), url_source (per-article source), filled_counts (per-category article count), seen_urls (cross-phase dedup), classification_categories (user categories + Divers; Sans date is assigned by no-date routing)
  10. Batch trace bufferpending_traces: Vec<ArticleHistoryEntry> accumulates all article history writes; flushed with db::article_history::batch_insert_entries at phase boundaries.

Phase 1: Personalized Sources

Skipped entirely if user has 0 sources.

1a. Windowed source extraction

  • Query article_history for the last source used. Reorder sources so the first source follows the last one used (rolling window).
  • Separate preferred sources (processed first) from non-preferred, preserving rotation order within each group.
  • Process sources in waves of source_extraction_window size:
    • For each source in the wave: fetch page HTML, extract up to max_links_per_source article URLs via HTML parsing (same-domain, non-homepage, no static assets).
    • SSRF check performed on each source URL before fetching.
    • Deduplicate candidate URLs (case-insensitive, cross-source via seen_urls).
    • Filter against article history — hash each URL (normalized: lowercase, strip fragments/UTM params/trailing slashes), batch-query article_history → remove matches. Trace dropped articles as status: filtered_history.
    • Preferred-first shuffle — shuffle preferred URLs separately from non-preferred, then concatenate (preferred first).
    • Track url → source in url_source.

1b. Scrape, classify, and summarize articles (batched)

Processing in batches of settings.batch_size (minimum 1). For each batch:

Batch assembly: Pull up to batch_size candidates, skipping any where source_counts[domain] >= max_articles_per_source (traced as filtered_diversity).

Phase A — Scrape batch in parallel (JoinSet):

  • SSRF check (no private IPs), 15s timeout, 5MB body limit.
  • HTML parsing for title (<title>, og:title), date (meta tags, JSON-LD, <time>), body (strip scripts/nav), soft-404 detection.
  • If article body is empty, is a soft-404, or is too old: trace as filtered_empty / filtered_too_old and skip.

Phase B — Classify/summarize batch in parallel (JoinSet):

  • Check rate limit before classifying (waits up to 60s, then errors).
  • Send article (title + body snippet based on summary_length: 500/2000/4000 chars) + categories + "Divers" to LLM.
  • LLM returns {title, summary, category, date, is_article}.
  • is_article check: if false, trace as filtered_not_article and skip.
  • Date fallback: if LLM returned a date and it exceeds max_age_days, trace as filtered_too_old and skip.
  • No-date routing: if no date found (neither scraper nor LLM), route to Sans date category.
  • assign_category() helper: validates category, falls back to "Divers" if unknown or full. If "Divers" is also full, drops the article.
  • LLM call logged with full prompt/response/timing.
  • Add article to article_scraped, increment filled_counts and source_counts.

Early exit: After each batch, if total articles ≥ (num_categories + 1) × max_items_per_category, stop.

Wave check: After each wave, if synthesis is full, skip remaining waves.

Trace flush: Pending traces batch-inserted into article_history between waves.

Phase 2: Web Search Fallback

Skipped if all user-defined categories are already filled.

2a. Compute category gaps

For each user category: needed = max_items_per_category - already_filled. Only proceed if any category needs more.

Selected by settings.use_brave_search.

Path A: Brave Search (use_brave_search = true)

  1. Resolve and decrypt the user's Brave Search API key (error if not configured).
  2. Query: "{theme} actualites", up to 20 results, freshness mapped from max_age_days (pd/pw/pm/py).
  3. Filter results through filter_phase2_url(): homepage filter → cross-phase dedup → article history → source diversity.
  4. Batch scrape + classify (same as Phase 1b, source_type = "brave_search").

Path B: LLM Web Search (use_brave_search = false)

  1. Build search prompt with theme, categories, gap counts.
  2. Call LLM with model_websearch. Returns {category_0: [{title, url, summary}], ...}.
  3. Filter URLs through filter_phase2_url().
  4. Scrape each result sequentially. Keep LLM-provided title/summary (no re-classification).
  5. source_type = "web_search".

Save + Record

  1. Error if empty — if all article lists are empty and generation wasn't cancelled, return error.
  2. Order sections — user-defined categories first (in order), then Divers if non-empty, then Sans date if non-empty.
  3. Sanitize — strip \u0000 null bytes from JSON (PostgreSQL JSONB requirement).
  4. Save synthesis — insert into syntheses table with job_id, week (ISO week), sections (JSONB), status: completed, theme_id.
  5. Record used articles — for each article in the final synthesis, build trace with status: "used", synthesis_id, and correct source_type (inferred from url_source). Batch-insert into article_history.

Shared Helpers

  • build_trace_entry() — constructs an ArticleHistoryEntry from an ArticleTrace struct. Never writes to DB directly; caller accumulates in pending_traces.
  • scrape_and_classify_batch() — shared batch processing logic used by Phase 1 and Phase 2 Brave paths.
  • assign_category() — validates LLM-returned category, falls back to "Divers", drops if all full.
  • filter_phase2_url() — async helper applying homepage/dedup/history/diversity filters for Phase 2.
  • scrape_single_article() — thin wrapper around scraper::scrape_url returning (body_text, page_title, final_url, drop_reason).
  • hash_article_url() — normalizes URL (strips fragments, UTM params, trailing slashes, lowercases) then SHA-256 hashes.

6. LLM Provider Abstraction

Trait Definition

#[async_trait]
pub trait LlmProvider: Send + Sync {
    fn provider_id(&self) -> &str;
    async fn call_llm(&self, model: &str, system_prompt: &str,
                       user_prompt: &str, response_schema: &Value)
        -> Result<Value, AppError>;
}

All calls use structured JSON output (response_schema defines the expected shape).

Implementations

Provider Module API Endpoint Auth Method
Google Gemini llm/gemini.rs generativelanguage.googleapis.com Query param ?key=
OpenAI llm/openai.rs api.openai.com/v1/chat/completions Bearer token
Anthropic llm/anthropic.rs api.anthropic.com/v1/messages x-api-key header
Mock llm/mock.rs N/A (in-memory) N/A

Factory

llm/factory.rs provides create_provider(provider_name, api_key, http_client) -> Arc<dyn LlmProvider>. Matches on provider name string.

Response Schema

llm/schema.rs builds JSON Schema definitions for:

  • Classification/summarization: {title, summary, category, is_article}
  • Web search: {category_0: [{title, url, summary}], ...} with per-category arrays
  • Source link extraction: handled via heuristic HTML parsing (no LLM schema).

Error Mapping

map_provider_http_error() translates HTTP status codes to AppError variants:

  • 400 -> BadRequest
  • 401/403 -> BadRequest (invalid key)
  • 404 -> BadRequest (model not found)
  • 429/529 -> RateLimited
  • Other -> Internal

7. Background Tasks

Session Cleanup

Runs hourly via tokio::spawn. Calls db::sessions::delete_expired to remove sessions past their expires_at timestamp.

Job Store Cleanup

JobStore::cleanup_expired removes job entries older than 1 hour (the TTL constant). Called periodically. Releases user locks for expired jobs.

Scheduler

Runs every minute via tokio::spawn with a 60-second interval. For each tick:

  1. current_day_code() -> "mon" through "sun"
  2. find_due_schedules(pool, day, time) -> queries enabled schedules matching current day and time (HH:MM)
  3. For each due schedule:
    • Skip if job_store.has_active_job(user_id) returns Some (manual generation in progress)
    • Create a temporary watch::channel and AtomicBool
    • Call synthesis::run_generation_inner directly (bypasses job store)
    • On success: send emails to configured recipients (up to 3), mark schedule as run
    • On failure: log error, do not mark as run

8. Configuration

Environment Variables

Variable Required Default Description
DATABASE_URL Yes - PostgreSQL connection string
MASTER_ENCRYPTION_KEY Yes - 64 hex chars (32 bytes) for AES-256-GCM
APP_URL Yes - Public URL (CORS, magic links, cookies). No trailing slash.
PORT No 8080 HTTP server port
RUST_LOG No - Logging filter (e.g., "info,ai_synth_backend=debug")
STATIC_DIR No ../frontend/dist Path to built SolidJS files
RESEND_API_KEY Yes - Resend email service API key
EMAIL_FROM Yes - Sender address for emails
TURNSTILE_SECRET_KEY Yes - Cloudflare Turnstile server secret
TURNSTILE_SITE_KEY Yes - Cloudflare Turnstile client key
POSTGRES_PASSWORD Yes - Used by docker-compose for DB container

Startup Validation

AppConfig::validate() checks at startup:

  • MASTER_ENCRYPTION_KEY is exactly 64 hex characters
  • APP_URL starts with http:// or https:// and has no trailing slash

The application refuses to start with invalid configuration.

User Settings Model

Default values applied when a user has no saved settings:

Setting Default Range
max_articles_per_source 3 1-10
max_links_per_source 8 1-30
use_brave_search false boolean
article_history_days 90 0-365
batch_size 5 1-20
source_extraction_window 3 1-10
search_agent_behavior "" max 2000 chars
ai_provider "" max 100 chars
ai_model "" max 100 chars
ai_model_websearch "" max 100 chars
rate_limit_max_requests null >= 1 if set
rate_limit_time_window_seconds null >= 1 if set