You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

31 KiB

Raw Permalink Blame History Unescape Escape

AI Weekly Synth -- Technical Specifications

1. Backend Tech Stack

Dependency	Version	Purpose
axum	0.8	Web framework (macros, multipart)
tokio	1	Async runtime (full features)
tower	0.5	Middleware composition
tower-http	0.6	CORS, static files, tracing, headers
sqlx	0.8	Async Postgres driver (runtime-tokio, tls-rustls, uuid, chrono, json, migrate)
reqwest	0.12	HTTP client (JSON)
serde / serde_json	1	Serialization/deserialization
chrono	0.4	Date/time handling (serde feature)
aes-gcm	0.10	AES-256-GCM encryption
zeroize	1	Secure memory zeroing
sha2	0.10	SHA-256 hashing
rand	0.8	Random number generation
base64	0.22	Base64 encoding
hex	0.4	Hex encoding/decoding
async-trait	0.1	Async trait objects
tracing / tracing-subscriber	0.1 / 0.3	Structured logging (env-filter, json)
dotenvy	0.15	.env file loading
clap	4	CLI argument parsing
scraper	0.22	HTML parsing (CSS selectors)
ego-tree	0.10	Tree data structure (used by scraper)
url	2	URL parsing and validation
email_address	0.2	Email validation
anyhow	1	Error context
thiserror	2	Error type derivation
uuid	1	UUID v4 generation (serde feature)
dashmap	6	Concurrent hash maps
tokio-stream	0.1	Stream utilities for SSE
futures	0.3	Async stream combinators
printpdf	0.7	PDF generation

Dev dependencies: tower (util), http-body-util, wiremock 0.6.

Rust edition: 2021.

2. Frontend Tech Stack

Dependency	Version	Purpose
solid-js	^1.9.0	Reactive UI framework
@solidjs/router	^0.15.0	Client-side routing
lucide-solid	^0.475.0	Icon library
date-fns	^4.1.0	Date formatting
tailwindcss	^4.1.0	Utility-first CSS (v4)
@tailwindcss/vite	^4.1.0	Tailwind Vite plugin
vite	^6.2.0	Build tool and dev server
vite-plugin-solid	^2.11.0	SolidJS Vite integration
typescript	~5.8.0	Type checking
vitest	^3.0.0	Unit testing
@solidjs/testing-library	^0.8.0	Component testing
jsdom	^25.0.0	DOM environment for tests

Frontend Routes

Path	Component	Auth	Description
/login	Login	Public	Login page
/register	Register	Public	Registration page
/auth/verify	AuthVerify	Public	Magic link verification
/	Home	Protected	Dashboard / synthesis list
/settings	Settings	Protected	User settings
/themes	ThemeManager	Protected	Theme CRUD + source management
/generate	GenerateSynthesis	Protected	Generation trigger + progress
/synthesis/:id	SynthesisDetail	Protected	Full synthesis view
/article-history	ArticleHistory	Protected	Article history browser
/llm-logs/:jobId	LlmLogs	Protected	LLM call log viewer
/admin/providers	AdminProviders	Admin	Provider configuration
/admin/rate-limits	AdminRateLimits	Admin	Rate limit configuration
/admin/users	AdminUsers	Admin	User management

3. Database Schema

3.1 `users`

Column	Type	Constraints
id	UUID	PK, DEFAULT gen_random_uuid()
email	TEXT	NOT NULL, UNIQUE
display_name	TEXT	nullable
role	TEXT	NOT NULL, DEFAULT 'user', CHECK (user/admin)
created_at	TIMESTAMPTZ	NOT NULL, DEFAULT now()
updated_at	TIMESTAMPTZ	NOT NULL, DEFAULT now()

Indexes: idx_users_email on (email).

3.2 `sessions`

Column	Type	Constraints
session_hash	TEXT	PK (SHA-256 of raw token)
user_id	UUID	NOT NULL, FK users(id) CASCADE
created_at	TIMESTAMPTZ	NOT NULL, DEFAULT now()
expires_at	TIMESTAMPTZ	NOT NULL
last_active_at	TIMESTAMPTZ	NOT NULL, DEFAULT now()
ip_address	TEXT	nullable
user_agent	TEXT	nullable

Indexes: idx_sessions_user_id, idx_sessions_expires_at.

3.3 `magic_tokens`

Column	Type	Constraints
id	UUID	PK, DEFAULT gen_random_uuid()
email	TEXT	NOT NULL
token_hash	TEXT	NOT NULL, UNIQUE
created_at	TIMESTAMPTZ	NOT NULL, DEFAULT now()
expires_at	TIMESTAMPTZ	NOT NULL
used	BOOLEAN	NOT NULL, DEFAULT false

Indexes: idx_magic_tokens_email, idx_magic_tokens_expires.

3.4 `settings`

Per-user pipeline configuration. One row per user (user_id is the PK).

Column	Type	Constraints
user_id	UUID	PK, FK users(id) CASCADE
max_articles_per_source	INTEGER	NOT NULL, DEFAULT 3
max_links_per_source	INTEGER	NOT NULL, DEFAULT 8
use_brave_search	BOOLEAN	NOT NULL, DEFAULT false
article_history_days	INTEGER	NOT NULL, DEFAULT 90
batch_size	INTEGER	NOT NULL, DEFAULT 5
source_extraction_window	INTEGER	NOT NULL, DEFAULT 3
search_agent_behavior	TEXT	NOT NULL, DEFAULT ''
ai_provider	TEXT	NOT NULL, DEFAULT ''
ai_model	TEXT	NOT NULL, DEFAULT ''
ai_model_websearch	TEXT	NOT NULL, DEFAULT ''
rate_limit_max_requests	INTEGER	nullable
rate_limit_time_window_seconds	INTEGER	nullable
updated_at	TIMESTAMPTZ	NOT NULL, DEFAULT now()

3.5 `themes`

Per-user topic configurations with content settings.

Column	Type	Constraints
id	UUID	PK, DEFAULT gen_random_uuid()
user_id	UUID	NOT NULL, FK users(id) CASCADE
name	TEXT	NOT NULL
theme	TEXT	NOT NULL (search topic)
categories	JSONB	NOT NULL, DEFAULT '[]'
max_items_per_category	INTEGER	NOT NULL, DEFAULT 4
max_age_days	INTEGER	NOT NULL, DEFAULT 7
summary_length	INTEGER	NOT NULL, DEFAULT 3
created_at	TIMESTAMPTZ	NOT NULL, DEFAULT now()
updated_at	TIMESTAMPTZ	NOT NULL, DEFAULT now()

Indexes: idx_themes_user_id.

categories stores user-defined categories only. Runtime/category assignment always includes Divers and Sans date.

3.6 `sources`

User-curated news source URLs, always tied to a theme.

Column	Type	Constraints
id	UUID	PK, DEFAULT gen_random_uuid()
user_id	UUID	NOT NULL, FK users(id) CASCADE
title	VARCHAR(200)	NOT NULL, CHECK length 1-200
url	VARCHAR(1000)	NOT NULL, CHECK length <= 1000
theme_id	UUID	NOT NULL, FK themes(id) CASCADE
is_preferred	BOOLEAN	NOT NULL, DEFAULT false
created_at	TIMESTAMPTZ	NOT NULL, DEFAULT now()

Indexes: idx_sources_user_id, UNIQUE idx_sources_user_id_url on (user_id, url).

3.7 `syntheses`

Generated synthesis results with JSONB section data.

Column	Type	Constraints
id	UUID	PK, DEFAULT gen_random_uuid()
user_id	UUID	NOT NULL, FK users(id) CASCADE
week	VARCHAR(10)	NOT NULL (ISO week string)
sections	JSONB	NOT NULL, DEFAULT '[]'
status	VARCHAR(20)	NOT NULL, DEFAULT 'completed'
job_id	UUID	nullable
theme_id	UUID	nullable, FK themes(id) SET NULL
created_at	TIMESTAMPTZ	NOT NULL, DEFAULT now()

Indexes: idx_syntheses_user_id_created_at on (user_id, created_at DESC).

JSONB structure for sections:

[
  {
    "title": "Category Name",
    "items": [
      { "title": "Article Title", "url": "https://...", "summary": "...", "date": "2026-03-25" }
    ]
  }
]

3.8 `theme_schedules`

Automated generation schedules, one per theme.

Column	Type	Constraints
id	UUID	PK, DEFAULT gen_random_uuid()
theme_id	UUID	NOT NULL, UNIQUE, FK themes(id) CASCADE
user_id	UUID	NOT NULL, FK users(id) CASCADE
enabled	BOOLEAN	NOT NULL, DEFAULT true
days	JSONB	NOT NULL, DEFAULT '[]' (e.g. ["mon","fri"])
time_utc	TEXT	NOT NULL, DEFAULT '08:00' (HH:MM)
emails	JSONB	NOT NULL, DEFAULT '[]' (up to 3 addresses)
last_run_at	TIMESTAMPTZ	nullable
created_at	TIMESTAMPTZ	NOT NULL, DEFAULT now()
updated_at	TIMESTAMPTZ	NOT NULL, DEFAULT now()

Indexes: idx_theme_schedules_enabled (partial, WHERE enabled = true).

3.9 `article_history`

Article URL deduplication and full provenance tracing.

Column	Type	Constraints
id	UUID	PK, DEFAULT gen_random_uuid()
user_id	UUID	NOT NULL, FK users(id) CASCADE
url_hash	TEXT	NOT NULL (SHA-256 of normalized URL)
url	TEXT	NOT NULL
title	TEXT	NOT NULL, DEFAULT ''
source_type	TEXT	NOT NULL, DEFAULT 'unknown'
source_url	TEXT	nullable
category	TEXT	nullable
synthesis_id	UUID	nullable, FK syntheses(id) SET NULL
status	TEXT	NOT NULL, DEFAULT 'used'
scraped_ok	BOOLEAN	NOT NULL, DEFAULT true
job_id	UUID	NOT NULL
published_date	TEXT	nullable
created_at	TIMESTAMPTZ	NOT NULL, DEFAULT now()

Indexes: idx_article_history_user_url on (user_id, url_hash), idx_article_history_job_id.

Status values: used, filtered_history, filtered_diversity, filtered_not_article, filtered_too_old, filtered_empty, filtered_homepage, filtered_cross_phase_dedup.

Source type values: personalized_source, brave_search, web_search.

3.10 `llm_call_log`

Full LLM interaction logging for debugging and analysis.

Column	Type	Constraints
id	UUID	PK, DEFAULT gen_random_uuid()
user_id	UUID	NOT NULL, FK users(id) CASCADE
job_id	UUID	NOT NULL
call_type	TEXT	NOT NULL
model	TEXT	NOT NULL
system_prompt	TEXT	NOT NULL, DEFAULT ''
user_prompt	TEXT	NOT NULL, DEFAULT ''
response_body	TEXT	NOT NULL, DEFAULT ''
duration_ms	INTEGER	NOT NULL, DEFAULT 0
article_url	TEXT	nullable
created_at	TIMESTAMPTZ	NOT NULL, DEFAULT now()

Indexes: idx_llm_call_log_job_id, idx_llm_call_log_user_id on (user_id, created_at).

3.11 `admin_providers`

Admin-curated catalog of LLM providers and their models.

Column	Type	Constraints
id	UUID	PK, DEFAULT gen_random_uuid()
provider_name	VARCHAR(50)	NOT NULL, UNIQUE
display_name	VARCHAR(100)	NOT NULL
models_scraping	JSONB	NOT NULL, DEFAULT '[]'
models_websearch	JSONB	NOT NULL, DEFAULT '[]'
is_enabled	BOOLEAN	NOT NULL, DEFAULT true
created_at	TIMESTAMPTZ	NOT NULL, DEFAULT now()
updated_at	TIMESTAMPTZ	NOT NULL, DEFAULT now()

Indexes: idx_admin_providers_enabled (partial, WHERE is_enabled = true).

Seeded with: gemini, openai, anthropic.

JSONB model structure:

[{"model_id": "gemini-2.5-pro", "display_name": "Gemini 2.5 Pro", "is_default": true}]

3.12 `admin_rate_limits`

Per-provider rate limit configuration.

Column	Type	Constraints
id	UUID	PK, DEFAULT gen_random_uuid()
provider_name	VARCHAR(50)	NOT NULL, UNIQUE, FK admin_providers(provider_name) CASCADE
max_requests	INTEGER	NOT NULL, DEFAULT 30
time_window_seconds	INTEGER	NOT NULL, DEFAULT 60
updated_at	TIMESTAMPTZ	NOT NULL, DEFAULT now()

Seeded defaults: gemini 29/60s, openai 50/60s, anthropic 40/60s.

3.13 `user_api_keys`

Encrypted user LLM API keys.

Column	Type	Constraints
id	UUID	PK, DEFAULT gen_random_uuid()
user_id	UUID	NOT NULL, FK users(id) CASCADE
provider_name	VARCHAR(50)	NOT NULL
encrypted_key	BYTEA	NOT NULL
nonce	BYTEA	NOT NULL
key_prefix	VARCHAR(20)	NOT NULL
created_at	TIMESTAMPTZ	NOT NULL, DEFAULT now()
updated_at	TIMESTAMPTZ	NOT NULL, DEFAULT now()

Constraint: UNIQUE(user_id, provider_name). Valid providers: gemini, openai, anthropic, brave_search.

3.14 `audit_log`

Admin mutation audit trail.

Column	Type	Constraints
id	UUID	PK, DEFAULT gen_random_uuid()
admin_user_id	UUID	nullable, FK users(id) SET NULL
action	VARCHAR(100)	NOT NULL
target_type	VARCHAR(50)	nullable
target_id	VARCHAR(255)	nullable
details	JSONB	nullable
created_at	TIMESTAMPTZ	NOT NULL, DEFAULT now()

Indexes: idx_audit_log_created_at (DESC), idx_audit_log_admin_user.

4. API Endpoints

All endpoints are prefixed with /api/v1. Responses are JSON. Errors follow the shape { "error": "message" }.

4.1 Authentication

POST /auth/register

Auth: Public
Body: { email: string, display_name?: string, turnstile_token: string }
Response: { message: string }
Sends magic link email. Rate limited.

POST /auth/login

Auth: Public
Body: { email: string, turnstile_token: string }
Response: { message: string }
Sends magic link email. Rate limited.

GET /auth/verify?token=...&email=...

Auth: Public
Response: Redirect to frontend with session cookie set.

POST /auth/verify

Auth: Public
Body: { token: string, email: string }
Response: { message: string, user: User }
Sets session HttpOnly cookie (30-day expiry).

POST /auth/logout

Auth: Authenticated
Response: { message: string }
Clears session cookie and deletes DB session.

GET /auth/me

Auth: Authenticated
Response: { id, email, display_name, role, created_at }

4.2 Settings

GET /settings

Auth: Authenticated
Response: UserSettings (creates defaults if not exists)

PUT /settings

Auth: Authenticated
Body: UpdateSettingsRequest (all fields required)
Validation: max_articles_per_source 1-10, max_links_per_source 1-30, batch_size 1-20, source_extraction_window 1-10, article_history_days 0-365, search_agent_behavior max 2000 chars, ai_provider/ai_model/ai_model_websearch max 100 chars.
Response: Updated UserSettings

4.3 Themes

GET /themes

Auth: Authenticated
Response: ThemeResponse[]

POST /themes

Auth: Authenticated
Body: { name, theme, categories: string[], max_items_per_category?, max_age_days?, summary_length? }
Validation: name non-empty max 200 chars, categories 0-20 non-empty entries, max_items 1-50, max_age 1-365, summary_length 1-3.
Notes: theme creation is valid with an empty user-defined categories list. The system always includes Divers and Sans date.
Response: ThemeResponse

PUT /themes/{id}

Auth: Authenticated (owner only)
Body: UpdateThemeRequest (all fields optional)
Response: ThemeResponse

DELETE /themes/{id}

Auth: Authenticated (owner only)
Response: 204 No Content

4.4 Schedules

GET /themes/{id}/schedule

Auth: Authenticated (theme owner)
Response: ScheduleResponse | null with HTTP 200

PUT /themes/{id}/schedule

Auth: Authenticated (theme owner)
Body: { enabled, days: string[], time_utc: "HH:MM", emails: string[] }
Validation: days from mon-sun, time HH:MM format, max 3 emails.
Response: ScheduleResponse

DELETE /themes/{id}/schedule

Auth: Authenticated (theme owner)
Response: 204 No Content

4.5 Sources

GET /sources?theme_id=...

Auth: Authenticated
Query: theme_id is required
Response: SourceResponse[]

POST /sources

Auth: Authenticated
Body: { title, url, theme_id }
Validation: title non-empty max 200, URL http(s) max 1000 chars.
Response: SourceResponse

PUT /sources/preferred

Auth: Authenticated
Body: { theme_id: UUID, source_ids: UUID[] }
Note: preferred state is scoped per theme.
Response: { updated: number }

DELETE /sources/{id}

Auth: Authenticated (owner only)
Response: 204 No Content

POST /sources/bulk

Auth: Authenticated
Body: { sources: CreateSourceRequest[], theme_id: UUID }
Response: { imported, skipped, errors }

POST /sources/import-csv

Auth: Authenticated
Body: Multipart file upload (CSV: title,url) + required theme_id
Response: { imported, skipped, errors }

GET /sources/export-csv

Auth: Authenticated
Query: theme_id is required
Scope: exports sources for the selected theme only
Response: CSV file download

4.6 Generation

POST /syntheses/generate

Auth: Authenticated
Body: { theme_id: UUID }
Response: { job_id: UUID }
Creates job in JobStore, spawns background generation task. Returns 409 if user already has active job.

GET /syntheses/generate/{job_id}/progress

Auth: Authenticated (job owner)
Response: SSE stream of ProgressEvent
Events: progress (step, message, percent), complete (synthesis_id), error (message).

POST /syntheses/generate/{job_id}/stop

Auth: Authenticated (job owner)
Response: { message: string }
Sets cooperative cancellation flag.

4.7 Syntheses

GET /syntheses

Auth: Authenticated
Response: SynthesisListItem[] (with section summaries, theme info)

GET /syntheses/{id}

Auth: Authenticated (owner only)
Response: SynthesisResponse (full sections data)

DELETE /syntheses/{id}

Auth: Authenticated (owner only)
Response: 204 No Content

POST /syntheses/{id}/send-email

Auth: Authenticated
Body: { email: string }
Response: { message: string }

GET /syntheses/{id}/export/markdown

Auth: Authenticated
Response: Markdown file download

GET /syntheses/{id}/export/pdf

Auth: Authenticated
Response: PDF file download

4.8 Article History & Provenance

GET /article-history?limit=&offset=&job_id=&status=

Auth: Authenticated
Response: { items: ArticleHistoryEntry[], total: number }

DELETE /article-history

Auth: Authenticated
Response: { deleted: number }

GET /syntheses/{id}/provenance

Auth: Authenticated
Response: ArticleHistoryEntry[] (articles with status "used" for this synthesis's job_id)

4.9 LLM Call Logs

GET /llm-logs/{job_id}

Auth: Authenticated
Response: LlmCallLogEntry[]

4.10 User API Keys

GET /user/api-keys

Auth: Authenticated
Response: ApiKeyResponse[] (id, provider_name, key_prefix, timestamps; never the full key)

POST /user/api-keys

Auth: Authenticated
Body: { provider_name, api_key }
Validation: provider in (gemini, openai, anthropic, brave_search), key 8-500 chars.
Response: ApiKeyResponse
Encrypts key with AES-256-GCM before storage; upserts (one key per user per provider).

DELETE /user/api-keys/{provider}

Auth: Authenticated
Response: 204 No Content

POST /user/api-keys/{provider}/test

Auth: Authenticated
Response: { success: boolean, message: string }
Decrypts key, calls provider test endpoint.

POST /user/api-keys/export

Auth: Authenticated
Response: { keys: [{ provider_name, api_key }] }
Decrypts and returns all keys (used for backup/migration).

4.11 Public Configuration

GET /config/providers

Auth: Authenticated
Response: ProviderConfigResponse[] (enabled providers with model lists for scraping and websearch)

4.12 Admin Endpoints

All admin endpoints require AdminUser extractor (role = admin).

GET /admin/providers

Response: AdminProviderResponse[]

POST /admin/providers

Body: CreateProviderRequest
Validation: provider_name in (gemini, openai, anthropic), at least one model per list, at most one default per list.
Response: AdminProviderResponse

PUT /admin/providers/{id}

Body: UpdateProviderRequest (all fields optional)
Response: AdminProviderResponse

DELETE /admin/providers/{id}

Response: 204 No Content

GET /admin/rate-limits

Response: RateLimitResponse[]

PUT /admin/rate-limits/{provider_name}

Body: { max_requests: 1-1000, time_window_seconds: 1-3600 }
Response: RateLimitResponse
Hot-reloads the in-memory provider rate limiter.

GET /admin/users

Response: AdminUserResponse[]

PUT /admin/users/{id}/role

Body: { role: "user" | "admin" }
Response: { message: string }

GET /health

Auth: Public
Response: { status: "ok" }

5. Generation Pipeline — Full Algorithm

Startup & Background Tasks

Session cleanup: an hourly background task deletes expired DB sessions (db::sessions::delete_expired).
Job store TTL: expired job entries (older than 1 hour) are cleaned up via JobStore::cleanup_expired.

Generation Lifecycle

POST /api/v1/syntheses/generate creates a job in the JobStore, then spawns two nested tasks:

Inner task: wraps run_generation in a 15-minute tokio::time::timeout. If the timeout fires, sends an Error progress event and releases the user lock.
Outer task: monitors the inner task's JoinHandle for panics. If the inner task panics, sends an Error progress event and releases the user lock.

Progress is streamed to clients via a tokio::sync::watch channel (SSE endpoint subscribes to it).

Initialization

Load user settings from DB (provider, models, batch_size, rate limits, etc.)
Cleanup — delete old article history entries (>N days, dropped only) + truncate old LLM call logs
Validate — runtime category set always includes Divers and Sans date even when no user-defined categories are configured.
Load theme — categories, max_items_per_category, max_age_days, summary_length
Load user sources (personalized URLs filtered by theme_id)
Resolve LLM provider — decrypt user's API key, create provider instance (Arc<dyn LlmProvider>)
Resolve models — research model + web-search model (user override or admin default)
Setup rate limiter — per-user or global provider limiter
Initialize tracking structures — article_scraped (category→articles), source_counts (per-domain article count), url_source (per-article source), filled_counts (per-category article count), seen_urls (cross-phase dedup), classification_categories (user categories + Divers; Sans date is assigned by no-date routing)
Batch trace buffer — pending_traces: Vec<ArticleHistoryEntry> accumulates all article history writes; flushed with db::article_history::batch_insert_entries at phase boundaries.

Phase 1: Personalized Sources

Skipped entirely if user has 0 sources.

1a. Windowed source extraction

Query article_history for the last source used. Reorder sources so the first source follows the last one used (rolling window).
Separate preferred sources (processed first) from non-preferred, preserving rotation order within each group.
Process sources in waves of source_extraction_window size:
- For each source in the wave: fetch page HTML, extract up to max_links_per_source article URLs via HTML parsing (same-domain, non-homepage, no static assets).
- SSRF check performed on each source URL before fetching.
- Deduplicate candidate URLs (case-insensitive, cross-source via seen_urls).
- Filter against article history — hash each URL (normalized: lowercase, strip fragments/UTM params/trailing slashes), batch-query article_history → remove matches. Trace dropped articles as status: filtered_history.
- Preferred-first shuffle — shuffle preferred URLs separately from non-preferred, then concatenate (preferred first).
- Track url → source in url_source.

1b. Scrape, classify, and summarize articles (batched)

Processing in batches of settings.batch_size (minimum 1). For each batch:

Batch assembly: Pull up to batch_size candidates, skipping any where source_counts[domain] >= max_articles_per_source (traced as filtered_diversity).

Phase A — Scrape batch in parallel (JoinSet):

SSRF check (no private IPs), 15s timeout, 5MB body limit.
HTML parsing for title (<title>, og:title), date (meta tags, JSON-LD, <time>), body (strip scripts/nav), soft-404 detection.
If article body is empty, is a soft-404, or is too old: trace as filtered_empty / filtered_too_old and skip.

Phase B — Classify/summarize batch in parallel (JoinSet):

Check rate limit before classifying (waits up to 60s, then errors).
Send article (title + body snippet based on summary_length: 500/2000/4000 chars) + categories + "Divers" to LLM.
LLM returns {title, summary, category, date, is_article}.
is_article check: if false, trace as filtered_not_article and skip.
Date fallback: if LLM returned a date and it exceeds max_age_days, trace as filtered_too_old and skip.
No-date routing: if no date found (neither scraper nor LLM), route to Sans date category.
assign_category() helper: validates category, falls back to "Divers" if unknown or full. If "Divers" is also full, drops the article.
LLM call logged with full prompt/response/timing.
Add article to article_scraped, increment filled_counts and source_counts.

Early exit: After each batch, if total articles ≥ (num_categories + 1) × max_items_per_category, stop.

Wave check: After each wave, if synthesis is full, skip remaining waves.

Trace flush: Pending traces batch-inserted into article_history between waves.

Phase 2: Web Search Fallback

Skipped if all user-defined categories are already filled.

2a. Compute category gaps

For each user category: needed = max_items_per_category - already_filled. Only proceed if any category needs more.

2b. Choose path: Brave Search or LLM web search

Selected by settings.use_brave_search.

Path A: Brave Search (`use_brave_search = true`)

Resolve and decrypt the user's Brave Search API key (error if not configured).
Query: "{theme} actualites", up to 20 results, freshness mapped from max_age_days (pd/pw/pm/py).
Filter results through filter_phase2_url(): homepage filter → cross-phase dedup → article history → source diversity.
Batch scrape + classify (same as Phase 1b, source_type = "brave_search").

Path B: LLM Web Search (`use_brave_search = false`)

Build search prompt with theme, categories, gap counts.
Call LLM with model_websearch. Returns {category_0: [{title, url, summary}], ...}.
Filter URLs through filter_phase2_url().
Scrape each result sequentially. Keep LLM-provided title/summary (no re-classification).
source_type = "web_search".

Save + Record

Error if empty — if all article lists are empty and generation wasn't cancelled, return error.
Order sections — user-defined categories first (in order), then Divers if non-empty, then Sans date if non-empty.
Sanitize — strip \u0000 null bytes from JSON (PostgreSQL JSONB requirement).
Save synthesis — insert into syntheses table with job_id, week (ISO week), sections (JSONB), status: completed, theme_id.
Record used articles — for each article in the final synthesis, build trace with status: "used", synthesis_id, and correct source_type (inferred from url_source). Batch-insert into article_history.

Shared Helpers

build_trace_entry() — constructs an ArticleHistoryEntry from an ArticleTrace struct. Never writes to DB directly; caller accumulates in pending_traces.
scrape_and_classify_batch() — shared batch processing logic used by Phase 1 and Phase 2 Brave paths.
assign_category() — validates LLM-returned category, falls back to "Divers", drops if all full.
filter_phase2_url() — async helper applying homepage/dedup/history/diversity filters for Phase 2.
scrape_single_article() — thin wrapper around scraper::scrape_url returning (body_text, page_title, final_url, drop_reason).
hash_article_url() — normalizes URL (strips fragments, UTM params, trailing slashes, lowercases) then SHA-256 hashes.

6. LLM Provider Abstraction

Trait Definition

#[async_trait]
pub trait LlmProvider: Send + Sync {
    fn provider_id(&self) -> &str;
    async fn call_llm(&self, model: &str, system_prompt: &str,
                       user_prompt: &str, response_schema: &Value)
        -> Result<Value, AppError>;
}

All calls use structured JSON output (response_schema defines the expected shape).

Implementations

Provider	Module	API Endpoint	Auth Method
Google Gemini	`llm/gemini.rs`	`generativelanguage.googleapis.com`	Query param `?key=`
OpenAI	`llm/openai.rs`	`api.openai.com/v1/chat/completions`	Bearer token
Anthropic	`llm/anthropic.rs`	`api.anthropic.com/v1/messages`	`x-api-key` header
Mock	`llm/mock.rs`	N/A (in-memory)	N/A

Factory

llm/factory.rs provides create_provider(provider_name, api_key, http_client) -> Arc<dyn LlmProvider>. Matches on provider name string.

Response Schema

llm/schema.rs builds JSON Schema definitions for:

Classification/summarization: {title, summary, category, is_article}
Web search: {category_0: [{title, url, summary}], ...} with per-category arrays
Source link extraction: handled via heuristic HTML parsing (no LLM schema).

Error Mapping

map_provider_http_error() translates HTTP status codes to AppError variants:

400 -> BadRequest
401/403 -> BadRequest (invalid key)
404 -> BadRequest (model not found)
429/529 -> RateLimited
Other -> Internal

7. Background Tasks

Session Cleanup

Runs hourly via tokio::spawn. Calls db::sessions::delete_expired to remove sessions past their expires_at timestamp.

Job Store Cleanup

JobStore::cleanup_expired removes job entries older than 1 hour (the TTL constant). Called periodically. Releases user locks for expired jobs.

Scheduler

Runs every minute via tokio::spawn with a 60-second interval. For each tick:

current_day_code() -> "mon" through "sun"
find_due_schedules(pool, day, time) -> queries enabled schedules matching current day and time (HH:MM)
For each due schedule:
- Skip if job_store.has_active_job(user_id) returns Some (manual generation in progress)
- Create a temporary watch::channel and AtomicBool
- Call synthesis::run_generation_inner directly (bypasses job store)
- On success: send emails to configured recipients (up to 3), mark schedule as run
- On failure: log error, do not mark as run

8. Configuration

Environment Variables

Variable	Required	Default	Description
DATABASE_URL	Yes	-	PostgreSQL connection string
MASTER_ENCRYPTION_KEY	Yes	-	64 hex chars (32 bytes) for AES-256-GCM
APP_URL	Yes	-	Public URL (CORS, magic links, cookies). No trailing slash.
PORT	No	8080	HTTP server port
RUST_LOG	No	-	Logging filter (e.g., "info,ai_synth_backend=debug")
STATIC_DIR	No	../frontend/dist	Path to built SolidJS files
RESEND_API_KEY	Yes	-	Resend email service API key
EMAIL_FROM	Yes	-	Sender address for emails
TURNSTILE_SECRET_KEY	Yes	-	Cloudflare Turnstile server secret
TURNSTILE_SITE_KEY	Yes	-	Cloudflare Turnstile client key
POSTGRES_PASSWORD	Yes	-	Used by docker-compose for DB container

Startup Validation

AppConfig::validate() checks at startup:

MASTER_ENCRYPTION_KEY is exactly 64 hex characters
APP_URL starts with http:// or https:// and has no trailing slash

The application refuses to start with invalid configuration.

User Settings Model

Default values applied when a user has no saved settings:

Setting	Default	Range
max_articles_per_source	3	1-10
max_links_per_source	8	1-30
use_brave_search	false	boolean
article_history_days	90	0-365
batch_size	5	1-20
source_extraction_window	3	1-10
search_agent_behavior	""	max 2000 chars
ai_provider	""	max 100 chars
ai_model	""	max 100 chars
ai_model_websearch	""	max 100 chars
rate_limit_max_requests	null	>= 1 if set
rate_limit_time_window_seconds	null	>= 1 if set

31 KiB Raw Permalink Blame History Unescape Escape