52 KiB
Technical Architecture Analysis: AI Weekly Synth Refactoring
Open Questions and Clarifications Needed
Before implementation, the following points require decisions from stakeholders:
-
Admin scope: Is the "admin" a single super-user defined by config, or a full role-based system with multiple admins? This analysis assumes a simple role flag on users plus a single bootstrap admin defined via environment variable.
-
Google OAuth retention: The requirements specify email+captcha and magic link auth. Should Google SSO be dropped entirely, or kept as an additional option? This analysis assumes Google SSO is dropped to remove all Google dependencies.
-
Email sending for syntheses: The current app sends syntheses via Gmail API with OAuth popup. With Google dependencies removed, should SMTP-based email sending replace this? This analysis assumes yes, using the same SMTP configuration as magic link delivery.
-
Data migration volume: How many existing users and syntheses need migrating? This impacts whether a one-shot script suffices or whether incremental migration tooling is needed.
-
Concurrent users target: Rate limiter design and session store choice depend on expected load. This analysis assumes a small-to-medium deployment (1-100 concurrent users).
-
Legacy data: The current
SynthesisDatahas legacy fields (majorAnnouncements,financialSector, etc.). The requirements say "remove legacy data/formats/code." This analysis assumes legacy fields are dropped during migration; only thesections[]format is carried forward.
1. Rust Backend Architecture
1.1 Framework Choice: Axum
Recommendation: Axum over Actix-web.
Justification:
| Criterion | Axum | Actix-web |
|---|---|---|
| Ecosystem alignment | Built on tokio + tower + hyper -- the de-facto Rust async stack |
Has its own runtime layer (though uses tokio underneath) |
| Middleware model | Tower Layer/Service -- composable, reusable, testable |
Actor-based middleware -- powerful but idiosyncratic |
| Extractors | Type-safe, ergonomic, uses FromRequest traits |
Similar, but with web::Data, web::Json wrappers |
| Community trajectory | Growing faster, backed by the tokio team | Mature, stable, but slower growth |
| Learning curve | Lower for developers already using tokio ecosystem | Slightly higher due to actor concepts |
| Compile-time type safety | Strong -- handler function signatures are validated at compile time | Strong, but less ergonomic error messages |
Axum's tower-based middleware model is a decisive advantage for this project: the auth middleware, rate limiter, and CORS layer compose naturally as tower Layers. Axum also has first-class support for shared state via State extractor, which maps well to a shared database pool and configuration.
1.2 Project Structure
ai-synth-backend/
├── Cargo.toml
├── Cargo.lock
├── .env.example
├── migrations/ # sqlx migrations
│ ├── 001_create_users.sql
│ ├── 002_create_sessions.sql
│ ├── 003_create_settings.sql
│ ├── 004_create_sources.sql
│ ├── 005_create_syntheses.sql
│ ├── 006_create_admin_config.sql
│ └── 007_create_rate_limits.sql
├── src/
│ ├── main.rs # Entry point: init tracing, DB, run server
│ ├── config.rs # Env-based configuration (envy / dotenvy)
│ ├── app_state.rs # AppState struct (pool, config, http client)
│ ├── error.rs # AppError enum, IntoResponse impl
│ ├── router.rs # All route definitions, middleware wiring
│ ├── middleware/
│ │ ├── mod.rs
│ │ ├── auth.rs # Session cookie extraction, user injection
│ │ ├── csrf.rs # Double-submit cookie CSRF protection
│ │ └── rate_limit.rs # Per-provider, configurable rate limiter
│ ├── models/
│ │ ├── mod.rs
│ │ ├── user.rs # User, NewUser, UserRole
│ │ ├── session.rs # Session
│ │ ├── settings.rs # UserSettings
│ │ ├── source.rs # Source
│ │ ├── synthesis.rs # Synthesis, NewsSection, NewsItem
│ │ └── admin.rs # LlmProviderConfig, RateLimitConfig
│ ├── handlers/
│ │ ├── mod.rs
│ │ ├── auth.rs # register, login (magic link), verify, logout
│ │ ├── syntheses.rs # list, get, create (trigger generation), delete
│ │ ├── sources.rs # CRUD, bulk import, CSV export
│ │ ├── settings.rs # get, update, export, import
│ │ ├── admin.rs # LLM config CRUD, rate limit config, user list
│ │ └── email.rs # Send synthesis by email
│ ├── services/
│ │ ├── mod.rs
│ │ ├── llm/
│ │ │ ├── mod.rs # LlmProvider trait, factory function
│ │ │ ├── gemini.rs # Google Gemini implementation
│ │ │ ├── openai.rs # OpenAI implementation
│ │ │ ├── anthropic.rs # Anthropic implementation
│ │ │ └── types.rs # Shared request/response types
│ │ ├── synthesis.rs # 2-pass generation pipeline orchestration
│ │ ├── scraper.rs # URL validation, HTML scraping, date extraction
│ │ ├── email.rs # SMTP email sending (magic links + syntheses)
│ │ └── captcha.rs # Captcha verification
│ └── db/
│ ├── mod.rs
│ ├── users.rs # User queries
│ ├── sessions.rs # Session queries
│ ├── settings.rs # Settings queries
│ ├── sources.rs # Source queries
│ ├── syntheses.rs # Synthesis queries
│ └── admin.rs # Admin config queries
└── tests/
├── api/ # Integration tests
└── services/ # Unit tests for services
1.3 Layered Architecture
The application follows a clean 3-layer architecture:
- Handlers (HTTP layer): Extract request data, call services, return responses. No business logic.
- Services (Business layer): Orchestrate operations, enforce business rules, call DB and external APIs.
- DB (Persistence layer): Raw sqlx queries, mapping to/from model structs.
1.4 Error Handling
A unified AppError enum implements IntoResponse:
#[derive(Debug)]
pub enum AppError {
// Client errors
BadRequest(String),
Unauthorized(String),
Forbidden(String),
NotFound(String),
Conflict(String),
TooManyRequests { retry_after_secs: u64 },
ValidationError(Vec<FieldError>),
// Server errors
Internal(anyhow::Error),
LlmError(String),
SmtpError(String),
ScrapingError(String),
}
impl IntoResponse for AppError {
fn into_response(self) -> axum::response::Response {
let (status, message) = match &self {
AppError::BadRequest(msg) => (StatusCode::BAD_REQUEST, msg.clone()),
AppError::Unauthorized(_) => (StatusCode::UNAUTHORIZED, "Unauthorized".into()),
AppError::Forbidden(_) => (StatusCode::FORBIDDEN, "Forbidden".into()),
AppError::NotFound(msg) => (StatusCode::NOT_FOUND, msg.clone()),
AppError::TooManyRequests { retry_after_secs } => {
// Include Retry-After header
(StatusCode::TOO_MANY_REQUESTS, format!("Retry after {retry_after_secs}s"))
}
AppError::Internal(e) => {
tracing::error!("Internal error: {e:#}");
(StatusCode::INTERNAL_SERVER_ERROR, "Internal server error".into())
}
// ...
};
(status, Json(json!({ "error": message }))).into_response()
}
}
All handlers return Result<impl IntoResponse, AppError>. The ? operator propagates errors naturally. From implementations convert sqlx::Error, reqwest::Error, etc. into AppError.
1.5 SQLite with sqlx: Schema Design
All tables use TEXT primary keys (UUIDs generated by the backend) for portability. Timestamps are stored as TEXT in ISO 8601 format (SQLite has no native timestamp; this also works on Postgres via TIMESTAMPTZ cast).
Migration 001: Users
CREATE TABLE users (
id TEXT PRIMARY KEY, -- UUID
email TEXT NOT NULL UNIQUE,
display_name TEXT,
role TEXT NOT NULL DEFAULT 'user', -- 'user' | 'admin'
created_at TEXT NOT NULL, -- ISO 8601
updated_at TEXT NOT NULL
);
CREATE INDEX idx_users_email ON users(email);
Migration 002: Sessions
CREATE TABLE sessions (
id TEXT PRIMARY KEY, -- Secure random token (32 bytes, base64url)
user_id TEXT NOT NULL REFERENCES users(id) ON DELETE CASCADE,
created_at TEXT NOT NULL,
expires_at TEXT NOT NULL,
ip_address TEXT,
user_agent TEXT
);
CREATE INDEX idx_sessions_user_id ON sessions(user_id);
CREATE INDEX idx_sessions_expires_at ON sessions(expires_at);
Migration 003: Settings
CREATE TABLE settings (
user_id TEXT PRIMARY KEY REFERENCES users(id) ON DELETE CASCADE,
theme TEXT NOT NULL DEFAULT 'Intelligence Artificielle',
max_age_days INTEGER NOT NULL DEFAULT 7,
categories TEXT NOT NULL, -- JSON array stored as TEXT
max_items_per_category INTEGER NOT NULL DEFAULT 4,
search_agent_behavior TEXT NOT NULL DEFAULT '',
ai_model TEXT NOT NULL DEFAULT 'gemini-3.1-pro-preview',
updated_at TEXT NOT NULL
);
Migration 004: Sources
CREATE TABLE sources (
id TEXT PRIMARY KEY,
user_id TEXT NOT NULL REFERENCES users(id) ON DELETE CASCADE,
title TEXT NOT NULL,
url TEXT NOT NULL,
created_at TEXT NOT NULL
);
CREATE INDEX idx_sources_user_id ON sources(user_id);
Migration 005: Syntheses
CREATE TABLE syntheses (
id TEXT PRIMARY KEY,
user_id TEXT NOT NULL REFERENCES users(id) ON DELETE CASCADE,
week TEXT NOT NULL, -- e.g. "2026-W12"
sections TEXT NOT NULL, -- JSON: [{ title, items: [{ title, url, summary }] }]
created_at TEXT NOT NULL
);
CREATE INDEX idx_syntheses_user_id ON syntheses(user_id);
CREATE INDEX idx_syntheses_created_at ON syntheses(created_at);
Migration 006: Admin Config (LLM Providers)
CREATE TABLE llm_providers (
id TEXT PRIMARY KEY,
provider TEXT NOT NULL, -- 'gemini' | 'openai' | 'anthropic'
display_name TEXT NOT NULL,
api_key TEXT NOT NULL, -- Encrypted at rest (AES-256-GCM)
base_url TEXT, -- Optional override for self-hosted/proxy
models TEXT NOT NULL, -- JSON array of available model identifiers
is_enabled BOOLEAN NOT NULL DEFAULT 1,
created_at TEXT NOT NULL,
updated_at TEXT NOT NULL,
UNIQUE(provider)
);
Migration 007: Rate Limit Configuration
CREATE TABLE rate_limits (
id TEXT PRIMARY KEY,
provider_id TEXT NOT NULL REFERENCES llm_providers(id) ON DELETE CASCADE,
max_requests INTEGER NOT NULL DEFAULT 29,
time_window_ms INTEGER NOT NULL DEFAULT 60000,
updated_at TEXT NOT NULL,
UNIQUE(provider_id)
);
-- Magic link rate limiting
CREATE TABLE magic_link_tokens (
id TEXT PRIMARY KEY,
email TEXT NOT NULL,
token_hash TEXT NOT NULL, -- SHA-256 of the token
created_at TEXT NOT NULL,
expires_at TEXT NOT NULL,
used BOOLEAN NOT NULL DEFAULT 0
);
CREATE INDEX idx_magic_link_email ON magic_link_tokens(email);
1.6 SQLite/Postgres Dual Compatibility Strategy
Recommendation: Use sqlx with runtime database selection via sqlx::AnyPool.
However, AnyPool has limitations (no compile-time query checking). A more robust approach:
Strategy: Feature-flag based conditional compilation.
# Cargo.toml
[features]
default = ["sqlite"]
sqlite = ["sqlx/sqlite"]
postgres = ["sqlx/postgres"]
For this project, the SQL differences between SQLite and Postgres are minimal:
| Concern | SQLite | Postgres | Resolution |
|---|---|---|---|
| Auto-increment PK | INTEGER PRIMARY KEY |
SERIAL |
Use UUID TEXT PKs -- identical on both |
| Timestamps | TEXT (ISO 8601) |
TIMESTAMPTZ |
Store as TEXT on both; parse in application layer |
| JSON columns | TEXT + app-side JSON parse |
JSONB |
Store as TEXT on both; Postgres can migrate to JSONB later |
| Boolean | INTEGER (0/1) |
BOOLEAN |
Use INTEGER on SQLite, BOOLEAN on Postgres; sqlx handles mapping |
| RETURNING clause | Supported since SQLite 3.35 | Supported | Use RETURNING on both |
Practical approach for v1: Target SQLite only. Write SQL that is Postgres-compatible by design (UUID text PKs, ISO timestamps, no SQLite-specific functions). When the Postgres upgrade happens, create a parallel migrations_pg/ folder and swap the connection pool. The query layer (db/) remains identical because all queries use standard SQL.
Compile-time checking is preserved by using sqlx::query! and sqlx::query_as! macros with the DATABASE_URL environment variable pointing to an SQLite file during development.
2. API Design
2.1 REST API Endpoints
All endpoints prefixed with /api/v1. Request and response bodies are JSON unless stated otherwise.
Authentication
| Method | Path | Auth | Description |
|---|---|---|---|
POST |
/auth/register |
No | Create account (email + captcha) |
POST |
/auth/login |
No | Request magic link (email + captcha) |
GET |
/auth/verify?token=... |
No | Verify magic link token, create session |
POST |
/auth/logout |
Yes | Invalidate session |
GET |
/auth/me |
Yes | Get current user info |
Syntheses
| Method | Path | Auth | Description |
|---|---|---|---|
GET |
/syntheses |
Yes | List user's syntheses (paginated) |
GET |
/syntheses/:id |
Yes | Get synthesis detail |
POST |
/syntheses/generate |
Yes | Trigger generation (async, returns job ID) |
GET |
/syntheses/generate/:job_id/status |
Yes | Poll generation status |
DELETE |
/syntheses/:id |
Yes | Delete a synthesis |
POST |
/syntheses/:id/email |
Yes | Send synthesis by email |
Sources
| Method | Path | Auth | Description |
|---|---|---|---|
GET |
/sources |
Yes | List user's sources |
POST |
/sources |
Yes | Add a source |
DELETE |
/sources/:id |
Yes | Delete a source |
POST |
/sources/bulk |
Yes | Bulk import (JSON array) |
POST |
/sources/import-csv |
Yes | Import from CSV (multipart upload) |
GET |
/sources/export-csv |
Yes | Export as CSV download |
Settings
| Method | Path | Auth | Description |
|---|---|---|---|
GET |
/settings |
Yes | Get user's settings |
PUT |
/settings |
Yes | Update settings |
GET |
/settings/export |
Yes | Export as JSON download |
POST |
/settings/import |
Yes | Import from JSON |
Admin
| Method | Path | Auth | Description |
|---|---|---|---|
GET |
/admin/providers |
Admin | List LLM provider configs |
POST |
/admin/providers |
Admin | Add/update provider config |
DELETE |
/admin/providers/:id |
Admin | Remove provider |
GET |
/admin/rate-limits |
Admin | Get rate limit configs |
PUT |
/admin/rate-limits/:provider_id |
Admin | Update rate limit config |
GET |
/admin/users |
Admin | List all users |
PUT |
/admin/users/:id/role |
Admin | Change user role |
Public (for frontend config)
| Method | Path | Auth | Description |
|---|---|---|---|
GET |
/config/providers |
Yes | List enabled providers + their model names (no API keys) |
2.2 Request/Response Shapes
POST /auth/register
// Request
{
"email": "user@example.com",
"display_name": "Jane Doe",
"captcha_token": "hcaptcha-response-token"
}
// Response 200
{
"message": "A verification link has been sent to your email."
}
POST /syntheses/generate
// Request (empty body -- uses user's saved settings and sources)
{}
// Response 202
{
"job_id": "uuid-of-generation-job",
"status": "pending"
}
GET /syntheses/:id
// Response 200
{
"id": "uuid",
"week": "2026-W12",
"created_at": "2026-03-21T10:30:00Z",
"sections": [
{
"title": "Annonces majeures",
"items": [
{
"title": "Article title",
"url": "https://example.com/article",
"summary": "4-5 line summary..."
}
]
}
]
}
PUT /settings
// Request
{
"theme": "Intelligence Artificielle",
"max_age_days": 7,
"categories": ["Annonces majeures", "Secteur financier"],
"max_items_per_category": 4,
"search_agent_behavior": "Custom instructions...",
"ai_model": "gemini-3.1-pro-preview"
}
// Response 200
{
"message": "Settings updated successfully."
}
POST /admin/providers
// Request
{
"provider": "openai",
"display_name": "OpenAI GPT-4o",
"api_key": "sk-...",
"base_url": null,
"models": ["gpt-4o", "gpt-4o-mini"],
"is_enabled": true
}
2.3 Authentication Middleware
The auth middleware is a tower Layer that:
- Extracts the session cookie (
ai_synth_session) from the request. - Looks up the session ID in the
sessionstable. - Checks
expires_athas not passed. - Loads the
Userfrom theuserstable. - Injects the
Userinto request extensions (request.extensions_mut().insert(user)). - Handlers extract the user via
Extension<User>or a customAuthUserextractor.
For admin routes, an additional RequireAdmin layer checks user.role == "admin".
Session cookies configuration:
Cookie::build(("ai_synth_session", session_id))
.http_only(true)
.secure(true) // HTTPS only
.same_site(SameSite::Lax)
.path("/")
.max_age(Duration::days(30))
CSRF Protection:
Since this is an API consumed by a SPA on the same origin (or proxied), the combination of SameSite=Lax cookies and requiring a custom header (X-Requested-With: XMLHttpRequest) on mutating requests provides sufficient CSRF protection. This is the "custom header" pattern -- browsers will not send custom headers on cross-origin requests without CORS preflight approval.
For the SPA, every fetch call to the API includes:
headers: { "X-Requested-With": "XMLHttpRequest" }
The CSRF middleware rejects POST/PUT/DELETE requests missing this header.
3. LLM Provider Abstraction
3.1 Trait Design
#[async_trait]
pub trait LlmProvider: Send + Sync {
/// Returns the provider identifier (e.g., "gemini", "openai", "anthropic").
fn provider_id(&self) -> &str;
/// Pass 1: Search the web and generate structured news items.
/// Returns raw JSON matching the category schema.
async fn generate_search_pass(
&self,
model: &str,
system_prompt: &str,
user_prompt: &str,
response_schema: &serde_json::Value,
) -> Result<serde_json::Value, AppError>;
/// Pass 2: Rewrite titles and summaries based on scraped content.
/// No web search tool needed.
async fn generate_rewrite_pass(
&self,
model: &str,
system_prompt: &str,
user_prompt: &str,
response_schema: &serde_json::Value,
) -> Result<serde_json::Value, AppError>;
/// Lists available models for this provider.
fn available_models(&self) -> &[String];
}
3.2 Provider-Specific Web Search Handling
Each provider handles web grounding differently. The trait design abstracts this:
| Provider | Pass 1 (Search) | Pass 2 (Rewrite) |
|---|---|---|
| Gemini | Uses googleSearch tool in config. Structured output via responseSchema. |
Standard generation, no tools. responseSchema for structured output. |
| OpenAI | Uses web_search tool (Responses API) or a two-step approach: first call with browsing tool, then structured output. |
Standard chat completion with response_format: { type: "json_schema", ... }. |
| Anthropic | Uses web_search tool (available on Claude models). Structured output via tool-use pattern or explicit JSON instructions. |
Standard message with JSON output instructions. Anthropic does not have native JSON schema enforcement, so the prompt includes the schema and parsing is done server-side with validation. |
Implementation details for each provider:
// Gemini implementation
pub struct GeminiProvider {
client: reqwest::Client,
api_key: String,
base_url: String,
models: Vec<String>,
}
impl GeminiProvider {
async fn generate_search_pass(&self, model: &str, ...) -> Result<serde_json::Value, AppError> {
// POST to /v1beta/models/{model}:generateContent
// Config includes: tools: [{ googleSearch: {} }]
// responseMimeType: "application/json"
// responseSchema: <schema>
}
}
// OpenAI implementation
pub struct OpenAiProvider {
client: reqwest::Client,
api_key: String,
base_url: String, // default: https://api.openai.com/v1
models: Vec<String>,
}
// Anthropic implementation
pub struct AnthropicProvider {
client: reqwest::Client,
api_key: String,
base_url: String, // default: https://api.anthropic.com
models: Vec<String>,
}
3.3 Provider Factory
pub fn create_provider(config: &LlmProviderConfig) -> Result<Box<dyn LlmProvider>, AppError> {
match config.provider.as_str() {
"gemini" => Ok(Box::new(GeminiProvider::new(
config.api_key.clone(),
config.base_url.clone(),
config.models.clone(),
))),
"openai" => Ok(Box::new(OpenAiProvider::new(...))),
"anthropic" => Ok(Box::new(AnthropicProvider::new(...))),
_ => Err(AppError::BadRequest(format!("Unknown provider: {}", config.provider))),
}
}
3.4 Rate Limiter Design
The rate limiter is a server-side, per-provider, in-memory token bucket with configuration stored in the database.
pub struct RateLimiter {
state: Arc<DashMap<String, ProviderBucket>>,
}
struct ProviderBucket {
timestamps: VecDeque<Instant>,
max_requests: u32,
time_window: Duration,
}
impl RateLimiter {
/// Blocks until a slot is available for the given provider.
pub async fn acquire(&self, provider_id: &str) -> Result<(), AppError> {
loop {
let mut bucket = self.state
.entry(provider_id.to_string())
.or_insert_with(|| self.default_bucket());
bucket.timestamps.retain(|t| t.elapsed() < bucket.time_window);
if bucket.timestamps.len() < bucket.max_requests as usize {
bucket.timestamps.push_back(Instant::now());
return Ok(());
}
let wait_time = bucket.time_window - bucket.timestamps.front().unwrap().elapsed();
drop(bucket); // Release the DashMap lock before sleeping
tokio::time::sleep(wait_time).await;
}
}
/// Reload configuration from DB (called by admin update endpoint).
pub async fn reload_config(&self, pool: &SqlitePool) -> Result<(), AppError> {
// Fetch rate_limits table, update each ProviderBucket
}
}
The rate limiter lives in AppState and is shared across all requests. When an admin updates rate limit configuration, reload_config is called to hot-reload without restart.
3.5 Two-Pass Generation Pipeline
The SynthesisService orchestrates the full pipeline:
pub struct SynthesisService;
impl SynthesisService {
pub async fn generate(
state: &AppState,
user_id: &str,
) -> Result<Synthesis, AppError> {
// 1. Load user settings
let settings = db::settings::get(pool, user_id).await?;
// 2. Load user sources
let sources = db::sources::list(pool, user_id).await?;
// 3. Resolve LLM provider + model
let (provider, model) = resolve_provider(state, &settings.ai_model).await?;
// 4. Build dynamic schema from categories
let schema = build_category_schema(&settings.categories);
// 5. Rate limit: acquire slot
state.rate_limiter.acquire(provider.provider_id()).await?;
// 6. Pass 1: Search
let raw_results = provider.generate_search_pass(
&model, &system_prompt, &user_prompt, &schema
).await?;
// 7. Validate & scrape URLs (server-side, no CORS issues)
let scraped = scraper::validate_and_scrape(
&state.http_client,
raw_results,
settings.max_age_days,
).await;
// 8. Rate limit: acquire slot for pass 2
state.rate_limiter.acquire(provider.provider_id()).await?;
// 9. Pass 2: Rewrite with scraped content
let final_results = provider.generate_rewrite_pass(
&model, &rewrite_system_prompt, &rewrite_prompt, &schema
).await?;
// 10. Persist
let synthesis = db::syntheses::create(
pool, user_id, &week_string, &final_results
).await?;
Ok(synthesis)
}
}
3.6 Asynchronous Generation
Synthesis generation can take 30-90 seconds. Two options:
Option A: Synchronous with long timeout. Simple, but ties up a connection. Acceptable for low-traffic deployments.
Option B (Recommended): Background task with polling. The POST /syntheses/generate endpoint spawns a tokio task and returns a job ID. The frontend polls GET /syntheses/generate/:job_id/status. Job state is kept in an in-memory DashMap<String, JobStatus> (not in DB, since jobs are ephemeral).
enum JobStatus {
Pending,
InProgress { step: String }, // "search", "scraping", "rewriting"
Completed { synthesis_id: String },
Failed { error: String },
}
The frontend polls every 3-5 seconds with the same loading UX as the current React app.
4. URL Scraping / Validation
4.1 CORS Elimination
Moving scraping to the backend completely eliminates CORS issues. The Rust backend makes direct HTTP requests to target URLs -- no proxies needed. This is the single biggest reliability improvement in the refactoring.
4.2 reqwest-Based HTTP Client
let client = reqwest::Client::builder()
.user_agent("Mozilla/5.0 (compatible; AISynthBot/1.0; +https://your-domain.com/bot)")
.timeout(Duration::from_secs(15))
.redirect(reqwest::redirect::Policy::limited(5))
.connect_timeout(Duration::from_secs(5))
.danger_accept_invalid_certs(false)
.build()?;
The HTTP client is created once in AppState and reused across all requests (connection pooling).
4.3 HTML Parsing with scraper Crate
The current app uses the browser's DOMParser. The Rust equivalent uses the scraper crate (built on html5ever):
use scraper::{Html, Selector};
pub async fn validate_and_scrape(
client: &reqwest::Client,
items: Vec<RawNewsItem>,
max_age_days: i64,
) -> Vec<ScrapedNewsItem> {
let futures = items.into_iter().map(|item| {
let client = client.clone();
async move { scrape_single(&client, item, max_age_days).await }
});
let results = futures::future::join_all(futures).await;
results.into_iter().filter_map(|r| r).collect()
}
async fn scrape_single(
client: &reqwest::Client,
item: RawNewsItem,
max_age_days: i64,
) -> Option<ScrapedNewsItem> {
// 1. Validate URL format
let url = Url::parse(&item.url).ok()?;
// 2. Fetch
let resp = client.get(url).send().await.ok()?;
if !resp.status().is_success() { return None; }
let html_text = resp.text().await.ok()?;
// 3. Parse HTML
let document = Html::parse_document(&html_text);
// 4. Soft-404 detection
let title_sel = Selector::parse("title").unwrap();
let h1_sel = Selector::parse("h1").unwrap();
let title_text = document.select(&title_sel).next()
.map(|el| el.text().collect::<String>().to_lowercase())
.unwrap_or_default();
let h1_text = document.select(&h1_sel).next()
.map(|el| el.text().collect::<String>().to_lowercase())
.unwrap_or_default();
let error_keywords = [
"page not found", "404", "403", "access denied",
"forbidden", "not found", "introuvable",
];
if error_keywords.iter().any(|kw| title_text.contains(kw) || h1_text.contains(kw)) {
return None;
}
// 5. Date extraction (meta tags, JSON-LD, <time>)
if let Some(pub_date) = extract_publication_date(&document) {
let age = Utc::now() - pub_date;
if age.num_days() > max_age_days {
return None;
}
}
// 6. Extract body text (remove script, style, nav, etc.)
let content = extract_body_text(&document, 4000);
Some(ScrapedNewsItem {
title: item.title,
url: item.url,
summary: item.summary,
scraped_content: content,
})
}
Date extraction mirrors the current logic: check meta[property="article:published_time"], meta[itemprop="datePublished"], <time datetime>, and JSON-LD datePublished. The chrono crate handles date parsing with multiple format attempts.
4.4 Concurrency Control
To avoid overwhelming target sites, scraping runs with bounded concurrency:
use futures::stream::{self, StreamExt};
stream::iter(items)
.map(|item| scrape_single(&client, item, max_age_days))
.buffer_unordered(10) // Max 10 concurrent scrapes
.collect::<Vec<_>>()
.await
5. SolidJS Frontend
5.1 Build Tooling
SolidJS uses Vite natively. The migration is straightforward:
// vite.config.ts
import { defineConfig } from 'vite';
import solidPlugin from 'vite-plugin-solid';
import tailwindcss from '@tailwindcss/vite';
export default defineConfig({
plugins: [solidPlugin(), tailwindcss()],
server: {
port: 3000,
proxy: {
'/api': 'http://localhost:8080', // Proxy to Rust backend during dev
},
},
build: {
target: 'esnext',
},
});
package.json dependencies:
{
"dependencies": {
"solid-js": "^1.9",
"@solidjs/router": "^0.15",
"lucide-solid": "^0.450",
"date-fns": "^4.1"
},
"devDependencies": {
"vite": "^6.2",
"vite-plugin-solid": "^2.11",
"@tailwindcss/vite": "^4.1",
"tailwindcss": "^4.1",
"typescript": "^5.8"
}
}
5.2 State Management: React to SolidJS Mapping
| React Pattern | SolidJS Equivalent | Notes |
|---|---|---|
useState(value) |
createSignal(value) |
Returns [getter, setter] -- getter is a function call: count() |
useEffect(() => {}, [deps]) |
createEffect(() => {}) |
Auto-tracks dependencies, no dep array needed |
useContext(Ctx) |
useContext(Ctx) |
Nearly identical API |
createContext() |
createContext() |
Same concept |
React.FC<Props> |
Component<Props> |
import { Component } from 'solid-js' |
{items.map(i => ...)} |
<For each={items()}>{(item) => ...}</For> |
SolidJS uses <For> for efficient list rendering |
{condition && <X/>} |
<Show when={condition()}><X/></Show> |
<Show> avoids unnecessary DOM creation |
useNavigate() |
useNavigate() |
Same API from @solidjs/router |
useParams() |
useParams() |
Same API |
onSnapshot (realtime) |
createResource + polling or SSE |
SolidJS does not have a Firestore equivalent; use createResource for data fetching |
5.3 Authentication Context Port
// src/context/AuthContext.tsx
import { createContext, useContext, createSignal, createResource, ParentComponent } from 'solid-js';
interface User {
id: string;
email: string;
display_name: string | null;
role: string;
}
interface AuthContextType {
user: () => User | null | undefined;
loading: () => boolean;
logout: () => Promise<void>;
}
const AuthContext = createContext<AuthContextType>();
async function fetchCurrentUser(): Promise<User | null> {
const resp = await fetch('/api/v1/auth/me', {
headers: { 'X-Requested-With': 'XMLHttpRequest' },
credentials: 'include',
});
if (resp.status === 401) return null;
if (!resp.ok) throw new Error('Failed to fetch user');
return resp.json();
}
export const AuthProvider: ParentComponent = (props) => {
const [user, { refetch }] = createResource(fetchCurrentUser);
const logout = async () => {
await fetch('/api/v1/auth/logout', {
method: 'POST',
headers: { 'X-Requested-With': 'XMLHttpRequest' },
credentials: 'include',
});
refetch();
};
return (
<AuthContext.Provider value={{
user: () => user(),
loading: () => user.loading,
logout,
}}>
{props.children}
</AuthContext.Provider>
);
};
export const useAuth = () => {
const ctx = useContext(AuthContext);
if (!ctx) throw new Error('useAuth must be used within AuthProvider');
return ctx;
};
5.4 Data Fetching Pattern
The current React app uses Firestore's onSnapshot for real-time updates. With the REST API backend, data fetching uses createResource:
// src/pages/Home.tsx
import { createResource, For, Show } from 'solid-js';
import { A } from '@solidjs/router';
import { fetchApi } from '../lib/api';
async function fetchSyntheses() {
return fetchApi<SynthesisDocument[]>('/api/v1/syntheses');
}
export default function Home() {
const [syntheses, { refetch }] = createResource(fetchSyntheses);
return (
<Show when={!syntheses.loading} fallback={<Spinner />}>
<For each={syntheses()}>
{(synth) => (
<A href={`/synthesis/${synth.id}`}>
{/* card content */}
</A>
)}
</For>
</Show>
);
}
5.5 Tailwind CSS Compatibility
Tailwind CSS v4 works identically with SolidJS. The @tailwindcss/vite plugin scans .tsx files for class names regardless of framework. All existing Tailwind classes carry over without changes. The lucide-solid package provides the same icon components as lucide-react with identical APIs.
5.6 Routing
// src/App.tsx
import { Router, Route } from '@solidjs/router';
import { AuthProvider } from './context/AuthContext';
function App() {
return (
<AuthProvider>
<Router>
<Route path="/login" component={Login} />
<Route path="/" component={ProtectedLayout}>
<Route path="/" component={Home} />
<Route path="/sources" component={Sources} />
<Route path="/settings" component={Settings} />
<Route path="/generate" component={GenerateSynthesis} />
<Route path="/synthesis/:id" component={SynthesisDetail} />
</Route>
</Router>
</AuthProvider>
);
}
The ProtectedLayout component checks auth and renders <Navigate> if not logged in -- same pattern as the current React ProtectedRoute but using SolidJS's <Navigate>.
6. Authentication System
6.1 Magic Link Flow
User Frontend Backend SMTP Server
| | | |
|-- Enter email -------->| | |
| |-- POST /auth/login --> |
| | { email, captcha_token } |
| | |-- verify captcha ->|
| | |-- generate token |
| | |-- store hash in DB |
| | |-- send email ------+-->
| |<-- 200 "Check email" | |
| | | |
|<---- Email arrives (link: /auth/verify?token=xxx) -------------|
| | | |
|-- Click link --------->| | |
| |-- GET /auth/verify?token=xxx --> |
| | |-- hash token |
| | |-- lookup in DB |
| | |-- verify not expired|
| | |-- mark as used |
| | |-- create/get user |
| | |-- create session |
| |<-- 302 redirect + Set-Cookie |
|<-- Redirect to / ------| | |
Token generation:
- 32 bytes of cryptographically secure random data (
rand::rngs::OsRng) - Base64url encoded for URL safety
- SHA-256 hash stored in DB (never store raw token)
- 15-minute expiry
- Single use (marked
used = trueafter verification)
Rate limiting on magic link requests:
- Max 3 requests per email per 15 minutes
- Max 10 requests per IP per hour
- Prevents email bombing
6.2 Account Registration Flow
- User submits email + display name + captcha token.
- Backend verifies captcha with provider.
- Backend checks email uniqueness.
- Backend creates user with
role = 'user'and default settings. - Backend sends magic link email for initial verification.
- User clicks link, session is created.
The first user can be bootstrapped as admin via environment variable:
ADMIN_EMAIL=admin@example.com
On startup, if a user with this email exists, their role is set to admin.
6.3 Session Management
Sessions are stored in the sessions table. The session ID is a 32-byte random token (base64url-encoded, 43 characters). Session lookup is O(1) via primary key.
Session lifecycle:
- Created on magic link verification
- Expires after 30 days (configurable)
- Refreshed (expiry extended) on each authenticated request
- Deleted on logout
- Periodic cleanup job (tokio interval) removes expired sessions
6.4 Captcha Integration
Recommendation: Cloudflare Turnstile.
| Option | Self-hostable | Privacy | Free tier |
|---|---|---|---|
| hCaptcha | No (SaaS) | Better than reCAPTCHA | Yes (unlimited) |
| Cloudflare Turnstile | No (SaaS) | Excellent (often invisible) | Yes (unlimited) |
| mCaptcha | Yes (open source) | Full control | N/A (self-hosted) |
None of the mainstream captcha services are fully self-hostable. Cloudflare Turnstile is recommended for its invisible challenge mode (better UX) and generous free tier. If strict self-hosting is required, mCaptcha (Rust-based, open source) is the only viable option, though it requires running a separate service.
Backend verification is simple:
pub async fn verify_captcha(client: &reqwest::Client, token: &str, secret: &str) -> Result<bool, AppError> {
let resp = client
.post("https://challenges.cloudflare.com/turnstile/v0/siteverify")
.form(&[("secret", secret), ("response", token)])
.send()
.await?;
let result: TurnstileResponse = resp.json().await?;
Ok(result.success)
}
7. Docker Deployment
7.1 Multi-Stage Dockerfile
# ===== Stage 1: Build Rust backend =====
FROM rust:1.85-bookworm AS backend-builder
WORKDIR /app
COPY Cargo.toml Cargo.lock ./
COPY src/ src/
COPY migrations/ migrations/
# Create a dummy SQLite DB for sqlx compile-time checks
ENV DATABASE_URL="sqlite:///tmp/build.db"
RUN cargo install sqlx-cli --no-default-features --features sqlite \
&& sqlx database create \
&& sqlx migrate run
RUN cargo build --release
# ===== Stage 2: Build SolidJS frontend =====
FROM node:22-alpine AS frontend-builder
WORKDIR /app/frontend
COPY frontend/package.json frontend/package-lock.json ./
RUN npm ci
COPY frontend/ ./
RUN npm run build
# ===== Stage 3: Minimal runtime =====
FROM debian:bookworm-slim AS runtime
RUN apt-get update && apt-get install -y \
ca-certificates \
libssl3 \
&& rm -rf /var/lib/apt/lists/*
RUN useradd -ms /bin/bash appuser
WORKDIR /app
# Copy backend binary
COPY --from=backend-builder /app/target/release/ai-synth-backend .
# Copy migrations for runtime migration
COPY --from=backend-builder /app/migrations/ migrations/
# Copy frontend static files
COPY --from=frontend-builder /app/frontend/dist/ static/
# Create data directory for SQLite
RUN mkdir -p /app/data && chown appuser:appuser /app/data
USER appuser
ENV DATABASE_URL="sqlite:///app/data/ai_synth.db"
ENV STATIC_DIR="/app/static"
ENV PORT=8080
EXPOSE 8080
# Run migrations on startup, then start server
CMD ["./ai-synth-backend"]
The Rust backend serves the static SolidJS files directly (via tower-http::ServeDir), eliminating the need for a separate nginx container. All /api/* routes go to handlers; everything else serves index.html (SPA fallback).
7.2 docker-compose.yml
version: "3.9"
services:
app:
build:
context: .
dockerfile: Dockerfile
container_name: ai-synth
restart: unless-stopped
ports:
- "${PORT:-8080}:8080"
volumes:
- ai_synth_data:/app/data # SQLite persistence
environment:
- DATABASE_URL=sqlite:///app/data/ai_synth.db
- PORT=8080
- ADMIN_EMAIL=${ADMIN_EMAIL}
- SESSION_SECRET=${SESSION_SECRET} # 64-byte hex for cookie signing
- SMTP_HOST=${SMTP_HOST}
- SMTP_PORT=${SMTP_PORT:-587}
- SMTP_USER=${SMTP_USER}
- SMTP_PASSWORD=${SMTP_PASSWORD}
- SMTP_FROM=${SMTP_FROM}
- CAPTCHA_SECRET=${CAPTCHA_SECRET}
- CAPTCHA_SITE_KEY=${CAPTCHA_SITE_KEY}
- ENCRYPTION_KEY=${ENCRYPTION_KEY} # 32-byte hex for API key encryption
- RUST_LOG=info
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/api/v1/health"]
interval: 30s
timeout: 5s
retries: 3
# Optional: Mailpit for local development (SMTP catch-all)
mailpit:
image: axllent/mailpit
container_name: ai-synth-mail
restart: unless-stopped
ports:
- "8025:8025" # Web UI
- "1025:1025" # SMTP
profiles:
- dev
volumes:
ai_synth_data:
driver: local
7.3 Volume Mounts for SQLite
The SQLite database file is stored in a Docker named volume (ai_synth_data). This ensures:
- Data persists across container restarts and rebuilds
- The volume can be backed up via
docker cpor volume backup tools - WAL mode is used for concurrent read/write performance
Important SQLite configuration for production:
let pool = SqlitePoolOptions::new()
.max_connections(5) // SQLite handles limited concurrency
.after_connect(|conn, _| {
Box::pin(async move {
conn.execute("PRAGMA journal_mode=WAL").await?;
conn.execute("PRAGMA synchronous=NORMAL").await?;
conn.execute("PRAGMA foreign_keys=ON").await?;
conn.execute("PRAGMA busy_timeout=5000").await?;
Ok(())
})
})
.connect(&database_url)
.await?;
7.4 Environment Variable Configuration
A .env.example file documents all required and optional variables:
# === Required ===
DATABASE_URL=sqlite:///app/data/ai_synth.db
SESSION_SECRET=<64-byte-hex-string>
ENCRYPTION_KEY=<32-byte-hex-string>
ADMIN_EMAIL=admin@example.com
# === SMTP (required for magic link auth) ===
SMTP_HOST=smtp.example.com
SMTP_PORT=587
SMTP_USER=user@example.com
SMTP_PASSWORD=password
SMTP_FROM=noreply@example.com
# === Captcha ===
CAPTCHA_SECRET=<turnstile-secret-key>
CAPTCHA_SITE_KEY=<turnstile-site-key>
# === Optional ===
PORT=8080
RUST_LOG=info
BASE_URL=https://your-domain.com # For magic link URLs
8. Migration from Firebase
8.1 Data Migration Strategy
A standalone Rust CLI tool (or a script using firebase-admin SDK in Python/Node) handles the migration:
Step 1: Export Firestore data
Use firebase-admin SDK (Python or Node.js is simplest for this one-shot task):
# migrate_export.py
import firebase_admin
from firebase_admin import credentials, firestore
import json
cred = credentials.Certificate("service-account.json")
firebase_admin.initialize_app(cred)
db = firestore.client()
# Export users (from Firebase Auth)
# Export syntheses, sources, settings collections
data = {
"syntheses": [],
"sources": [],
"settings": [],
}
for doc in db.collection("syntheses").stream():
d = doc.to_dict()
d["_id"] = doc.id
data["syntheses"].append(d)
# ... same for sources, settings
with open("firebase_export.json", "w") as f:
json.dump(data, f, default=str)
Step 2: Transform and import into SQLite
A Rust CLI tool reads the JSON export and inserts into SQLite:
cargo run --bin migrate -- --input firebase_export.json --db ai_synth.db
Key transformations:
authorUid/userIdfrom Firebase Auth UID -> new UUID inuserstable (mapping table maintained during migration)- Firebase
Timestamp-> ISO 8601 string - Legacy
SynthesisDatafields (majorAnnouncements,financialSector, etc.) -> normalizedsections[]JSON - Settings doc ID (was
{userId}in Firestore) ->user_idforeign key
Step 3: User notification
Since authentication changes from Google SSO to email+magic link, existing users need to be notified that they must use the magic link flow. Their email addresses (from Firebase Auth) are imported into the users table. On first magic link login, the user's existing data is accessible via their email.
8.2 Mapping Firestore Security Rules to Rust
The Firestore rules enforce three categories of protection that map to backend patterns:
| Firestore Rule | Rust Equivalent |
|---|---|
isAuthenticated() |
Auth middleware layer (rejects 401 if no valid session) |
isDocOwner() / request.auth.uid == resource.data.authorUid |
Query-level filtering: WHERE user_id = $1 with the authenticated user's ID |
isValidSynthesis() / isValidSettings() / isValidSource() |
Request validation using validator crate or manual checks in handlers |
uidUnchanged() / uidNotModified() |
Not applicable -- user_id is never in the request body; it is injected server-side from the session |
request.resource.data.createdAt == resource.data.createdAt |
created_at is set server-side and never updatable via API |
| Field type checks (string, number, timestamp) | Serde deserialization + custom validators |
Size limits (e.g., title.size() < 500) |
Validator annotations: #[validate(length(max = 500))] |
Example validation in Rust:
#[derive(Deserialize, Validate)]
pub struct CreateSourceRequest {
#[validate(length(min = 1, max = 200))]
pub title: String,
#[validate(url, length(max = 1000))]
pub url: String,
}
The key architectural difference: in Firestore, rules are the only security layer (the client has direct DB access). In the Rust backend, security is enforced at the handler level (authentication middleware + query scoping + input validation). The database is never directly accessible from the client.
Ownership enforcement pattern:
Every query that reads or mutates user data includes WHERE user_id = ? with the authenticated user's ID. This is not a "rule" but a structural guarantee -- there is no code path that can access another user's data because the user ID comes from the session, not the request.
// db/syntheses.rs
pub async fn get_by_id(pool: &SqlitePool, user_id: &str, synthesis_id: &str) -> Result<Option<Synthesis>, sqlx::Error> {
sqlx::query_as!(
Synthesis,
"SELECT * FROM syntheses WHERE id = ? AND user_id = ?",
synthesis_id,
user_id
)
.fetch_optional(pool)
.await
}
If the synthesis belongs to another user, this returns None, and the handler returns 404. There is no way for a user to query, update, or delete another user's data.
Summary of Key Crate Dependencies
| Purpose | Crate | Version Guidance |
|---|---|---|
| Web framework | axum |
^0.8 |
| Async runtime | tokio |
^1 (full features) |
| Database | sqlx |
^0.8 (features: sqlite, runtime-tokio) |
| HTTP client | reqwest |
^0.12 (features: json, cookies) |
| HTML parsing | scraper |
^0.22 |
| Serialization | serde, serde_json |
^1 |
| Date/time | chrono |
^0.4 |
| Password/token hashing | sha2 |
^0.10 |
| Random tokens | rand |
^0.8 |
| SMTP | lettre |
^0.11 |
| Logging | tracing, tracing-subscriber |
^0.1 / ^0.3 |
| Config | dotenvy |
^0.15 |
| Validation | validator |
^0.19 |
| Concurrent map | dashmap |
^6 |
| Static file serving | tower-http |
^0.6 (features: fs, cors, trace) |
| Cookie handling | axum-extra |
^0.10 (features: cookie) |
| Encryption (API keys) | aes-gcm |
^0.10 |
| Base64 | base64 |
^0.22 |
| UUID | uuid |
^1 (features: v4) |
| Error handling | anyhow, thiserror |
^1 |
Architecture Diagram (Text)
┌─────────────────────┐
│ Docker Container │
│ │
Browser ◄──── HTTPS ────► ┌─────┴─────────────────┐ │
(SolidJS SPA) │ Axum Web Server │ │
│ │ │
│ /static/* ──► ServeDir│ │
│ /api/v1/* ──► Router │ │
│ │ │
│ ┌─ Auth Middleware ─┐ │ │
│ │ Session Cookie │ │ │
│ │ CSRF Check │ │ │
│ └───────────────────┘ │ │
│ │ │
│ ┌─ Handlers ────────┐ │ │
│ │ auth, syntheses, │ │ │
│ │ sources, settings,│ │ │
│ │ admin, email │ │ │
│ └────────┬──────────┘ │ │
│ │ │ │
│ ┌─ Services ────────┐ │ │
│ │ LLM providers │─┼───┼──► Gemini API
│ │ (trait-based) │─┼───┼──► OpenAI API
│ │ │─┼───┼──► Anthropic API
│ │ Scraper (reqwest) │─┼───┼──► Target URLs
│ │ Email (lettre) │─┼───┼──► SMTP Server
│ │ Captcha │─┼───┼──► Turnstile API
│ └────────┬──────────┘ │ │
│ │ │ │
│ ┌─ DB Layer (sqlx) ─┐ │ │
│ │ SQLite (WAL) │ │ │
│ └───────────────────┘ │ │
└───────────┬────────────┘ │
│ │
┌───────────▼────────────┐ │
│ /app/data/ │ │
│ ai_synth.db │ │
│ (Docker volume) │ │
└─────────────────────────┘ │
└─────────────────────┘