You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

1458 lines
52 KiB
Markdown

# Technical Architecture Analysis: AI Weekly Synth Refactoring
## Open Questions and Clarifications Needed
Before implementation, the following points require decisions from stakeholders:
1. **Admin scope**: Is the "admin" a single super-user defined by config, or a full role-based system with multiple admins? This analysis assumes a simple role flag on users plus a single bootstrap admin defined via environment variable.
2. **Google OAuth retention**: The requirements specify email+captcha and magic link auth. Should Google SSO be dropped entirely, or kept as an additional option? This analysis assumes Google SSO is dropped to remove all Google dependencies.
3. **Email sending for syntheses**: The current app sends syntheses via Gmail API with OAuth popup. With Google dependencies removed, should SMTP-based email sending replace this? This analysis assumes yes, using the same SMTP configuration as magic link delivery.
4. **Data migration volume**: How many existing users and syntheses need migrating? This impacts whether a one-shot script suffices or whether incremental migration tooling is needed.
5. **Concurrent users target**: Rate limiter design and session store choice depend on expected load. This analysis assumes a small-to-medium deployment (1-100 concurrent users).
6. **Legacy data**: The current `SynthesisData` has legacy fields (`majorAnnouncements`, `financialSector`, etc.). The requirements say "remove legacy data/formats/code." This analysis assumes legacy fields are dropped during migration; only the `sections[]` format is carried forward.
---
## 1. Rust Backend Architecture
### 1.1 Framework Choice: Axum
**Recommendation: Axum** over Actix-web.
**Justification:**
| Criterion | Axum | Actix-web |
|---|---|---|
| Ecosystem alignment | Built on `tokio` + `tower` + `hyper` -- the de-facto Rust async stack | Has its own runtime layer (though uses tokio underneath) |
| Middleware model | Tower `Layer`/`Service` -- composable, reusable, testable | Actor-based middleware -- powerful but idiosyncratic |
| Extractors | Type-safe, ergonomic, uses `FromRequest` traits | Similar, but with `web::Data`, `web::Json` wrappers |
| Community trajectory | Growing faster, backed by the tokio team | Mature, stable, but slower growth |
| Learning curve | Lower for developers already using tokio ecosystem | Slightly higher due to actor concepts |
| Compile-time type safety | Strong -- handler function signatures are validated at compile time | Strong, but less ergonomic error messages |
Axum's tower-based middleware model is a decisive advantage for this project: the auth middleware, rate limiter, and CORS layer compose naturally as tower `Layer`s. Axum also has first-class support for shared state via `State` extractor, which maps well to a shared database pool and configuration.
### 1.2 Project Structure
```
ai-synth-backend/
├── Cargo.toml
├── Cargo.lock
├── .env.example
├── migrations/ # sqlx migrations
│ ├── 001_create_users.sql
│ ├── 002_create_sessions.sql
│ ├── 003_create_settings.sql
│ ├── 004_create_sources.sql
│ ├── 005_create_syntheses.sql
│ ├── 006_create_admin_config.sql
│ └── 007_create_rate_limits.sql
├── src/
│ ├── main.rs # Entry point: init tracing, DB, run server
│ ├── config.rs # Env-based configuration (envy / dotenvy)
│ ├── app_state.rs # AppState struct (pool, config, http client)
│ ├── error.rs # AppError enum, IntoResponse impl
│ ├── router.rs # All route definitions, middleware wiring
│ ├── middleware/
│ │ ├── mod.rs
│ │ ├── auth.rs # Session cookie extraction, user injection
│ │ ├── csrf.rs # Double-submit cookie CSRF protection
│ │ └── rate_limit.rs # Per-provider, configurable rate limiter
│ ├── models/
│ │ ├── mod.rs
│ │ ├── user.rs # User, NewUser, UserRole
│ │ ├── session.rs # Session
│ │ ├── settings.rs # UserSettings
│ │ ├── source.rs # Source
│ │ ├── synthesis.rs # Synthesis, NewsSection, NewsItem
│ │ └── admin.rs # LlmProviderConfig, RateLimitConfig
│ ├── handlers/
│ │ ├── mod.rs
│ │ ├── auth.rs # register, login (magic link), verify, logout
│ │ ├── syntheses.rs # list, get, create (trigger generation), delete
│ │ ├── sources.rs # CRUD, bulk import, CSV export
│ │ ├── settings.rs # get, update, export, import
│ │ ├── admin.rs # LLM config CRUD, rate limit config, user list
│ │ └── email.rs # Send synthesis by email
│ ├── services/
│ │ ├── mod.rs
│ │ ├── llm/
│ │ │ ├── mod.rs # LlmProvider trait, factory function
│ │ │ ├── gemini.rs # Google Gemini implementation
│ │ │ ├── openai.rs # OpenAI implementation
│ │ │ ├── anthropic.rs # Anthropic implementation
│ │ │ └── types.rs # Shared request/response types
│ │ ├── synthesis.rs # 2-pass generation pipeline orchestration
│ │ ├── scraper.rs # URL validation, HTML scraping, date extraction
│ │ ├── email.rs # SMTP email sending (magic links + syntheses)
│ │ └── captcha.rs # Captcha verification
│ └── db/
│ ├── mod.rs
│ ├── users.rs # User queries
│ ├── sessions.rs # Session queries
│ ├── settings.rs # Settings queries
│ ├── sources.rs # Source queries
│ ├── syntheses.rs # Synthesis queries
│ └── admin.rs # Admin config queries
└── tests/
├── api/ # Integration tests
└── services/ # Unit tests for services
```
### 1.3 Layered Architecture
The application follows a clean 3-layer architecture:
- **Handlers** (HTTP layer): Extract request data, call services, return responses. No business logic.
- **Services** (Business layer): Orchestrate operations, enforce business rules, call DB and external APIs.
- **DB** (Persistence layer): Raw sqlx queries, mapping to/from model structs.
### 1.4 Error Handling
A unified `AppError` enum implements `IntoResponse`:
```rust
#[derive(Debug)]
pub enum AppError {
// Client errors
BadRequest(String),
Unauthorized(String),
Forbidden(String),
NotFound(String),
Conflict(String),
TooManyRequests { retry_after_secs: u64 },
ValidationError(Vec<FieldError>),
// Server errors
Internal(anyhow::Error),
LlmError(String),
SmtpError(String),
ScrapingError(String),
}
impl IntoResponse for AppError {
fn into_response(self) -> axum::response::Response {
let (status, message) = match &self {
AppError::BadRequest(msg) => (StatusCode::BAD_REQUEST, msg.clone()),
AppError::Unauthorized(_) => (StatusCode::UNAUTHORIZED, "Unauthorized".into()),
AppError::Forbidden(_) => (StatusCode::FORBIDDEN, "Forbidden".into()),
AppError::NotFound(msg) => (StatusCode::NOT_FOUND, msg.clone()),
AppError::TooManyRequests { retry_after_secs } => {
// Include Retry-After header
(StatusCode::TOO_MANY_REQUESTS, format!("Retry after {retry_after_secs}s"))
}
AppError::Internal(e) => {
tracing::error!("Internal error: {e:#}");
(StatusCode::INTERNAL_SERVER_ERROR, "Internal server error".into())
}
// ...
};
(status, Json(json!({ "error": message }))).into_response()
}
}
```
All handlers return `Result<impl IntoResponse, AppError>`. The `?` operator propagates errors naturally. `From` implementations convert `sqlx::Error`, `reqwest::Error`, etc. into `AppError`.
### 1.5 SQLite with sqlx: Schema Design
All tables use TEXT primary keys (UUIDs generated by the backend) for portability. Timestamps are stored as `TEXT` in ISO 8601 format (SQLite has no native timestamp; this also works on Postgres via `TIMESTAMPTZ` cast).
#### Migration 001: Users
```sql
CREATE TABLE users (
id TEXT PRIMARY KEY, -- UUID
email TEXT NOT NULL UNIQUE,
display_name TEXT,
role TEXT NOT NULL DEFAULT 'user', -- 'user' | 'admin'
created_at TEXT NOT NULL, -- ISO 8601
updated_at TEXT NOT NULL
);
CREATE INDEX idx_users_email ON users(email);
```
#### Migration 002: Sessions
```sql
CREATE TABLE sessions (
id TEXT PRIMARY KEY, -- Secure random token (32 bytes, base64url)
user_id TEXT NOT NULL REFERENCES users(id) ON DELETE CASCADE,
created_at TEXT NOT NULL,
expires_at TEXT NOT NULL,
ip_address TEXT,
user_agent TEXT
);
CREATE INDEX idx_sessions_user_id ON sessions(user_id);
CREATE INDEX idx_sessions_expires_at ON sessions(expires_at);
```
#### Migration 003: Settings
```sql
CREATE TABLE settings (
user_id TEXT PRIMARY KEY REFERENCES users(id) ON DELETE CASCADE,
theme TEXT NOT NULL DEFAULT 'Intelligence Artificielle',
max_age_days INTEGER NOT NULL DEFAULT 7,
categories TEXT NOT NULL, -- JSON array stored as TEXT
max_items_per_category INTEGER NOT NULL DEFAULT 4,
search_agent_behavior TEXT NOT NULL DEFAULT '',
ai_model TEXT NOT NULL DEFAULT 'gemini-3.1-pro-preview',
updated_at TEXT NOT NULL
);
```
#### Migration 004: Sources
```sql
CREATE TABLE sources (
id TEXT PRIMARY KEY,
user_id TEXT NOT NULL REFERENCES users(id) ON DELETE CASCADE,
title TEXT NOT NULL,
url TEXT NOT NULL,
created_at TEXT NOT NULL
);
CREATE INDEX idx_sources_user_id ON sources(user_id);
```
#### Migration 005: Syntheses
```sql
CREATE TABLE syntheses (
id TEXT PRIMARY KEY,
user_id TEXT NOT NULL REFERENCES users(id) ON DELETE CASCADE,
week TEXT NOT NULL, -- e.g. "2026-W12"
sections TEXT NOT NULL, -- JSON: [{ title, items: [{ title, url, summary }] }]
created_at TEXT NOT NULL
);
CREATE INDEX idx_syntheses_user_id ON syntheses(user_id);
CREATE INDEX idx_syntheses_created_at ON syntheses(created_at);
```
#### Migration 006: Admin Config (LLM Providers)
```sql
CREATE TABLE llm_providers (
id TEXT PRIMARY KEY,
provider TEXT NOT NULL, -- 'gemini' | 'openai' | 'anthropic'
display_name TEXT NOT NULL,
api_key TEXT NOT NULL, -- Encrypted at rest (AES-256-GCM)
base_url TEXT, -- Optional override for self-hosted/proxy
models TEXT NOT NULL, -- JSON array of available model identifiers
is_enabled BOOLEAN NOT NULL DEFAULT 1,
created_at TEXT NOT NULL,
updated_at TEXT NOT NULL,
UNIQUE(provider)
);
```
#### Migration 007: Rate Limit Configuration
```sql
CREATE TABLE rate_limits (
id TEXT PRIMARY KEY,
provider_id TEXT NOT NULL REFERENCES llm_providers(id) ON DELETE CASCADE,
max_requests INTEGER NOT NULL DEFAULT 29,
time_window_ms INTEGER NOT NULL DEFAULT 60000,
updated_at TEXT NOT NULL,
UNIQUE(provider_id)
);
-- Magic link rate limiting
CREATE TABLE magic_link_tokens (
id TEXT PRIMARY KEY,
email TEXT NOT NULL,
token_hash TEXT NOT NULL, -- SHA-256 of the token
created_at TEXT NOT NULL,
expires_at TEXT NOT NULL,
used BOOLEAN NOT NULL DEFAULT 0
);
CREATE INDEX idx_magic_link_email ON magic_link_tokens(email);
```
### 1.6 SQLite/Postgres Dual Compatibility Strategy
**Recommendation: Use sqlx with runtime database selection via `sqlx::AnyPool`.**
However, `AnyPool` has limitations (no compile-time query checking). A more robust approach:
**Strategy: Feature-flag based conditional compilation.**
```toml
# Cargo.toml
[features]
default = ["sqlite"]
sqlite = ["sqlx/sqlite"]
postgres = ["sqlx/postgres"]
```
For this project, the SQL differences between SQLite and Postgres are minimal:
| Concern | SQLite | Postgres | Resolution |
|---|---|---|---|
| Auto-increment PK | `INTEGER PRIMARY KEY` | `SERIAL` | Use UUID TEXT PKs -- identical on both |
| Timestamps | `TEXT` (ISO 8601) | `TIMESTAMPTZ` | Store as TEXT on both; parse in application layer |
| JSON columns | `TEXT` + app-side JSON parse | `JSONB` | Store as TEXT on both; Postgres can migrate to JSONB later |
| Boolean | `INTEGER` (0/1) | `BOOLEAN` | Use `INTEGER` on SQLite, `BOOLEAN` on Postgres; sqlx handles mapping |
| RETURNING clause | Supported since SQLite 3.35 | Supported | Use `RETURNING` on both |
**Practical approach for v1**: Target SQLite only. Write SQL that is Postgres-compatible by design (UUID text PKs, ISO timestamps, no SQLite-specific functions). When the Postgres upgrade happens, create a parallel `migrations_pg/` folder and swap the connection pool. The query layer (db/) remains identical because all queries use standard SQL.
Compile-time checking is preserved by using `sqlx::query!` and `sqlx::query_as!` macros with the `DATABASE_URL` environment variable pointing to an SQLite file during development.
---
## 2. API Design
### 2.1 REST API Endpoints
All endpoints prefixed with `/api/v1`. Request and response bodies are JSON unless stated otherwise.
#### Authentication
| Method | Path | Auth | Description |
|---|---|---|---|
| `POST` | `/auth/register` | No | Create account (email + captcha) |
| `POST` | `/auth/login` | No | Request magic link (email + captcha) |
| `GET` | `/auth/verify?token=...` | No | Verify magic link token, create session |
| `POST` | `/auth/logout` | Yes | Invalidate session |
| `GET` | `/auth/me` | Yes | Get current user info |
#### Syntheses
| Method | Path | Auth | Description |
|---|---|---|---|
| `GET` | `/syntheses` | Yes | List user's syntheses (paginated) |
| `GET` | `/syntheses/:id` | Yes | Get synthesis detail |
| `POST` | `/syntheses/generate` | Yes | Trigger generation (async, returns job ID) |
| `GET` | `/syntheses/generate/:job_id/status` | Yes | Poll generation status |
| `DELETE` | `/syntheses/:id` | Yes | Delete a synthesis |
| `POST` | `/syntheses/:id/email` | Yes | Send synthesis by email |
#### Sources
| Method | Path | Auth | Description |
|---|---|---|---|
| `GET` | `/sources` | Yes | List user's sources |
| `POST` | `/sources` | Yes | Add a source |
| `DELETE` | `/sources/:id` | Yes | Delete a source |
| `POST` | `/sources/bulk` | Yes | Bulk import (JSON array) |
| `POST` | `/sources/import-csv` | Yes | Import from CSV (multipart upload) |
| `GET` | `/sources/export-csv` | Yes | Export as CSV download |
#### Settings
| Method | Path | Auth | Description |
|---|---|---|---|
| `GET` | `/settings` | Yes | Get user's settings |
| `PUT` | `/settings` | Yes | Update settings |
| `GET` | `/settings/export` | Yes | Export as JSON download |
| `POST` | `/settings/import` | Yes | Import from JSON |
#### Admin
| Method | Path | Auth | Description |
|---|---|---|---|
| `GET` | `/admin/providers` | Admin | List LLM provider configs |
| `POST` | `/admin/providers` | Admin | Add/update provider config |
| `DELETE` | `/admin/providers/:id` | Admin | Remove provider |
| `GET` | `/admin/rate-limits` | Admin | Get rate limit configs |
| `PUT` | `/admin/rate-limits/:provider_id` | Admin | Update rate limit config |
| `GET` | `/admin/users` | Admin | List all users |
| `PUT` | `/admin/users/:id/role` | Admin | Change user role |
#### Public (for frontend config)
| Method | Path | Auth | Description |
|---|---|---|---|
| `GET` | `/config/providers` | Yes | List enabled providers + their model names (no API keys) |
### 2.2 Request/Response Shapes
**POST /auth/register**
```json
// Request
{
"email": "user@example.com",
"display_name": "Jane Doe",
"captcha_token": "hcaptcha-response-token"
}
// Response 200
{
"message": "A verification link has been sent to your email."
}
```
**POST /syntheses/generate**
```json
// Request (empty body -- uses user's saved settings and sources)
{}
// Response 202
{
"job_id": "uuid-of-generation-job",
"status": "pending"
}
```
**GET /syntheses/:id**
```json
// Response 200
{
"id": "uuid",
"week": "2026-W12",
"created_at": "2026-03-21T10:30:00Z",
"sections": [
{
"title": "Annonces majeures",
"items": [
{
"title": "Article title",
"url": "https://example.com/article",
"summary": "4-5 line summary..."
}
]
}
]
}
```
**PUT /settings**
```json
// Request
{
"theme": "Intelligence Artificielle",
"max_age_days": 7,
"categories": ["Annonces majeures", "Secteur financier"],
"max_items_per_category": 4,
"search_agent_behavior": "Custom instructions...",
"ai_model": "gemini-3.1-pro-preview"
}
// Response 200
{
"message": "Settings updated successfully."
}
```
**POST /admin/providers**
```json
// Request
{
"provider": "openai",
"display_name": "OpenAI GPT-4o",
"api_key": "sk-...",
"base_url": null,
"models": ["gpt-4o", "gpt-4o-mini"],
"is_enabled": true
}
```
### 2.3 Authentication Middleware
The auth middleware is a tower `Layer` that:
1. Extracts the session cookie (`ai_synth_session`) from the request.
2. Looks up the session ID in the `sessions` table.
3. Checks `expires_at` has not passed.
4. Loads the `User` from the `users` table.
5. Injects the `User` into request extensions (`request.extensions_mut().insert(user)`).
6. Handlers extract the user via `Extension<User>` or a custom `AuthUser` extractor.
For admin routes, an additional `RequireAdmin` layer checks `user.role == "admin"`.
**Session cookies configuration:**
```rust
Cookie::build(("ai_synth_session", session_id))
.http_only(true)
.secure(true) // HTTPS only
.same_site(SameSite::Lax)
.path("/")
.max_age(Duration::days(30))
```
**CSRF Protection:**
Since this is an API consumed by a SPA on the same origin (or proxied), the combination of `SameSite=Lax` cookies and requiring a custom header (`X-Requested-With: XMLHttpRequest`) on mutating requests provides sufficient CSRF protection. This is the "custom header" pattern -- browsers will not send custom headers on cross-origin requests without CORS preflight approval.
For the SPA, every `fetch` call to the API includes:
```javascript
headers: { "X-Requested-With": "XMLHttpRequest" }
```
The CSRF middleware rejects `POST/PUT/DELETE` requests missing this header.
---
## 3. LLM Provider Abstraction
### 3.1 Trait Design
```rust
#[async_trait]
pub trait LlmProvider: Send + Sync {
/// Returns the provider identifier (e.g., "gemini", "openai", "anthropic").
fn provider_id(&self) -> &str;
/// Pass 1: Search the web and generate structured news items.
/// Returns raw JSON matching the category schema.
async fn generate_search_pass(
&self,
model: &str,
system_prompt: &str,
user_prompt: &str,
response_schema: &serde_json::Value,
) -> Result<serde_json::Value, AppError>;
/// Pass 2: Rewrite titles and summaries based on scraped content.
/// No web search tool needed.
async fn generate_rewrite_pass(
&self,
model: &str,
system_prompt: &str,
user_prompt: &str,
response_schema: &serde_json::Value,
) -> Result<serde_json::Value, AppError>;
/// Lists available models for this provider.
fn available_models(&self) -> &[String];
}
```
### 3.2 Provider-Specific Web Search Handling
Each provider handles web grounding differently. The trait design abstracts this:
| Provider | Pass 1 (Search) | Pass 2 (Rewrite) |
|---|---|---|
| **Gemini** | Uses `googleSearch` tool in config. Structured output via `responseSchema`. | Standard generation, no tools. `responseSchema` for structured output. |
| **OpenAI** | Uses `web_search` tool (Responses API) or a two-step approach: first call with `browsing` tool, then structured output. | Standard chat completion with `response_format: { type: "json_schema", ... }`. |
| **Anthropic** | Uses `web_search` tool (available on Claude models). Structured output via tool-use pattern or explicit JSON instructions. | Standard message with JSON output instructions. Anthropic does not have native JSON schema enforcement, so the prompt includes the schema and parsing is done server-side with validation. |
**Implementation details for each provider:**
```rust
// Gemini implementation
pub struct GeminiProvider {
client: reqwest::Client,
api_key: String,
base_url: String,
models: Vec<String>,
}
impl GeminiProvider {
async fn generate_search_pass(&self, model: &str, ...) -> Result<serde_json::Value, AppError> {
// POST to /v1beta/models/{model}:generateContent
// Config includes: tools: [{ googleSearch: {} }]
// responseMimeType: "application/json"
// responseSchema: <schema>
}
}
// OpenAI implementation
pub struct OpenAiProvider {
client: reqwest::Client,
api_key: String,
base_url: String, // default: https://api.openai.com/v1
models: Vec<String>,
}
// Anthropic implementation
pub struct AnthropicProvider {
client: reqwest::Client,
api_key: String,
base_url: String, // default: https://api.anthropic.com
models: Vec<String>,
}
```
### 3.3 Provider Factory
```rust
pub fn create_provider(config: &LlmProviderConfig) -> Result<Box<dyn LlmProvider>, AppError> {
match config.provider.as_str() {
"gemini" => Ok(Box::new(GeminiProvider::new(
config.api_key.clone(),
config.base_url.clone(),
config.models.clone(),
))),
"openai" => Ok(Box::new(OpenAiProvider::new(...))),
"anthropic" => Ok(Box::new(AnthropicProvider::new(...))),
_ => Err(AppError::BadRequest(format!("Unknown provider: {}", config.provider))),
}
}
```
### 3.4 Rate Limiter Design
The rate limiter is a server-side, per-provider, in-memory token bucket with configuration stored in the database.
```rust
pub struct RateLimiter {
state: Arc<DashMap<String, ProviderBucket>>,
}
struct ProviderBucket {
timestamps: VecDeque<Instant>,
max_requests: u32,
time_window: Duration,
}
impl RateLimiter {
/// Blocks until a slot is available for the given provider.
pub async fn acquire(&self, provider_id: &str) -> Result<(), AppError> {
loop {
let mut bucket = self.state
.entry(provider_id.to_string())
.or_insert_with(|| self.default_bucket());
bucket.timestamps.retain(|t| t.elapsed() < bucket.time_window);
if bucket.timestamps.len() < bucket.max_requests as usize {
bucket.timestamps.push_back(Instant::now());
return Ok(());
}
let wait_time = bucket.time_window - bucket.timestamps.front().unwrap().elapsed();
drop(bucket); // Release the DashMap lock before sleeping
tokio::time::sleep(wait_time).await;
}
}
/// Reload configuration from DB (called by admin update endpoint).
pub async fn reload_config(&self, pool: &SqlitePool) -> Result<(), AppError> {
// Fetch rate_limits table, update each ProviderBucket
}
}
```
The rate limiter lives in `AppState` and is shared across all requests. When an admin updates rate limit configuration, `reload_config` is called to hot-reload without restart.
### 3.5 Two-Pass Generation Pipeline
The `SynthesisService` orchestrates the full pipeline:
```rust
pub struct SynthesisService;
impl SynthesisService {
pub async fn generate(
state: &AppState,
user_id: &str,
) -> Result<Synthesis, AppError> {
// 1. Load user settings
let settings = db::settings::get(pool, user_id).await?;
// 2. Load user sources
let sources = db::sources::list(pool, user_id).await?;
// 3. Resolve LLM provider + model
let (provider, model) = resolve_provider(state, &settings.ai_model).await?;
// 4. Build dynamic schema from categories
let schema = build_category_schema(&settings.categories);
// 5. Rate limit: acquire slot
state.rate_limiter.acquire(provider.provider_id()).await?;
// 6. Pass 1: Search
let raw_results = provider.generate_search_pass(
&model, &system_prompt, &user_prompt, &schema
).await?;
// 7. Validate & scrape URLs (server-side, no CORS issues)
let scraped = scraper::validate_and_scrape(
&state.http_client,
raw_results,
settings.max_age_days,
).await;
// 8. Rate limit: acquire slot for pass 2
state.rate_limiter.acquire(provider.provider_id()).await?;
// 9. Pass 2: Rewrite with scraped content
let final_results = provider.generate_rewrite_pass(
&model, &rewrite_system_prompt, &rewrite_prompt, &schema
).await?;
// 10. Persist
let synthesis = db::syntheses::create(
pool, user_id, &week_string, &final_results
).await?;
Ok(synthesis)
}
}
```
### 3.6 Asynchronous Generation
Synthesis generation can take 30-90 seconds. Two options:
**Option A: Synchronous with long timeout.** Simple, but ties up a connection. Acceptable for low-traffic deployments.
**Option B (Recommended): Background task with polling.** The `POST /syntheses/generate` endpoint spawns a tokio task and returns a job ID. The frontend polls `GET /syntheses/generate/:job_id/status`. Job state is kept in an in-memory `DashMap<String, JobStatus>` (not in DB, since jobs are ephemeral).
```rust
enum JobStatus {
Pending,
InProgress { step: String }, // "search", "scraping", "rewriting"
Completed { synthesis_id: String },
Failed { error: String },
}
```
The frontend polls every 3-5 seconds with the same loading UX as the current React app.
---
## 4. URL Scraping / Validation
### 4.1 CORS Elimination
Moving scraping to the backend **completely eliminates CORS issues**. The Rust backend makes direct HTTP requests to target URLs -- no proxies needed. This is the single biggest reliability improvement in the refactoring.
### 4.2 reqwest-Based HTTP Client
```rust
let client = reqwest::Client::builder()
.user_agent("Mozilla/5.0 (compatible; AISynthBot/1.0; +https://your-domain.com/bot)")
.timeout(Duration::from_secs(15))
.redirect(reqwest::redirect::Policy::limited(5))
.connect_timeout(Duration::from_secs(5))
.danger_accept_invalid_certs(false)
.build()?;
```
The HTTP client is created once in `AppState` and reused across all requests (connection pooling).
### 4.3 HTML Parsing with `scraper` Crate
The current app uses the browser's `DOMParser`. The Rust equivalent uses the `scraper` crate (built on `html5ever`):
```rust
use scraper::{Html, Selector};
pub async fn validate_and_scrape(
client: &reqwest::Client,
items: Vec<RawNewsItem>,
max_age_days: i64,
) -> Vec<ScrapedNewsItem> {
let futures = items.into_iter().map(|item| {
let client = client.clone();
async move { scrape_single(&client, item, max_age_days).await }
});
let results = futures::future::join_all(futures).await;
results.into_iter().filter_map(|r| r).collect()
}
async fn scrape_single(
client: &reqwest::Client,
item: RawNewsItem,
max_age_days: i64,
) -> Option<ScrapedNewsItem> {
// 1. Validate URL format
let url = Url::parse(&item.url).ok()?;
// 2. Fetch
let resp = client.get(url).send().await.ok()?;
if !resp.status().is_success() { return None; }
let html_text = resp.text().await.ok()?;
// 3. Parse HTML
let document = Html::parse_document(&html_text);
// 4. Soft-404 detection
let title_sel = Selector::parse("title").unwrap();
let h1_sel = Selector::parse("h1").unwrap();
let title_text = document.select(&title_sel).next()
.map(|el| el.text().collect::<String>().to_lowercase())
.unwrap_or_default();
let h1_text = document.select(&h1_sel).next()
.map(|el| el.text().collect::<String>().to_lowercase())
.unwrap_or_default();
let error_keywords = [
"page not found", "404", "403", "access denied",
"forbidden", "not found", "introuvable",
];
if error_keywords.iter().any(|kw| title_text.contains(kw) || h1_text.contains(kw)) {
return None;
}
// 5. Date extraction (meta tags, JSON-LD, <time>)
if let Some(pub_date) = extract_publication_date(&document) {
let age = Utc::now() - pub_date;
if age.num_days() > max_age_days {
return None;
}
}
// 6. Extract body text (remove script, style, nav, etc.)
let content = extract_body_text(&document, 4000);
Some(ScrapedNewsItem {
title: item.title,
url: item.url,
summary: item.summary,
scraped_content: content,
})
}
```
**Date extraction** mirrors the current logic: check `meta[property="article:published_time"]`, `meta[itemprop="datePublished"]`, `<time datetime>`, and JSON-LD `datePublished`. The `chrono` crate handles date parsing with multiple format attempts.
### 4.4 Concurrency Control
To avoid overwhelming target sites, scraping runs with bounded concurrency:
```rust
use futures::stream::{self, StreamExt};
stream::iter(items)
.map(|item| scrape_single(&client, item, max_age_days))
.buffer_unordered(10) // Max 10 concurrent scrapes
.collect::<Vec<_>>()
.await
```
---
## 5. SolidJS Frontend
### 5.1 Build Tooling
SolidJS uses Vite natively. The migration is straightforward:
```js
// vite.config.ts
import { defineConfig } from 'vite';
import solidPlugin from 'vite-plugin-solid';
import tailwindcss from '@tailwindcss/vite';
export default defineConfig({
plugins: [solidPlugin(), tailwindcss()],
server: {
port: 3000,
proxy: {
'/api': 'http://localhost:8080', // Proxy to Rust backend during dev
},
},
build: {
target: 'esnext',
},
});
```
**package.json dependencies:**
```json
{
"dependencies": {
"solid-js": "^1.9",
"@solidjs/router": "^0.15",
"lucide-solid": "^0.450",
"date-fns": "^4.1"
},
"devDependencies": {
"vite": "^6.2",
"vite-plugin-solid": "^2.11",
"@tailwindcss/vite": "^4.1",
"tailwindcss": "^4.1",
"typescript": "^5.8"
}
}
```
### 5.2 State Management: React to SolidJS Mapping
| React Pattern | SolidJS Equivalent | Notes |
|---|---|---|
| `useState(value)` | `createSignal(value)` | Returns `[getter, setter]` -- getter is a function call: `count()` |
| `useEffect(() => {}, [deps])` | `createEffect(() => {})` | Auto-tracks dependencies, no dep array needed |
| `useContext(Ctx)` | `useContext(Ctx)` | Nearly identical API |
| `createContext()` | `createContext()` | Same concept |
| `React.FC<Props>` | `Component<Props>` | `import { Component } from 'solid-js'` |
| `{items.map(i => ...)}` | `<For each={items()}>{(item) => ...}</For>` | SolidJS uses `<For>` for efficient list rendering |
| `{condition && <X/>}` | `<Show when={condition()}><X/></Show>` | `<Show>` avoids unnecessary DOM creation |
| `useNavigate()` | `useNavigate()` | Same API from `@solidjs/router` |
| `useParams()` | `useParams()` | Same API |
| `onSnapshot` (realtime) | `createResource` + polling or SSE | SolidJS does not have a Firestore equivalent; use `createResource` for data fetching |
### 5.3 Authentication Context Port
```tsx
// src/context/AuthContext.tsx
import { createContext, useContext, createSignal, createResource, ParentComponent } from 'solid-js';
interface User {
id: string;
email: string;
display_name: string | null;
role: string;
}
interface AuthContextType {
user: () => User | null | undefined;
loading: () => boolean;
logout: () => Promise<void>;
}
const AuthContext = createContext<AuthContextType>();
async function fetchCurrentUser(): Promise<User | null> {
const resp = await fetch('/api/v1/auth/me', {
headers: { 'X-Requested-With': 'XMLHttpRequest' },
credentials: 'include',
});
if (resp.status === 401) return null;
if (!resp.ok) throw new Error('Failed to fetch user');
return resp.json();
}
export const AuthProvider: ParentComponent = (props) => {
const [user, { refetch }] = createResource(fetchCurrentUser);
const logout = async () => {
await fetch('/api/v1/auth/logout', {
method: 'POST',
headers: { 'X-Requested-With': 'XMLHttpRequest' },
credentials: 'include',
});
refetch();
};
return (
<AuthContext.Provider value={{
user: () => user(),
loading: () => user.loading,
logout,
}}>
{props.children}
</AuthContext.Provider>
);
};
export const useAuth = () => {
const ctx = useContext(AuthContext);
if (!ctx) throw new Error('useAuth must be used within AuthProvider');
return ctx;
};
```
### 5.4 Data Fetching Pattern
The current React app uses Firestore's `onSnapshot` for real-time updates. With the REST API backend, data fetching uses `createResource`:
```tsx
// src/pages/Home.tsx
import { createResource, For, Show } from 'solid-js';
import { A } from '@solidjs/router';
import { fetchApi } from '../lib/api';
async function fetchSyntheses() {
return fetchApi<SynthesisDocument[]>('/api/v1/syntheses');
}
export default function Home() {
const [syntheses, { refetch }] = createResource(fetchSyntheses);
return (
<Show when={!syntheses.loading} fallback={<Spinner />}>
<For each={syntheses()}>
{(synth) => (
<A href={`/synthesis/${synth.id}`}>
{/* card content */}
</A>
)}
</For>
</Show>
);
}
```
### 5.5 Tailwind CSS Compatibility
Tailwind CSS v4 works identically with SolidJS. The `@tailwindcss/vite` plugin scans `.tsx` files for class names regardless of framework. All existing Tailwind classes carry over without changes. The `lucide-solid` package provides the same icon components as `lucide-react` with identical APIs.
### 5.6 Routing
```tsx
// src/App.tsx
import { Router, Route } from '@solidjs/router';
import { AuthProvider } from './context/AuthContext';
function App() {
return (
<AuthProvider>
<Router>
<Route path="/login" component={Login} />
<Route path="/" component={ProtectedLayout}>
<Route path="/" component={Home} />
<Route path="/sources" component={Sources} />
<Route path="/settings" component={Settings} />
<Route path="/generate" component={GenerateSynthesis} />
<Route path="/synthesis/:id" component={SynthesisDetail} />
</Route>
</Router>
</AuthProvider>
);
}
```
The `ProtectedLayout` component checks auth and renders `<Navigate>` if not logged in -- same pattern as the current React `ProtectedRoute` but using SolidJS's `<Navigate>`.
---
## 6. Authentication System
### 6.1 Magic Link Flow
```
User Frontend Backend SMTP Server
| | | |
|-- Enter email -------->| | |
| |-- POST /auth/login --> |
| | { email, captcha_token } |
| | |-- verify captcha ->|
| | |-- generate token |
| | |-- store hash in DB |
| | |-- send email ------+-->
| |<-- 200 "Check email" | |
| | | |
|<---- Email arrives (link: /auth/verify?token=xxx) -------------|
| | | |
|-- Click link --------->| | |
| |-- GET /auth/verify?token=xxx --> |
| | |-- hash token |
| | |-- lookup in DB |
| | |-- verify not expired|
| | |-- mark as used |
| | |-- create/get user |
| | |-- create session |
| |<-- 302 redirect + Set-Cookie |
|<-- Redirect to / ------| | |
```
**Token generation:**
- 32 bytes of cryptographically secure random data (`rand::rngs::OsRng`)
- Base64url encoded for URL safety
- SHA-256 hash stored in DB (never store raw token)
- 15-minute expiry
- Single use (marked `used = true` after verification)
**Rate limiting on magic link requests:**
- Max 3 requests per email per 15 minutes
- Max 10 requests per IP per hour
- Prevents email bombing
### 6.2 Account Registration Flow
1. User submits email + display name + captcha token.
2. Backend verifies captcha with provider.
3. Backend checks email uniqueness.
4. Backend creates user with `role = 'user'` and default settings.
5. Backend sends magic link email for initial verification.
6. User clicks link, session is created.
The first user can be bootstrapped as admin via environment variable:
```
ADMIN_EMAIL=admin@example.com
```
On startup, if a user with this email exists, their role is set to `admin`.
### 6.3 Session Management
Sessions are stored in the `sessions` table. The session ID is a 32-byte random token (base64url-encoded, 43 characters). Session lookup is O(1) via primary key.
**Session lifecycle:**
- Created on magic link verification
- Expires after 30 days (configurable)
- Refreshed (expiry extended) on each authenticated request
- Deleted on logout
- Periodic cleanup job (tokio interval) removes expired sessions
### 6.4 Captcha Integration
**Recommendation: Cloudflare Turnstile.**
| Option | Self-hostable | Privacy | Free tier |
|---|---|---|---|
| hCaptcha | No (SaaS) | Better than reCAPTCHA | Yes (unlimited) |
| Cloudflare Turnstile | No (SaaS) | Excellent (often invisible) | Yes (unlimited) |
| mCaptcha | Yes (open source) | Full control | N/A (self-hosted) |
None of the mainstream captcha services are fully self-hostable. **Cloudflare Turnstile** is recommended for its invisible challenge mode (better UX) and generous free tier. If strict self-hosting is required, **mCaptcha** (Rust-based, open source) is the only viable option, though it requires running a separate service.
Backend verification is simple:
```rust
pub async fn verify_captcha(client: &reqwest::Client, token: &str, secret: &str) -> Result<bool, AppError> {
let resp = client
.post("https://challenges.cloudflare.com/turnstile/v0/siteverify")
.form(&[("secret", secret), ("response", token)])
.send()
.await?;
let result: TurnstileResponse = resp.json().await?;
Ok(result.success)
}
```
---
## 7. Docker Deployment
### 7.1 Multi-Stage Dockerfile
```dockerfile
# ===== Stage 1: Build Rust backend =====
FROM rust:1.85-bookworm AS backend-builder
WORKDIR /app
COPY Cargo.toml Cargo.lock ./
COPY src/ src/
COPY migrations/ migrations/
# Create a dummy SQLite DB for sqlx compile-time checks
ENV DATABASE_URL="sqlite:///tmp/build.db"
RUN cargo install sqlx-cli --no-default-features --features sqlite \
&& sqlx database create \
&& sqlx migrate run
RUN cargo build --release
# ===== Stage 2: Build SolidJS frontend =====
FROM node:22-alpine AS frontend-builder
WORKDIR /app/frontend
COPY frontend/package.json frontend/package-lock.json ./
RUN npm ci
COPY frontend/ ./
RUN npm run build
# ===== Stage 3: Minimal runtime =====
FROM debian:bookworm-slim AS runtime
RUN apt-get update && apt-get install -y \
ca-certificates \
libssl3 \
&& rm -rf /var/lib/apt/lists/*
RUN useradd -ms /bin/bash appuser
WORKDIR /app
# Copy backend binary
COPY --from=backend-builder /app/target/release/ai-synth-backend .
# Copy migrations for runtime migration
COPY --from=backend-builder /app/migrations/ migrations/
# Copy frontend static files
COPY --from=frontend-builder /app/frontend/dist/ static/
# Create data directory for SQLite
RUN mkdir -p /app/data && chown appuser:appuser /app/data
USER appuser
ENV DATABASE_URL="sqlite:///app/data/ai_synth.db"
ENV STATIC_DIR="/app/static"
ENV PORT=8080
EXPOSE 8080
# Run migrations on startup, then start server
CMD ["./ai-synth-backend"]
```
The Rust backend serves the static SolidJS files directly (via `tower-http::ServeDir`), eliminating the need for a separate nginx container. All `/api/*` routes go to handlers; everything else serves `index.html` (SPA fallback).
### 7.2 docker-compose.yml
```yaml
version: "3.9"
services:
app:
build:
context: .
dockerfile: Dockerfile
container_name: ai-synth
restart: unless-stopped
ports:
- "${PORT:-8080}:8080"
volumes:
- ai_synth_data:/app/data # SQLite persistence
environment:
- DATABASE_URL=sqlite:///app/data/ai_synth.db
- PORT=8080
- ADMIN_EMAIL=${ADMIN_EMAIL}
- SESSION_SECRET=${SESSION_SECRET} # 64-byte hex for cookie signing
- SMTP_HOST=${SMTP_HOST}
- SMTP_PORT=${SMTP_PORT:-587}
- SMTP_USER=${SMTP_USER}
- SMTP_PASSWORD=${SMTP_PASSWORD}
- SMTP_FROM=${SMTP_FROM}
- CAPTCHA_SECRET=${CAPTCHA_SECRET}
- CAPTCHA_SITE_KEY=${CAPTCHA_SITE_KEY}
- ENCRYPTION_KEY=${ENCRYPTION_KEY} # 32-byte hex for API key encryption
- RUST_LOG=info
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/api/v1/health"]
interval: 30s
timeout: 5s
retries: 3
# Optional: Mailpit for local development (SMTP catch-all)
mailpit:
image: axllent/mailpit
container_name: ai-synth-mail
restart: unless-stopped
ports:
- "8025:8025" # Web UI
- "1025:1025" # SMTP
profiles:
- dev
volumes:
ai_synth_data:
driver: local
```
### 7.3 Volume Mounts for SQLite
The SQLite database file is stored in a Docker named volume (`ai_synth_data`). This ensures:
- Data persists across container restarts and rebuilds
- The volume can be backed up via `docker cp` or volume backup tools
- WAL mode is used for concurrent read/write performance
**Important SQLite configuration for production:**
```rust
let pool = SqlitePoolOptions::new()
.max_connections(5) // SQLite handles limited concurrency
.after_connect(|conn, _| {
Box::pin(async move {
conn.execute("PRAGMA journal_mode=WAL").await?;
conn.execute("PRAGMA synchronous=NORMAL").await?;
conn.execute("PRAGMA foreign_keys=ON").await?;
conn.execute("PRAGMA busy_timeout=5000").await?;
Ok(())
})
})
.connect(&database_url)
.await?;
```
### 7.4 Environment Variable Configuration
A `.env.example` file documents all required and optional variables:
```env
# === Required ===
DATABASE_URL=sqlite:///app/data/ai_synth.db
SESSION_SECRET=<64-byte-hex-string>
ENCRYPTION_KEY=<32-byte-hex-string>
ADMIN_EMAIL=admin@example.com
# === SMTP (required for magic link auth) ===
SMTP_HOST=smtp.example.com
SMTP_PORT=587
SMTP_USER=user@example.com
SMTP_PASSWORD=password
SMTP_FROM=noreply@example.com
# === Captcha ===
CAPTCHA_SECRET=<turnstile-secret-key>
CAPTCHA_SITE_KEY=<turnstile-site-key>
# === Optional ===
PORT=8080
RUST_LOG=info
BASE_URL=https://your-domain.com # For magic link URLs
```
---
## 8. Migration from Firebase
### 8.1 Data Migration Strategy
A standalone Rust CLI tool (or a script using `firebase-admin` SDK in Python/Node) handles the migration:
**Step 1: Export Firestore data**
Use `firebase-admin` SDK (Python or Node.js is simplest for this one-shot task):
```python
# migrate_export.py
import firebase_admin
from firebase_admin import credentials, firestore
import json
cred = credentials.Certificate("service-account.json")
firebase_admin.initialize_app(cred)
db = firestore.client()
# Export users (from Firebase Auth)
# Export syntheses, sources, settings collections
data = {
"syntheses": [],
"sources": [],
"settings": [],
}
for doc in db.collection("syntheses").stream():
d = doc.to_dict()
d["_id"] = doc.id
data["syntheses"].append(d)
# ... same for sources, settings
with open("firebase_export.json", "w") as f:
json.dump(data, f, default=str)
```
**Step 2: Transform and import into SQLite**
A Rust CLI tool reads the JSON export and inserts into SQLite:
```
cargo run --bin migrate -- --input firebase_export.json --db ai_synth.db
```
Key transformations:
- `authorUid` / `userId` from Firebase Auth UID -> new UUID in `users` table (mapping table maintained during migration)
- Firebase `Timestamp` -> ISO 8601 string
- Legacy `SynthesisData` fields (`majorAnnouncements`, `financialSector`, etc.) -> normalized `sections[]` JSON
- Settings doc ID (was `{userId}` in Firestore) -> `user_id` foreign key
**Step 3: User notification**
Since authentication changes from Google SSO to email+magic link, existing users need to be notified that they must use the magic link flow. Their email addresses (from Firebase Auth) are imported into the `users` table. On first magic link login, the user's existing data is accessible via their email.
### 8.2 Mapping Firestore Security Rules to Rust
The Firestore rules enforce three categories of protection that map to backend patterns:
| Firestore Rule | Rust Equivalent |
|---|---|
| `isAuthenticated()` | Auth middleware layer (rejects 401 if no valid session) |
| `isDocOwner()` / `request.auth.uid == resource.data.authorUid` | Query-level filtering: `WHERE user_id = $1` with the authenticated user's ID |
| `isValidSynthesis()` / `isValidSettings()` / `isValidSource()` | Request validation using `validator` crate or manual checks in handlers |
| `uidUnchanged()` / `uidNotModified()` | Not applicable -- `user_id` is never in the request body; it is injected server-side from the session |
| `request.resource.data.createdAt == resource.data.createdAt` | `created_at` is set server-side and never updatable via API |
| Field type checks (string, number, timestamp) | Serde deserialization + custom validators |
| Size limits (e.g., `title.size() < 500`) | Validator annotations: `#[validate(length(max = 500))]` |
**Example validation in Rust:**
```rust
#[derive(Deserialize, Validate)]
pub struct CreateSourceRequest {
#[validate(length(min = 1, max = 200))]
pub title: String,
#[validate(url, length(max = 1000))]
pub url: String,
}
```
The key architectural difference: in Firestore, rules are the *only* security layer (the client has direct DB access). In the Rust backend, security is enforced at the handler level (authentication middleware + query scoping + input validation). The database is never directly accessible from the client.
**Ownership enforcement pattern:**
Every query that reads or mutates user data includes `WHERE user_id = ?` with the authenticated user's ID. This is not a "rule" but a structural guarantee -- there is no code path that can access another user's data because the user ID comes from the session, not the request.
```rust
// db/syntheses.rs
pub async fn get_by_id(pool: &SqlitePool, user_id: &str, synthesis_id: &str) -> Result<Option<Synthesis>, sqlx::Error> {
sqlx::query_as!(
Synthesis,
"SELECT * FROM syntheses WHERE id = ? AND user_id = ?",
synthesis_id,
user_id
)
.fetch_optional(pool)
.await
}
```
If the synthesis belongs to another user, this returns `None`, and the handler returns 404. There is no way for a user to query, update, or delete another user's data.
---
## Summary of Key Crate Dependencies
| Purpose | Crate | Version Guidance |
|---|---|---|
| Web framework | `axum` | ^0.8 |
| Async runtime | `tokio` | ^1 (full features) |
| Database | `sqlx` | ^0.8 (features: sqlite, runtime-tokio) |
| HTTP client | `reqwest` | ^0.12 (features: json, cookies) |
| HTML parsing | `scraper` | ^0.22 |
| Serialization | `serde`, `serde_json` | ^1 |
| Date/time | `chrono` | ^0.4 |
| Password/token hashing | `sha2` | ^0.10 |
| Random tokens | `rand` | ^0.8 |
| SMTP | `lettre` | ^0.11 |
| Logging | `tracing`, `tracing-subscriber` | ^0.1 / ^0.3 |
| Config | `dotenvy` | ^0.15 |
| Validation | `validator` | ^0.19 |
| Concurrent map | `dashmap` | ^6 |
| Static file serving | `tower-http` | ^0.6 (features: fs, cors, trace) |
| Cookie handling | `axum-extra` | ^0.10 (features: cookie) |
| Encryption (API keys) | `aes-gcm` | ^0.10 |
| Base64 | `base64` | ^0.22 |
| UUID | `uuid` | ^1 (features: v4) |
| Error handling | `anyhow`, `thiserror` | ^1 |
---
## Architecture Diagram (Text)
```
┌─────────────────────┐
│ Docker Container │
│ │
Browser ◄──── HTTPS ────► ┌─────┴─────────────────┐ │
(SolidJS SPA) │ Axum Web Server │ │
│ │ │
│ /static/* ──► ServeDir│ │
│ /api/v1/* ──► Router │ │
│ │ │
│ ┌─ Auth Middleware ─┐ │ │
│ │ Session Cookie │ │ │
│ │ CSRF Check │ │ │
│ └───────────────────┘ │ │
│ │ │
│ ┌─ Handlers ────────┐ │ │
│ │ auth, syntheses, │ │ │
│ │ sources, settings,│ │ │
│ │ admin, email │ │ │
│ └────────┬──────────┘ │ │
│ │ │ │
│ ┌─ Services ────────┐ │ │
│ │ LLM providers │─┼───┼──► Gemini API
│ │ (trait-based) │─┼───┼──► OpenAI API
│ │ │─┼───┼──► Anthropic API
│ │ Scraper (reqwest) │─┼───┼──► Target URLs
│ │ Email (lettre) │─┼───┼──► SMTP Server
│ │ Captcha │─┼───┼──► Turnstile API
│ └────────┬──────────┘ │ │
│ │ │ │
│ ┌─ DB Layer (sqlx) ─┐ │ │
│ │ SQLite (WAL) │ │ │
│ └───────────────────┘ │ │
└───────────┬────────────┘ │
│ │
┌───────────▼────────────┐ │
│ /app/data/ │ │
│ ai_synth.db │ │
│ (Docker volume) │ │
└─────────────────────────┘ │
└─────────────────────┘
```