You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

52 KiB

Raw Blame History

Technical Architecture Analysis: AI Weekly Synth Refactoring

Open Questions and Clarifications Needed

Before implementation, the following points require decisions from stakeholders:

Admin scope: Is the "admin" a single super-user defined by config, or a full role-based system with multiple admins? This analysis assumes a simple role flag on users plus a single bootstrap admin defined via environment variable.
Google OAuth retention: The requirements specify email+captcha and magic link auth. Should Google SSO be dropped entirely, or kept as an additional option? This analysis assumes Google SSO is dropped to remove all Google dependencies.
Email sending for syntheses: The current app sends syntheses via Gmail API with OAuth popup. With Google dependencies removed, should SMTP-based email sending replace this? This analysis assumes yes, using the same SMTP configuration as magic link delivery.
Data migration volume: How many existing users and syntheses need migrating? This impacts whether a one-shot script suffices or whether incremental migration tooling is needed.
Concurrent users target: Rate limiter design and session store choice depend on expected load. This analysis assumes a small-to-medium deployment (1-100 concurrent users).
Legacy data: The current SynthesisData has legacy fields (majorAnnouncements, financialSector, etc.). The requirements say "remove legacy data/formats/code." This analysis assumes legacy fields are dropped during migration; only the sections[] format is carried forward.

1. Rust Backend Architecture

1.1 Framework Choice: Axum

Recommendation: Axum over Actix-web.

Justification:

Criterion	Axum	Actix-web
Ecosystem alignment	Built on `tokio` + `tower` + `hyper` -- the de-facto Rust async stack	Has its own runtime layer (though uses tokio underneath)
Middleware model	Tower `Layer`/`Service` -- composable, reusable, testable	Actor-based middleware -- powerful but idiosyncratic
Extractors	Type-safe, ergonomic, uses `FromRequest` traits	Similar, but with `web::Data`, `web::Json` wrappers
Community trajectory	Growing faster, backed by the tokio team	Mature, stable, but slower growth
Learning curve	Lower for developers already using tokio ecosystem	Slightly higher due to actor concepts
Compile-time type safety	Strong -- handler function signatures are validated at compile time	Strong, but less ergonomic error messages

Axum's tower-based middleware model is a decisive advantage for this project: the auth middleware, rate limiter, and CORS layer compose naturally as tower Layers. Axum also has first-class support for shared state via State extractor, which maps well to a shared database pool and configuration.

1.2 Project Structure

ai-synth-backend/
├── Cargo.toml
├── Cargo.lock
├── .env.example
├── migrations/                    # sqlx migrations
│   ├── 001_create_users.sql
│   ├── 002_create_sessions.sql
│   ├── 003_create_settings.sql
│   ├── 004_create_sources.sql
│   ├── 005_create_syntheses.sql
│   ├── 006_create_admin_config.sql
│   └── 007_create_rate_limits.sql
├── src/
│   ├── main.rs                    # Entry point: init tracing, DB, run server
│   ├── config.rs                  # Env-based configuration (envy / dotenvy)
│   ├── app_state.rs               # AppState struct (pool, config, http client)
│   ├── error.rs                   # AppError enum, IntoResponse impl
│   ├── router.rs                  # All route definitions, middleware wiring
│   ├── middleware/
│   │   ├── mod.rs
│   │   ├── auth.rs                # Session cookie extraction, user injection
│   │   ├── csrf.rs                # Double-submit cookie CSRF protection
│   │   └── rate_limit.rs          # Per-provider, configurable rate limiter
│   ├── models/
│   │   ├── mod.rs
│   │   ├── user.rs                # User, NewUser, UserRole
│   │   ├── session.rs             # Session
│   │   ├── settings.rs            # UserSettings
│   │   ├── source.rs              # Source
│   │   ├── synthesis.rs           # Synthesis, NewsSection, NewsItem
│   │   └── admin.rs               # LlmProviderConfig, RateLimitConfig
│   ├── handlers/
│   │   ├── mod.rs
│   │   ├── auth.rs                # register, login (magic link), verify, logout
│   │   ├── syntheses.rs           # list, get, create (trigger generation), delete
│   │   ├── sources.rs             # CRUD, bulk import, CSV export
│   │   ├── settings.rs            # get, update, export, import
│   │   ├── admin.rs               # LLM config CRUD, rate limit config, user list
│   │   └── email.rs               # Send synthesis by email
│   ├── services/
│   │   ├── mod.rs
│   │   ├── llm/
│   │   │   ├── mod.rs             # LlmProvider trait, factory function
│   │   │   ├── gemini.rs          # Google Gemini implementation
│   │   │   ├── openai.rs          # OpenAI implementation
│   │   │   ├── anthropic.rs       # Anthropic implementation
│   │   │   └── types.rs           # Shared request/response types
│   │   ├── synthesis.rs           # 2-pass generation pipeline orchestration
│   │   ├── scraper.rs             # URL validation, HTML scraping, date extraction
│   │   ├── email.rs               # SMTP email sending (magic links + syntheses)
│   │   └── captcha.rs             # Captcha verification
│   └── db/
│       ├── mod.rs
│       ├── users.rs               # User queries
│       ├── sessions.rs            # Session queries
│       ├── settings.rs            # Settings queries
│       ├── sources.rs             # Source queries
│       ├── syntheses.rs           # Synthesis queries
│       └── admin.rs               # Admin config queries
└── tests/
    ├── api/                       # Integration tests
    └── services/                  # Unit tests for services

1.3 Layered Architecture

The application follows a clean 3-layer architecture:

Handlers (HTTP layer): Extract request data, call services, return responses. No business logic.
Services (Business layer): Orchestrate operations, enforce business rules, call DB and external APIs.
DB (Persistence layer): Raw sqlx queries, mapping to/from model structs.

1.4 Error Handling

A unified AppError enum implements IntoResponse:

#[derive(Debug)]
pub enum AppError {
    // Client errors
    BadRequest(String),
    Unauthorized(String),
    Forbidden(String),
    NotFound(String),
    Conflict(String),
    TooManyRequests { retry_after_secs: u64 },
    ValidationError(Vec<FieldError>),

    // Server errors
    Internal(anyhow::Error),
    LlmError(String),
    SmtpError(String),
    ScrapingError(String),
}

impl IntoResponse for AppError {
    fn into_response(self) -> axum::response::Response {
        let (status, message) = match &self {
            AppError::BadRequest(msg) => (StatusCode::BAD_REQUEST, msg.clone()),
            AppError::Unauthorized(_) => (StatusCode::UNAUTHORIZED, "Unauthorized".into()),
            AppError::Forbidden(_) => (StatusCode::FORBIDDEN, "Forbidden".into()),
            AppError::NotFound(msg) => (StatusCode::NOT_FOUND, msg.clone()),
            AppError::TooManyRequests { retry_after_secs } => {
                // Include Retry-After header
                (StatusCode::TOO_MANY_REQUESTS, format!("Retry after {retry_after_secs}s"))
            }
            AppError::Internal(e) => {
                tracing::error!("Internal error: {e:#}");
                (StatusCode::INTERNAL_SERVER_ERROR, "Internal server error".into())
            }
            // ...
        };
        (status, Json(json!({ "error": message }))).into_response()
    }
}

All handlers return Result<impl IntoResponse, AppError>. The ? operator propagates errors naturally. From implementations convert sqlx::Error, reqwest::Error, etc. into AppError.

1.5 SQLite with sqlx: Schema Design

All tables use TEXT primary keys (UUIDs generated by the backend) for portability. Timestamps are stored as TEXT in ISO 8601 format (SQLite has no native timestamp; this also works on Postgres via TIMESTAMPTZ cast).

Migration 001: Users

CREATE TABLE users (
    id          TEXT PRIMARY KEY,           -- UUID
    email       TEXT NOT NULL UNIQUE,
    display_name TEXT,
    role        TEXT NOT NULL DEFAULT 'user', -- 'user' | 'admin'
    created_at  TEXT NOT NULL,              -- ISO 8601
    updated_at  TEXT NOT NULL
);
CREATE INDEX idx_users_email ON users(email);

Migration 002: Sessions

CREATE TABLE sessions (
    id          TEXT PRIMARY KEY,           -- Secure random token (32 bytes, base64url)
    user_id     TEXT NOT NULL REFERENCES users(id) ON DELETE CASCADE,
    created_at  TEXT NOT NULL,
    expires_at  TEXT NOT NULL,
    ip_address  TEXT,
    user_agent  TEXT
);
CREATE INDEX idx_sessions_user_id ON sessions(user_id);
CREATE INDEX idx_sessions_expires_at ON sessions(expires_at);

Migration 003: Settings

CREATE TABLE settings (
    user_id              TEXT PRIMARY KEY REFERENCES users(id) ON DELETE CASCADE,
    theme                TEXT NOT NULL DEFAULT 'Intelligence Artificielle',
    max_age_days         INTEGER NOT NULL DEFAULT 7,
    categories           TEXT NOT NULL,    -- JSON array stored as TEXT
    max_items_per_category INTEGER NOT NULL DEFAULT 4,
    search_agent_behavior TEXT NOT NULL DEFAULT '',
    ai_model             TEXT NOT NULL DEFAULT 'gemini-3.1-pro-preview',
    updated_at           TEXT NOT NULL
);

Migration 004: Sources

CREATE TABLE sources (
    id          TEXT PRIMARY KEY,
    user_id     TEXT NOT NULL REFERENCES users(id) ON DELETE CASCADE,
    title       TEXT NOT NULL,
    url         TEXT NOT NULL,
    created_at  TEXT NOT NULL
);
CREATE INDEX idx_sources_user_id ON sources(user_id);

Migration 005: Syntheses

CREATE TABLE syntheses (
    id          TEXT PRIMARY KEY,
    user_id     TEXT NOT NULL REFERENCES users(id) ON DELETE CASCADE,
    week        TEXT NOT NULL,             -- e.g. "2026-W12"
    sections    TEXT NOT NULL,             -- JSON: [{ title, items: [{ title, url, summary }] }]
    created_at  TEXT NOT NULL
);
CREATE INDEX idx_syntheses_user_id ON syntheses(user_id);
CREATE INDEX idx_syntheses_created_at ON syntheses(created_at);

Migration 006: Admin Config (LLM Providers)

CREATE TABLE llm_providers (
    id           TEXT PRIMARY KEY,
    provider     TEXT NOT NULL,            -- 'gemini' | 'openai' | 'anthropic'
    display_name TEXT NOT NULL,
    api_key      TEXT NOT NULL,            -- Encrypted at rest (AES-256-GCM)
    base_url     TEXT,                     -- Optional override for self-hosted/proxy
    models       TEXT NOT NULL,            -- JSON array of available model identifiers
    is_enabled   BOOLEAN NOT NULL DEFAULT 1,
    created_at   TEXT NOT NULL,
    updated_at   TEXT NOT NULL,
    UNIQUE(provider)
);

Migration 007: Rate Limit Configuration

CREATE TABLE rate_limits (
    id              TEXT PRIMARY KEY,
    provider_id     TEXT NOT NULL REFERENCES llm_providers(id) ON DELETE CASCADE,
    max_requests    INTEGER NOT NULL DEFAULT 29,
    time_window_ms  INTEGER NOT NULL DEFAULT 60000,
    updated_at      TEXT NOT NULL,
    UNIQUE(provider_id)
);

-- Magic link rate limiting
CREATE TABLE magic_link_tokens (
    id          TEXT PRIMARY KEY,
    email       TEXT NOT NULL,
    token_hash  TEXT NOT NULL,             -- SHA-256 of the token
    created_at  TEXT NOT NULL,
    expires_at  TEXT NOT NULL,
    used        BOOLEAN NOT NULL DEFAULT 0
);
CREATE INDEX idx_magic_link_email ON magic_link_tokens(email);

1.6 SQLite/Postgres Dual Compatibility Strategy

Recommendation: Use sqlx with runtime database selection via sqlx::AnyPool.

However, AnyPool has limitations (no compile-time query checking). A more robust approach:

Strategy: Feature-flag based conditional compilation.

# Cargo.toml
[features]
default = ["sqlite"]
sqlite = ["sqlx/sqlite"]
postgres = ["sqlx/postgres"]

For this project, the SQL differences between SQLite and Postgres are minimal:

Concern	SQLite	Postgres	Resolution
Auto-increment PK	`INTEGER PRIMARY KEY`	`SERIAL`	Use UUID TEXT PKs -- identical on both
Timestamps	`TEXT` (ISO 8601)	`TIMESTAMPTZ`	Store as TEXT on both; parse in application layer
JSON columns	`TEXT` + app-side JSON parse	`JSONB`	Store as TEXT on both; Postgres can migrate to JSONB later
Boolean	`INTEGER` (0/1)	`BOOLEAN`	Use `INTEGER` on SQLite, `BOOLEAN` on Postgres; sqlx handles mapping
RETURNING clause	Supported since SQLite 3.35	Supported	Use `RETURNING` on both

Practical approach for v1: Target SQLite only. Write SQL that is Postgres-compatible by design (UUID text PKs, ISO timestamps, no SQLite-specific functions). When the Postgres upgrade happens, create a parallel migrations_pg/ folder and swap the connection pool. The query layer (db/) remains identical because all queries use standard SQL.

Compile-time checking is preserved by using sqlx::query! and sqlx::query_as! macros with the DATABASE_URL environment variable pointing to an SQLite file during development.

2. API Design

2.1 REST API Endpoints

All endpoints prefixed with /api/v1. Request and response bodies are JSON unless stated otherwise.

Authentication

Method	Path	Auth	Description
`POST`	`/auth/register`	No	Create account (email + captcha)
`POST`	`/auth/login`	No	Request magic link (email + captcha)
`GET`	`/auth/verify?token=...`	No	Verify magic link token, create session
`POST`	`/auth/logout`	Yes	Invalidate session
`GET`	`/auth/me`	Yes	Get current user info

Syntheses

Method	Path	Auth	Description
`GET`	`/syntheses`	Yes	List user's syntheses (paginated)
`GET`	`/syntheses/:id`	Yes	Get synthesis detail
`POST`	`/syntheses/generate`	Yes	Trigger generation (async, returns job ID)
`GET`	`/syntheses/generate/:job_id/status`	Yes	Poll generation status
`DELETE`	`/syntheses/:id`	Yes	Delete a synthesis
`POST`	`/syntheses/:id/email`	Yes	Send synthesis by email

Sources

Method	Path	Auth	Description
`GET`	`/sources`	Yes	List user's sources
`POST`	`/sources`	Yes	Add a source
`DELETE`	`/sources/:id`	Yes	Delete a source
`POST`	`/sources/bulk`	Yes	Bulk import (JSON array)
`POST`	`/sources/import-csv`	Yes	Import from CSV (multipart upload)
`GET`	`/sources/export-csv`	Yes	Export as CSV download

Settings

Method	Path	Auth	Description
`GET`	`/settings`	Yes	Get user's settings
`PUT`	`/settings`	Yes	Update settings
`GET`	`/settings/export`	Yes	Export as JSON download
`POST`	`/settings/import`	Yes	Import from JSON

Admin

Method	Path	Auth	Description
`GET`	`/admin/providers`	Admin	List LLM provider configs
`POST`	`/admin/providers`	Admin	Add/update provider config
`DELETE`	`/admin/providers/:id`	Admin	Remove provider
`GET`	`/admin/rate-limits`	Admin	Get rate limit configs
`PUT`	`/admin/rate-limits/:provider_id`	Admin	Update rate limit config
`GET`	`/admin/users`	Admin	List all users
`PUT`	`/admin/users/:id/role`	Admin	Change user role

Public (for frontend config)

Method	Path	Auth	Description
`GET`	`/config/providers`	Yes	List enabled providers + their model names (no API keys)

2.2 Request/Response Shapes

POST /auth/register

// Request
{
  "email": "user@example.com",
  "display_name": "Jane Doe",
  "captcha_token": "hcaptcha-response-token"
}
// Response 200
{
  "message": "A verification link has been sent to your email."
}

POST /syntheses/generate

// Request (empty body -- uses user's saved settings and sources)
{}
// Response 202
{
  "job_id": "uuid-of-generation-job",
  "status": "pending"
}

GET /syntheses/:id

// Response 200
{
  "id": "uuid",
  "week": "2026-W12",
  "created_at": "2026-03-21T10:30:00Z",
  "sections": [
    {
      "title": "Annonces majeures",
      "items": [
        {
          "title": "Article title",
          "url": "https://example.com/article",
          "summary": "4-5 line summary..."
        }
      ]
    }
  ]
}

PUT /settings

// Request
{
  "theme": "Intelligence Artificielle",
  "max_age_days": 7,
  "categories": ["Annonces majeures", "Secteur financier"],
  "max_items_per_category": 4,
  "search_agent_behavior": "Custom instructions...",
  "ai_model": "gemini-3.1-pro-preview"
}
// Response 200
{
  "message": "Settings updated successfully."
}

POST /admin/providers

// Request
{
  "provider": "openai",
  "display_name": "OpenAI GPT-4o",
  "api_key": "sk-...",
  "base_url": null,
  "models": ["gpt-4o", "gpt-4o-mini"],
  "is_enabled": true
}

2.3 Authentication Middleware

The auth middleware is a tower Layer that:

Extracts the session cookie (ai_synth_session) from the request.
Looks up the session ID in the sessions table.
Checks expires_at has not passed.
Loads the User from the users table.
Injects the User into request extensions (request.extensions_mut().insert(user)).
Handlers extract the user via Extension<User> or a custom AuthUser extractor.

For admin routes, an additional RequireAdmin layer checks user.role == "admin".

Session cookies configuration:

Cookie::build(("ai_synth_session", session_id))
    .http_only(true)
    .secure(true)           // HTTPS only
    .same_site(SameSite::Lax)
    .path("/")
    .max_age(Duration::days(30))

CSRF Protection:

Since this is an API consumed by a SPA on the same origin (or proxied), the combination of SameSite=Lax cookies and requiring a custom header (X-Requested-With: XMLHttpRequest) on mutating requests provides sufficient CSRF protection. This is the "custom header" pattern -- browsers will not send custom headers on cross-origin requests without CORS preflight approval.

For the SPA, every fetch call to the API includes:

headers: { "X-Requested-With": "XMLHttpRequest" }

The CSRF middleware rejects POST/PUT/DELETE requests missing this header.

3. LLM Provider Abstraction

3.1 Trait Design

#[async_trait]
pub trait LlmProvider: Send + Sync {
    /// Returns the provider identifier (e.g., "gemini", "openai", "anthropic").
    fn provider_id(&self) -> &str;

    /// Pass 1: Search the web and generate structured news items.
    /// Returns raw JSON matching the category schema.
    async fn generate_search_pass(
        &self,
        model: &str,
        system_prompt: &str,
        user_prompt: &str,
        response_schema: &serde_json::Value,
    ) -> Result<serde_json::Value, AppError>;

    /// Pass 2: Rewrite titles and summaries based on scraped content.
    /// No web search tool needed.
    async fn generate_rewrite_pass(
        &self,
        model: &str,
        system_prompt: &str,
        user_prompt: &str,
        response_schema: &serde_json::Value,
    ) -> Result<serde_json::Value, AppError>;

    /// Lists available models for this provider.
    fn available_models(&self) -> &[String];
}

3.2 Provider-Specific Web Search Handling

Each provider handles web grounding differently. The trait design abstracts this:

Provider	Pass 1 (Search)	Pass 2 (Rewrite)
Gemini	Uses `googleSearch` tool in config. Structured output via `responseSchema`.	Standard generation, no tools. `responseSchema` for structured output.
OpenAI	Uses `web_search` tool (Responses API) or a two-step approach: first call with `browsing` tool, then structured output.	Standard chat completion with `response_format: { type: "json_schema", ... }`.
Anthropic	Uses `web_search` tool (available on Claude models). Structured output via tool-use pattern or explicit JSON instructions.	Standard message with JSON output instructions. Anthropic does not have native JSON schema enforcement, so the prompt includes the schema and parsing is done server-side with validation.

Implementation details for each provider:

// Gemini implementation
pub struct GeminiProvider {
    client: reqwest::Client,
    api_key: String,
    base_url: String,
    models: Vec<String>,
}

impl GeminiProvider {
    async fn generate_search_pass(&self, model: &str, ...) -> Result<serde_json::Value, AppError> {
        // POST to /v1beta/models/{model}:generateContent
        // Config includes: tools: [{ googleSearch: {} }]
        //                  responseMimeType: "application/json"
        //                  responseSchema: <schema>
    }
}

// OpenAI implementation
pub struct OpenAiProvider {
    client: reqwest::Client,
    api_key: String,
    base_url: String,  // default: https://api.openai.com/v1
    models: Vec<String>,
}

// Anthropic implementation
pub struct AnthropicProvider {
    client: reqwest::Client,
    api_key: String,
    base_url: String,  // default: https://api.anthropic.com
    models: Vec<String>,
}

3.3 Provider Factory

pub fn create_provider(config: &LlmProviderConfig) -> Result<Box<dyn LlmProvider>, AppError> {
    match config.provider.as_str() {
        "gemini" => Ok(Box::new(GeminiProvider::new(
            config.api_key.clone(),
            config.base_url.clone(),
            config.models.clone(),
        ))),
        "openai" => Ok(Box::new(OpenAiProvider::new(...))),
        "anthropic" => Ok(Box::new(AnthropicProvider::new(...))),
        _ => Err(AppError::BadRequest(format!("Unknown provider: {}", config.provider))),
    }
}

3.4 Rate Limiter Design

The rate limiter is a server-side, per-provider, in-memory token bucket with configuration stored in the database.

pub struct RateLimiter {
    state: Arc<DashMap<String, ProviderBucket>>,
}

struct ProviderBucket {
    timestamps: VecDeque<Instant>,
    max_requests: u32,
    time_window: Duration,
}

impl RateLimiter {
    /// Blocks until a slot is available for the given provider.
    pub async fn acquire(&self, provider_id: &str) -> Result<(), AppError> {
        loop {
            let mut bucket = self.state
                .entry(provider_id.to_string())
                .or_insert_with(|| self.default_bucket());

            bucket.timestamps.retain(|t| t.elapsed() < bucket.time_window);

            if bucket.timestamps.len() < bucket.max_requests as usize {
                bucket.timestamps.push_back(Instant::now());
                return Ok(());
            }

            let wait_time = bucket.time_window - bucket.timestamps.front().unwrap().elapsed();
            drop(bucket); // Release the DashMap lock before sleeping
            tokio::time::sleep(wait_time).await;
        }
    }

    /// Reload configuration from DB (called by admin update endpoint).
    pub async fn reload_config(&self, pool: &SqlitePool) -> Result<(), AppError> {
        // Fetch rate_limits table, update each ProviderBucket
    }
}

The rate limiter lives in AppState and is shared across all requests. When an admin updates rate limit configuration, reload_config is called to hot-reload without restart.

3.5 Two-Pass Generation Pipeline

The SynthesisService orchestrates the full pipeline:

pub struct SynthesisService;

impl SynthesisService {
    pub async fn generate(
        state: &AppState,
        user_id: &str,
    ) -> Result<Synthesis, AppError> {
        // 1. Load user settings
        let settings = db::settings::get(pool, user_id).await?;

        // 2. Load user sources
        let sources = db::sources::list(pool, user_id).await?;

        // 3. Resolve LLM provider + model
        let (provider, model) = resolve_provider(state, &settings.ai_model).await?;

        // 4. Build dynamic schema from categories
        let schema = build_category_schema(&settings.categories);

        // 5. Rate limit: acquire slot
        state.rate_limiter.acquire(provider.provider_id()).await?;

        // 6. Pass 1: Search
        let raw_results = provider.generate_search_pass(
            &model, &system_prompt, &user_prompt, &schema
        ).await?;

        // 7. Validate & scrape URLs (server-side, no CORS issues)
        let scraped = scraper::validate_and_scrape(
            &state.http_client,
            raw_results,
            settings.max_age_days,
        ).await;

        // 8. Rate limit: acquire slot for pass 2
        state.rate_limiter.acquire(provider.provider_id()).await?;

        // 9. Pass 2: Rewrite with scraped content
        let final_results = provider.generate_rewrite_pass(
            &model, &rewrite_system_prompt, &rewrite_prompt, &schema
        ).await?;

        // 10. Persist
        let synthesis = db::syntheses::create(
            pool, user_id, &week_string, &final_results
        ).await?;

        Ok(synthesis)
    }
}

3.6 Asynchronous Generation

Synthesis generation can take 30-90 seconds. Two options:

Option A: Synchronous with long timeout. Simple, but ties up a connection. Acceptable for low-traffic deployments.

Option B (Recommended): Background task with polling. The POST /syntheses/generate endpoint spawns a tokio task and returns a job ID. The frontend polls GET /syntheses/generate/:job_id/status. Job state is kept in an in-memory DashMap<String, JobStatus> (not in DB, since jobs are ephemeral).

enum JobStatus {
    Pending,
    InProgress { step: String },  // "search", "scraping", "rewriting"
    Completed { synthesis_id: String },
    Failed { error: String },
}

The frontend polls every 3-5 seconds with the same loading UX as the current React app.

4. URL Scraping / Validation

4.1 CORS Elimination

Moving scraping to the backend completely eliminates CORS issues. The Rust backend makes direct HTTP requests to target URLs -- no proxies needed. This is the single biggest reliability improvement in the refactoring.

4.2 reqwest-Based HTTP Client

let client = reqwest::Client::builder()
    .user_agent("Mozilla/5.0 (compatible; AISynthBot/1.0; +https://your-domain.com/bot)")
    .timeout(Duration::from_secs(15))
    .redirect(reqwest::redirect::Policy::limited(5))
    .connect_timeout(Duration::from_secs(5))
    .danger_accept_invalid_certs(false)
    .build()?;

The HTTP client is created once in AppState and reused across all requests (connection pooling).

4.3 HTML Parsing with `scraper` Crate

The current app uses the browser's DOMParser. The Rust equivalent uses the scraper crate (built on html5ever):

use scraper::{Html, Selector};

pub async fn validate_and_scrape(
    client: &reqwest::Client,
    items: Vec<RawNewsItem>,
    max_age_days: i64,
) -> Vec<ScrapedNewsItem> {
    let futures = items.into_iter().map(|item| {
        let client = client.clone();
        async move { scrape_single(&client, item, max_age_days).await }
    });

    let results = futures::future::join_all(futures).await;
    results.into_iter().filter_map(|r| r).collect()
}

async fn scrape_single(
    client: &reqwest::Client,
    item: RawNewsItem,
    max_age_days: i64,
) -> Option<ScrapedNewsItem> {
    // 1. Validate URL format
    let url = Url::parse(&item.url).ok()?;

    // 2. Fetch
    let resp = client.get(url).send().await.ok()?;
    if !resp.status().is_success() { return None; }
    let html_text = resp.text().await.ok()?;

    // 3. Parse HTML
    let document = Html::parse_document(&html_text);

    // 4. Soft-404 detection
    let title_sel = Selector::parse("title").unwrap();
    let h1_sel = Selector::parse("h1").unwrap();
    let title_text = document.select(&title_sel).next()
        .map(|el| el.text().collect::<String>().to_lowercase())
        .unwrap_or_default();
    let h1_text = document.select(&h1_sel).next()
        .map(|el| el.text().collect::<String>().to_lowercase())
        .unwrap_or_default();

    let error_keywords = [
        "page not found", "404", "403", "access denied",
        "forbidden", "not found", "introuvable",
    ];
    if error_keywords.iter().any(|kw| title_text.contains(kw) || h1_text.contains(kw)) {
        return None;
    }

    // 5. Date extraction (meta tags, JSON-LD, <time>)
    if let Some(pub_date) = extract_publication_date(&document) {
        let age = Utc::now() - pub_date;
        if age.num_days() > max_age_days {
            return None;
        }
    }

    // 6. Extract body text (remove script, style, nav, etc.)
    let content = extract_body_text(&document, 4000);

    Some(ScrapedNewsItem {
        title: item.title,
        url: item.url,
        summary: item.summary,
        scraped_content: content,
    })
}

Date extraction mirrors the current logic: check meta[property="article:published_time"], meta[itemprop="datePublished"], <time datetime>, and JSON-LD datePublished. The chrono crate handles date parsing with multiple format attempts.

4.4 Concurrency Control

To avoid overwhelming target sites, scraping runs with bounded concurrency:

use futures::stream::{self, StreamExt};

stream::iter(items)
    .map(|item| scrape_single(&client, item, max_age_days))
    .buffer_unordered(10)  // Max 10 concurrent scrapes
    .collect::<Vec<_>>()
    .await

5. SolidJS Frontend

5.1 Build Tooling

SolidJS uses Vite natively. The migration is straightforward:

// vite.config.ts
import { defineConfig } from 'vite';
import solidPlugin from 'vite-plugin-solid';
import tailwindcss from '@tailwindcss/vite';

export default defineConfig({
  plugins: [solidPlugin(), tailwindcss()],
  server: {
    port: 3000,
    proxy: {
      '/api': 'http://localhost:8080',  // Proxy to Rust backend during dev
    },
  },
  build: {
    target: 'esnext',
  },
});

package.json dependencies:

{
  "dependencies": {
    "solid-js": "^1.9",
    "@solidjs/router": "^0.15",
    "lucide-solid": "^0.450",
    "date-fns": "^4.1"
  },
  "devDependencies": {
    "vite": "^6.2",
    "vite-plugin-solid": "^2.11",
    "@tailwindcss/vite": "^4.1",
    "tailwindcss": "^4.1",
    "typescript": "^5.8"
  }
}

5.2 State Management: React to SolidJS Mapping

React Pattern	SolidJS Equivalent	Notes
`useState(value)`	`createSignal(value)`	Returns `[getter, setter]` -- getter is a function call: `count()`
`useEffect(() => {}, [deps])`	`createEffect(() => {})`	Auto-tracks dependencies, no dep array needed
`useContext(Ctx)`	`useContext(Ctx)`	Nearly identical API
`createContext()`	`createContext()`	Same concept
`React.FC<Props>`	`Component<Props>`	`import { Component } from 'solid-js'`
`{items.map(i => ...)}`	`<For each={items()}>{(item) => ...}</For>`	SolidJS uses `<For>` for efficient list rendering
`{condition && <X/>}`	`<Show when={condition()}><X/></Show>`	`<Show>` avoids unnecessary DOM creation
`useNavigate()`	`useNavigate()`	Same API from `@solidjs/router`
`useParams()`	`useParams()`	Same API
`onSnapshot` (realtime)	`createResource` + polling or SSE	SolidJS does not have a Firestore equivalent; use `createResource` for data fetching

5.3 Authentication Context Port

// src/context/AuthContext.tsx
import { createContext, useContext, createSignal, createResource, ParentComponent } from 'solid-js';

interface User {
  id: string;
  email: string;
  display_name: string | null;
  role: string;
}

interface AuthContextType {
  user: () => User | null | undefined;
  loading: () => boolean;
  logout: () => Promise<void>;
}

const AuthContext = createContext<AuthContextType>();

async function fetchCurrentUser(): Promise<User | null> {
  const resp = await fetch('/api/v1/auth/me', {
    headers: { 'X-Requested-With': 'XMLHttpRequest' },
    credentials: 'include',
  });
  if (resp.status === 401) return null;
  if (!resp.ok) throw new Error('Failed to fetch user');
  return resp.json();
}

export const AuthProvider: ParentComponent = (props) => {
  const [user, { refetch }] = createResource(fetchCurrentUser);

  const logout = async () => {
    await fetch('/api/v1/auth/logout', {
      method: 'POST',
      headers: { 'X-Requested-With': 'XMLHttpRequest' },
      credentials: 'include',
    });
    refetch();
  };

  return (
    <AuthContext.Provider value={{
      user: () => user(),
      loading: () => user.loading,
      logout,
    }}>
      {props.children}
    </AuthContext.Provider>
  );
};

export const useAuth = () => {
  const ctx = useContext(AuthContext);
  if (!ctx) throw new Error('useAuth must be used within AuthProvider');
  return ctx;
};

5.4 Data Fetching Pattern

The current React app uses Firestore's onSnapshot for real-time updates. With the REST API backend, data fetching uses createResource:

// src/pages/Home.tsx
import { createResource, For, Show } from 'solid-js';
import { A } from '@solidjs/router';
import { fetchApi } from '../lib/api';

async function fetchSyntheses() {
  return fetchApi<SynthesisDocument[]>('/api/v1/syntheses');
}

export default function Home() {
  const [syntheses, { refetch }] = createResource(fetchSyntheses);

  return (
    <Show when={!syntheses.loading} fallback={<Spinner />}>
      <For each={syntheses()}>
        {(synth) => (
          <A href={`/synthesis/${synth.id}`}>
            {/* card content */}
          </A>
        )}
      </For>
    </Show>
  );
}

5.5 Tailwind CSS Compatibility

Tailwind CSS v4 works identically with SolidJS. The @tailwindcss/vite plugin scans .tsx files for class names regardless of framework. All existing Tailwind classes carry over without changes. The lucide-solid package provides the same icon components as lucide-react with identical APIs.

5.6 Routing

// src/App.tsx
import { Router, Route } from '@solidjs/router';
import { AuthProvider } from './context/AuthContext';

function App() {
  return (
    <AuthProvider>
      <Router>
        <Route path="/login" component={Login} />
        <Route path="/" component={ProtectedLayout}>
          <Route path="/" component={Home} />
          <Route path="/sources" component={Sources} />
          <Route path="/settings" component={Settings} />
          <Route path="/generate" component={GenerateSynthesis} />
          <Route path="/synthesis/:id" component={SynthesisDetail} />
        </Route>
      </Router>
    </AuthProvider>
  );
}

The ProtectedLayout component checks auth and renders <Navigate> if not logged in -- same pattern as the current React ProtectedRoute but using SolidJS's <Navigate>.

6. Authentication System

6.1 Magic Link Flow

User                    Frontend           Backend            SMTP Server
 |                        |                   |                    |
 |-- Enter email -------->|                   |                    |
 |                        |-- POST /auth/login -->                |
 |                        |   { email, captcha_token }            |
 |                        |                   |-- verify captcha ->|
 |                        |                   |-- generate token   |
 |                        |                   |-- store hash in DB |
 |                        |                   |-- send email ------+-->
 |                        |<-- 200 "Check email" |                |
 |                        |                   |                    |
 |<---- Email arrives (link: /auth/verify?token=xxx) -------------|
 |                        |                   |                    |
 |-- Click link --------->|                   |                    |
 |                        |-- GET /auth/verify?token=xxx -->      |
 |                        |                   |-- hash token       |
 |                        |                   |-- lookup in DB     |
 |                        |                   |-- verify not expired|
 |                        |                   |-- mark as used     |
 |                        |                   |-- create/get user  |
 |                        |                   |-- create session   |
 |                        |<-- 302 redirect + Set-Cookie          |
 |<-- Redirect to / ------|                   |                    |

Token generation:

32 bytes of cryptographically secure random data (rand::rngs::OsRng)
Base64url encoded for URL safety
SHA-256 hash stored in DB (never store raw token)
15-minute expiry
Single use (marked used = true after verification)

Rate limiting on magic link requests:

Max 3 requests per email per 15 minutes
Max 10 requests per IP per hour
Prevents email bombing

6.2 Account Registration Flow

User submits email + display name + captcha token.
Backend verifies captcha with provider.
Backend checks email uniqueness.
Backend creates user with role = 'user' and default settings.
Backend sends magic link email for initial verification.
User clicks link, session is created.

The first user can be bootstrapped as admin via environment variable:

ADMIN_EMAIL=admin@example.com

On startup, if a user with this email exists, their role is set to admin.

6.3 Session Management

Sessions are stored in the sessions table. The session ID is a 32-byte random token (base64url-encoded, 43 characters). Session lookup is O(1) via primary key.

Session lifecycle:

Created on magic link verification
Expires after 30 days (configurable)
Refreshed (expiry extended) on each authenticated request
Deleted on logout
Periodic cleanup job (tokio interval) removes expired sessions

6.4 Captcha Integration

Recommendation: Cloudflare Turnstile.

Option	Self-hostable	Privacy	Free tier
hCaptcha	No (SaaS)	Better than reCAPTCHA	Yes (unlimited)
Cloudflare Turnstile	No (SaaS)	Excellent (often invisible)	Yes (unlimited)
mCaptcha	Yes (open source)	Full control	N/A (self-hosted)

None of the mainstream captcha services are fully self-hostable. Cloudflare Turnstile is recommended for its invisible challenge mode (better UX) and generous free tier. If strict self-hosting is required, mCaptcha (Rust-based, open source) is the only viable option, though it requires running a separate service.

Backend verification is simple:

pub async fn verify_captcha(client: &reqwest::Client, token: &str, secret: &str) -> Result<bool, AppError> {
    let resp = client
        .post("https://challenges.cloudflare.com/turnstile/v0/siteverify")
        .form(&[("secret", secret), ("response", token)])
        .send()
        .await?;
    let result: TurnstileResponse = resp.json().await?;
    Ok(result.success)
}

7. Docker Deployment

7.1 Multi-Stage Dockerfile

# ===== Stage 1: Build Rust backend =====
FROM rust:1.85-bookworm AS backend-builder

WORKDIR /app
COPY Cargo.toml Cargo.lock ./
COPY src/ src/
COPY migrations/ migrations/

# Create a dummy SQLite DB for sqlx compile-time checks
ENV DATABASE_URL="sqlite:///tmp/build.db"
RUN cargo install sqlx-cli --no-default-features --features sqlite \
    && sqlx database create \
    && sqlx migrate run

RUN cargo build --release

# ===== Stage 2: Build SolidJS frontend =====
FROM node:22-alpine AS frontend-builder

WORKDIR /app/frontend
COPY frontend/package.json frontend/package-lock.json ./
RUN npm ci

COPY frontend/ ./
RUN npm run build

# ===== Stage 3: Minimal runtime =====
FROM debian:bookworm-slim AS runtime

RUN apt-get update && apt-get install -y \
    ca-certificates \
    libssl3 \
    && rm -rf /var/lib/apt/lists/*

RUN useradd -ms /bin/bash appuser

WORKDIR /app

# Copy backend binary
COPY --from=backend-builder /app/target/release/ai-synth-backend .
# Copy migrations for runtime migration
COPY --from=backend-builder /app/migrations/ migrations/
# Copy frontend static files
COPY --from=frontend-builder /app/frontend/dist/ static/

# Create data directory for SQLite
RUN mkdir -p /app/data && chown appuser:appuser /app/data

USER appuser

ENV DATABASE_URL="sqlite:///app/data/ai_synth.db"
ENV STATIC_DIR="/app/static"
ENV PORT=8080

EXPOSE 8080

# Run migrations on startup, then start server
CMD ["./ai-synth-backend"]

The Rust backend serves the static SolidJS files directly (via tower-http::ServeDir), eliminating the need for a separate nginx container. All /api/* routes go to handlers; everything else serves index.html (SPA fallback).

7.2 docker-compose.yml

version: "3.9"

services:
  app:
    build:
      context: .
      dockerfile: Dockerfile
    container_name: ai-synth
    restart: unless-stopped
    ports:
      - "${PORT:-8080}:8080"
    volumes:
      - ai_synth_data:/app/data        # SQLite persistence
    environment:
      - DATABASE_URL=sqlite:///app/data/ai_synth.db
      - PORT=8080
      - ADMIN_EMAIL=${ADMIN_EMAIL}
      - SESSION_SECRET=${SESSION_SECRET}     # 64-byte hex for cookie signing
      - SMTP_HOST=${SMTP_HOST}
      - SMTP_PORT=${SMTP_PORT:-587}
      - SMTP_USER=${SMTP_USER}
      - SMTP_PASSWORD=${SMTP_PASSWORD}
      - SMTP_FROM=${SMTP_FROM}
      - CAPTCHA_SECRET=${CAPTCHA_SECRET}
      - CAPTCHA_SITE_KEY=${CAPTCHA_SITE_KEY}
      - ENCRYPTION_KEY=${ENCRYPTION_KEY}     # 32-byte hex for API key encryption
      - RUST_LOG=info
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/api/v1/health"]
      interval: 30s
      timeout: 5s
      retries: 3

  # Optional: Mailpit for local development (SMTP catch-all)
  mailpit:
    image: axllent/mailpit
    container_name: ai-synth-mail
    restart: unless-stopped
    ports:
      - "8025:8025"   # Web UI
      - "1025:1025"   # SMTP
    profiles:
      - dev

volumes:
  ai_synth_data:
    driver: local

7.3 Volume Mounts for SQLite

The SQLite database file is stored in a Docker named volume (ai_synth_data). This ensures:

Data persists across container restarts and rebuilds
The volume can be backed up via docker cp or volume backup tools
WAL mode is used for concurrent read/write performance

Important SQLite configuration for production:

let pool = SqlitePoolOptions::new()
    .max_connections(5)         // SQLite handles limited concurrency
    .after_connect(|conn, _| {
        Box::pin(async move {
            conn.execute("PRAGMA journal_mode=WAL").await?;
            conn.execute("PRAGMA synchronous=NORMAL").await?;
            conn.execute("PRAGMA foreign_keys=ON").await?;
            conn.execute("PRAGMA busy_timeout=5000").await?;
            Ok(())
        })
    })
    .connect(&database_url)
    .await?;

7.4 Environment Variable Configuration

A .env.example file documents all required and optional variables:

# === Required ===
DATABASE_URL=sqlite:///app/data/ai_synth.db
SESSION_SECRET=<64-byte-hex-string>
ENCRYPTION_KEY=<32-byte-hex-string>
ADMIN_EMAIL=admin@example.com

# === SMTP (required for magic link auth) ===
SMTP_HOST=smtp.example.com
SMTP_PORT=587
SMTP_USER=user@example.com
SMTP_PASSWORD=password
SMTP_FROM=noreply@example.com

# === Captcha ===
CAPTCHA_SECRET=<turnstile-secret-key>
CAPTCHA_SITE_KEY=<turnstile-site-key>

# === Optional ===
PORT=8080
RUST_LOG=info
BASE_URL=https://your-domain.com  # For magic link URLs

8. Migration from Firebase

8.1 Data Migration Strategy

A standalone Rust CLI tool (or a script using firebase-admin SDK in Python/Node) handles the migration:

Step 1: Export Firestore data

Use firebase-admin SDK (Python or Node.js is simplest for this one-shot task):

# migrate_export.py
import firebase_admin
from firebase_admin import credentials, firestore
import json

cred = credentials.Certificate("service-account.json")
firebase_admin.initialize_app(cred)
db = firestore.client()

# Export users (from Firebase Auth)
# Export syntheses, sources, settings collections
data = {
    "syntheses": [],
    "sources": [],
    "settings": [],
}

for doc in db.collection("syntheses").stream():
    d = doc.to_dict()
    d["_id"] = doc.id
    data["syntheses"].append(d)

# ... same for sources, settings

with open("firebase_export.json", "w") as f:
    json.dump(data, f, default=str)

Step 2: Transform and import into SQLite

A Rust CLI tool reads the JSON export and inserts into SQLite:

cargo run --bin migrate -- --input firebase_export.json --db ai_synth.db

Key transformations:

authorUid / userId from Firebase Auth UID -> new UUID in users table (mapping table maintained during migration)
Firebase Timestamp -> ISO 8601 string
Legacy SynthesisData fields (majorAnnouncements, financialSector, etc.) -> normalized sections[] JSON
Settings doc ID (was {userId} in Firestore) -> user_id foreign key

Step 3: User notification

Since authentication changes from Google SSO to email+magic link, existing users need to be notified that they must use the magic link flow. Their email addresses (from Firebase Auth) are imported into the users table. On first magic link login, the user's existing data is accessible via their email.

8.2 Mapping Firestore Security Rules to Rust

The Firestore rules enforce three categories of protection that map to backend patterns:

Firestore Rule	Rust Equivalent
`isAuthenticated()`	Auth middleware layer (rejects 401 if no valid session)
`isDocOwner()` / `request.auth.uid == resource.data.authorUid`	Query-level filtering: `WHERE user_id = $1` with the authenticated user's ID
`isValidSynthesis()` / `isValidSettings()` / `isValidSource()`	Request validation using `validator` crate or manual checks in handlers
`uidUnchanged()` / `uidNotModified()`	Not applicable -- `user_id` is never in the request body; it is injected server-side from the session
`request.resource.data.createdAt == resource.data.createdAt`	`created_at` is set server-side and never updatable via API
Field type checks (string, number, timestamp)	Serde deserialization + custom validators
Size limits (e.g., `title.size() < 500`)	Validator annotations: `#[validate(length(max = 500))]`

Example validation in Rust:

#[derive(Deserialize, Validate)]
pub struct CreateSourceRequest {
    #[validate(length(min = 1, max = 200))]
    pub title: String,

    #[validate(url, length(max = 1000))]
    pub url: String,
}

The key architectural difference: in Firestore, rules are the only security layer (the client has direct DB access). In the Rust backend, security is enforced at the handler level (authentication middleware + query scoping + input validation). The database is never directly accessible from the client.

Ownership enforcement pattern:

Every query that reads or mutates user data includes WHERE user_id = ? with the authenticated user's ID. This is not a "rule" but a structural guarantee -- there is no code path that can access another user's data because the user ID comes from the session, not the request.

// db/syntheses.rs
pub async fn get_by_id(pool: &SqlitePool, user_id: &str, synthesis_id: &str) -> Result<Option<Synthesis>, sqlx::Error> {
    sqlx::query_as!(
        Synthesis,
        "SELECT * FROM syntheses WHERE id = ? AND user_id = ?",
        synthesis_id,
        user_id
    )
    .fetch_optional(pool)
    .await
}

If the synthesis belongs to another user, this returns None, and the handler returns 404. There is no way for a user to query, update, or delete another user's data.

Summary of Key Crate Dependencies

Purpose	Crate	Version Guidance
Web framework	`axum`	^0.8
Async runtime	`tokio`	^1 (full features)
Database	`sqlx`	^0.8 (features: sqlite, runtime-tokio)
HTTP client	`reqwest`	^0.12 (features: json, cookies)
HTML parsing	`scraper`	^0.22
Serialization	`serde`, `serde_json`	^1
Date/time	`chrono`	^0.4
Password/token hashing	`sha2`	^0.10
Random tokens	`rand`	^0.8
SMTP	`lettre`	^0.11
Logging	`tracing`, `tracing-subscriber`	^0.1 / ^0.3
Config	`dotenvy`	^0.15
Validation	`validator`	^0.19
Concurrent map	`dashmap`	^6
Static file serving	`tower-http`	^0.6 (features: fs, cors, trace)
Cookie handling	`axum-extra`	^0.10 (features: cookie)
Encryption (API keys)	`aes-gcm`	^0.10
Base64	`base64`	^0.22
UUID	`uuid`	^1 (features: v4)
Error handling	`anyhow`, `thiserror`	^1

Architecture Diagram (Text)

                                   ┌─────────────────────┐
                                   │   Docker Container   │
                                   │                     │
  Browser ◄──── HTTPS ────►  ┌─────┴─────────────────┐   │
  (SolidJS SPA)               │    Axum Web Server     │   │
                              │                       │   │
                              │  /static/* ──► ServeDir│   │
                              │  /api/v1/* ──► Router  │   │
                              │                       │   │
                              │  ┌─ Auth Middleware ─┐ │   │
                              │  │  Session Cookie   │ │   │
                              │  │  CSRF Check       │ │   │
                              │  └───────────────────┘ │   │
                              │                       │   │
                              │  ┌─ Handlers ────────┐ │   │
                              │  │ auth, syntheses,  │ │   │
                              │  │ sources, settings,│ │   │
                              │  │ admin, email      │ │   │
                              │  └────────┬──────────┘ │   │
                              │           │            │   │
                              │  ┌─ Services ────────┐ │   │
                              │  │ LLM providers     │─┼───┼──► Gemini API
                              │  │ (trait-based)     │─┼───┼──► OpenAI API
                              │  │                   │─┼───┼──► Anthropic API
                              │  │ Scraper (reqwest) │─┼───┼──► Target URLs
                              │  │ Email (lettre)    │─┼───┼──► SMTP Server
                              │  │ Captcha           │─┼───┼──► Turnstile API
                              │  └────────┬──────────┘ │   │
                              │           │            │   │
                              │  ┌─ DB Layer (sqlx) ─┐ │   │
                              │  │  SQLite (WAL)     │ │   │
                              │  └───────────────────┘ │   │
                              └───────────┬────────────┘   │
                                          │                │
                              ┌───────────▼────────────┐   │
                              │   /app/data/            │   │
                              │   ai_synth.db           │   │
                              │   (Docker volume)       │   │
                              └─────────────────────────┘   │
                                   └─────────────────────┘

52 KiB Raw Blame History