# Technical Architecture Analysis: AI Weekly Synth Refactoring ## Open Questions and Clarifications Needed Before implementation, the following points require decisions from stakeholders: 1. **Admin scope**: Is the "admin" a single super-user defined by config, or a full role-based system with multiple admins? This analysis assumes a simple role flag on users plus a single bootstrap admin defined via environment variable. 2. **Google OAuth retention**: The requirements specify email+captcha and magic link auth. Should Google SSO be dropped entirely, or kept as an additional option? This analysis assumes Google SSO is dropped to remove all Google dependencies. 3. **Email sending for syntheses**: The current app sends syntheses via Gmail API with OAuth popup. With Google dependencies removed, should SMTP-based email sending replace this? This analysis assumes yes, using the same SMTP configuration as magic link delivery. 4. **Data migration volume**: How many existing users and syntheses need migrating? This impacts whether a one-shot script suffices or whether incremental migration tooling is needed. 5. **Concurrent users target**: Rate limiter design and session store choice depend on expected load. This analysis assumes a small-to-medium deployment (1-100 concurrent users). 6. **Legacy data**: The current `SynthesisData` has legacy fields (`majorAnnouncements`, `financialSector`, etc.). The requirements say "remove legacy data/formats/code." This analysis assumes legacy fields are dropped during migration; only the `sections[]` format is carried forward. --- ## 1. Rust Backend Architecture ### 1.1 Framework Choice: Axum **Recommendation: Axum** over Actix-web. **Justification:** | Criterion | Axum | Actix-web | |---|---|---| | Ecosystem alignment | Built on `tokio` + `tower` + `hyper` -- the de-facto Rust async stack | Has its own runtime layer (though uses tokio underneath) | | Middleware model | Tower `Layer`/`Service` -- composable, reusable, testable | Actor-based middleware -- powerful but idiosyncratic | | Extractors | Type-safe, ergonomic, uses `FromRequest` traits | Similar, but with `web::Data`, `web::Json` wrappers | | Community trajectory | Growing faster, backed by the tokio team | Mature, stable, but slower growth | | Learning curve | Lower for developers already using tokio ecosystem | Slightly higher due to actor concepts | | Compile-time type safety | Strong -- handler function signatures are validated at compile time | Strong, but less ergonomic error messages | Axum's tower-based middleware model is a decisive advantage for this project: the auth middleware, rate limiter, and CORS layer compose naturally as tower `Layer`s. Axum also has first-class support for shared state via `State` extractor, which maps well to a shared database pool and configuration. ### 1.2 Project Structure ``` ai-synth-backend/ ├── Cargo.toml ├── Cargo.lock ├── .env.example ├── migrations/ # sqlx migrations │ ├── 001_create_users.sql │ ├── 002_create_sessions.sql │ ├── 003_create_settings.sql │ ├── 004_create_sources.sql │ ├── 005_create_syntheses.sql │ ├── 006_create_admin_config.sql │ └── 007_create_rate_limits.sql ├── src/ │ ├── main.rs # Entry point: init tracing, DB, run server │ ├── config.rs # Env-based configuration (envy / dotenvy) │ ├── app_state.rs # AppState struct (pool, config, http client) │ ├── error.rs # AppError enum, IntoResponse impl │ ├── router.rs # All route definitions, middleware wiring │ ├── middleware/ │ │ ├── mod.rs │ │ ├── auth.rs # Session cookie extraction, user injection │ │ ├── csrf.rs # Double-submit cookie CSRF protection │ │ └── rate_limit.rs # Per-provider, configurable rate limiter │ ├── models/ │ │ ├── mod.rs │ │ ├── user.rs # User, NewUser, UserRole │ │ ├── session.rs # Session │ │ ├── settings.rs # UserSettings │ │ ├── source.rs # Source │ │ ├── synthesis.rs # Synthesis, NewsSection, NewsItem │ │ └── admin.rs # LlmProviderConfig, RateLimitConfig │ ├── handlers/ │ │ ├── mod.rs │ │ ├── auth.rs # register, login (magic link), verify, logout │ │ ├── syntheses.rs # list, get, create (trigger generation), delete │ │ ├── sources.rs # CRUD, bulk import, CSV export │ │ ├── settings.rs # get, update, export, import │ │ ├── admin.rs # LLM config CRUD, rate limit config, user list │ │ └── email.rs # Send synthesis by email │ ├── services/ │ │ ├── mod.rs │ │ ├── llm/ │ │ │ ├── mod.rs # LlmProvider trait, factory function │ │ │ ├── gemini.rs # Google Gemini implementation │ │ │ ├── openai.rs # OpenAI implementation │ │ │ ├── anthropic.rs # Anthropic implementation │ │ │ └── types.rs # Shared request/response types │ │ ├── synthesis.rs # 2-pass generation pipeline orchestration │ │ ├── scraper.rs # URL validation, HTML scraping, date extraction │ │ ├── email.rs # SMTP email sending (magic links + syntheses) │ │ └── captcha.rs # Captcha verification │ └── db/ │ ├── mod.rs │ ├── users.rs # User queries │ ├── sessions.rs # Session queries │ ├── settings.rs # Settings queries │ ├── sources.rs # Source queries │ ├── syntheses.rs # Synthesis queries │ └── admin.rs # Admin config queries └── tests/ ├── api/ # Integration tests └── services/ # Unit tests for services ``` ### 1.3 Layered Architecture The application follows a clean 3-layer architecture: - **Handlers** (HTTP layer): Extract request data, call services, return responses. No business logic. - **Services** (Business layer): Orchestrate operations, enforce business rules, call DB and external APIs. - **DB** (Persistence layer): Raw sqlx queries, mapping to/from model structs. ### 1.4 Error Handling A unified `AppError` enum implements `IntoResponse`: ```rust #[derive(Debug)] pub enum AppError { // Client errors BadRequest(String), Unauthorized(String), Forbidden(String), NotFound(String), Conflict(String), TooManyRequests { retry_after_secs: u64 }, ValidationError(Vec), // Server errors Internal(anyhow::Error), LlmError(String), SmtpError(String), ScrapingError(String), } impl IntoResponse for AppError { fn into_response(self) -> axum::response::Response { let (status, message) = match &self { AppError::BadRequest(msg) => (StatusCode::BAD_REQUEST, msg.clone()), AppError::Unauthorized(_) => (StatusCode::UNAUTHORIZED, "Unauthorized".into()), AppError::Forbidden(_) => (StatusCode::FORBIDDEN, "Forbidden".into()), AppError::NotFound(msg) => (StatusCode::NOT_FOUND, msg.clone()), AppError::TooManyRequests { retry_after_secs } => { // Include Retry-After header (StatusCode::TOO_MANY_REQUESTS, format!("Retry after {retry_after_secs}s")) } AppError::Internal(e) => { tracing::error!("Internal error: {e:#}"); (StatusCode::INTERNAL_SERVER_ERROR, "Internal server error".into()) } // ... }; (status, Json(json!({ "error": message }))).into_response() } } ``` All handlers return `Result`. The `?` operator propagates errors naturally. `From` implementations convert `sqlx::Error`, `reqwest::Error`, etc. into `AppError`. ### 1.5 SQLite with sqlx: Schema Design All tables use TEXT primary keys (UUIDs generated by the backend) for portability. Timestamps are stored as `TEXT` in ISO 8601 format (SQLite has no native timestamp; this also works on Postgres via `TIMESTAMPTZ` cast). #### Migration 001: Users ```sql CREATE TABLE users ( id TEXT PRIMARY KEY, -- UUID email TEXT NOT NULL UNIQUE, display_name TEXT, role TEXT NOT NULL DEFAULT 'user', -- 'user' | 'admin' created_at TEXT NOT NULL, -- ISO 8601 updated_at TEXT NOT NULL ); CREATE INDEX idx_users_email ON users(email); ``` #### Migration 002: Sessions ```sql CREATE TABLE sessions ( id TEXT PRIMARY KEY, -- Secure random token (32 bytes, base64url) user_id TEXT NOT NULL REFERENCES users(id) ON DELETE CASCADE, created_at TEXT NOT NULL, expires_at TEXT NOT NULL, ip_address TEXT, user_agent TEXT ); CREATE INDEX idx_sessions_user_id ON sessions(user_id); CREATE INDEX idx_sessions_expires_at ON sessions(expires_at); ``` #### Migration 003: Settings ```sql CREATE TABLE settings ( user_id TEXT PRIMARY KEY REFERENCES users(id) ON DELETE CASCADE, theme TEXT NOT NULL DEFAULT 'Intelligence Artificielle', max_age_days INTEGER NOT NULL DEFAULT 7, categories TEXT NOT NULL, -- JSON array stored as TEXT max_items_per_category INTEGER NOT NULL DEFAULT 4, search_agent_behavior TEXT NOT NULL DEFAULT '', ai_model TEXT NOT NULL DEFAULT 'gemini-3.1-pro-preview', updated_at TEXT NOT NULL ); ``` #### Migration 004: Sources ```sql CREATE TABLE sources ( id TEXT PRIMARY KEY, user_id TEXT NOT NULL REFERENCES users(id) ON DELETE CASCADE, title TEXT NOT NULL, url TEXT NOT NULL, created_at TEXT NOT NULL ); CREATE INDEX idx_sources_user_id ON sources(user_id); ``` #### Migration 005: Syntheses ```sql CREATE TABLE syntheses ( id TEXT PRIMARY KEY, user_id TEXT NOT NULL REFERENCES users(id) ON DELETE CASCADE, week TEXT NOT NULL, -- e.g. "2026-W12" sections TEXT NOT NULL, -- JSON: [{ title, items: [{ title, url, summary }] }] created_at TEXT NOT NULL ); CREATE INDEX idx_syntheses_user_id ON syntheses(user_id); CREATE INDEX idx_syntheses_created_at ON syntheses(created_at); ``` #### Migration 006: Admin Config (LLM Providers) ```sql CREATE TABLE llm_providers ( id TEXT PRIMARY KEY, provider TEXT NOT NULL, -- 'gemini' | 'openai' | 'anthropic' display_name TEXT NOT NULL, api_key TEXT NOT NULL, -- Encrypted at rest (AES-256-GCM) base_url TEXT, -- Optional override for self-hosted/proxy models TEXT NOT NULL, -- JSON array of available model identifiers is_enabled BOOLEAN NOT NULL DEFAULT 1, created_at TEXT NOT NULL, updated_at TEXT NOT NULL, UNIQUE(provider) ); ``` #### Migration 007: Rate Limit Configuration ```sql CREATE TABLE rate_limits ( id TEXT PRIMARY KEY, provider_id TEXT NOT NULL REFERENCES llm_providers(id) ON DELETE CASCADE, max_requests INTEGER NOT NULL DEFAULT 29, time_window_ms INTEGER NOT NULL DEFAULT 60000, updated_at TEXT NOT NULL, UNIQUE(provider_id) ); -- Magic link rate limiting CREATE TABLE magic_link_tokens ( id TEXT PRIMARY KEY, email TEXT NOT NULL, token_hash TEXT NOT NULL, -- SHA-256 of the token created_at TEXT NOT NULL, expires_at TEXT NOT NULL, used BOOLEAN NOT NULL DEFAULT 0 ); CREATE INDEX idx_magic_link_email ON magic_link_tokens(email); ``` ### 1.6 SQLite/Postgres Dual Compatibility Strategy **Recommendation: Use sqlx with runtime database selection via `sqlx::AnyPool`.** However, `AnyPool` has limitations (no compile-time query checking). A more robust approach: **Strategy: Feature-flag based conditional compilation.** ```toml # Cargo.toml [features] default = ["sqlite"] sqlite = ["sqlx/sqlite"] postgres = ["sqlx/postgres"] ``` For this project, the SQL differences between SQLite and Postgres are minimal: | Concern | SQLite | Postgres | Resolution | |---|---|---|---| | Auto-increment PK | `INTEGER PRIMARY KEY` | `SERIAL` | Use UUID TEXT PKs -- identical on both | | Timestamps | `TEXT` (ISO 8601) | `TIMESTAMPTZ` | Store as TEXT on both; parse in application layer | | JSON columns | `TEXT` + app-side JSON parse | `JSONB` | Store as TEXT on both; Postgres can migrate to JSONB later | | Boolean | `INTEGER` (0/1) | `BOOLEAN` | Use `INTEGER` on SQLite, `BOOLEAN` on Postgres; sqlx handles mapping | | RETURNING clause | Supported since SQLite 3.35 | Supported | Use `RETURNING` on both | **Practical approach for v1**: Target SQLite only. Write SQL that is Postgres-compatible by design (UUID text PKs, ISO timestamps, no SQLite-specific functions). When the Postgres upgrade happens, create a parallel `migrations_pg/` folder and swap the connection pool. The query layer (db/) remains identical because all queries use standard SQL. Compile-time checking is preserved by using `sqlx::query!` and `sqlx::query_as!` macros with the `DATABASE_URL` environment variable pointing to an SQLite file during development. --- ## 2. API Design ### 2.1 REST API Endpoints All endpoints prefixed with `/api/v1`. Request and response bodies are JSON unless stated otherwise. #### Authentication | Method | Path | Auth | Description | |---|---|---|---| | `POST` | `/auth/register` | No | Create account (email + captcha) | | `POST` | `/auth/login` | No | Request magic link (email + captcha) | | `GET` | `/auth/verify?token=...` | No | Verify magic link token, create session | | `POST` | `/auth/logout` | Yes | Invalidate session | | `GET` | `/auth/me` | Yes | Get current user info | #### Syntheses | Method | Path | Auth | Description | |---|---|---|---| | `GET` | `/syntheses` | Yes | List user's syntheses (paginated) | | `GET` | `/syntheses/:id` | Yes | Get synthesis detail | | `POST` | `/syntheses/generate` | Yes | Trigger generation (async, returns job ID) | | `GET` | `/syntheses/generate/:job_id/status` | Yes | Poll generation status | | `DELETE` | `/syntheses/:id` | Yes | Delete a synthesis | | `POST` | `/syntheses/:id/email` | Yes | Send synthesis by email | #### Sources | Method | Path | Auth | Description | |---|---|---|---| | `GET` | `/sources` | Yes | List user's sources | | `POST` | `/sources` | Yes | Add a source | | `DELETE` | `/sources/:id` | Yes | Delete a source | | `POST` | `/sources/bulk` | Yes | Bulk import (JSON array) | | `POST` | `/sources/import-csv` | Yes | Import from CSV (multipart upload) | | `GET` | `/sources/export-csv` | Yes | Export as CSV download | #### Settings | Method | Path | Auth | Description | |---|---|---|---| | `GET` | `/settings` | Yes | Get user's settings | | `PUT` | `/settings` | Yes | Update settings | | `GET` | `/settings/export` | Yes | Export as JSON download | | `POST` | `/settings/import` | Yes | Import from JSON | #### Admin | Method | Path | Auth | Description | |---|---|---|---| | `GET` | `/admin/providers` | Admin | List LLM provider configs | | `POST` | `/admin/providers` | Admin | Add/update provider config | | `DELETE` | `/admin/providers/:id` | Admin | Remove provider | | `GET` | `/admin/rate-limits` | Admin | Get rate limit configs | | `PUT` | `/admin/rate-limits/:provider_id` | Admin | Update rate limit config | | `GET` | `/admin/users` | Admin | List all users | | `PUT` | `/admin/users/:id/role` | Admin | Change user role | #### Public (for frontend config) | Method | Path | Auth | Description | |---|---|---|---| | `GET` | `/config/providers` | Yes | List enabled providers + their model names (no API keys) | ### 2.2 Request/Response Shapes **POST /auth/register** ```json // Request { "email": "user@example.com", "display_name": "Jane Doe", "captcha_token": "hcaptcha-response-token" } // Response 200 { "message": "A verification link has been sent to your email." } ``` **POST /syntheses/generate** ```json // Request (empty body -- uses user's saved settings and sources) {} // Response 202 { "job_id": "uuid-of-generation-job", "status": "pending" } ``` **GET /syntheses/:id** ```json // Response 200 { "id": "uuid", "week": "2026-W12", "created_at": "2026-03-21T10:30:00Z", "sections": [ { "title": "Annonces majeures", "items": [ { "title": "Article title", "url": "https://example.com/article", "summary": "4-5 line summary..." } ] } ] } ``` **PUT /settings** ```json // Request { "theme": "Intelligence Artificielle", "max_age_days": 7, "categories": ["Annonces majeures", "Secteur financier"], "max_items_per_category": 4, "search_agent_behavior": "Custom instructions...", "ai_model": "gemini-3.1-pro-preview" } // Response 200 { "message": "Settings updated successfully." } ``` **POST /admin/providers** ```json // Request { "provider": "openai", "display_name": "OpenAI GPT-4o", "api_key": "sk-...", "base_url": null, "models": ["gpt-4o", "gpt-4o-mini"], "is_enabled": true } ``` ### 2.3 Authentication Middleware The auth middleware is a tower `Layer` that: 1. Extracts the session cookie (`ai_synth_session`) from the request. 2. Looks up the session ID in the `sessions` table. 3. Checks `expires_at` has not passed. 4. Loads the `User` from the `users` table. 5. Injects the `User` into request extensions (`request.extensions_mut().insert(user)`). 6. Handlers extract the user via `Extension` or a custom `AuthUser` extractor. For admin routes, an additional `RequireAdmin` layer checks `user.role == "admin"`. **Session cookies configuration:** ```rust Cookie::build(("ai_synth_session", session_id)) .http_only(true) .secure(true) // HTTPS only .same_site(SameSite::Lax) .path("/") .max_age(Duration::days(30)) ``` **CSRF Protection:** Since this is an API consumed by a SPA on the same origin (or proxied), the combination of `SameSite=Lax` cookies and requiring a custom header (`X-Requested-With: XMLHttpRequest`) on mutating requests provides sufficient CSRF protection. This is the "custom header" pattern -- browsers will not send custom headers on cross-origin requests without CORS preflight approval. For the SPA, every `fetch` call to the API includes: ```javascript headers: { "X-Requested-With": "XMLHttpRequest" } ``` The CSRF middleware rejects `POST/PUT/DELETE` requests missing this header. --- ## 3. LLM Provider Abstraction ### 3.1 Trait Design ```rust #[async_trait] pub trait LlmProvider: Send + Sync { /// Returns the provider identifier (e.g., "gemini", "openai", "anthropic"). fn provider_id(&self) -> &str; /// Pass 1: Search the web and generate structured news items. /// Returns raw JSON matching the category schema. async fn generate_search_pass( &self, model: &str, system_prompt: &str, user_prompt: &str, response_schema: &serde_json::Value, ) -> Result; /// Pass 2: Rewrite titles and summaries based on scraped content. /// No web search tool needed. async fn generate_rewrite_pass( &self, model: &str, system_prompt: &str, user_prompt: &str, response_schema: &serde_json::Value, ) -> Result; /// Lists available models for this provider. fn available_models(&self) -> &[String]; } ``` ### 3.2 Provider-Specific Web Search Handling Each provider handles web grounding differently. The trait design abstracts this: | Provider | Pass 1 (Search) | Pass 2 (Rewrite) | |---|---|---| | **Gemini** | Uses `googleSearch` tool in config. Structured output via `responseSchema`. | Standard generation, no tools. `responseSchema` for structured output. | | **OpenAI** | Uses `web_search` tool (Responses API) or a two-step approach: first call with `browsing` tool, then structured output. | Standard chat completion with `response_format: { type: "json_schema", ... }`. | | **Anthropic** | Uses `web_search` tool (available on Claude models). Structured output via tool-use pattern or explicit JSON instructions. | Standard message with JSON output instructions. Anthropic does not have native JSON schema enforcement, so the prompt includes the schema and parsing is done server-side with validation. | **Implementation details for each provider:** ```rust // Gemini implementation pub struct GeminiProvider { client: reqwest::Client, api_key: String, base_url: String, models: Vec, } impl GeminiProvider { async fn generate_search_pass(&self, model: &str, ...) -> Result { // POST to /v1beta/models/{model}:generateContent // Config includes: tools: [{ googleSearch: {} }] // responseMimeType: "application/json" // responseSchema: } } // OpenAI implementation pub struct OpenAiProvider { client: reqwest::Client, api_key: String, base_url: String, // default: https://api.openai.com/v1 models: Vec, } // Anthropic implementation pub struct AnthropicProvider { client: reqwest::Client, api_key: String, base_url: String, // default: https://api.anthropic.com models: Vec, } ``` ### 3.3 Provider Factory ```rust pub fn create_provider(config: &LlmProviderConfig) -> Result, AppError> { match config.provider.as_str() { "gemini" => Ok(Box::new(GeminiProvider::new( config.api_key.clone(), config.base_url.clone(), config.models.clone(), ))), "openai" => Ok(Box::new(OpenAiProvider::new(...))), "anthropic" => Ok(Box::new(AnthropicProvider::new(...))), _ => Err(AppError::BadRequest(format!("Unknown provider: {}", config.provider))), } } ``` ### 3.4 Rate Limiter Design The rate limiter is a server-side, per-provider, in-memory token bucket with configuration stored in the database. ```rust pub struct RateLimiter { state: Arc>, } struct ProviderBucket { timestamps: VecDeque, max_requests: u32, time_window: Duration, } impl RateLimiter { /// Blocks until a slot is available for the given provider. pub async fn acquire(&self, provider_id: &str) -> Result<(), AppError> { loop { let mut bucket = self.state .entry(provider_id.to_string()) .or_insert_with(|| self.default_bucket()); bucket.timestamps.retain(|t| t.elapsed() < bucket.time_window); if bucket.timestamps.len() < bucket.max_requests as usize { bucket.timestamps.push_back(Instant::now()); return Ok(()); } let wait_time = bucket.time_window - bucket.timestamps.front().unwrap().elapsed(); drop(bucket); // Release the DashMap lock before sleeping tokio::time::sleep(wait_time).await; } } /// Reload configuration from DB (called by admin update endpoint). pub async fn reload_config(&self, pool: &SqlitePool) -> Result<(), AppError> { // Fetch rate_limits table, update each ProviderBucket } } ``` The rate limiter lives in `AppState` and is shared across all requests. When an admin updates rate limit configuration, `reload_config` is called to hot-reload without restart. ### 3.5 Two-Pass Generation Pipeline The `SynthesisService` orchestrates the full pipeline: ```rust pub struct SynthesisService; impl SynthesisService { pub async fn generate( state: &AppState, user_id: &str, ) -> Result { // 1. Load user settings let settings = db::settings::get(pool, user_id).await?; // 2. Load user sources let sources = db::sources::list(pool, user_id).await?; // 3. Resolve LLM provider + model let (provider, model) = resolve_provider(state, &settings.ai_model).await?; // 4. Build dynamic schema from categories let schema = build_category_schema(&settings.categories); // 5. Rate limit: acquire slot state.rate_limiter.acquire(provider.provider_id()).await?; // 6. Pass 1: Search let raw_results = provider.generate_search_pass( &model, &system_prompt, &user_prompt, &schema ).await?; // 7. Validate & scrape URLs (server-side, no CORS issues) let scraped = scraper::validate_and_scrape( &state.http_client, raw_results, settings.max_age_days, ).await; // 8. Rate limit: acquire slot for pass 2 state.rate_limiter.acquire(provider.provider_id()).await?; // 9. Pass 2: Rewrite with scraped content let final_results = provider.generate_rewrite_pass( &model, &rewrite_system_prompt, &rewrite_prompt, &schema ).await?; // 10. Persist let synthesis = db::syntheses::create( pool, user_id, &week_string, &final_results ).await?; Ok(synthesis) } } ``` ### 3.6 Asynchronous Generation Synthesis generation can take 30-90 seconds. Two options: **Option A: Synchronous with long timeout.** Simple, but ties up a connection. Acceptable for low-traffic deployments. **Option B (Recommended): Background task with polling.** The `POST /syntheses/generate` endpoint spawns a tokio task and returns a job ID. The frontend polls `GET /syntheses/generate/:job_id/status`. Job state is kept in an in-memory `DashMap` (not in DB, since jobs are ephemeral). ```rust enum JobStatus { Pending, InProgress { step: String }, // "search", "scraping", "rewriting" Completed { synthesis_id: String }, Failed { error: String }, } ``` The frontend polls every 3-5 seconds with the same loading UX as the current React app. --- ## 4. URL Scraping / Validation ### 4.1 CORS Elimination Moving scraping to the backend **completely eliminates CORS issues**. The Rust backend makes direct HTTP requests to target URLs -- no proxies needed. This is the single biggest reliability improvement in the refactoring. ### 4.2 reqwest-Based HTTP Client ```rust let client = reqwest::Client::builder() .user_agent("Mozilla/5.0 (compatible; AISynthBot/1.0; +https://your-domain.com/bot)") .timeout(Duration::from_secs(15)) .redirect(reqwest::redirect::Policy::limited(5)) .connect_timeout(Duration::from_secs(5)) .danger_accept_invalid_certs(false) .build()?; ``` The HTTP client is created once in `AppState` and reused across all requests (connection pooling). ### 4.3 HTML Parsing with `scraper` Crate The current app uses the browser's `DOMParser`. The Rust equivalent uses the `scraper` crate (built on `html5ever`): ```rust use scraper::{Html, Selector}; pub async fn validate_and_scrape( client: &reqwest::Client, items: Vec, max_age_days: i64, ) -> Vec { let futures = items.into_iter().map(|item| { let client = client.clone(); async move { scrape_single(&client, item, max_age_days).await } }); let results = futures::future::join_all(futures).await; results.into_iter().filter_map(|r| r).collect() } async fn scrape_single( client: &reqwest::Client, item: RawNewsItem, max_age_days: i64, ) -> Option { // 1. Validate URL format let url = Url::parse(&item.url).ok()?; // 2. Fetch let resp = client.get(url).send().await.ok()?; if !resp.status().is_success() { return None; } let html_text = resp.text().await.ok()?; // 3. Parse HTML let document = Html::parse_document(&html_text); // 4. Soft-404 detection let title_sel = Selector::parse("title").unwrap(); let h1_sel = Selector::parse("h1").unwrap(); let title_text = document.select(&title_sel).next() .map(|el| el.text().collect::().to_lowercase()) .unwrap_or_default(); let h1_text = document.select(&h1_sel).next() .map(|el| el.text().collect::().to_lowercase()) .unwrap_or_default(); let error_keywords = [ "page not found", "404", "403", "access denied", "forbidden", "not found", "introuvable", ]; if error_keywords.iter().any(|kw| title_text.contains(kw) || h1_text.contains(kw)) { return None; } // 5. Date extraction (meta tags, JSON-LD,