91 Commits (e74a1850bf8257409761b694d82d398b1bc918b4)

Author SHA1 Message Date
oabrivard ba7024e280 feat: add classification response parsing with category filling and Autre fallback
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard 104b6a0d7b feat: add classification prompt and schema for article categorization
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard c06b5ba454 feat: add source_scraper module for extracting article links from source pages
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard 45e5ee8a7d fix: rewrite pass schema uses actual scraped item counts, not max setting
The rewrite pass shared the search pass schema which enforced minItems/maxItems
equal to max_items_per_category. After filter_empty_scraped_articles removed
old/failed articles, the scraped data had fewer items than the schema required,
causing the LLM to duplicate content to fill the quota.

Now build_rewrite_schema counts actual items per category from the scraped data
and sets minItems/maxItems accordingly. Also removed dead domain_counts variable.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 13894a8f50 fix: filter empty scraped articles + restore URLs after rewrite + E2E assertions
- filter_empty_scraped_articles: removes articles with empty scraped content
  (too old, soft 404, scrape failure) before the rewrite pass, preventing
  empty articles in the final synthesis
- restore_scraped_urls: already existed, now has unit tests
- E2E test: added assertions for no Wikipedia URLs, no empty summaries,
  and updated settings payload with new fields (max_articles_per_source,
  source_diversity_window)
- 4 new unit tests for filter_empty + restore_scraped_urls

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard a9be1ce435 fix: restore scraped URLs after LLM rewrite pass to prevent hallucination
The rewrite pass can replace validated URLs with hallucinated ones (Wikipedia,
corporate sites) despite being instructed to preserve them. After the rewrite,
restore_scraped_urls() replaces each article's URL with the original scraped
URL by matching on position (category + item index). Logs when a URL is
restored so hallucination patterns can be monitored.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 8a18b70aff fix: set max output tokens to 16384 for all LLM providers
OpenAI's default output limit (4096 tokens) was too low for structured
synthesis output with multiple categories and articles per category,
causing truncated JSON. Set 16384 for both OpenAI APIs (Responses +
Chat Completions) and Gemini. Anthropic already had 16384.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 55c2b050b3 feat: extract recent domains and pass to search prompt for diversity
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard 3f6ad9853c feat: build_search_prompt accepts recent_domains for source diversity
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard a31915d3ce feat: add source_diversity_window setting (migration + model + DB + validation tests)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard b558619d10 feat: source diversity limit + URL deduplication in generation pipeline
- Add max_articles_per_source setting (default 3, range 1-10) with migration,
  backend model, DB queries, and frontend number input
- Add limit_articles_per_source filter: spreads articles across categories
  (1 per domain per category first), then fills remaining slots up to the limit
- Add dedup_by_url filter: removes duplicate URLs across categories (case-insensitive)
- Pipeline order: parse → filter_homepage → dedup_by_url → limit_per_source → scrape
- 10 new unit tests covering spread, cap enforcement, dedup, and edge cases

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 6819c7193c feat: add limit_articles_per_source filter with unit tests
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard c1ee79bcf6 feat: add max_articles_per_source setting (migration + model + DB)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard a3f4c3b42f fix: always run scrape+rewrite pass to prevent hallucinated URLs
The adaptive pipeline skipped the scrape+rewrite pass when the LLM's search
results had URLs starting with "http". But LLMs hallucinate plausible URLs
(Wikipedia, corporate sites) that pass the http check but aren't actual source
articles. The scrape pass catches these by fetching each URL and validating
the content exists. Always running the full pipeline ensures URL integrity.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 45c9e71589 fix: enforce max_items_per_category in JSON schema and prompt
The LLM was returning only 1 article per category despite the user setting 4.
- Added minItems/maxItems to the category array schema (enforced by OpenAI strict mode)
- Changed prompt from "au maximum N actualites" to "exactement N actualites"
- Schema builder now takes max_items_per_category parameter

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 0b0702de39 fix: strip null bytes from LLM output before saving to PostgreSQL JSONB
LLM output occasionally contains \u0000 null bytes (e.g., "annonc\u0000...")
which PostgreSQL rejects in JSONB columns. Added sanitize_json_null_bytes()
that recursively strips null bytes from all string values before DB insert.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 3fe667591d fix: LLM providers use own HTTP client with 120s timeout (was sharing scraper's 15s)
The scraper client (build_scraper_client) has a 15s timeout appropriate for web
scraping, but LLM API calls — especially with web search — take 30-60s. LLM
providers now build their own reqwest client with 120s timeout via build_llm_client().

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 6fe75d77e7 feat: add source file:line to WARN and ERROR log lines
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard 004f08f385 fix: runtime bugs found during first Docker run + integration tests
Bugs fixed:
- resolve_model queried non-existent admin_provider_models table (use JSONB query on admin_providers)
- key_prefix VARCHAR(10) too short for 11-char prefix (migration to VARCHAR(12))
- API key test schema missing additionalProperties: false (OpenAI strict mode)
- CSP missing font-src data: directive (PDF font embedding blocked)
- Magic link URL not logged in test mode (can't verify without real email)
- Rust 1.85 Docker image too old for dependencies (bumped to 1.88)

Tests added to prevent recurrence:
- schema_meets_openai_strict_mode_requirements: validates additionalProperties on all objects
- key_prefix_full_length_stored_in_db: verifies 11-char prefix survives DB round-trip
- generate_pipeline_resolves_model_from_admin_config: exercises full generation pipeline

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 069a4f2022 feat: graceful shutdown and frontend build in Docker
- Add SIGTERM/Ctrl+C signal handling with graceful connection draining
- Close database pool cleanly on shutdown
- Add frontend-builder stage to Dockerfile (node:22-alpine, npm ci + build)
- Move Docker build context to project root so both frontend/ and backend/ are accessible
- Frontend dist/ copied into container at ./static/ for the backend to serve

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard b961f82f01 refactor: add UserRateLimitEntry constructor and settings_changed method
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard c1f2f1456f refactor: simplify recent changes — extract helper, named struct, atomic entry, pre-alloc
- Extract auth::create_and_send_magic_link() to deduplicate token rollback logic
- Replace (i32, i32, RateLimiter) tuple with named UserRateLimitEntry struct
- Use DashMap entry API for atomic rate limiter lookup (fixes TOCTOU race)
- Pre-allocate scraper body Vec from Content-Length when available

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 54d54f2a06 fix: architect assessment remediation — 6 issues across backend, frontend, and infra
- Wire hardened scraper client into runtime (SSRF redirect validation was defined but unused)
- Stream scraper body with per-chunk size limit instead of post-download check (DoS/OOM)
- Persist user rate-limit overrides across generation jobs via AppState DashMap
- Roll back magic-link token on email send failure to prevent quota exhaustion
- Fix API error UX: prefer human message over machine error code in frontend
- Unwrap GET /syntheses { items } wrapper in frontend API layer (contract mismatch)
- Bind Postgres to localhost in docker-compose (was exposed on all interfaces)
- Fix CLAUDE.md: runtime queries not compile-time, 10 migrations not 9

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard ae01bc8e62 security: SSRF redirect validation per hop with custom reqwest policy
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard a4e618feda test: add unit tests for auth middleware cookie extraction
Extract cookie parsing into a standalone `extract_session_token` function
and add 5 unit tests covering the valid, missing, multi-cookie, whitespace,
and empty-header cases.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard 98528f51bd Fix rate limiter bug, simplify v2 code
Bug fix:
- Per-generation rate limiter was creating a new instance on every check,
  making user rate limit overrides non-functional. Fixed by creating the
  limiter once at pipeline start and reusing for both passes.

Simplifications:
- Extract spawn_task closure in scrape_articles (deduplicate spawn blocks)
- Use idiomatic if let Ok(...) instead of if let Some(..).ok() in scraper
- Replace manual loop with iterator chain in export_keys handler
- Simplify check_rate_limit to single boolean check
- Simplify handleImport settings merge (spread already provides defaults)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 0f66c28c38 v2: empty sections fallback in email template
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard 7eb24cfd9a v2: API key export endpoint (POST, rate-limited)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard 191e1c716b v2: enhanced scraper - title priority chain, broken link detection, noindex
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 9b994e0528 v2: pipeline user model selection, rate limiter, URL filter, original title, null-safe sections
- resolve_provider_and_key() now respects user ai_provider preference
- Dual model resolution: ai_model for search pass, ai_model_writing for rewrite pass
- Per-generation rate limiter with user override support
- Homepage URL filter removes domain-only URLs after search pass
- ScrapedNewsItem gains original_title field populated from page <title>
- SynthesisResponse::try_from handles null sections gracefully (returns empty vec)
- Search prompt warns LLM against returning homepage URLs
- Rewrite prompt instructs LLM to use originalTitle with language preservation rules

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard ed6b41fe52 v2: add settings migration, model expansion, DB queries (provider, models, rate limits)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 04819aa926 Simplify code: deduplicate patterns, fix captcha field name bug
Bug fix:
- Fix frontend sending captcha_token instead of turnstile_token in
  login/register requests (would cause 422 errors on auth)

Backend simplifications:
- Deduplicate VALID_PROVIDERS constant (provider.rs is now the single source)
- Extract validate_display_name/validate_models helpers in provider model
- Add From<UserSettings> for SettingsResponse, From<User> for AdminUserResponse
- Consolidate Resend API call pattern into shared send_via_resend()
- Extract do_bulk_import() for sources bulk/CSV import
- Use idiomatic range.contains() for rate limit validation

Frontend simplifications:
- Consolidate file download logic (exportCsv reuses shared fetchFile/triggerDownload)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 1f9f7f39d7 Phase 7: Email sending via Resend + Markdown/PDF export
Backend:
- Synthesis email sending via Resend API with HTML template (inline CSS,
  tables-based for email client compatibility) + plain-text fallback
- XSS prevention via html_escape() on all user content in email templates
- Markdown export: clean format with headers, links, summaries
- PDF export: printpdf with built-in Helvetica fonts, indigo color scheme,
  automatic page breaks, word wrapping
- 3 new endpoints: send-email, export/markdown, export/pdf
- All endpoints enforce ownership checks
- Email validation using email_address crate
- 24 new unit tests, 13 integration tests

Frontend:
- Email section on SynthesisDetail: input pre-filled with user email,
  send button with loading state, success/error feedback
- Export buttons: Markdown + PDF with per-button loading states
- File download via Blob + temporary anchor with Content-Disposition parsing
- 6 new export tests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 631bd43b9f Phase 6: Multi-provider support with OpenAI and Anthropic
Backend:
- OpenAiProvider: Responses API with web_search_preview (pass 1),
  Chat Completions with json_schema structured output (pass 2)
- AnthropicProvider: Messages API with web_search tool (pass 1),
  schema-in-prompt for structured output, code fence stripping (pass 2)
- Pipeline adaptation: skip scrape+rewrite when >70% of search URLs are valid
- Provider factory updated for all three providers
- Error sanitization extended for Anthropic key patterns (sk-ant-)
- 44 new unit tests (OpenAI, Anthropic, factory, pipeline heuristic)

Frontend:
- Provider-specific info text below model selection
- Web search support badges (green/gray)
- Generate page shows selected provider and model
- Warning banner when provider lacks web search
- Provider utility module with 10 tests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard aa6f1ba76b Phase 5: Generation pipeline with SSE progress, syntheses CRUD
Backend:
- Full 2-pass generation pipeline: LLM search -> URL scraping -> LLM rewrite
- Async generation with tokio::spawn, JobStore with per-user concurrency limit
- SSE progress streaming via axum::response::Sse + tokio::sync::watch
- Syntheses CRUD: list (paginated), get (ownership check), delete
- Prompt construction ported from original geminiService.ts
- Parallel URL scraping with bounded concurrency (max 10)
- Graceful partial failure handling (some URLs fail -> continue)
- 36 new unit tests, 16 integration tests

Frontend:
- Home dashboard: synthesis card grid, week badges, delete with confirmation
- Generate page: SSE-driven progress bar, step checklist, auto-redirect
- Synthesis detail: section-by-section display, external links, delete
- SSE client helper with auto-reconnect (exponential backoff)
- Date utilities with French locale formatting

Critical fixes applied:
- SSE EventSource now sends credentials (withCredentials: true)
- Gemini error logging sanitized to prevent API key leak in logs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 439e547367 Phase 4: LLM provider abstraction with Gemini, user API key encryption
Backend:
- LlmProvider async trait with generate_search_pass/generate_rewrite_pass
- GeminiProvider: googleSearch grounding (pass 1), structured JSON output (pass 2)
- AES-256-GCM encryption for user API keys at rest (per-key random nonces)
- MasterKey with zeroize-on-drop (no Clone to prevent unzeroized copies)
- User API key endpoints: list (prefix only), create/update, delete, test
- Dynamic category schema builder for structured LLM output
- Provider factory (Gemini implemented, OpenAI/Anthropic stubbed for Phase 6)
- 37 new unit tests (encryption, schema, Gemini serialization, factory)
- 17 integration tests (CRUD, encryption verification, ownership isolation)

Frontend:
- ApiKeyManager component: per-provider key management in Settings
- Password input with show/hide toggle, key prefix display (monospace)
- Test button validates key with minimal LLM call
- Status badges (configured/not configured)
- 11 new tests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 5abbf9b9ad Phase 3: Admin module with provider/model curation, rate limits, user management
Backend:
- Admin API: CRUD for providers, rate limits, user role management
- Public config endpoint for enabled providers/models
- AdminUser extractor enforces RBAC on all admin endpoints
- Per-provider rate limiter with hot-reload from DB
- Audit logging for all admin mutations
- Seed data: Gemini, OpenAI, Anthropic providers with default models
- Self-demotion prevention on role changes
- 30 integration tests, 27 new unit tests

Frontend:
- Admin layout with sidebar navigation (providers, rate limits, users)
- Provider management: enable/disable, model CRUD, default model selection
- Rate limit configuration with effective rate display
- User management with role badges and promote/demote
- Admin link in navbar/mobile menu (visible only to admins)
- Settings page: dynamic provider/model selection from admin config
- 10 new tests (admin guard, config API)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 22ff026a4c Fix Phase 2 critical issues: SSRF IPv6 gaps, body text filtering, CSV validation
- Fix body text extraction to actually filter excluded elements (script,
  nav, footer, aside, etc.) using node ID tracking instead of unused HashSet
- Add IPv6 reserved range checks to SSRF prevention: ULA (fc00::/7),
  documentation (2001:db8::/32), discard prefix (100::/64)
- Add errors field to frontend BulkImportResponse type
- Validate Content-Type on CSV multipart upload (reject non-text files)
- Add 6 new unit tests for the fixes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 2b75dc7049 Finished phase 2 3 months ago
oabrivard a36e3732bf Fixed critical problems from phase 1 3 months ago
oabrivard 355dbf6a5a Finished phase 1 3 months ago