137 Commits (0963559e0ff4e336bf512a44fb6ab9399230d4ab)

Author SHA1 Message Date
oabrivard 9a734f136e fix: resolve all clippy warnings (0 remaining)
- db/themes: pass CreateThemeRequest/UpdateThemeRequest structs instead
  of 8-9 individual parameters
- llm/mock: add Default impl for MockLlmProvider
- middleware/auth: suppress manual_async_fn (Axum extractor constraint)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2 months ago
oabrivard 3d790e7ce7 feat: extract article URLs from JSON-LD structured data in source pages
Many modern sites (Hugo, WordPress, Next.js) load articles via JavaScript
but include full article URLs in JSON-LD schema.org markup in the <head>.
The scraper now extracts these first (highest quality), then falls back
to <a href> heuristic extraction. Supports ItemList, BlogPosting,
NewsArticle, @graph arrays, and mainEntity wrappers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2 months ago
oabrivard 9a310bbf19 feat: add French/European/US date formats + remove "Articles sans date" category
Date parser now supports: 25/03/2026, 25-03-2026, March 25 2026,
25 mars 2026, 15 février 2026, and short month variants.

Articles without dates are no longer routed to a separate category —
they stay in their LLM-classified category with date shown as empty.
This prevents losing good articles in a catch-all section.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2 months ago
oabrivard f0b60f3f13 fix: return 204 No Content from preferred sources endpoint
The API client expects empty responses to use 204, not 200.
Returning 200 with no body caused JSON parse error in the frontend.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2 months ago
oabrivard 68b1956059 refactor: extract synthesis helpers (assign_category, filter_phase2_url, tracing) into helpers.rs
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2 months ago
oabrivard b124d73c2a fix: P1 audit items — CSV export theme filter, theme validation, ownership checks, history enums, i18n
- export_csv now accepts optional theme_id query param and filters accordingly
- Add UpdateThemeRequest::validate() with bounds checking; call it in the update handler
- Verify theme ownership in sources::create when theme_id is provided
- Update STATUS_OPTIONS (add filtered_too_old, filtered_not_article; remove filtered_duplicate) and SOURCE_TYPE_OPTIONS (add brave_search; remove overflow) in ArticleHistory
- Replace hardcoded French strings ('Confirmer', 'Erreur inconnue') with t() calls; add settings.apiKeys.unknownError key to fr.ts

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2 months ago
oabrivard d5d624b896 fix: P0 audit bugs — theme-scoped imports/preferred, creation flow, scheduler timeout, job cleanup
- Bulk/CSV import now passes theme_id through to DB
- Preferred source update scoped by theme_id (no cross-theme reset)
- Theme creation sends sensible defaults from frontend
- Scheduler wraps generation in 15-minute timeout
- Job store cleanup runs every 5 minutes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2 months ago
oabrivard fa793de8bf test: add scheduler unit test and find_due_schedules integration tests
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard 7f647bc656 refactor: extract JobStore to services/job_store.rs
Moves JobEntry, JobStore, ProgressEvent, JOB_TTL, and emit_progress
to a dedicated module. Updates imports in synthesis.rs, generation.rs,
scheduler.rs, and app_state.rs. synthesis.rs re-exports for backward
compatibility.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 1ab9c817e4 refactor: extract scrape_and_classify_batch from synthesis pipeline
Replaces ~200 duplicated lines in Phase 1 (personalized sources) and
Phase 2 (Brave Search) with a shared scrape_and_classify_batch function.
Uses ScrapeClassifyCtx to bundle shared parameters. Also prepares
synthesis.rs for JobStore extraction by re-exporting from job_store.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard a068d04fa8 feat: add background scheduler for automated synthesis generation
Spawns a tokio task that checks for due schedules every 60 seconds,
runs generation via run_generation_inner, and sends emails to configured
recipients before marking each schedule as run.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard 384649b2b6 feat: add theme schedules — model, DB, CRUD handler, routes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard e43a4d2180 feat: add preferred sources — prioritized during synthesis generation
Users can mark sources as preferred via star buttons on the theme page.
Preferred sources are processed first in the pipeline (ordered before
non-preferred in waves, shuffled separately then merged).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 78844c4ebe chore: remove unused test_settings() function from prompts.rs
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 6f3e6883c9 feat: add stop generation button — saves partial synthesis on cancel
Adds Arc<AtomicBool> cancellation flag to JobStore/JobEntry. The pipeline
checks the flag before each wave and after each batch, then saves whatever
articles have been collected. A new POST /syntheses/generate/:job_id/stop
endpoint sets the flag. The frontend shows a red stop button during generation
and POSTs to the stop endpoint on click.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard e444f79c0b fix: mock provider returns today's date in classify response
Without a date, articles are routed to "Articles sans date" instead
of their classified category, breaking pipeline tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 14908cf603 test: add themes CRUD, article history, and assign_category tests
Covers GAP-01 (themes API), GAP-02 (article history API), and
GAP-04 (assign_category unit tests).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 196005a27b feat: multi-theme Phase 1 — settings split, sources/syntheses theme_id, pipeline theme-aware
Remove content settings from settings table (moved to themes).
Add theme_id to sources and syntheses. Pipeline loads content
settings from the selected theme. Generate endpoint requires theme_id.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 10b8d950b9 feat: add themes CRUD endpoints
Implements GET/POST/PUT/DELETE /api/v1/themes handlers following the same patterns as sources.rs, registers the module, and wires up routes in the router.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 467ad550a5 feat: add Theme model and DB queries
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard 6b84c335d0 feat: improve synthesis list cards with time, all categories, and uniform height
- Add generation time below date in synthesis cards
- Show all categories with article count in parentheses
- Use flex-col layout for uniform card height
- Add sections_summary to SynthesisListItem API response
- Add formatTime utility

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard a89c61c5b6 feat: add "Articles sans date" category for articles without publication date
Articles where neither the scraper nor the LLM could extract a date
are now placed in a separate "Articles sans date" section instead of
their classified category. This makes undated articles visible without
mixing them with properly dated content.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard fb086a706f feat: rename fallback category "Autre" to "Divers"
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard e24236a069 feat: add max_links_per_source setting (default 8, was hardcoded 15)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard 2c3c6008a3 fix: monotonic progress bar with 3 clean phases (sources, websearch, saving)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard d234fa9b24 feat: add is_article LLM check + remove use_llm_for_source_links option
The LLM now determines if scraped content is a real article during
classify (zero extra cost). The separate LLM link extraction option
is removed — heuristic extraction is sufficient.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 37d17e577a feat: restructure Phase 1 into windowed source extraction waves
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 0f1b0306e4 feat: add source_extraction_window setting (default 3)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard c5a56c8fb8 feat: save publication date in article history and show in synthesis
- Add published_date column to article_history table
- Add date field to NewsItem (serialized in synthesis JSONB)
- Pass LLM-extracted date through ArticleTrace to article history
- Display date below article title in SynthesisDetail page

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard de25a08d51 feat: LLM extracts publication date as fallback for article age filtering
The classify prompt now asks the LLM to return a date field (YYYY-MM-DD).
When the scraper couldn't find a date, the LLM-extracted date is used to
filter articles that exceed max_age_days.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 91272ddfc4 feat: dynamic summary length and body snippet size based on setting
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard 1b63afd12a feat: add summary_length setting (1=court, 2=moyen, 3=detaille)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard 0874650a7f fix: pipeline tests use wiremock URLs + skip SSRF for localhost
- Add SKIP_SSRF_CHECK env var to bypass SSRF in test environments
- Use wiremock server as source URL (same domain as article URLs)
- Add source page mock to wiremock setup
- Set SKIP_SSRF_CHECK=1 in integration test script
- Fix unused import warning

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard aee70b37d4 fix: use docker-compose.test.yml for integration test DB
Rewrite run-integration-tests.sh to use the e2e docker-compose config
(Postgres on port 5433). Add --db-check flag for connectivity debugging.
Remove build_test_router (reverted to build_router). Keep minimal_test
for oneshot debugging.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 53813007c6 fix: use lightweight test router without SPA fallback and TraceLayer
Unauthenticated requests were hanging in integration tests due to
tower middleware layers interacting with oneshot(). Add build_test_router()
that only includes API routes + CSRF middleware.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard ccecaa2d13 refactor: add provider_override for pipeline dependency injection
Adds an optional LlmProvider override to run_generation and
run_generation_inner, allowing tests to inject a mock provider without
touching real credentials or the provider-resolution path. Makes
run_generation_inner pub so integration tests can call it directly.
Production callers pass None and behaviour is unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard 17e054c257 feat: add MockLlmProvider for integration testing
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard 4bbdd5c4d1 perf: batch article history INSERTs to reduce DB round-trips
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard f37e0b42a0 perf: use Arc for immutable values in pipeline to reduce cloning
Wrap `model_research` (String), `classify_schema` (Value), and
`classification_categories` (Vec<String>) in Arc before the batch
loops so spawned tasks clone a cheap pointer instead of the full
heap data on every iteration. Also removes the redundant intermediate
`mdl`/`class_sys`/`class_user` bindings in both classify loops.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard 60494aeceb perf: cache CSS selectors with LazyLock to avoid re-parsing
Replace runtime Selector::parse calls on static strings with module-level
LazyLock statics in source_scraper.rs (ANCHOR_SELECTOR) and scraper.rs
(SEL_TITLE, SEL_H1, SEL_BODY), so each selector is compiled once at
first use instead of on every function call.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 69c1688bc7 chore: remove SESSION_SECRET and wrap master_encryption_key in Arc
SESSION_SECRET was loaded and validated but never used anywhere in the
codebase. master_encryption_key is now wrapped in Arc<String> to avoid
cloning the secret string on every AppState clone.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard f44aa44c48 refactor: replace trace_article 11 parameters with ArticleTrace struct
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard f5466a6bd5 refactor: extract shared LLM error mapping to reduce duplication
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard 2036c12b24 refactor: eliminate SettingsResponse struct, serialize UserSettings directly
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard e056ef9d3e refactor: extract assign_category and filter_phase2_url helpers from synthesis pipeline
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 24d53a01d1 fix: block SSRF via IPv4-mapped IPv6 and add check to source page fetching
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard 93003229f1 fix: add periodic expired session cleanup (hourly)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard 347558a278 fix: atomic job creation, 15min timeout, and panic handling
- Replace iterating DashMap check with atomic DashSet insert in create_job to
  eliminate the race condition where double-click could create two concurrent
  jobs for the same user
- Add release_user method called at end of generation task (normal, timeout,
  and panic paths) so the generating slot is always freed
- Wrap run_generation in tokio::time::timeout(900s) to prevent hung LLM calls
  from blocking the generation slot forever
- Spawn a second task to await the JoinHandle and call release_user + send
  error event if the generation task panics, preventing SSE clients from
  hanging indefinitely
- Update cleanup_expired to also remove users from generating_users set
- Update tests to call release_user after completion/error to match new contract

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard 59932589cc fix: prevent UTF-8 panic in error message truncation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard e74a1850bf fix: log source URL in link_extraction LLM call logs
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago