197 Commits (de8e5ab85c1c24d75ece81eef6639b6e2f7ab174)
 

Author SHA1 Message Date
oabrivard 0a87b7ed8f feat: add normalize_article_url and hash_article_url utilities
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard 5a928aa990 feat: add article_history DB module (check, insert, cleanup)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard c271c240a2 feat: add article_history table and article_history_days setting
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard d7c91c956f docs: add article history implementation plan with retry logic 3 months ago
oabrivard 633a51dc8c docs: add spec for article history to prevent cross-synthesis duplicates 3 months ago
oabrivard 8e06357b47 test: update integration test with LLM scraping settings 3 months ago
oabrivard a7599e512a test: comprehensive E2E synthesis validation (duplicates, links, counts, domains)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard c8779f6ca2 feat: add LLM scraping toggles to Settings page
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 8a061c98db feat: LLM-assisted article extraction with Arc provider, concurrency control, and progress
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 357f06e405 feat: LLM-assisted source link extraction with heuristic fallback
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard e6e8aa1eeb feat: add LLM prompts and schemas for link and article extraction
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard 23f121a58d feat: ScrapedContent url+head_html fields, Arc<dyn LlmProvider>, 3-tuple scrape returns
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard e483789d1b feat: add use_llm_for_source_links and use_llm_for_article_extraction settings
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard d508ea9620 docs: revise LLM scraping plan — fix Arc provider, head_html, concurrency, tests 3 months ago
oabrivard 175483dfe3 docs: add spec and plan for LLM-assisted scraping
Two optional LLM enhancements: link extraction from source pages and article
content extraction. Plan needs revision for Arc<dyn LlmProvider> threading
and <head> HTML preservation before implementation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 420603d76a Updated specifications of source diversity functionality 3 months ago
oabrivard 53ecce84b0 feat: two-phase generation pipeline — personalized sources first, web search fallback
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 51ea032838 feat: add scrape_flat_urls helper and gap-aware search prompt
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard d508b5b4ab feat: Autre category support in rewrite schema, final sections, URL restore + remove dead code
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard ba7024e280 feat: add classification response parsing with category filling and Autre fallback
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard 104b6a0d7b feat: add classification prompt and schema for article categorization
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard c06b5ba454 feat: add source_scraper module for extracting article links from source pages
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard 7c8a19616d docs: add spec and plan for source priority pipeline redesign
Two-phase pipeline: scrape personalized sources first, classify with LLM,
fall back to web search for gaps. Tasks 1-3 ready, Task 4 needs elaboration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 45e5ee8a7d fix: rewrite pass schema uses actual scraped item counts, not max setting
The rewrite pass shared the search pass schema which enforced minItems/maxItems
equal to max_items_per_category. After filter_empty_scraped_articles removed
old/failed articles, the scraped data had fewer items than the schema required,
causing the LLM to duplicate content to fill the quota.

Now build_rewrite_schema counts actual items per category from the scraped data
and sets minItems/maxItems accordingly. Also removed dead domain_counts variable.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 13894a8f50 fix: filter empty scraped articles + restore URLs after rewrite + E2E assertions
- filter_empty_scraped_articles: removes articles with empty scraped content
  (too old, soft 404, scrape failure) before the rewrite pass, preventing
  empty articles in the final synthesis
- restore_scraped_urls: already existed, now has unit tests
- E2E test: added assertions for no Wikipedia URLs, no empty summaries,
  and updated settings payload with new fields (max_articles_per_source,
  source_diversity_window)
- 4 new unit tests for filter_empty + restore_scraped_urls

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard a9be1ce435 fix: restore scraped URLs after LLM rewrite pass to prevent hallucination
The rewrite pass can replace validated URLs with hallucinated ones (Wikipedia,
corporate sites) despite being instructed to preserve them. After the rewrite,
restore_scraped_urls() replaces each article's URL with the original scraped
URL by matching on position (category + item index). Logs when a URL is
restored so hallucination patterns can be monitored.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 8a18b70aff fix: set max output tokens to 16384 for all LLM providers
OpenAI's default output limit (4096 tokens) was too low for structured
synthesis output with multiple categories and articles per category,
causing truncated JSON. Set 16384 for both OpenAI APIs (Responses +
Chat Completions) and Gemini. Anthropic already had 16384.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard fdb3110407 feat: add source_diversity_window setting to frontend
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 55c2b050b3 feat: extract recent domains and pass to search prompt for diversity
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard 3f6ad9853c feat: build_search_prompt accepts recent_domains for source diversity
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard a31915d3ce feat: add source_diversity_window setting (migration + model + DB + validation tests)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard b558619d10 feat: source diversity limit + URL deduplication in generation pipeline
- Add max_articles_per_source setting (default 3, range 1-10) with migration,
  backend model, DB queries, and frontend number input
- Add limit_articles_per_source filter: spreads articles across categories
  (1 per domain per category first), then fills remaining slots up to the limit
- Add dedup_by_url filter: removes duplicate URLs across categories (case-insensitive)
- Pipeline order: parse → filter_homepage → dedup_by_url → limit_per_source → scrape
- 10 new unit tests covering spread, cap enforcement, dedup, and edge cases

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard da05965dde feat: add max_articles_per_source setting to frontend
Add max_articles_per_source field to UserSettings interface and DEFAULT_SETTINGS,
expose it as a number input on the Settings page, and add the French i18n label.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 6819c7193c feat: add limit_articles_per_source filter with unit tests
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard c1ee79bcf6 feat: add max_articles_per_source setting (migration + model + DB)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard a3f4c3b42f fix: always run scrape+rewrite pass to prevent hallucinated URLs
The adaptive pipeline skipped the scrape+rewrite pass when the LLM's search
results had URLs starting with "http". But LLMs hallucinate plausible URLs
(Wikipedia, corporate sites) that pass the http check but aren't actual source
articles. The scrape pass catches these by fetching each URL and validating
the content exists. Always running the full pipeline ensures URL integrity.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 2b8f5236d5 fix: strip smart quotes and zero-width chars from pasted URLs
normalizeUrl now strips smart quotes, zero-width spaces, and other invisible
formatting characters that browsers inject when copy-pasting URLs from rich
text sources. Prevents false "URL invalid" errors for valid URLs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 45c9e71589 fix: enforce max_items_per_category in JSON schema and prompt
The LLM was returning only 1 article per category despite the user setting 4.
- Added minItems/maxItems to the category array schema (enforced by OpenAI strict mode)
- Changed prompt from "au maximum N actualites" to "exactement N actualites"
- Schema builder now takes max_items_per_category parameter

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 0b0702de39 fix: strip null bytes from LLM output before saving to PostgreSQL JSONB
LLM output occasionally contains \u0000 null bytes (e.g., "annonc\u0000...")
which PostgreSQL rejects in JSONB columns. Added sanitize_json_null_bytes()
that recursively strips null bytes from all string values before DB insert.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 3fe667591d fix: LLM providers use own HTTP client with 120s timeout (was sharing scraper's 15s)
The scraper client (build_scraper_client) has a 15s timeout appropriate for web
scraping, but LLM API calls — especially with web search — take 30-60s. LLM
providers now build their own reqwest client with 120s timeout via build_llm_client().

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 69e5f2257a fix: UAT test — ESM compat, correct status codes, idempotent source setup
- Use import.meta.url for ESM-compatible __dirname
- Source creation expects 201, not 200
- Clean up existing sources before adding to avoid unique constraint violation
- Fix E2E docker-compose build context to project root

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard fb604a408b Changed Claude code configuration 3 months ago
oabrivard 97cb58ff42 fix: improve type safety and error handling in generation UAT 3 months ago
oabrivard 02017db2e0 test: add live generation UAT with real OpenAI API key
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard 6fe75d77e7 feat: add source file:line to WARN and ERROR log lines
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard 004f08f385 fix: runtime bugs found during first Docker run + integration tests
Bugs fixed:
- resolve_model queried non-existent admin_provider_models table (use JSONB query on admin_providers)
- key_prefix VARCHAR(10) too short for 11-char prefix (migration to VARCHAR(12))
- API key test schema missing additionalProperties: false (OpenAI strict mode)
- CSP missing font-src data: directive (PDF font embedding blocked)
- Magic link URL not logged in test mode (can't verify without real email)
- Rust 1.85 Docker image too old for dependencies (bumped to 1.88)

Tests added to prevent recurrence:
- schema_meets_openai_strict_mode_requirements: validates additionalProperties on all objects
- key_prefix_full_length_stored_in_db: verifies 11-char prefix survives DB round-trip
- generate_pipeline_resolves_model_from_admin_config: exercises full generation pipeline

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard fa8f604407 fix: set STATIC_DIR in production docker-compose for frontend serving
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 069a4f2022 feat: graceful shutdown and frontend build in Docker
- Add SIGTERM/Ctrl+C signal handling with graceful connection draining
- Close database pool cleanly on shutdown
- Add frontend-builder stage to Dockerfile (node:22-alpine, npm ci + build)
- Move Docker build context to project root so both frontend/ and backend/ are accessible
- Frontend dist/ copied into container at ./static/ for the backend to serve

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard b961f82f01 refactor: add UserRateLimitEntry constructor and settings_changed method
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard c1f2f1456f refactor: simplify recent changes — extract helper, named struct, atomic entry, pre-alloc
- Extract auth::create_and_send_magic_link() to deduplicate token rollback logic
- Replace (i32, i32, RateLimiter) tuple with named UserRateLimitEntry struct
- Use DashMap entry API for atomic rate limiter lookup (fixes TOCTOU race)
- Pre-allocate scraper body Vec from Content-Length when available

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago