Add source_url field to ScrapedNewsItem and a trace_article helper that
inserts into article_history with full provenance metadata. Instrument
Phase 1 (empty content, history dedup, source diversity) and Phase 2
(homepage filter, cross-phase dedup, history dedup, empty content) so
every dropped article is recorded with its filter reason. Replace the
old insert_urls call with per-article trace_article calls for used
articles, preserving dedup semantics via url_hash.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Accumulates overflow articles from both classification phases and redistributes
them into the Autre category when total articles fall below 75% of the configured
max, respecting per-source diversity limits.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Returns a (result, overflow) tuple so callers can access articles that
could not fit in any category or Autre. Also adds the
SYNTHESIS_MIN_FILL_RATIO constant for the upcoming fill-up logic.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The rewrite pass shared the search pass schema which enforced minItems/maxItems
equal to max_items_per_category. After filter_empty_scraped_articles removed
old/failed articles, the scraped data had fewer items than the schema required,
causing the LLM to duplicate content to fill the quota.
Now build_rewrite_schema counts actual items per category from the scraped data
and sets minItems/maxItems accordingly. Also removed dead domain_counts variable.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- filter_empty_scraped_articles: removes articles with empty scraped content
(too old, soft 404, scrape failure) before the rewrite pass, preventing
empty articles in the final synthesis
- restore_scraped_urls: already existed, now has unit tests
- E2E test: added assertions for no Wikipedia URLs, no empty summaries,
and updated settings payload with new fields (max_articles_per_source,
source_diversity_window)
- 4 new unit tests for filter_empty + restore_scraped_urls
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The rewrite pass can replace validated URLs with hallucinated ones (Wikipedia,
corporate sites) despite being instructed to preserve them. After the rewrite,
restore_scraped_urls() replaces each article's URL with the original scraped
URL by matching on position (category + item index). Logs when a URL is
restored so hallucination patterns can be monitored.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
OpenAI's default output limit (4096 tokens) was too low for structured
synthesis output with multiple categories and articles per category,
causing truncated JSON. Set 16384 for both OpenAI APIs (Responses +
Chat Completions) and Gemini. Anthropic already had 16384.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add max_articles_per_source setting (default 3, range 1-10) with migration,
backend model, DB queries, and frontend number input
- Add limit_articles_per_source filter: spreads articles across categories
(1 per domain per category first), then fills remaining slots up to the limit
- Add dedup_by_url filter: removes duplicate URLs across categories (case-insensitive)
- Pipeline order: parse → filter_homepage → dedup_by_url → limit_per_source → scrape
- 10 new unit tests covering spread, cap enforcement, dedup, and edge cases
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The adaptive pipeline skipped the scrape+rewrite pass when the LLM's search
results had URLs starting with "http". But LLMs hallucinate plausible URLs
(Wikipedia, corporate sites) that pass the http check but aren't actual source
articles. The scrape pass catches these by fetching each URL and validating
the content exists. Always running the full pipeline ensures URL integrity.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The LLM was returning only 1 article per category despite the user setting 4.
- Added minItems/maxItems to the category array schema (enforced by OpenAI strict mode)
- Changed prompt from "au maximum N actualites" to "exactement N actualites"
- Schema builder now takes max_items_per_category parameter
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
LLM output occasionally contains \u0000 null bytes (e.g., "annonc\u0000...")
which PostgreSQL rejects in JSONB columns. Added sanitize_json_null_bytes()
that recursively strips null bytes from all string values before DB insert.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The scraper client (build_scraper_client) has a 15s timeout appropriate for web
scraping, but LLM API calls — especially with web search — take 30-60s. LLM
providers now build their own reqwest client with 120s timeout via build_llm_client().
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bugs fixed:
- resolve_model queried non-existent admin_provider_models table (use JSONB query on admin_providers)
- key_prefix VARCHAR(10) too short for 11-char prefix (migration to VARCHAR(12))
- API key test schema missing additionalProperties: false (OpenAI strict mode)
- CSP missing font-src data: directive (PDF font embedding blocked)
- Magic link URL not logged in test mode (can't verify without real email)
- Rust 1.85 Docker image too old for dependencies (bumped to 1.88)
Tests added to prevent recurrence:
- schema_meets_openai_strict_mode_requirements: validates additionalProperties on all objects
- key_prefix_full_length_stored_in_db: verifies 11-char prefix survives DB round-trip
- generate_pipeline_resolves_model_from_admin_config: exercises full generation pipeline
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add SIGTERM/Ctrl+C signal handling with graceful connection draining
- Close database pool cleanly on shutdown
- Add frontend-builder stage to Dockerfile (node:22-alpine, npm ci + build)
- Move Docker build context to project root so both frontend/ and backend/ are accessible
- Frontend dist/ copied into container at ./static/ for the backend to serve
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Extract auth::create_and_send_magic_link() to deduplicate token rollback logic
- Replace (i32, i32, RateLimiter) tuple with named UserRateLimitEntry struct
- Use DashMap entry API for atomic rate limiter lookup (fixes TOCTOU race)
- Pre-allocate scraper body Vec from Content-Length when available
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Wire hardened scraper client into runtime (SSRF redirect validation was defined but unused)
- Stream scraper body with per-chunk size limit instead of post-download check (DoS/OOM)
- Persist user rate-limit overrides across generation jobs via AppState DashMap
- Roll back magic-link token on email send failure to prevent quota exhaustion
- Fix API error UX: prefer human message over machine error code in frontend
- Unwrap GET /syntheses { items } wrapper in frontend API layer (contract mismatch)
- Bind Postgres to localhost in docker-compose (was exposed on all interfaces)
- Fix CLAUDE.md: runtime queries not compile-time, 10 migrations not 9
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extract cookie parsing into a standalone `extract_session_token` function
and add 5 unit tests covering the valid, missing, multi-cookie, whitespace,
and empty-header cases.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Bug fix:
- Per-generation rate limiter was creating a new instance on every check,
making user rate limit overrides non-functional. Fixed by creating the
limiter once at pipeline start and reusing for both passes.
Simplifications:
- Extract spawn_task closure in scrape_articles (deduplicate spawn blocks)
- Use idiomatic if let Ok(...) instead of if let Some(..).ok() in scraper
- Replace manual loop with iterator chain in export_keys handler
- Simplify check_rate_limit to single boolean check
- Simplify handleImport settings merge (spread already provides defaults)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- resolve_provider_and_key() now respects user ai_provider preference
- Dual model resolution: ai_model for search pass, ai_model_writing for rewrite pass
- Per-generation rate limiter with user override support
- Homepage URL filter removes domain-only URLs after search pass
- ScrapedNewsItem gains original_title field populated from page <title>
- SynthesisResponse::try_from handles null sections gracefully (returns empty vec)
- Search prompt warns LLM against returning homepage URLs
- Rewrite prompt instructs LLM to use originalTitle with language preservation rules
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bug fix:
- Fix frontend sending captcha_token instead of turnstile_token in
login/register requests (would cause 422 errors on auth)
Backend simplifications:
- Deduplicate VALID_PROVIDERS constant (provider.rs is now the single source)
- Extract validate_display_name/validate_models helpers in provider model
- Add From<UserSettings> for SettingsResponse, From<User> for AdminUserResponse
- Consolidate Resend API call pattern into shared send_via_resend()
- Extract do_bulk_import() for sources bulk/CSV import
- Use idiomatic range.contains() for rate limit validation
Frontend simplifications:
- Consolidate file download logic (exportCsv reuses shared fetchFile/triggerDownload)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Backend:
- Synthesis email sending via Resend API with HTML template (inline CSS,
tables-based for email client compatibility) + plain-text fallback
- XSS prevention via html_escape() on all user content in email templates
- Markdown export: clean format with headers, links, summaries
- PDF export: printpdf with built-in Helvetica fonts, indigo color scheme,
automatic page breaks, word wrapping
- 3 new endpoints: send-email, export/markdown, export/pdf
- All endpoints enforce ownership checks
- Email validation using email_address crate
- 24 new unit tests, 13 integration tests
Frontend:
- Email section on SynthesisDetail: input pre-filled with user email,
send button with loading state, success/error feedback
- Export buttons: Markdown + PDF with per-button loading states
- File download via Blob + temporary anchor with Content-Disposition parsing
- 6 new export tests
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Backend:
- OpenAiProvider: Responses API with web_search_preview (pass 1),
Chat Completions with json_schema structured output (pass 2)
- AnthropicProvider: Messages API with web_search tool (pass 1),
schema-in-prompt for structured output, code fence stripping (pass 2)
- Pipeline adaptation: skip scrape+rewrite when >70% of search URLs are valid
- Provider factory updated for all three providers
- Error sanitization extended for Anthropic key patterns (sk-ant-)
- 44 new unit tests (OpenAI, Anthropic, factory, pipeline heuristic)
Frontend:
- Provider-specific info text below model selection
- Web search support badges (green/gray)
- Generate page shows selected provider and model
- Warning banner when provider lacks web search
- Provider utility module with 10 tests
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Backend:
- Admin API: CRUD for providers, rate limits, user role management
- Public config endpoint for enabled providers/models
- AdminUser extractor enforces RBAC on all admin endpoints
- Per-provider rate limiter with hot-reload from DB
- Audit logging for all admin mutations
- Seed data: Gemini, OpenAI, Anthropic providers with default models
- Self-demotion prevention on role changes
- 30 integration tests, 27 new unit tests
Frontend:
- Admin layout with sidebar navigation (providers, rate limits, users)
- Provider management: enable/disable, model CRUD, default model selection
- Rate limit configuration with effective rate display
- User management with role badges and promote/demote
- Admin link in navbar/mobile menu (visible only to admins)
- Settings page: dynamic provider/model selection from admin config
- 10 new tests (admin guard, config API)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix body text extraction to actually filter excluded elements (script,
nav, footer, aside, etc.) using node ID tracking instead of unused HashSet
- Add IPv6 reserved range checks to SSRF prevention: ULA (fc00::/7),
documentation (2001:db8::/32), discard prefix (100::/64)
- Add errors field to frontend BulkImportResponse type
- Validate Content-Type on CSV multipart upload (reject non-text files)
- Add 6 new unit tests for the fixes
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>