- Add get_last_source_url() to article_history db module for source rotation
- Remove head_html field from ScrapedContent struct and scrape_url function
- Fix synthesis.rs scrape_single_article_with_llm to pass empty string instead of removed field
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the three-method LlmProvider trait (generate_search_pass,
generate_rewrite_pass, supports_web_search) and ProviderCapabilities
with a single call_llm method. Update all three provider implementations
(Gemini, OpenAI, Anthropic) and all callers in synthesis.rs,
source_scraper.rs, and api_keys.rs.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add LlmLogs page with collapsible prompts/response sections, call-type
colored badges, and duration display
- Wire /llm-logs/:jobId route in App.tsx (lazy-loaded)
- Expose job_id in backend SynthesisListItem and frontend SynthesisListItem
type; update test fixture accordingly
- Add log-icon link next to delete button on each Home synthesis card
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add source_url field to ScrapedNewsItem and a trace_article helper that
inserts into article_history with full provenance metadata. Instrument
Phase 1 (empty content, history dedup, source diversity) and Phase 2
(homepage filter, cross-phase dedup, history dedup, empty content) so
every dropped article is recorded with its filter reason. Replace the
old insert_urls call with per-article trace_article calls for used
articles, preserving dedup semantics via url_hash.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Accumulates overflow articles from both classification phases and redistributes
them into the Autre category when total articles fall below 75% of the configured
max, respecting per-source diversity limits.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Returns a (result, overflow) tuple so callers can access articles that
could not fit in any category or Autre. Also adds the
SYNTHESIS_MIN_FILL_RATIO constant for the upcoming fill-up logic.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The rewrite pass shared the search pass schema which enforced minItems/maxItems
equal to max_items_per_category. After filter_empty_scraped_articles removed
old/failed articles, the scraped data had fewer items than the schema required,
causing the LLM to duplicate content to fill the quota.
Now build_rewrite_schema counts actual items per category from the scraped data
and sets minItems/maxItems accordingly. Also removed dead domain_counts variable.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- filter_empty_scraped_articles: removes articles with empty scraped content
(too old, soft 404, scrape failure) before the rewrite pass, preventing
empty articles in the final synthesis
- restore_scraped_urls: already existed, now has unit tests
- E2E test: added assertions for no Wikipedia URLs, no empty summaries,
and updated settings payload with new fields (max_articles_per_source,
source_diversity_window)
- 4 new unit tests for filter_empty + restore_scraped_urls
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The rewrite pass can replace validated URLs with hallucinated ones (Wikipedia,
corporate sites) despite being instructed to preserve them. After the rewrite,
restore_scraped_urls() replaces each article's URL with the original scraped
URL by matching on position (category + item index). Logs when a URL is
restored so hallucination patterns can be monitored.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
OpenAI's default output limit (4096 tokens) was too low for structured
synthesis output with multiple categories and articles per category,
causing truncated JSON. Set 16384 for both OpenAI APIs (Responses +
Chat Completions) and Gemini. Anthropic already had 16384.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add max_articles_per_source setting (default 3, range 1-10) with migration,
backend model, DB queries, and frontend number input
- Add limit_articles_per_source filter: spreads articles across categories
(1 per domain per category first), then fills remaining slots up to the limit
- Add dedup_by_url filter: removes duplicate URLs across categories (case-insensitive)
- Pipeline order: parse → filter_homepage → dedup_by_url → limit_per_source → scrape
- 10 new unit tests covering spread, cap enforcement, dedup, and edge cases
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The adaptive pipeline skipped the scrape+rewrite pass when the LLM's search
results had URLs starting with "http". But LLMs hallucinate plausible URLs
(Wikipedia, corporate sites) that pass the http check but aren't actual source
articles. The scrape pass catches these by fetching each URL and validating
the content exists. Always running the full pipeline ensures URL integrity.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The LLM was returning only 1 article per category despite the user setting 4.
- Added minItems/maxItems to the category array schema (enforced by OpenAI strict mode)
- Changed prompt from "au maximum N actualites" to "exactement N actualites"
- Schema builder now takes max_items_per_category parameter
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
LLM output occasionally contains \u0000 null bytes (e.g., "annonc\u0000...")
which PostgreSQL rejects in JSONB columns. Added sanitize_json_null_bytes()
that recursively strips null bytes from all string values before DB insert.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The scraper client (build_scraper_client) has a 15s timeout appropriate for web
scraping, but LLM API calls — especially with web search — take 30-60s. LLM
providers now build their own reqwest client with 120s timeout via build_llm_client().
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bugs fixed:
- resolve_model queried non-existent admin_provider_models table (use JSONB query on admin_providers)
- key_prefix VARCHAR(10) too short for 11-char prefix (migration to VARCHAR(12))
- API key test schema missing additionalProperties: false (OpenAI strict mode)
- CSP missing font-src data: directive (PDF font embedding blocked)
- Magic link URL not logged in test mode (can't verify without real email)
- Rust 1.85 Docker image too old for dependencies (bumped to 1.88)
Tests added to prevent recurrence:
- schema_meets_openai_strict_mode_requirements: validates additionalProperties on all objects
- key_prefix_full_length_stored_in_db: verifies 11-char prefix survives DB round-trip
- generate_pipeline_resolves_model_from_admin_config: exercises full generation pipeline
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add SIGTERM/Ctrl+C signal handling with graceful connection draining
- Close database pool cleanly on shutdown
- Add frontend-builder stage to Dockerfile (node:22-alpine, npm ci + build)
- Move Docker build context to project root so both frontend/ and backend/ are accessible
- Frontend dist/ copied into container at ./static/ for the backend to serve
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>