Add missing summary_length and source_extraction_window fields to all
settings JSON payloads in api_settings_test.rs. The pipeline_test.rs,
generation-live.spec.ts, and api_syntheses_test.rs already had correct
fixtures or use JSON literals that are unaffected by the optional date field.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add published_date column to article_history table
- Add date field to NewsItem (serialized in synthesis JSONB)
- Pass LLM-extracted date through ArticleTrace to article history
- Display date below article title in SynthesisDetail page
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The classify prompt now asks the LLM to return a date field (YYYY-MM-DD).
When the scraper couldn't find a date, the LLM-extracted date is used to
filter articles that exceed max_age_days.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add SKIP_SSRF_CHECK env var to bypass SSRF in test environments
- Use wiremock server as source URL (same domain as article URLs)
- Add source page mock to wiremock setup
- Set SKIP_SSRF_CHECK=1 in integration test script
- Fix unused import warning
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The SSE stream blocks until the generation completes or times out
(15 min). With a fake API key, the LLM call hangs for 120s before
failing. Just verify the 202 trigger succeeded — that confirms
model resolution and provider creation worked.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Drop impl spawned a thread with a new tokio runtime and called
.join(), which blocked the test thread. The spawned thread's block_on
deadlocked when pool.close() tried to communicate with connections
owned by the outer tokio runtime. Removing .join() makes cleanup
fire-and-forget, avoiding the deadlock.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rewrite run-integration-tests.sh to use the e2e docker-compose config
(Postgres on port 5433). Add --db-check flag for connectivity debugging.
Remove build_test_router (reverted to build_router). Keep minimal_test
for oneshot debugging.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Unauthenticated requests were hanging in integration tests due to
tower middleware layers interacting with oneshot(). Add build_test_router()
that only includes API routes + CSRF middleware.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The SPA fallback uses ServeDir/ServeFile which can hang when the
directory doesn't exist. Create it in TestApp::new() with a minimal
index.html.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add three integration tests that exercise the synthesis generation
pipeline end-to-end using MockLlmProvider and wiremock for HTTP mocking:
- phase1_with_llm_link_extraction_classifies_articles
- phase2_search_fills_gaps_when_no_sources
- category_overflow_spills_to_autre
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds an optional LlmProvider override to run_generation and
run_generation_inner, allowing tests to inject a mock provider without
touching real credentials or the provider-resolution path. Makes
run_generation_inner pub so integration tests can call it directly.
Production callers pass None and behaviour is unchanged.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove session_secret field (no longer in AppConfig), wrap
master_encryption_key in Arc<String>, and pass a generated job_id
to db::syntheses::create which now requires it.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wrap `model_research` (String), `classify_schema` (Value), and
`classification_categories` (Vec<String>) in Arc before the batch
loops so spawned tasks clone a cheap pointer instead of the full
heap data on every iteration. Also removes the redundant intermediate
`mdl`/`class_sys`/`class_user` bindings in both classify loops.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace runtime Selector::parse calls on static strings with module-level
LazyLock statics in source_scraper.rs (ANCHOR_SELECTOR) and scraper.rs
(SEL_TITLE, SEL_H1, SEL_BODY), so each selector is compiled once at
first use instead of on every function call.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SESSION_SECRET was loaded and validated but never used anywhere in the
codebase. master_encryption_key is now wrapped in Arc<String> to avoid
cloning the secret string on every AppState clone.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Replace iterating DashMap check with atomic DashSet insert in create_job to
eliminate the race condition where double-click could create two concurrent
jobs for the same user
- Add release_user method called at end of generation task (normal, timeout,
and panic paths) so the generating slot is always freed
- Wrap run_generation in tokio::time::timeout(900s) to prevent hung LLM calls
from blocking the generation slot forever
- Spawn a second task to await the JoinHandle and call release_user + send
error event if the generation task panics, preventing SSE clients from
hanging indefinitely
- Update cleanup_expired to also remove users from generating_users set
- Update tests to call release_user after completion/error to match new contract
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Split VALID_PROVIDERS (LLM only) from VALID_API_KEY_PROVIDERS (includes
brave_search) so Brave keys can be stored without allowing brave_search
as an admin LLM provider.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add a branch in test_key to route brave_search provider to
crate::services::brave_search::test_api_key instead of the LLM factory.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add use_brave_search boolean field to all settings structs, DB layer,
SQL queries, frontend types, i18n labels, and test fixtures following
the same pattern as use_llm_for_source_links.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds an optional article_url column to llm_call_log so classify_summarize
entries are traceable back to their source article in the LLM Logs UI.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add a user-configurable batch_size setting (default 5, range 1-20)
that controls how many articles are processed in parallel during
Phase 1 scrape+classify. Previously hardcoded to 5.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove 10-source cap; all sources are now processed
- Increase max links per source from 10 to 15
- Extract article links in parallel (up to 5 concurrent) using JoinSet
- Shuffle candidate URLs after history filtering to interleave sources
- Add DELETE /api/v1/article-history endpoint to clear all history for a user
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add get_last_source_url() to article_history db module for source rotation
- Remove head_html field from ScrapedContent struct and scrape_url function
- Fix synthesis.rs scrape_single_article_with_llm to pass empty string instead of removed field
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the three-method LlmProvider trait (generate_search_pass,
generate_rewrite_pass, supports_web_search) and ProviderCapabilities
with a single call_llm method. Update all three provider implementations
(Gemini, OpenAI, Anthropic) and all callers in synthesis.rs,
source_scraper.rs, and api_keys.rs.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add LlmLogs page with collapsible prompts/response sections, call-type
colored badges, and duration display
- Wire /llm-logs/:jobId route in App.tsx (lazy-loaded)
- Expose job_id in backend SynthesisListItem and frontend SynthesisListItem
type; update test fixture accordingly
- Add log-icon link next to delete button on each Home synthesis card
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add source_url field to ScrapedNewsItem and a trace_article helper that
inserts into article_history with full provenance metadata. Instrument
Phase 1 (empty content, history dedup, source diversity) and Phase 2
(homepage filter, cross-phase dedup, history dedup, empty content) so
every dropped article is recorded with its filter reason. Replace the
old insert_urls call with per-article trace_article calls for used
articles, preserving dedup semantics via url_hash.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Accumulates overflow articles from both classification phases and redistributes
them into the Autre category when total articles fall below 75% of the configured
max, respecting per-source diversity limits.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Returns a (result, overflow) tuple so callers can access articles that
could not fit in any category or Autre. Also adds the
SYNTHESIS_MIN_FILL_RATIO constant for the upcoming fill-up logic.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The rewrite pass shared the search pass schema which enforced minItems/maxItems
equal to max_items_per_category. After filter_empty_scraped_articles removed
old/failed articles, the scraped data had fewer items than the schema required,
causing the LLM to duplicate content to fill the quota.
Now build_rewrite_schema counts actual items per category from the scraped data
and sets minItems/maxItems accordingly. Also removed dead domain_counts variable.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- filter_empty_scraped_articles: removes articles with empty scraped content
(too old, soft 404, scrape failure) before the rewrite pass, preventing
empty articles in the final synthesis
- restore_scraped_urls: already existed, now has unit tests
- E2E test: added assertions for no Wikipedia URLs, no empty summaries,
and updated settings payload with new fields (max_articles_per_source,
source_diversity_window)
- 4 new unit tests for filter_empty + restore_scraped_urls
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The rewrite pass can replace validated URLs with hallucinated ones (Wikipedia,
corporate sites) despite being instructed to preserve them. After the rewrite,
restore_scraped_urls() replaces each article's URL with the original scraped
URL by matching on position (category + item index). Logs when a URL is
restored so hallucination patterns can be monitored.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
OpenAI's default output limit (4096 tokens) was too low for structured
synthesis output with multiple categories and articles per category,
causing truncated JSON. Set 16384 for both OpenAI APIs (Responses +
Chat Completions) and Gemini. Anthropic already had 16384.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add max_articles_per_source setting (default 3, range 1-10) with migration,
backend model, DB queries, and frontend number input
- Add limit_articles_per_source filter: spreads articles across categories
(1 per domain per category first), then fills remaining slots up to the limit
- Add dedup_by_url filter: removes duplicate URLs across categories (case-insensitive)
- Pipeline order: parse → filter_homepage → dedup_by_url → limit_per_source → scrape
- 10 new unit tests covering spread, cap enforcement, dedup, and edge cases
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The adaptive pipeline skipped the scrape+rewrite pass when the LLM's search
results had URLs starting with "http". But LLMs hallucinate plausible URLs
(Wikipedia, corporate sites) that pass the http check but aren't actual source
articles. The scrape pass catches these by fetching each URL and validating
the content exists. Always running the full pipeline ensures URL integrity.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>