Add a user-configurable batch_size setting (default 5, range 1-20)
that controls how many articles are processed in parallel during
Phase 1 scrape+classify. Previously hardcoded to 5.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove 10-source cap; all sources are now processed
- Increase max links per source from 10 to 15
- Extract article links in parallel (up to 5 concurrent) using JoinSet
- Shuffle candidate URLs after history filtering to interleave sources
- Add DELETE /api/v1/article-history endpoint to clear all history for a user
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add get_last_source_url() to article_history db module for source rotation
- Remove head_html field from ScrapedContent struct and scrape_url function
- Fix synthesis.rs scrape_single_article_with_llm to pass empty string instead of removed field
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the three-method LlmProvider trait (generate_search_pass,
generate_rewrite_pass, supports_web_search) and ProviderCapabilities
with a single call_llm method. Update all three provider implementations
(Gemini, OpenAI, Anthropic) and all callers in synthesis.rs,
source_scraper.rs, and api_keys.rs.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add LlmLogs page with collapsible prompts/response sections, call-type
colored badges, and duration display
- Wire /llm-logs/:jobId route in App.tsx (lazy-loaded)
- Expose job_id in backend SynthesisListItem and frontend SynthesisListItem
type; update test fixture accordingly
- Add log-icon link next to delete button on each Home synthesis card
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add source_url field to ScrapedNewsItem and a trace_article helper that
inserts into article_history with full provenance metadata. Instrument
Phase 1 (empty content, history dedup, source diversity) and Phase 2
(homepage filter, cross-phase dedup, history dedup, empty content) so
every dropped article is recorded with its filter reason. Replace the
old insert_urls call with per-article trace_article calls for used
articles, preserving dedup semantics via url_hash.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Accumulates overflow articles from both classification phases and redistributes
them into the Autre category when total articles fall below 75% of the configured
max, respecting per-source diversity limits.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Returns a (result, overflow) tuple so callers can access articles that
could not fit in any category or Autre. Also adds the
SYNTHESIS_MIN_FILL_RATIO constant for the upcoming fill-up logic.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The rewrite pass shared the search pass schema which enforced minItems/maxItems
equal to max_items_per_category. After filter_empty_scraped_articles removed
old/failed articles, the scraped data had fewer items than the schema required,
causing the LLM to duplicate content to fill the quota.
Now build_rewrite_schema counts actual items per category from the scraped data
and sets minItems/maxItems accordingly. Also removed dead domain_counts variable.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- filter_empty_scraped_articles: removes articles with empty scraped content
(too old, soft 404, scrape failure) before the rewrite pass, preventing
empty articles in the final synthesis
- restore_scraped_urls: already existed, now has unit tests
- E2E test: added assertions for no Wikipedia URLs, no empty summaries,
and updated settings payload with new fields (max_articles_per_source,
source_diversity_window)
- 4 new unit tests for filter_empty + restore_scraped_urls
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The rewrite pass can replace validated URLs with hallucinated ones (Wikipedia,
corporate sites) despite being instructed to preserve them. After the rewrite,
restore_scraped_urls() replaces each article's URL with the original scraped
URL by matching on position (category + item index). Logs when a URL is
restored so hallucination patterns can be monitored.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
OpenAI's default output limit (4096 tokens) was too low for structured
synthesis output with multiple categories and articles per category,
causing truncated JSON. Set 16384 for both OpenAI APIs (Responses +
Chat Completions) and Gemini. Anthropic already had 16384.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add max_articles_per_source setting (default 3, range 1-10) with migration,
backend model, DB queries, and frontend number input
- Add limit_articles_per_source filter: spreads articles across categories
(1 per domain per category first), then fills remaining slots up to the limit
- Add dedup_by_url filter: removes duplicate URLs across categories (case-insensitive)
- Pipeline order: parse → filter_homepage → dedup_by_url → limit_per_source → scrape
- 10 new unit tests covering spread, cap enforcement, dedup, and edge cases
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>