177 Commits (c45050ce3ca9328bff0d57d2e04d83bd6239674b)

Author SHA1 Message Date
oabrivard 5b67ef2e51 fix: update integration and E2E test fixtures for summary_length, source_extraction_window, and NewsItem.date
Add missing summary_length and source_extraction_window fields to all
settings JSON payloads in api_settings_test.rs. The pipeline_test.rs,
generation-live.spec.ts, and api_syntheses_test.rs already had correct
fixtures or use JSON literals that are unaffected by the optional date field.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 37d17e577a feat: restructure Phase 1 into windowed source extraction waves
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 0f1b0306e4 feat: add source_extraction_window setting (default 3)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard c5a56c8fb8 feat: save publication date in article history and show in synthesis
- Add published_date column to article_history table
- Add date field to NewsItem (serialized in synthesis JSONB)
- Pass LLM-extracted date through ArticleTrace to article history
- Display date below article title in SynthesisDetail page

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard de25a08d51 feat: LLM extracts publication date as fallback for article age filtering
The classify prompt now asks the LLM to return a date field (YYYY-MM-DD).
When the scraper couldn't find a date, the LLM-extracted date is used to
filter articles that exceed max_age_days.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 91272ddfc4 feat: dynamic summary length and body snippet size based on setting
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard 1b63afd12a feat: add summary_length setting (1=court, 2=moyen, 3=detaille)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard 0874650a7f fix: pipeline tests use wiremock URLs + skip SSRF for localhost
- Add SKIP_SSRF_CHECK env var to bypass SSRF in test environments
- Use wiremock server as source URL (same domain as article URLs)
- Add source page mock to wiremock setup
- Set SKIP_SSRF_CHECK=1 in integration test script
- Fix unused import warning

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard a158f14311 fix: don't poll SSE stream in model resolution test
The SSE stream blocks until the generation completes or times out
(15 min). With a fake API key, the LLM call hangs for 120s before
failing. Just verify the 202 trigger succeeded — that confirms
model resolution and provider creation worked.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard eadfbc000b fix: expect 201 Created for source creation in syntheses test
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 54c637647b fix: add missing fields to syntheses test settings payload
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 7ec491b6ac fix: update settings test payloads for new required fields + fix unused var warning
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard a0a8f72caa fix: don't join Drop cleanup thread — prevents deadlock in tokio tests
The Drop impl spawned a thread with a new tokio runtime and called
.join(), which blocked the test thread. The spawned thread's block_on
deadlocked when pool.close() tried to communicate with connections
owned by the outer tokio runtime. Removing .join() makes cleanup
fire-and-forget, avoiding the deadlock.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 243558b950 debug: add full_app_health_check test with logging 3 months ago
oabrivard aee70b37d4 fix: use docker-compose.test.yml for integration test DB
Rewrite run-integration-tests.sh to use the e2e docker-compose config
(Postgres on port 5433). Add --db-check flag for connectivity debugging.
Remove build_test_router (reverted to build_router). Keep minimal_test
for oneshot debugging.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 5fa060fadc fix: use invalid session token for admin auth rejection tests
Same fix as other test files — avoids oneshot() hang with no cookie.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard bd4f101d16 fix: use invalid session token instead of no cookie for auth rejection tests
Unauthenticated requests (no Cookie header) hang with oneshot() in tests.
Using an invalid session token achieves the same 401 result without hanging.
3 months ago
oabrivard 53813007c6 fix: use lightweight test router without SPA fallback and TraceLayer
Unauthenticated requests were hanging in integration tests due to
tower middleware layers interacting with oneshot(). Add build_test_router()
that only includes API routes + CSRF middleware.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 7cbafdfb31 fix: create test static dir to prevent ServeDir/ServeFile hang
The SPA fallback uses ServeDir/ServeFile which can hang when the
directory doesn't exist. Create it in TestApp::new() with a minimal
index.html.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard bb2209e425 fix: update admin tests for models_scraping/models_websearch split
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard 370e033506 feat: add pipeline integration tests with MockLlmProvider and wiremock
Add three integration tests that exercise the synthesis generation
pipeline end-to-end using MockLlmProvider and wiremock for HTTP mocking:
- phase1_with_llm_link_extraction_classifies_articles
- phase2_search_fills_gaps_when_no_sources
- category_overflow_spills_to_autre

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard ccecaa2d13 refactor: add provider_override for pipeline dependency injection
Adds an optional LlmProvider override to run_generation and
run_generation_inner, allowing tests to inject a mock provider without
touching real credentials or the provider-resolution path. Makes
run_generation_inner pub so integration tests can call it directly.
Production callers pass None and behaviour is unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard 17e054c257 feat: add MockLlmProvider for integration testing
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard ecf95ffe35 fix: update test config for session_secret removal and Arc master key
Remove session_secret field (no longer in AppConfig), wrap
master_encryption_key in Arc<String>, and pass a generated job_id
to db::syntheses::create which now requires it.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard 4bbdd5c4d1 perf: batch article history INSERTs to reduce DB round-trips
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard f37e0b42a0 perf: use Arc for immutable values in pipeline to reduce cloning
Wrap `model_research` (String), `classify_schema` (Value), and
`classification_categories` (Vec<String>) in Arc before the batch
loops so spawned tasks clone a cheap pointer instead of the full
heap data on every iteration. Also removes the redundant intermediate
`mdl`/`class_sys`/`class_user` bindings in both classify loops.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard 60494aeceb perf: cache CSS selectors with LazyLock to avoid re-parsing
Replace runtime Selector::parse calls on static strings with module-level
LazyLock statics in source_scraper.rs (ANCHOR_SELECTOR) and scraper.rs
(SEL_TITLE, SEL_H1, SEL_BODY), so each selector is compiled once at
first use instead of on every function call.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 69c1688bc7 chore: remove SESSION_SECRET and wrap master_encryption_key in Arc
SESSION_SECRET was loaded and validated but never used anywhere in the
codebase. master_encryption_key is now wrapped in Arc<String> to avoid
cloning the secret string on every AppState clone.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard f44aa44c48 refactor: replace trace_article 11 parameters with ArticleTrace struct
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard f5466a6bd5 refactor: extract shared LLM error mapping to reduce duplication
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard 2036c12b24 refactor: eliminate SettingsResponse struct, serialize UserSettings directly
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard e056ef9d3e refactor: extract assign_category and filter_phase2_url helpers from synthesis pipeline
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 24d53a01d1 fix: block SSRF via IPv4-mapped IPv6 and add check to source page fetching
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard 93003229f1 fix: add periodic expired session cleanup (hourly)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard 347558a278 fix: atomic job creation, 15min timeout, and panic handling
- Replace iterating DashMap check with atomic DashSet insert in create_job to
  eliminate the race condition where double-click could create two concurrent
  jobs for the same user
- Add release_user method called at end of generation task (normal, timeout,
  and panic paths) so the generating slot is always freed
- Wrap run_generation in tokio::time::timeout(900s) to prevent hung LLM calls
  from blocking the generation slot forever
- Spawn a second task to await the JoinHandle and call release_user + send
  error event if the generation task panics, preventing SSE clients from
  hanging indefinitely
- Update cleanup_expired to also remove users from generating_users set
- Update tests to call release_user after completion/error to match new contract

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard 59932589cc fix: prevent UTF-8 panic in error message truncation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard e74a1850bf fix: log source URL in link_extraction LLM call logs
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard a968fdc308 fix: allow brave_search as valid API key provider
Split VALID_PROVIDERS (LLM only) from VALID_API_KEY_PROVIDERS (includes
brave_search) so Brave keys can be stored without allowing brave_search
as an admin LLM provider.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard f124b056fe feat: add Brave Search Phase 2 pipeline path
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard e05c2ae75a feat: handle brave_search in API key test endpoint
Add a branch in test_key to route brave_search provider to
crate::services::brave_search::test_api_key instead of the LLM factory.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard f414ff0f58 feat: add use_brave_search setting
Add use_brave_search boolean field to all settings structs, DB layer,
SQL queries, frontend types, i18n labels, and test fixtures following
the same pattern as use_llm_for_source_links.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard fa03c60339 feat: add Brave Search API client module
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard 41109b3d93 feat: send structured link pairs to LLM instead of raw HTML body
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard a5332f0996 feat: add article_url to LLM call logs for classify tracing
Adds an optional article_url column to llm_call_log so classify_summarize
entries are traceable back to their source article in the LLM Logs UI.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard b062e81218 fix: remove personalized sources from Phase 2 web search prompt
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard 4c6381b09a feat: add batch_size setting for Phase 1 parallelism
Add a user-configurable batch_size setting (default 5, range 1-20)
that controls how many articles are processed in parallel during
Phase 1 scrape+classify. Previously hardcoded to 5.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 7cd867c650 fix: resolve all clippy warnings
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard fa9375233e fix: remove 3 compiler warnings (unreachable code, unused variables) 3 months ago
oabrivard 14b0a0b7e8 refactor: LLM link extraction uses body only (no head), increased to 12000 chars 3 months ago
oabrivard 3353e5261f feat: rate limiter waits instead of failing — sleeps until window passes (max 60s) 3 months ago
oabrivard ed399e9a6e feat: parallelize Phase 1 scrape+classify in batches of 5 3 months ago
oabrivard a5f4239157 fix: distinguish filtered_too_old from filtered_empty in article tracing 3 months ago
oabrivard a760220d44 fix: log LLM calls for source link extraction in llm_call_log 3 months ago
oabrivard 8d232c1ade feat: split model selection — scraping vs websearch with GPT-5 models
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 7f7d584314 feat: parallel source extraction, shuffle candidates, clear history endpoint
- Remove 10-source cap; all sources are now processed
- Increase max links per source from 10 to 15
- Extract article links in parallel (up to 5 concurrent) using JoinSet
- Shuffle candidate URLs after history filtering to interleave sources
- Add DELETE /api/v1/article-history endpoint to clear all history for a user

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard 7a8427316c feat: rewrite synthesis pipeline — per-article classify/summarize, no rewrite pass
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 0b180eb75c refactor: remove old classification, rewrite, and article extraction prompts/schemas
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard bb716b5dc2 feat: add get_last_source_url + remove head_html from ScrapedContent
- Add get_last_source_url() to article_history db module for source rotation
- Remove head_html field from ScrapedContent struct and scrape_url function
- Fix synthesis.rs scrape_single_article_with_llm to pass empty string instead of removed field

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard b2dbc3847a feat: add per-article classify/summarize prompt and schema
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard 825b793387 feat: drop source_diversity_window and use_llm_for_article_extraction settings
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard a2fe3f3310 feat: simplify LlmProvider trait to single call_llm method
Replace the three-method LlmProvider trait (generate_search_pass,
generate_rewrite_pass, supports_web_search) and ProviderCapabilities
with a single call_llm method. Update all three provider implementations
(Gemini, OpenAI, Anthropic) and all callers in synthesis.rs,
source_scraper.rs, and api_keys.rs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard f9023cff7e feat: LLM logs viewer page + log button on Home synthesis list
- Add LlmLogs page with collapsible prompts/response sections, call-type
  colored badges, and duration display
- Wire /llm-logs/:jobId route in App.tsx (lazy-loaded)
- Expose job_id in backend SynthesisListItem and frontend SynthesisListItem
  type; update test fixture accordingly
- Add log-icon link next to delete button on each Home synthesis card

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard dafec2591b feat: API endpoint for LLM call logs by job_id
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard 9fffde8312 feat: log LLM calls with timing at search, classification, and rewrite steps
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard b2b0b286c0 feat: create llm_call_log table + DB module
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard 55fe828e58 feat: API endpoints for article history listing and provenance
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard b9003cde54 feat: instrument pipeline with article tracing at every filtering step
Add source_url field to ScrapedNewsItem and a trace_article helper that
inserts into article_history with full provenance metadata.  Instrument
Phase 1 (empty content, history dedup, source diversity) and Phase 2
(homepage filter, cross-phase dedup, history dedup, empty content) so
every dropped article is recorded with its filter reason.  Replace the
old insert_urls call with per-article trace_article calls for used
articles, preserving dedup semantics via url_hash.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 0e2c69edf7 feat: save job_id on syntheses for provenance lookup
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard eba721266f feat: article history entry struct + insert/query/cleanup functions
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard d7afd08eaf feat: enrich article_history with tracing metadata + syntheses.job_id 3 months ago
oabrivard 7cbb2853ce feat: Autre fill-up to 75% synthesis target with source diversity enforcement
Accumulates overflow articles from both classification phases and redistributes
them into the Autre category when total articles fall below 75% of the configured
max, respecting per-source diversity limits.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard c3e6103ef1 feat: parse_classification_response collects overflow articles
Returns a (result, overflow) tuple so callers can access articles that
could not fit in any category or Autre. Also adds the
SYNTHESIS_MIN_FILL_RATIO constant for the upcoming fill-up logic.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard cea723f7d7 test: update E2E and integration tests with article_history_days setting 3 months ago
oabrivard 65eb6004d2 feat: article history filtering in pipeline — cleanup, Phase 1/2 filter, retry, insert
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 0a87b7ed8f feat: add normalize_article_url and hash_article_url utilities
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard 5a928aa990 feat: add article_history DB module (check, insert, cleanup)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard c271c240a2 feat: add article_history table and article_history_days setting
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 8e06357b47 test: update integration test with LLM scraping settings 3 months ago
oabrivard 8a061c98db feat: LLM-assisted article extraction with Arc provider, concurrency control, and progress
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 357f06e405 feat: LLM-assisted source link extraction with heuristic fallback
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard e6e8aa1eeb feat: add LLM prompts and schemas for link and article extraction
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard 23f121a58d feat: ScrapedContent url+head_html fields, Arc<dyn LlmProvider>, 3-tuple scrape returns
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard e483789d1b feat: add use_llm_for_source_links and use_llm_for_article_extraction settings
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard 53ecce84b0 feat: two-phase generation pipeline — personalized sources first, web search fallback
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 51ea032838 feat: add scrape_flat_urls helper and gap-aware search prompt
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard d508b5b4ab feat: Autre category support in rewrite schema, final sections, URL restore + remove dead code
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard ba7024e280 feat: add classification response parsing with category filling and Autre fallback
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard 104b6a0d7b feat: add classification prompt and schema for article categorization
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard c06b5ba454 feat: add source_scraper module for extracting article links from source pages
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard 45e5ee8a7d fix: rewrite pass schema uses actual scraped item counts, not max setting
The rewrite pass shared the search pass schema which enforced minItems/maxItems
equal to max_items_per_category. After filter_empty_scraped_articles removed
old/failed articles, the scraped data had fewer items than the schema required,
causing the LLM to duplicate content to fill the quota.

Now build_rewrite_schema counts actual items per category from the scraped data
and sets minItems/maxItems accordingly. Also removed dead domain_counts variable.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 13894a8f50 fix: filter empty scraped articles + restore URLs after rewrite + E2E assertions
- filter_empty_scraped_articles: removes articles with empty scraped content
  (too old, soft 404, scrape failure) before the rewrite pass, preventing
  empty articles in the final synthesis
- restore_scraped_urls: already existed, now has unit tests
- E2E test: added assertions for no Wikipedia URLs, no empty summaries,
  and updated settings payload with new fields (max_articles_per_source,
  source_diversity_window)
- 4 new unit tests for filter_empty + restore_scraped_urls

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard a9be1ce435 fix: restore scraped URLs after LLM rewrite pass to prevent hallucination
The rewrite pass can replace validated URLs with hallucinated ones (Wikipedia,
corporate sites) despite being instructed to preserve them. After the rewrite,
restore_scraped_urls() replaces each article's URL with the original scraped
URL by matching on position (category + item index). Logs when a URL is
restored so hallucination patterns can be monitored.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 8a18b70aff fix: set max output tokens to 16384 for all LLM providers
OpenAI's default output limit (4096 tokens) was too low for structured
synthesis output with multiple categories and articles per category,
causing truncated JSON. Set 16384 for both OpenAI APIs (Responses +
Chat Completions) and Gemini. Anthropic already had 16384.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 55c2b050b3 feat: extract recent domains and pass to search prompt for diversity
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard 3f6ad9853c feat: build_search_prompt accepts recent_domains for source diversity
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard a31915d3ce feat: add source_diversity_window setting (migration + model + DB + validation tests)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard b558619d10 feat: source diversity limit + URL deduplication in generation pipeline
- Add max_articles_per_source setting (default 3, range 1-10) with migration,
  backend model, DB queries, and frontend number input
- Add limit_articles_per_source filter: spreads articles across categories
  (1 per domain per category first), then fills remaining slots up to the limit
- Add dedup_by_url filter: removes duplicate URLs across categories (case-insensitive)
- Pipeline order: parse → filter_homepage → dedup_by_url → limit_per_source → scrape
- 10 new unit tests covering spread, cap enforcement, dedup, and edge cases

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago
oabrivard 6819c7193c feat: add limit_articles_per_source filter with unit tests
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard c1ee79bcf6 feat: add max_articles_per_source setting (migration + model + DB)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 months ago
oabrivard a3f4c3b42f fix: always run scrape+rewrite pass to prevent hallucinated URLs
The adaptive pipeline skipped the scrape+rewrite pass when the LLM's search
results had URLs starting with "http". But LLMs hallucinate plausible URLs
(Wikipedia, corporate sites) that pass the http check but aren't actual source
articles. The scrape pass catches these by fetching each URL and validating
the content exists. Always running the full pipeline ensures URL integrity.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 months ago