5.0 KiB
Design: Pipeline Improvements — Web Search, LLM Logs, Link Extraction
Date: 2026-03-25 Scope: Three independent improvements to the synthesis pipeline
1. Remove personalized sources from web search prompt
Context
build_search_prompt receives &[Source] and injects personalized source URLs into the Phase 2 web search prompt. Phase 1 already handles personalized sources via scraping, so including them again in Phase 2 biases the web search away from discovering new content.
Change
In synthesis.rs, pass &[] instead of &sources when calling build_search_prompt for Phase 2. The function signature is unchanged — Phase 2 will do a pure Google search based on theme, categories, and gap counts only.
Files to modify
backend/src/services/synthesis.rs— pass&[]for sources in the Phase 2build_search_promptcall
2. Add article_url to LLM call logs
Context
The llm_call_log table records every LLM call during synthesis generation but has no field linking a classify_summarize call to the specific article URL being classified. To see which article a classify call relates to, you must cross-reference article_history — cumbersome for debugging.
Changes
Migration: Add nullable article_url TEXT column to llm_call_log.
Backend:
llm_call_log::insert— addarticle_url: Option<&str>parameter, bind it in the INSERTLlmCallLogRow— addarticle_url: Option<String>field, update SELECT inlist_by_job_idto includearticle_urllog_llm_callhelper insynthesis.rs— addarticle_url: Option<&str>parameter, pass through toinsert- The
classify_summarizecall in synthesis.rs callsinsertdirectly (not vialog_llm_call) — update it to pass the article URL - The
link_extractioncall insource_scraper.rsalso callsinsertdirectly — update it to passNone - All other call sites via
log_llm_call(search) passNone
Frontend:
LlmCallLogEntrytype — addarticle_url: string | nullLlmLogs.tsx— display the URL as a clickable link when presentfr.ts— add'llmLogs.articleUrl': 'Article'
Files to modify
- Create:
backend/migrations/20260325000021_add_article_url_to_llm_log.sql - Modify:
backend/src/db/llm_call_log.rs— insert signature, row struct, SELECT queries - Modify:
backend/src/services/synthesis.rs— pass article URL in classifyinsertcall, updatelog_llm_callhelper - Modify:
backend/src/services/source_scraper.rs— updateinsertcall to passNone - Modify:
frontend/src/types.ts— add field toLlmCallLogEntry - Modify:
frontend/src/pages/LlmLogs.tsx— display article URL - Modify:
frontend/src/i18n/fr.ts— add label - Modify:
CLAUDE.md— migration count
3. Send structured link pairs to LLM instead of raw HTML body
Context
The LLM link extraction path (extract_article_links_with_llm) sends the first 12000 chars of the HTML <body> to the LLM. This is noisy — the LLM must parse raw HTML with scripts, styles, and irrelevant markup, wasting tokens and reducing accuracy.
Changes
New function: extract_links_as_pairs(html: &str, base_url: &Url) -> Vec<(String, String)> in source_scraper.rs. Parses all <a href> tags and returns (resolved_href, anchor_text) pairs. Filtering: http/https only, same-domain, non-empty path. No dedup or article-pattern filtering (the LLM decides). Same-domain filtering is kept to avoid sending irrelevant cross-domain links that waste tokens.
Updated flow in extract_article_links_with_llm:
- Fetch the page HTML (unchanged)
- Call
extract_links_as_pairsinstead ofextract_body_html - Format pairs as a text list:
- /blog/article-1 | "OpenAI launches GPT-6"(capped at 200 links) - Pass the formatted list to
build_link_extraction_prompt
Updated prompt: build_link_extraction_prompt parameter renamed from body_html to links_text. Remove the internal 12000-char truncation (the input is now a pre-formatted list, not raw HTML; the 200-link cap controls size). Update prompt wording to ask the LLM to select article links from the list rather than extract URLs from HTML.
Schema: build_link_extraction_schema returns { "urls": [...] } — unchanged. The LLM now selects URLs from the provided list rather than extracting from HTML, but the output format stays the same.
Cleanup: Remove extract_body_html and its tests if no longer used elsewhere.
Files to modify
- Modify:
backend/src/services/source_scraper.rs— addextract_links_as_pairs, updateextract_article_links_with_llm, removeextract_body_html - Modify:
backend/src/services/prompts.rs— updatebuild_link_extraction_prompt(rename parameter, remove truncation, update wording) - Modify:
backend/src/services/source_scraper.rstests — add tests forextract_links_as_pairs, removeextract_body_htmltests - Modify:
backend/src/services/prompts.rstests — update link extraction prompt tests