# Design: Pipeline Improvements — Web Search, LLM Logs, Link Extraction **Date**: 2026-03-25 **Scope**: Three independent improvements to the synthesis pipeline --- ## 1. Remove personalized sources from web search prompt ### Context `build_search_prompt` receives `&[Source]` and injects personalized source URLs into the Phase 2 web search prompt. Phase 1 already handles personalized sources via scraping, so including them again in Phase 2 biases the web search away from discovering new content. ### Change In `synthesis.rs`, pass `&[]` instead of `&sources` when calling `build_search_prompt` for Phase 2. The function signature is unchanged — Phase 2 will do a pure Google search based on theme, categories, and gap counts only. ### Files to modify - `backend/src/services/synthesis.rs` — pass `&[]` for sources in the Phase 2 `build_search_prompt` call --- ## 2. Add `article_url` to LLM call logs ### Context The `llm_call_log` table records every LLM call during synthesis generation but has no field linking a `classify_summarize` call to the specific article URL being classified. To see which article a classify call relates to, you must cross-reference `article_history` — cumbersome for debugging. ### Changes **Migration:** Add nullable `article_url TEXT` column to `llm_call_log`. **Backend:** - `llm_call_log::insert` — add `article_url: Option<&str>` parameter, bind it in the INSERT - `LlmCallLogRow` — add `article_url: Option` field, update SELECT in `list_by_job_id` to include `article_url` - `log_llm_call` helper in `synthesis.rs` — add `article_url: Option<&str>` parameter, pass through to `insert` - The `classify_summarize` call in synthesis.rs calls `insert` directly (not via `log_llm_call`) — update it to pass the article URL - The `link_extraction` call in `source_scraper.rs` also calls `insert` directly — update it to pass `None` - All other call sites via `log_llm_call` (`search`) pass `None` **Frontend:** - `LlmCallLogEntry` type — add `article_url: string | null` - `LlmLogs.tsx` — display the URL as a clickable link when present - `fr.ts` — add `'llmLogs.articleUrl': 'Article'` ### Files to modify - **Create:** `backend/migrations/20260325000021_add_article_url_to_llm_log.sql` - **Modify:** `backend/src/db/llm_call_log.rs` — insert signature, row struct, SELECT queries - **Modify:** `backend/src/services/synthesis.rs` — pass article URL in classify `insert` call, update `log_llm_call` helper - **Modify:** `backend/src/services/source_scraper.rs` — update `insert` call to pass `None` - **Modify:** `frontend/src/types.ts` — add field to `LlmCallLogEntry` - **Modify:** `frontend/src/pages/LlmLogs.tsx` — display article URL - **Modify:** `frontend/src/i18n/fr.ts` — add label - **Modify:** `CLAUDE.md` — migration count --- ## 3. Send structured link pairs to LLM instead of raw HTML body ### Context The LLM link extraction path (`extract_article_links_with_llm`) sends the first 12000 chars of the HTML `` to the LLM. This is noisy — the LLM must parse raw HTML with scripts, styles, and irrelevant markup, wasting tokens and reducing accuracy. ### Changes **New function:** `extract_links_as_pairs(html: &str, base_url: &Url) -> Vec<(String, String)>` in `source_scraper.rs`. Parses all `` tags and returns `(resolved_href, anchor_text)` pairs. Filtering: http/https only, same-domain, non-empty path. No dedup or article-pattern filtering (the LLM decides). Same-domain filtering is kept to avoid sending irrelevant cross-domain links that waste tokens. **Updated flow in `extract_article_links_with_llm`:** 1. Fetch the page HTML (unchanged) 2. Call `extract_links_as_pairs` instead of `extract_body_html` 3. Format pairs as a text list: `- /blog/article-1 | "OpenAI launches GPT-6"` (capped at 200 links) 4. Pass the formatted list to `build_link_extraction_prompt` **Updated prompt:** `build_link_extraction_prompt` parameter renamed from `body_html` to `links_text`. Remove the internal 12000-char truncation (the input is now a pre-formatted list, not raw HTML; the 200-link cap controls size). Update prompt wording to ask the LLM to select article links from the list rather than extract URLs from HTML. **Schema:** `build_link_extraction_schema` returns `{ "urls": [...] }` — unchanged. The LLM now selects URLs from the provided list rather than extracting from HTML, but the output format stays the same. **Cleanup:** Remove `extract_body_html` and its tests if no longer used elsewhere. ### Files to modify - **Modify:** `backend/src/services/source_scraper.rs` — add `extract_links_as_pairs`, update `extract_article_links_with_llm`, remove `extract_body_html` - **Modify:** `backend/src/services/prompts.rs` — update `build_link_extraction_prompt` (rename parameter, remove truncation, update wording) - **Modify:** `backend/src/services/source_scraper.rs` tests — add tests for `extract_links_as_pairs`, remove `extract_body_html` tests - **Modify:** `backend/src/services/prompts.rs` tests — update link extraction prompt tests