You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

5.0 KiB

Raw Blame History

Design: Pipeline Improvements — Web Search, LLM Logs, Link Extraction

Date: 2026-03-25 Scope: Three independent improvements to the synthesis pipeline

1. Remove personalized sources from web search prompt

Context

build_search_prompt receives &[Source] and injects personalized source URLs into the Phase 2 web search prompt. Phase 1 already handles personalized sources via scraping, so including them again in Phase 2 biases the web search away from discovering new content.

Change

In synthesis.rs, pass &[] instead of &sources when calling build_search_prompt for Phase 2. The function signature is unchanged — Phase 2 will do a pure Google search based on theme, categories, and gap counts only.

Files to modify

backend/src/services/synthesis.rs — pass &[] for sources in the Phase 2 build_search_prompt call

2. Add `article_url` to LLM call logs

Context

The llm_call_log table records every LLM call during synthesis generation but has no field linking a classify_summarize call to the specific article URL being classified. To see which article a classify call relates to, you must cross-reference article_history — cumbersome for debugging.

Changes

Migration: Add nullable article_url TEXT column to llm_call_log.

Backend:

llm_call_log::insert — add article_url: Option<&str> parameter, bind it in the INSERT
LlmCallLogRow — add article_url: Option<String> field, update SELECT in list_by_job_id to include article_url
log_llm_call helper in synthesis.rs — add article_url: Option<&str> parameter, pass through to insert
The classify_summarize call in synthesis.rs calls insert directly (not via log_llm_call) — update it to pass the article URL
The link_extraction call in source_scraper.rs also calls insert directly — update it to pass None
All other call sites via log_llm_call (search) pass None

Frontend:

LlmCallLogEntry type — add article_url: string | null
LlmLogs.tsx — display the URL as a clickable link when present
fr.ts — add 'llmLogs.articleUrl': 'Article'

Files to modify

Create: backend/migrations/20260325000021_add_article_url_to_llm_log.sql
Modify: backend/src/db/llm_call_log.rs — insert signature, row struct, SELECT queries
Modify: backend/src/services/synthesis.rs — pass article URL in classify insert call, update log_llm_call helper
Modify: backend/src/services/source_scraper.rs — update insert call to pass None
Modify: frontend/src/types.ts — add field to LlmCallLogEntry
Modify: frontend/src/pages/LlmLogs.tsx — display article URL
Modify: frontend/src/i18n/fr.ts — add label
Modify: CLAUDE.md — migration count

3. Send structured link pairs to LLM instead of raw HTML body

Context

The LLM link extraction path (extract_article_links_with_llm) sends the first 12000 chars of the HTML <body> to the LLM. This is noisy — the LLM must parse raw HTML with scripts, styles, and irrelevant markup, wasting tokens and reducing accuracy.

Changes

New function: extract_links_as_pairs(html: &str, base_url: &Url) -> Vec<(String, String)> in source_scraper.rs. Parses all <a href> tags and returns (resolved_href, anchor_text) pairs. Filtering: http/https only, same-domain, non-empty path. No dedup or article-pattern filtering (the LLM decides). Same-domain filtering is kept to avoid sending irrelevant cross-domain links that waste tokens.

Updated flow in extract_article_links_with_llm:

Fetch the page HTML (unchanged)
Call extract_links_as_pairs instead of extract_body_html
Format pairs as a text list: - /blog/article-1 | "OpenAI launches GPT-6" (capped at 200 links)
Pass the formatted list to build_link_extraction_prompt

Updated prompt: build_link_extraction_prompt parameter renamed from body_html to links_text. Remove the internal 12000-char truncation (the input is now a pre-formatted list, not raw HTML; the 200-link cap controls size). Update prompt wording to ask the LLM to select article links from the list rather than extract URLs from HTML.

Schema: build_link_extraction_schema returns { "urls": [...] } — unchanged. The LLM now selects URLs from the provided list rather than extracting from HTML, but the output format stays the same.

Cleanup: Remove extract_body_html and its tests if no longer used elsewhere.

Files to modify

Modify: backend/src/services/source_scraper.rs — add extract_links_as_pairs, update extract_article_links_with_llm, remove extract_body_html
Modify: backend/src/services/prompts.rs — update build_link_extraction_prompt (rename parameter, remove truncation, update wording)
Modify: backend/src/services/source_scraper.rs tests — add tests for extract_links_as_pairs, remove extract_body_html tests
Modify: backend/src/services/prompts.rs tests — update link extraction prompt tests

5.0 KiB Raw Blame History

Design: Pipeline Improvements — Web Search, LLM Logs, Link Extraction

1. Remove personalized sources from web search prompt

Context

Change

Files to modify

2. Add article_url to LLM call logs

Context

Changes

Files to modify

3. Send structured link pairs to LLM instead of raw HTML body

Context

Changes

Files to modify

5.0 KiB

Raw Blame History

2. Add `article_url` to LLM call logs