You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
ai_synth/docs/superpowers/specs/2026-03-25-pipeline-improve...

5.0 KiB

Design: Pipeline Improvements — Web Search, LLM Logs, Link Extraction

Date: 2026-03-25 Scope: Three independent improvements to the synthesis pipeline


1. Remove personalized sources from web search prompt

Context

build_search_prompt receives &[Source] and injects personalized source URLs into the Phase 2 web search prompt. Phase 1 already handles personalized sources via scraping, so including them again in Phase 2 biases the web search away from discovering new content.

Change

In synthesis.rs, pass &[] instead of &sources when calling build_search_prompt for Phase 2. The function signature is unchanged — Phase 2 will do a pure Google search based on theme, categories, and gap counts only.

Files to modify

  • backend/src/services/synthesis.rs — pass &[] for sources in the Phase 2 build_search_prompt call

2. Add article_url to LLM call logs

Context

The llm_call_log table records every LLM call during synthesis generation but has no field linking a classify_summarize call to the specific article URL being classified. To see which article a classify call relates to, you must cross-reference article_history — cumbersome for debugging.

Changes

Migration: Add nullable article_url TEXT column to llm_call_log.

Backend:

  • llm_call_log::insert — add article_url: Option<&str> parameter, bind it in the INSERT
  • LlmCallLogRow — add article_url: Option<String> field, update SELECT in list_by_job_id to include article_url
  • log_llm_call helper in synthesis.rs — add article_url: Option<&str> parameter, pass through to insert
  • The classify_summarize call in synthesis.rs calls insert directly (not via log_llm_call) — update it to pass the article URL
  • The link_extraction call in source_scraper.rs also calls insert directly — update it to pass None
  • All other call sites via log_llm_call (search) pass None

Frontend:

  • LlmCallLogEntry type — add article_url: string | null
  • LlmLogs.tsx — display the URL as a clickable link when present
  • fr.ts — add 'llmLogs.articleUrl': 'Article'

Files to modify

  • Create: backend/migrations/20260325000021_add_article_url_to_llm_log.sql
  • Modify: backend/src/db/llm_call_log.rs — insert signature, row struct, SELECT queries
  • Modify: backend/src/services/synthesis.rs — pass article URL in classify insert call, update log_llm_call helper
  • Modify: backend/src/services/source_scraper.rs — update insert call to pass None
  • Modify: frontend/src/types.ts — add field to LlmCallLogEntry
  • Modify: frontend/src/pages/LlmLogs.tsx — display article URL
  • Modify: frontend/src/i18n/fr.ts — add label
  • Modify: CLAUDE.md — migration count

Context

The LLM link extraction path (extract_article_links_with_llm) sends the first 12000 chars of the HTML <body> to the LLM. This is noisy — the LLM must parse raw HTML with scripts, styles, and irrelevant markup, wasting tokens and reducing accuracy.

Changes

New function: extract_links_as_pairs(html: &str, base_url: &Url) -> Vec<(String, String)> in source_scraper.rs. Parses all <a href> tags and returns (resolved_href, anchor_text) pairs. Filtering: http/https only, same-domain, non-empty path. No dedup or article-pattern filtering (the LLM decides). Same-domain filtering is kept to avoid sending irrelevant cross-domain links that waste tokens.

Updated flow in extract_article_links_with_llm:

  1. Fetch the page HTML (unchanged)
  2. Call extract_links_as_pairs instead of extract_body_html
  3. Format pairs as a text list: - /blog/article-1 | "OpenAI launches GPT-6" (capped at 200 links)
  4. Pass the formatted list to build_link_extraction_prompt

Updated prompt: build_link_extraction_prompt parameter renamed from body_html to links_text. Remove the internal 12000-char truncation (the input is now a pre-formatted list, not raw HTML; the 200-link cap controls size). Update prompt wording to ask the LLM to select article links from the list rather than extract URLs from HTML.

Schema: build_link_extraction_schema returns { "urls": [...] } — unchanged. The LLM now selects URLs from the provided list rather than extracting from HTML, but the output format stays the same.

Cleanup: Remove extract_body_html and its tests if no longer used elsewhere.

Files to modify

  • Modify: backend/src/services/source_scraper.rs — add extract_links_as_pairs, update extract_article_links_with_llm, remove extract_body_html
  • Modify: backend/src/services/prompts.rs — update build_link_extraction_prompt (rename parameter, remove truncation, update wording)
  • Modify: backend/src/services/source_scraper.rs tests — add tests for extract_links_as_pairs, remove extract_body_html tests
  • Modify: backend/src/services/prompts.rs tests — update link extraction prompt tests