ai_synth/docs/superpowers/specs/2026-03-25-pipeline-improve...

# Design: Pipeline Improvements — Web Search, LLM Logs, Link Extraction

**Date**: 2026-03-25
**Scope**: Three independent improvements to the synthesis pipeline

---

## 1. Remove personalized sources from web search prompt

### Context

`build_search_prompt` receives `&[Source]` and injects personalized source URLs into the Phase 2 web search prompt. Phase 1 already handles personalized sources via scraping, so including them again in Phase 2 biases the web search away from discovering new content.

### Change

In `synthesis.rs`, pass `&[]` instead of `&sources` when calling `build_search_prompt` for Phase 2. The function signature is unchanged — Phase 2 will do a pure Google search based on theme, categories, and gap counts only.

### Files to modify

- `backend/src/services/synthesis.rs` — pass `&[]` for sources in the Phase 2 `build_search_prompt` call

---

## 2. Add `article_url` to LLM call logs

### Context

The `llm_call_log` table records every LLM call during synthesis generation but has no field linking a `classify_summarize` call to the specific article URL being classified. To see which article a classify call relates to, you must cross-reference `article_history` — cumbersome for debugging.

### Changes

**Migration:** Add nullable `article_url TEXT` column to `llm_call_log`.

**Backend:**
- `llm_call_log::insert` — add `article_url: Option<&str>` parameter, bind it in the INSERT
- `LlmCallLogRow` — add `article_url: Option<String>` field, update SELECT in `list_by_job_id` to include `article_url`
- `log_llm_call` helper in `synthesis.rs` — add `article_url: Option<&str>` parameter, pass through to `insert`
- The `classify_summarize` call in synthesis.rs calls `insert` directly (not via `log_llm_call`) — update it to pass the article URL
- The `link_extraction` call in `source_scraper.rs` also calls `insert` directly — update it to pass `None`
- All other call sites via `log_llm_call` (`search`) pass `None`

**Frontend:**
- `LlmCallLogEntry` type — add `article_url: string | null`
- `LlmLogs.tsx` — display the URL as a clickable link when present
- `fr.ts` — add `'llmLogs.articleUrl': 'Article'`

### Files to modify

- **Create:** `backend/migrations/20260325000021_add_article_url_to_llm_log.sql`
- **Modify:** `backend/src/db/llm_call_log.rs` — insert signature, row struct, SELECT queries
- **Modify:** `backend/src/services/synthesis.rs` — pass article URL in classify `insert` call, update `log_llm_call` helper
- **Modify:** `backend/src/services/source_scraper.rs` — update `insert` call to pass `None`
- **Modify:** `frontend/src/types.ts` — add field to `LlmCallLogEntry`
- **Modify:** `frontend/src/pages/LlmLogs.tsx` — display article URL
- **Modify:** `frontend/src/i18n/fr.ts` — add label
- **Modify:** `CLAUDE.md` — migration count

---

## 3. Send structured link pairs to LLM instead of raw HTML body

### Context

The LLM link extraction path (`extract_article_links_with_llm`) sends the first 12000 chars of the HTML `<body>` to the LLM. This is noisy — the LLM must parse raw HTML with scripts, styles, and irrelevant markup, wasting tokens and reducing accuracy.

### Changes

**New function:** `extract_links_as_pairs(html: &str, base_url: &Url) -> Vec<(String, String)>` in `source_scraper.rs`. Parses all `<a href>` tags and returns `(resolved_href, anchor_text)` pairs. Filtering: http/https only, same-domain, non-empty path. No dedup or article-pattern filtering (the LLM decides). Same-domain filtering is kept to avoid sending irrelevant cross-domain links that waste tokens.

**Updated flow in `extract_article_links_with_llm`:**
1. Fetch the page HTML (unchanged)
2. Call `extract_links_as_pairs` instead of `extract_body_html`
3. Format pairs as a text list: `- /blog/article-1 | "OpenAI launches GPT-6"` (capped at 200 links)
4. Pass the formatted list to `build_link_extraction_prompt`

**Updated prompt:** `build_link_extraction_prompt` parameter renamed from `body_html` to `links_text`. Remove the internal 12000-char truncation (the input is now a pre-formatted list, not raw HTML; the 200-link cap controls size). Update prompt wording to ask the LLM to select article links from the list rather than extract URLs from HTML.

**Schema:** `build_link_extraction_schema` returns `{ "urls": [...] }` — unchanged. The LLM now selects URLs from the provided list rather than extracting from HTML, but the output format stays the same.

**Cleanup:** Remove `extract_body_html` and its tests if no longer used elsewhere.

### Files to modify

- **Modify:** `backend/src/services/source_scraper.rs` — add `extract_links_as_pairs`, update `extract_article_links_with_llm`, remove `extract_body_html`
- **Modify:** `backend/src/services/prompts.rs` — update `build_link_extraction_prompt` (rename parameter, remove truncation, update wording)
- **Modify:** `backend/src/services/source_scraper.rs` tests — add tests for `extract_links_as_pairs`, remove `extract_body_html` tests
- **Modify:** `backend/src/services/prompts.rs` tests — update link extraction prompt tests