diff --git a/docs/superpowers/specs/2026-03-25-pipeline-improvements-design.md b/docs/superpowers/specs/2026-03-25-pipeline-improvements-design.md new file mode 100644 index 0000000..ab6d35a --- /dev/null +++ b/docs/superpowers/specs/2026-03-25-pipeline-improvements-design.md @@ -0,0 +1,87 @@ +# Design: Pipeline Improvements — Web Search, LLM Logs, Link Extraction + +**Date**: 2026-03-25 +**Scope**: Three independent improvements to the synthesis pipeline + +--- + +## 1. Remove personalized sources from web search prompt + +### Context + +`build_search_prompt` receives `&[Source]` and injects personalized source URLs into the Phase 2 web search prompt. Phase 1 already handles personalized sources via scraping, so including them again in Phase 2 biases the web search away from discovering new content. + +### Change + +In `synthesis.rs`, pass `&[]` instead of `&sources` when calling `build_search_prompt` for Phase 2. The function signature is unchanged — Phase 2 will do a pure Google search based on theme, categories, and gap counts only. + +### Files to modify + +- `backend/src/services/synthesis.rs` — pass `&[]` for sources in the Phase 2 `build_search_prompt` call + +--- + +## 2. Add `article_url` to LLM call logs + +### Context + +The `llm_call_log` table records every LLM call during synthesis generation but has no field linking a `classify_summarize` call to the specific article URL being classified. To see which article a classify call relates to, you must cross-reference `article_history` — cumbersome for debugging. + +### Changes + +**Migration:** Add nullable `article_url TEXT` column to `llm_call_log`. + +**Backend:** +- `llm_call_log::insert` — add `article_url: Option<&str>` parameter, bind it in the INSERT +- `LlmCallLogRow` — add `article_url: Option` field, update SELECT in `list_by_job_id` to include `article_url` +- `log_llm_call` helper in `synthesis.rs` — add `article_url: Option<&str>` parameter, pass through to `insert` +- The `classify_summarize` call in synthesis.rs calls `insert` directly (not via `log_llm_call`) — update it to pass the article URL +- The `link_extraction` call in `source_scraper.rs` also calls `insert` directly — update it to pass `None` +- All other call sites via `log_llm_call` (`search`) pass `None` + +**Frontend:** +- `LlmCallLogEntry` type — add `article_url: string | null` +- `LlmLogs.tsx` — display the URL as a clickable link when present +- `fr.ts` — add `'llmLogs.articleUrl': 'Article'` + +### Files to modify + +- **Create:** `backend/migrations/20260325000021_add_article_url_to_llm_log.sql` +- **Modify:** `backend/src/db/llm_call_log.rs` — insert signature, row struct, SELECT queries +- **Modify:** `backend/src/services/synthesis.rs` — pass article URL in classify `insert` call, update `log_llm_call` helper +- **Modify:** `backend/src/services/source_scraper.rs` — update `insert` call to pass `None` +- **Modify:** `frontend/src/types.ts` — add field to `LlmCallLogEntry` +- **Modify:** `frontend/src/pages/LlmLogs.tsx` — display article URL +- **Modify:** `frontend/src/i18n/fr.ts` — add label +- **Modify:** `CLAUDE.md` — migration count + +--- + +## 3. Send structured link pairs to LLM instead of raw HTML body + +### Context + +The LLM link extraction path (`extract_article_links_with_llm`) sends the first 12000 chars of the HTML `` to the LLM. This is noisy — the LLM must parse raw HTML with scripts, styles, and irrelevant markup, wasting tokens and reducing accuracy. + +### Changes + +**New function:** `extract_links_as_pairs(html: &str, base_url: &Url) -> Vec<(String, String)>` in `source_scraper.rs`. Parses all `` tags and returns `(resolved_href, anchor_text)` pairs. Filtering: http/https only, same-domain, non-empty path. No dedup or article-pattern filtering (the LLM decides). Same-domain filtering is kept to avoid sending irrelevant cross-domain links that waste tokens. + +**Updated flow in `extract_article_links_with_llm`:** +1. Fetch the page HTML (unchanged) +2. Call `extract_links_as_pairs` instead of `extract_body_html` +3. Format pairs as a text list: `- /blog/article-1 | "OpenAI launches GPT-6"` (capped at 200 links) +4. Pass the formatted list to `build_link_extraction_prompt` + +**Updated prompt:** `build_link_extraction_prompt` parameter renamed from `body_html` to `links_text`. Remove the internal 12000-char truncation (the input is now a pre-formatted list, not raw HTML; the 200-link cap controls size). Update prompt wording to ask the LLM to select article links from the list rather than extract URLs from HTML. + +**Schema:** `build_link_extraction_schema` returns `{ "urls": [...] }` — unchanged. The LLM now selects URLs from the provided list rather than extracting from HTML, but the output format stays the same. + +**Cleanup:** Remove `extract_body_html` and its tests if no longer used elsewhere. + +### Files to modify + +- **Modify:** `backend/src/services/source_scraper.rs` — add `extract_links_as_pairs`, update `extract_article_links_with_llm`, remove `extract_body_html` +- **Modify:** `backend/src/services/prompts.rs` — update `build_link_extraction_prompt` (rename parameter, remove truncation, update wording) +- **Modify:** `backend/src/services/source_scraper.rs` tests — add tests for `extract_links_as_pairs`, remove `extract_body_html` tests +- **Modify:** `backend/src/services/prompts.rs` tests — update link extraction prompt tests