You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
88 lines
5.0 KiB
Markdown
88 lines
5.0 KiB
Markdown
# Design: Pipeline Improvements — Web Search, LLM Logs, Link Extraction
|
|
|
|
**Date**: 2026-03-25
|
|
**Scope**: Three independent improvements to the synthesis pipeline
|
|
|
|
---
|
|
|
|
## 1. Remove personalized sources from web search prompt
|
|
|
|
### Context
|
|
|
|
`build_search_prompt` receives `&[Source]` and injects personalized source URLs into the Phase 2 web search prompt. Phase 1 already handles personalized sources via scraping, so including them again in Phase 2 biases the web search away from discovering new content.
|
|
|
|
### Change
|
|
|
|
In `synthesis.rs`, pass `&[]` instead of `&sources` when calling `build_search_prompt` for Phase 2. The function signature is unchanged — Phase 2 will do a pure Google search based on theme, categories, and gap counts only.
|
|
|
|
### Files to modify
|
|
|
|
- `backend/src/services/synthesis.rs` — pass `&[]` for sources in the Phase 2 `build_search_prompt` call
|
|
|
|
---
|
|
|
|
## 2. Add `article_url` to LLM call logs
|
|
|
|
### Context
|
|
|
|
The `llm_call_log` table records every LLM call during synthesis generation but has no field linking a `classify_summarize` call to the specific article URL being classified. To see which article a classify call relates to, you must cross-reference `article_history` — cumbersome for debugging.
|
|
|
|
### Changes
|
|
|
|
**Migration:** Add nullable `article_url TEXT` column to `llm_call_log`.
|
|
|
|
**Backend:**
|
|
- `llm_call_log::insert` — add `article_url: Option<&str>` parameter, bind it in the INSERT
|
|
- `LlmCallLogRow` — add `article_url: Option<String>` field, update SELECT in `list_by_job_id` to include `article_url`
|
|
- `log_llm_call` helper in `synthesis.rs` — add `article_url: Option<&str>` parameter, pass through to `insert`
|
|
- The `classify_summarize` call in synthesis.rs calls `insert` directly (not via `log_llm_call`) — update it to pass the article URL
|
|
- The `link_extraction` call in `source_scraper.rs` also calls `insert` directly — update it to pass `None`
|
|
- All other call sites via `log_llm_call` (`search`) pass `None`
|
|
|
|
**Frontend:**
|
|
- `LlmCallLogEntry` type — add `article_url: string | null`
|
|
- `LlmLogs.tsx` — display the URL as a clickable link when present
|
|
- `fr.ts` — add `'llmLogs.articleUrl': 'Article'`
|
|
|
|
### Files to modify
|
|
|
|
- **Create:** `backend/migrations/20260325000021_add_article_url_to_llm_log.sql`
|
|
- **Modify:** `backend/src/db/llm_call_log.rs` — insert signature, row struct, SELECT queries
|
|
- **Modify:** `backend/src/services/synthesis.rs` — pass article URL in classify `insert` call, update `log_llm_call` helper
|
|
- **Modify:** `backend/src/services/source_scraper.rs` — update `insert` call to pass `None`
|
|
- **Modify:** `frontend/src/types.ts` — add field to `LlmCallLogEntry`
|
|
- **Modify:** `frontend/src/pages/LlmLogs.tsx` — display article URL
|
|
- **Modify:** `frontend/src/i18n/fr.ts` — add label
|
|
- **Modify:** `CLAUDE.md` — migration count
|
|
|
|
---
|
|
|
|
## 3. Send structured link pairs to LLM instead of raw HTML body
|
|
|
|
### Context
|
|
|
|
The LLM link extraction path (`extract_article_links_with_llm`) sends the first 12000 chars of the HTML `<body>` to the LLM. This is noisy — the LLM must parse raw HTML with scripts, styles, and irrelevant markup, wasting tokens and reducing accuracy.
|
|
|
|
### Changes
|
|
|
|
**New function:** `extract_links_as_pairs(html: &str, base_url: &Url) -> Vec<(String, String)>` in `source_scraper.rs`. Parses all `<a href>` tags and returns `(resolved_href, anchor_text)` pairs. Filtering: http/https only, same-domain, non-empty path. No dedup or article-pattern filtering (the LLM decides). Same-domain filtering is kept to avoid sending irrelevant cross-domain links that waste tokens.
|
|
|
|
**Updated flow in `extract_article_links_with_llm`:**
|
|
1. Fetch the page HTML (unchanged)
|
|
2. Call `extract_links_as_pairs` instead of `extract_body_html`
|
|
3. Format pairs as a text list: `- /blog/article-1 | "OpenAI launches GPT-6"` (capped at 200 links)
|
|
4. Pass the formatted list to `build_link_extraction_prompt`
|
|
|
|
**Updated prompt:** `build_link_extraction_prompt` parameter renamed from `body_html` to `links_text`. Remove the internal 12000-char truncation (the input is now a pre-formatted list, not raw HTML; the 200-link cap controls size). Update prompt wording to ask the LLM to select article links from the list rather than extract URLs from HTML.
|
|
|
|
**Schema:** `build_link_extraction_schema` returns `{ "urls": [...] }` — unchanged. The LLM now selects URLs from the provided list rather than extracting from HTML, but the output format stays the same.
|
|
|
|
**Cleanup:** Remove `extract_body_html` and its tests if no longer used elsewhere.
|
|
|
|
### Files to modify
|
|
|
|
- **Modify:** `backend/src/services/source_scraper.rs` — add `extract_links_as_pairs`, update `extract_article_links_with_llm`, remove `extract_body_html`
|
|
- **Modify:** `backend/src/services/prompts.rs` — update `build_link_extraction_prompt` (rename parameter, remove truncation, update wording)
|
|
- **Modify:** `backend/src/services/source_scraper.rs` tests — add tests for `extract_links_as_pairs`, remove `extract_body_html` tests
|
|
- **Modify:** `backend/src/services/prompts.rs` tests — update link extraction prompt tests
|