You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

91 lines
5.4 KiB
Markdown

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

# Synthesis Generation Pipeline — Full Algorithm
## Initialization
1. **Load user settings** from DB (categories, provider, models, max_items, etc.)
2. **Cleanup** — delete old article history entries (>N days, dropped only) + truncate old LLM call logs
3. **Validate** — if no categories configured, there will just be the default category "Autre".
4. **Load user sources** (personalized URLs like `https://openai.com/blog`)
5. **Resolve LLM provider** — decrypt user's API key, create provider instance (`Arc<dyn LlmProvider>`)
6. **Resolve models** — research model + writing model (user override or admin default)
7. **Setup rate limiter** — per-user or global provider limiter
8. **Prepare LLM scraping option** — if `use_llm_for_article_extraction` enabled, clone provider+model for concurrent use
9. **Initialize tracking structures**`article_scraped` (category→articles), `source_counts` (per-source article count), `url_soucre` (per-article source), `filled_counts` (per-category article count), `seen_urls` (cross-phase dedup), classification categories (user categories + "Autre")
---
## Phase 1: Personalized Sources
**Skipped entirely if user has 0 sources.**
### 1a. Extract article links from source pages and filter against article history
- Query `article_history` for the last source used. Reorder the personalized source so that the first source is the one following the last source used (rolling window)
- For each source, fetch the source page HTML:
- If `use_llm_for_source_links` enabled: send HTML `<head>` + first 8000 chars of `<body>` to LLM → extract all article URLs up to a maximum of 10, with the most recent first. If LLM call fails, fall back to HTML parsing as described below.
- **LLM call logged** with full prompt/response/timing
- Otherwise: parse HTML `<a href>` links, filter by same-domain, non-homepage path, exclude `/tag/`, `/login/`, `/contact/`,`/presentation/`,`/newsletter/`, static assets, etc. and keep only the first 10 links found
- Deduplicate candidate URLs
- Hash each URL (normalized: lowercase, strip fragments/UTM params/trailing slashes)
- Query `article_history` for existing hashes → remove matches
- Trace dropped articles with `status: filtered_history`
- Add the url to `url_soucre`
### 1b. Scrape, classify and summarize articles
- For each url from step 1a:
- if the number of articles in `source_counts` for the source of the current url exceeds `max_articles_per_source`:
- Trace dropped article with `status: filtered_diversity`
- Move to next url
- Fetch each URL (bounded concurrency: 5 with LLM extraction, 10 without).
- SSRF check (no private IPs), 15s timeout, 5MB body limit.
- HTML parsing heuristics for title (`<title>`, `og:title`), date (meta tags, JSON-LD, `<time>`), body (strip scripts/nav), soft-404 detection
- If article scraped body text is empty (scrape failure, soft 404, too old):
- Trace dropped articles in `article_history` with `status: filtered_empty`
- Move to next url
- Send article (title + first 500 chars of body) + categories + "Autre" to LLM. LLM returns `{title, summary, category}` mapping the article to a category. The LLM generates the summary and a also a title if the provided title is empty
- **LLM call logged** with full prompt/response/timing
- Add the article to `article_scraped` and increase `filled_counts`
- if number of articles in the category of this artcile exceeds `max_items_per_category`: change the article catgeory to "Autre"
- If the total number of articles in `article_scraped` exceeds `number of categories (including Autre) × max_items_per_category` then exit for loop and move to synthesis generation
---
## Phase 2: Web Search Fallback
**Skipped if all user-defined categories are already filled to `max_items_per_category`.**
### 2a. Compute category gaps
- For each user category: `needed = max - already_filled`
- Only proceed if any category needs more
### 2b. LLM web search pass
- Build search prompt with theme, categories, gap counts ("find N articles for category_1, M for category_2")
- Send search prompt to LLM. LLM returns structured JSON: `{category_0: [{title, url, summary}], category_1: [...]}`
- **LLM call logged** with full prompt/response/timing
- **Filter homepage URLs** — drop articles with path `/` or empty
- **Cross-phase dedup** — drop URLs already seen in Phase 1
- **Dedup by URL** — drop duplicate URLs within Phase 2 (case-insensitive)
- **Limit articles per source** — enforce `max_articles_per_source` per domain (spread across categories first, then fill)
- **Filter against article history** — BEFORE scraping (saves HTTP requests), drop already-seen URLs
- Each drop traced in `article_history` with appropriate status
### 2c. Scrape web search results
- Same scraping as Phase 1 (bounded concurrency, SSRF check, optional LLM extraction)
- Filter empty content (scrape failures, soft 404, too old)
- Trace drops
- Merge results into `all_scraped`
- Move to synthesis generation
---
## Save + Record
- **Sanitize** — strip `\u0000` null bytes from JSON (PostgreSQL rejects them in JSONB)
- **Save synthesis** — insert into `syntheses` table with `job_id`, `week` (ISO week), `sections` (JSONB), `status: completed`
- **Record used articles** — insert each article URL into `article_history` with `status: used`, `synthesis_id`, `job_id`, and category name (for future dedup + provenance)