# Design: Article Tracing — Track Origin and Status of Every Article Candidate

**Date**: 2026-03-24
**Scope**: Enrich article_history with provenance metadata, track dropped articles, add frontend viewers

---

## Context

The synthesis pipeline drops articles at various stages (history dedup, empty content, too old, source diversity) but doesn't record why. When the output quality is low (generic links, archives instead of articles), there's no way to diagnose which step failed. Users need visibility into the full candidate pipeline to improve their sources and settings.

## Approach

Enrich the existing `article_history` table with tracing metadata. Insert dropped articles immediately at each filtering step. Add two frontend views: a global history page and a per-synthesis provenance section.

## Enriched `article_history` Table

New columns via migration:

```sql
ALTER TABLE article_history ADD COLUMN title TEXT NOT NULL DEFAULT '';
ALTER TABLE article_history ADD COLUMN source_type TEXT NOT NULL DEFAULT 'unknown';
ALTER TABLE article_history ADD COLUMN source_url TEXT;
ALTER TABLE article_history ADD COLUMN category TEXT;
ALTER TABLE article_history ADD COLUMN synthesis_id UUID REFERENCES syntheses(id) ON DELETE SET NULL;
ALTER TABLE article_history ADD COLUMN status TEXT NOT NULL DEFAULT 'used';
ALTER TABLE article_history ADD COLUMN scraped_ok BOOLEAN NOT NULL DEFAULT true;
ALTER TABLE article_history ADD COLUMN job_id UUID NOT NULL DEFAULT gen_random_uuid();

-- Drop unique index — table is now a trace log, same URL can appear in multiple runs
DROP INDEX idx_article_history_user_url;
CREATE INDEX idx_article_history_user_url ON article_history(user_id, url_hash);
CREATE INDEX idx_article_history_job_id ON article_history(job_id);

-- Store job_id on syntheses for direct provenance lookup
ALTER TABLE syntheses ADD COLUMN job_id UUID;
```

**`source_type` values:** `personalized_source`, `web_search`, `overflow`

**`status` values:** `used`, `filtered_history`, `filtered_empty`, `filtered_old`, `filtered_diversity`, `filtered_homepage`, `filtered_duplicate`, `filtered_cross_phase_dedup`

- `synthesis_id`: nullable — dropped articles don't belong to a synthesis. `ON DELETE SET NULL` preserves history.
- `job_id`: the generation pipeline's UUID — links both used and dropped articles from the same generation run. Enables the provenance view to show all candidates, not just used ones.

## Pipeline Integration

Insert dropped articles immediately at each filtering step. Insert used articles after synthesis saved with the real `synthesis_id`.

**Insertion points (in pipeline order):**

Phase 1 (personalized sources):
1. **Scrape** — after `scrape_flat_urls` + `filter_empty`, empty articles → `status: filtered_empty`, `scraped_ok: false`, `source_type: personalized_source`
2. **History filter** — articles matching existing history → `status: filtered_history`
3. **Retry scrape** (if under-filled) — same as steps 1-2 for retry candidates
4. **Classification overflow** — articles that overflow both target category and "Autre" → `status: filtered_diversity` (tracked via overflow vec)
5. **Source diversity** — articles pruned by `max_articles_per_source` → `status: filtered_diversity`

Phase 2 (web search):
6. **Homepage filter** — articles with homepage URLs → `status: filtered_homepage`, `source_type: web_search`
7. **Cross-phase dedup** — Phase 2 URLs already seen in Phase 1 → `status: filtered_cross_phase_dedup`
8. **Dedup by URL** — duplicate URLs within Phase 2 → `status: filtered_duplicate`
9. **Source diversity** — `limit_articles_per_source` drops → `status: filtered_diversity`
10. **History filter** — before scraping → `status: filtered_history`
11. **Scrape** — after scraping, empty articles → `status: filtered_empty`, `scraped_ok: false`

After save:
12. **Used articles** — insert with `status: used`, `synthesis_id` set, `scraped_ok: true`

All inserts include `job_id` from the pipeline's `_job_id` parameter (rename to `job_id` — remove underscore prefix).

**`source_url` tracking:** In Phase 1, thread the source page URL through `candidate_urls` as `Vec<(String, String)>` (article_url, source_url) and update `scrape_flat_urls` to accept pairs. In Phase 2, `source_url` is NULL. This requires modifying `ScrapedNewsItem` to add an optional `source_url: Option<String>` field.

**Provenance lookup:** Store `job_id` directly on the `syntheses` table (set during save). The provenance endpoint does a single query: `SELECT * FROM article_history WHERE job_id = (SELECT job_id FROM syntheses WHERE id = $1)`.

**Cleanup:** `cleanup_old` only deletes entries where `synthesis_id IS NULL` (dropped articles). Used articles linked to syntheses are kept until the synthesis is deleted (`ON DELETE SET NULL` then next cleanup removes them). This preserves provenance for existing syntheses.

## DB Module Updates

`db/article_history.rs` needs:

**Updated insert function:**
```rust
pub struct ArticleHistoryEntry {
    pub user_id: Uuid,
    pub url: String,
    pub url_hash: String,
    pub title: String,
    pub source_type: String,
    pub source_url: Option<String>,
    pub category: Option<String>,
    pub synthesis_id: Option<Uuid>,
    pub status: String,
    pub scraped_ok: bool,
    pub job_id: Uuid,
}

pub async fn insert_entry(pool: &PgPool, entry: &ArticleHistoryEntry) -> Result<(), AppError>
```

**Query functions:**
```rust
pub async fn list_history(
    pool: &PgPool, user_id: Uuid,
    limit: i64, offset: i64,
    status_filter: Option<&str>,
    source_type_filter: Option<&str>,
    synthesis_id_filter: Option<Uuid>,
) -> Result<Vec<ArticleHistoryRow>, AppError>

pub async fn list_by_job_id(
    pool: &PgPool, user_id: Uuid, job_id: Uuid,
) -> Result<Vec<ArticleHistoryRow>, AppError>

pub async fn count_history(
    pool: &PgPool, user_id: Uuid,
    status_filter: Option<&str>,
    source_type_filter: Option<&str>,
) -> Result<i64, AppError>
```

The existing `check_urls_exist`, `insert_urls`, and `cleanup_old` stay — they're used for the dedup logic. The new `insert_entry` is for tracing. The old `insert_urls` (batch insert used articles after save) will be replaced by `insert_entry` calls with `status: used` and `synthesis_id` set.

## API Endpoints

**`GET /api/v1/article-history?limit=50&offset=0&status=&source_type=&synthesis_id=`**

Returns paginated history with optional filters. Response:
```json
{
  "items": [
    {
      "id": "uuid",
      "url": "https://...",
      "title": "Article title",
      "source_type": "personalized_source",
      "source_url": "https://source-page.com/blog",
      "category": "AI News",
      "synthesis_id": "uuid or null",
      "status": "used",
      "scraped_ok": true,
      "job_id": "uuid",
      "created_at": "2026-03-24T..."
    }
  ],
  "total": 150
}
```

**`GET /api/v1/syntheses/:id/provenance`**

Returns all article_history entries for the generation run that produced this synthesis. Finds the `job_id` from the `used` entries with this `synthesis_id`, then queries all entries with that `job_id`. Response: same shape as above (array of entries).

## Frontend Views

### Settings — "Historique des articles" page

- New page at `/article-history`, accessible via button in Settings
- Table: Date, Title, URL (truncated, clickable), Source Type, Status, Category, Synthesis (link)
- Filters: dropdown for status, dropdown for source_type
- Pagination (50 per page)
- Color-coded status badges: green=used, red=filtered_*, grey=overflow

### Synthesis detail — "Provenance" section

- Collapsible section at the bottom of `SynthesisDetail.tsx`
- Same table but pre-filtered to the generation run's `job_id`
- Shows all candidates: used + dropped, so the user sees the full funnel
- Grouped or sortable by status to quickly see what was dropped and why

## Files to Modify

**Backend:**
- **Create:** migration `20260324000016_enrich_article_history.sql`
- **Modify:** `backend/src/db/article_history.rs` — add `ArticleHistoryEntry` struct, `insert_entry`, `list_history`, `list_by_job_id`, `count_history`. Update `insert_urls` to use new columns or replace with `insert_entry`.
- **Modify:** `backend/src/services/synthesis.rs` — at each filtering step, call `insert_entry` with appropriate status/source_type/job_id. After save, insert used articles with synthesis_id.
- **Create:** `backend/src/handlers/article_history.rs` — handlers for both endpoints
- **Modify:** `backend/src/handlers/mod.rs` — register module
- **Modify:** `backend/src/router.rs` — add routes
- **Modify:** `backend/src/models/synthesis.rs` — add `source_url: Option<String>` to `ScrapedNewsItem`
- **Modify:** `backend/src/db/syntheses.rs` — save `job_id` on synthesis creation
- **Modify:** `CLAUDE.md` — migration count to 16

**Frontend:**
- **Create:** `frontend/src/pages/ArticleHistory.tsx` — global history viewer
- **Create:** `frontend/src/api/articleHistory.ts` — API client
- **Modify:** `frontend/src/pages/SynthesisDetail.tsx` — collapsible provenance section
- **Modify:** `frontend/src/App.tsx` — add route
- **Modify:** `frontend/src/pages/Settings.tsx` — add link to history page
- **Modify:** `frontend/src/i18n/fr.ts` — labels
- **Modify:** `frontend/src/types.ts` — `ArticleHistoryEntry` type

**Tests:**
- **Modify:** `e2e/tests/generation-live.spec.ts` — verify provenance endpoint
- Unit tests for DB queries

## What Does NOT Change

- `check_urls_exist` — still used for dedup filtering (checks url_hash)
- `cleanup_old` — still deletes old entries (now richer but same cleanup logic)
- Pipeline logic — filtering steps unchanged, just adding insert calls alongside
- Existing synthesis display — unchanged
- Settings values — no new settings