# Experiment roadmap · finding the 90+ recipe that scales

**Goal:** OD recipe that produces 90+ composite across ≥3 customers of varying data depth.

**Current state (2026-05-18):**
- vicwest-roofing: 89/100 (one customer, rich data) ✓
- vip-roofing-brisbane: 0/100 (recipe didn't scale)

## What we measure (deep audit dimensions)

Eight scoring axes. Currently we have 4. Adding 4 more.

| Axis | Current | Gap | Tool to build |
|------|---------|-----|---------------|
| L1 locked-facts | ✓ `pl:audit-website-output` | bug: didn't exclude `brand/preview.html` cleanly | `pl:audit-locked-facts` (rewrite, customer-only) |
| L2 content quality | ✓ `pl:audit-website-output` L2 | doesn't catch meta-vibe headings | extend `content-validator.js` with meta-vibe regex |
| L3 build sanity | ✓ `pl:audit-website-output` L3 | OK | — |
| Vision 10-dim | ✓ `pl:audit-vision` | claude-vision, per-page array | — |
| LLM copy judge | ✓ codex judge (codex CLI) | only 1 score per page | per-section breakdown |
| **Meta-vibe leak detector** | ✗ missing | new | `pl:audit-meta-leaks` |
| **Cross-page design consistency** | ✗ missing | new | check header/footer byte-equal, section spacing, font-family |
| **Image relevance** | ✗ missing | currently uses vision D6+D9 | new vision pass: "does each section's image match its purpose" |

## What we vary (experiment axes)

Five independent axes. Greedy optimization: lock one at a time.

### Axis A · Input format
- A1 · `brief-summary.md` + sanitized `site-architecture.json` (current winner)
- A2 · `brief-summary.md` only (no architecture)
- A3 · `brief-summary.md` + per-page `copy-briefs.md` (architect's notes rewritten as copy briefs the agent shouldn't paraphrase)
- A4 · `brief-summary.md` + design-tokens-only architecture (just block types, no purpose/copy_brief)

### Axis B · Image strategy
- B1 · `I_real` (curated customer photos)
- B2 · `I_ai` (gpt-image-1 generated to spec)
- B3 · `I_mixed` (AI hero + customer rest)
- B4 · `I_library` (shared roofing template library, tagged by intent)
- B5 · `I_classified` (vision-classifier picks the best of customer + library)

### Axis C · Skill
- C1 · `web-prototype` (default, current winner)
- C2 · `web-prototype-taste-soft` (warmer for residential)
- C3 · `web-prototype-taste-editorial` (newspaper-style)
- C4 · `web-prototype-taste-brutalist` (Swiss-industrial)

### Axis D · Prompt depth
- D1 · current 200-line prompt (production-grade standards)
- D2 · D1 + explicit heading vocabulary (anti-meta-vibe)
- D3 · D2 + dropdown caret + per-page form variants
- D4 · D3 + design-language critical-rules block

### Axis E · Customer data depth
- E1 · rich data (vicwest level: 9 photos, 4 testimonials, 10-page crawl, ABN, hours, 23 suburbs)
- E2 · medium (5+ photos, 2+ testimonials, 5+ pages crawl)
- E3 · thin (3 photos, 1 testimonial, 1 page)
- E4 · none (just GBP basics)

## Stopping criteria per phase

- **Phase passes** if composite ≥ 90 AND no meta-vibe headings AND no L1 misses AND vision D6+D9 ≥ 8/10
- **Phase scales** if same recipe → 3 different customers all pass
- **Done** = scales-pass at the E2 (medium data) level

## Phase plan (greedy, lock one axis at a time)

### Phase 1 · Lock prompt depth (vicwest only, ~$3, 30min)
Test D1/D2/D3/D4 with fixed image=I_real, brief=A1, skill=C1.
**Hypothesis:** D4 wins by ~5 points (95-ish).
**Decision:** lock D, move on.

### Phase 2 · Cross-customer baseline (vip + 2 more, ~$10, 60min)
Apply Phase 1 winner to vip-roofing-brisbane, west-coast-roofing, weatherite.
**Hypothesis:** vip 60-70 (thin data), west-coast 80+, weatherite 70-80.
**Decision:** if all 3 ≥85, done. If gap exists, identify data-depth correlation.

### Phase 3 · Image strategy (vicwest, ~$8, 40min)
Test B1/B2/B3/B4/B5 with locked D from Phase 1.
**Hypothesis:** B5 (vision-classified) wins. B2 (AI) loses on D9.
**Decision:** lock image strategy. Build the classifier for B5 if it wins.

### Phase 4 · Library architecture (build, ~$5 + dev time)
Build the shared roofing template library:
- Vision-classify the 60+ ChatGPT images in `Downloads/roofing-inbox/`
- Tag by intent (hero / service / before-after / team / icon / gallery)
- Quality-filter (skip cartoon / wrong-niche / low-res)
- Index in a per-niche library JSON
Build a per-customer "image-pool" assembler that combines customer photos + library + fresh-generation for gaps.

### Phase 5 · Taste skill (vicwest + 2 cross-clients, ~$8, 50min)
Test C1/C2/C3/C4 with locked everything else.
**Hypothesis:** C2 (soft) wins for residential roofing. C4 (brutalist) loses.
**Decision:** lock skill per niche, not globally.

### Phase 6 · Deep audit upgrade (build, ~$0 + dev time)
Build the 4 missing audit dimensions:
- `pl:audit-meta-leaks` — regex/LLM detection of meta-vibe headings
- `pl:audit-cross-page-consistency` — header/footer byte-equal, fonts identical, section padding consistent
- `pl:audit-image-relevance` — vision per section, does image match content
- `pl:audit-locked-facts-v2` — customer-pages-only, exact-match grep

### Phase 7 · Cross-customer scale validation
Re-run the full pipeline on 5 customers spanning E1-E4 data tiers.
**Pass criterion:** 3 of 5 at ≥90, none below 75.

## Total budget estimate

- Phase 1 (~$3)  + Phase 2 (~$10) + Phase 3 (~$8) + Phase 5 (~$8) + Phase 7 (~$15) = **~$44** in LLM cost
- Plus ~$10 for image generation experiments
- Phase 4 + 6 are dev work, $0 LLM
- Total: **~$54 LLM cost, ~10-15 hours of automated runs**

## Today's actionable next step (when you're ready)

Phase 1 (prompt depth) is fastest to validate · ~30 min · $3.
If D4 wins (>=95 on vicwest), we have a recipe ceiling.

After Phase 1: decide whether to do Phase 2 (cross-customer baseline) OR Phase 3 (image strategy) first.

## Memory anchors

- `feedback_od_recipe_lessons.md` — the 14 lessons learned so far
- `project_single_page_ship.md` — both single-page AND multi-page are products