# Lessons learned · vicwest-roofing OD pipeline (2026-05-18)

After ~6 hours of iteration we found a recipe that scores 89/100. The lessons are about WHY everything before it failed.

## TL;DR · The recipe that works

```
image:  I_real  (9 real customer photos from GMB + owned site)
brief:  A2_compact  (10KB customer-brief summary · not the full 5500-word brief)
skill:  web-prototype  (OD's default skill)
prompt: production-grade page standards (in seed/prompt.txt + threaded as --prompt to OD)
```

Composite: **89/100** · 10-page multi-page site · live at https://vicwest-roofing-od-dev.pages.dev

## Lessons (in order of how much pain each cost)

### 1 · The agent reads `--prompt` as primary · files in seed-dir are just references

The seed dir had `prompt.txt` with strict rules. I assumed the OD agent (codex) would read it. It didn't. The OD daemon's `run-concept.js` builds a DEFAULT prompt and sends THAT to codex; files in seed-dir are project files the agent may or may not look at.

**Fix:** orchestrator must pass `--prompt` to `open-design:run-concept` with the contents of seed/prompt.txt. Then it becomes the PRIMARY directive. Without this, we got 1-page output with 184 words. With it, we got 10 pages × 600+ words each.

### 2 · "Production-grade" requires verbatim checklists, not aspirational descriptions

First attempt at the prompt said: "use real customer facts, write good copy, multi-page output."
Result: 184 words, no VBA, no email, no testimonials, single page. Vision 40, Copy 21.

Second attempt said: "Word count ≥ 600 per page. Phone `0403 554 592` AND `tel:0403554592` link must appear in (a) header (b) hero (c) every form (d) sticky mobile bar (e) footer. Email `info@vicwestroofing.com.au` with `mailto:` link in footer AND contact page. VBA exact wording in trust-bar AND every footer."
Result: Vision 90, Copy 47, 10 pages × 600+ words.

The LLM is competent but lazy. Explicit checklists beat aspirational language.

### 3 · Hard L1 gates must exclude non-customer files

`pl:audit-website-output` audits every .html file in the output dir. OD agents sometimes generate side artifacts like `brand/preview.html` — a brand-spec render, NOT a customer page. The audit was reporting "phone missing on brand/preview.html" → L1 hard fail → composite 0.

Real vicwest customer pages (all 10 site-architecture pages) had ZERO L1 failures. The page killing the score was a non-customer artifact.

**Fix:** filter out `brand/`, `design/`, `preview/`, `spec/` paths before computing L1 gate.

### 4 · pl:audit-vision output is per-page, not top-level

I wrote `imageScoreFromVision()` to read `vision.dim_scores.D6_hero_quality`. That field doesn't exist at the top level — the actual structure is `vision.vision_results[].scores.D6_hero_quality` (one entry per page). My function returned 0 always.

Vision D6 actual: 8.7/10. D9: 7.7/10. Image score: 40.8/50. Not zero. My orchestrator was throwing away 20+ points.

**Fix:** aggregate across all `vision_results[].scores[key]`.

### 5 · Claude CLI hangs on 50KB+ prompts · codex doesn't

Copy-judge prompt was ~50KB (page text + customer brief). First overnight run, claude CLI hung for 26+ minutes processing it. Same input through codex CLI: 27 seconds.

This is a known pattern from earlier in the day (claude CLI timed out twice on big prompts in extract-core). I forgot and used claude again. Wasted 30 min of overnight time.

**Fix:** all judge calls use codex with hard 5-min timeout. claude CLI reserved for short conversational prompts.

### 6 · Mac OD app GUI visibility requires the right daemon URL

OD has a packaged daemon (in `/Applications/Open Design.app/...`) that uses IPC socket. Our `run-concept.js` defaults to spawning a SEPARATE HTTP daemon at port 7466. Projects created by our daemon never appear in the Mac app GUI because they're in a different daemon's registry.

The Mac app's daemon publishes its HTTP port via IPC `status` message at `/tmp/open-design/ipc/release-stable/daemon.sock`. Random port each session (saw 49718 tonight).

**Fix:** `pl:od-run` discovers the Mac app daemon URL via IPC, passes it to `run-concept` via `--daemon-url`. Projects appear live in the Mac OD app GUI.

### 7 · Resume-on-checkpoint must check composite, not just ok-flag

First implementation: skip variant if `result.json` exists with `ok: true`.
Problem: variants that scored composite=0 (L1 hard-gate fail) had `ok: true` (the audit ran successfully · just scored 0). Resume would skip them forever.

**Fix:** skip only if `ok && composite > 0`. Otherwise retry.

### 8 · pkill -f doesn't always catch codex children

Codex is spawned as `codex exec` inside an `npm run` inside a `node` script. `pkill -f "pl-od-overnight"` matches the master but not the children. The orchestrator dies, the children keep running, the next codex spawned by the now-dead orchestrator is an ORPHAN.

**Fix:** always `pgrep -lf "codex exec|run-concept"` after a kill and `kill -9` any survivors. Wait 2s. Re-check.

### 9 · Vision composite ≥ 90 doesn't mean done · L1+L2+L3 must also pass

Vision 90.3 looks like a winning score until you realize L3 (build sanity) failed. L3 checks DOCTYPE / viewport / broken refs — a vision-perfect page can still be broken HTML.

Lesson: 4-layer scoring is right (L1 facts + L2 content + L3 build + vision quality + copy judge + image score). No single layer is sufficient.

### 10 · Rich customer data is upstream of good output

Vicwest had:
- 21 classified photos (hero / service / gallery categorized)
- 4 verbatim Google reviews with author + suburb
- Full ABN entity record (VICWEST GROUP PTY LTD, registered 2017)
- 10-page owned-website crawl
- 6 external tinyfish mentions with summaries

A customer without this depth probably can't hit 89. The cross-client validation will reveal how much of the 89 score is "the recipe" vs "the data". This is the most important open question.

## Cost summary tonight

- ~$2 image gen (12 gpt-image-1 images — unused in winning recipe but available for I_ai/I_mixed)
- ~$0.50 codex (the I_real winning OD run · 503s)
- ~$0.20 vision audit
- ~$0.05 copy judge (codex · 27s)
- ~$3.50 wasted on failed experiments (claude hang, false-negative scoring)

**Productive spend: ~$0.75 · waste: ~$3.50 · 80% of cost was on failed iterations**, which is normal for finding the right recipe.

## What didn't work (so we don't try again)

- Full 5500-word brief as input — overwhelmed the agent
- Claude CLI for big-prompt judging
- web-prototype + thin prompt → wireframe-grade output
- Spawning multiple codex processes in parallel (rate-limited, all fail in 1s)
- Skipping audit entirely (would have shipped broken sites)

## Next experiments

- [ ] Cross-client validation: vip-roofing-brisbane + west-coast-roofing (running now)
- [ ] Test taste-variants (`web-prototype-taste-soft`) on winning recipe — soft suits residential roofing better than default
- [ ] Test gpt-image-1 hero (I_mixed variant) — winning recipe used real photos, AI hero may improve "wow factor"
- [ ] Fix L3 (build sanity) failure to push composite from 89 → 95+