The Designer's Eval Stack: How to Measure Design Quality When AI Generates Everything
When AI generates ten thousand design variations a day, "looks good to me" stops scaling. Designers must build eval stacks like ML engineers do. A working playbook for the eval pyramid, real tools, runnable rubrics, and the role designers grow into in 2026.

A senior designer in 2026 opens their morning queue and finds eighteen thousand candidates waiting. Thirty briefs went out yesterday. Each produced six hundred AI variants overnight. The "looks good to me" loop, the Slack thread with two thumbs up, the design lead glancing at a Figma file before standup, was tolerable when one designer made one asset a week. It is a coin flip with extra steps at AI volume.
Quality at AI scale is not a vibe, it is a stack. Cheap automated checks at the base, LLM-as-judge in the middle, human taste at the top, conversion data closing the loop. ML engineers built this in 2023 when models shipped faster than humans could review. Designers are next.
The working playbook: the pyramid, four layers, a runnable rubric, the toolchain, and the role that grows out of it.
Looks good to me does not scale anymore
The LGTM loop worked because the bottleneck was making the asset, not reviewing it. Production is now functionally free. Claude, Cursor, v0, Lovable, and a stack of Skills generate finished candidates in minutes. The bottleneck moved to review, and review is where every quality signal lives.
A team that did not move review out of Slack still operates like it is 2022. They ship drift, contrast violations, off-brand voice, and broken grids at industrial volume. When AI generates ten thousand variants a day, taste plus a Slack thread is not a quality system, it is a coin flip with extra steps.

Designers should steal the ML eval playbook
ML engineers solved this three years ago. An eval suite runs before any model output reaches users, scoring candidates against a structured rubric, with cheap deterministic checks at the base, LLM-as-judge for the squishy stuff, and human review reserved for taste calls and edge cases.
The playbook ports cleanly. Same problem, same shape. The base layer kills obvious failures cheaply. The middle layer scores survivors on craft and brand fit. The top layer is the human deciding between three options that all passed everything below. Eval design is the senior skill in 2026.
The eval pyramid, top to bottom
Four layers and a feedback loop. Bottom to top: lint and token validation, visual diff and regression, LLM-as-judge with a structured rubric, human taste review. The loop is conversion data flowing back from production to retrain the rubric.
Each layer kills a different failure at a different cost. Lint is pennies. Visual diff is cheap. LLM-as-judge scales on dollars, not designer hours. Human review is the most expensive resource in the building, reserved for the last fifty candidates, not the first ten thousand.
Layer one, lint and token validation
The base of the pyramid is the cheap stuff that should never reach a designer's eyes. Contrast under WCAG AA. Token violations where the AI invented a hex instead of using a system color. Baseline grid drift. Padding off the four-pixel rhythm. Type scale escapes. Missing alt text. Touch targets under forty-four pixels. axe-core flags.
These are deterministic. They run in milliseconds and kill thirty to fifty percent of AI output without anyone looking. A team without this layer pays senior designers to catch eight-pixel padding errors, which is the most expensive way to catch them.
The fix is a lint job in CI for code-rendered surfaces and a token validator in Figma for static work. Both exist, both are free or cheap, both should be table stakes by quarter end.
Layer two, visual diff and regression
Visual regression catches the unintended change before review starts. Playwright takes the screenshot. Pixelmatch diffs against baseline. Chromatic hosts the review and flags drift. Storybook isolates the component so the diff is the component, not page chrome.
Industrial-strength git diff for pixels. A button changed three pixels in padding, the diff catches it. A spacing token got bumped and propagated to forty surfaces, the diff catches all forty. Visual diff cannot tell you the new version is better, only that it changed. Pair with the next layer.
Layer three, LLM-as-judge with structured rubric
The middle of the pyramid did not exist for designers two years ago and is now the most leveraged hour of the week. An LLM scoring AI output against a structured rubric. Ten thousand candidates an hour, a few dollars total.
Render each candidate to an image or component. Pass it to Claude or GPT with a rubric prompt. Get back a score per criterion, a one-line reason, pass or fail. Sort survivors by score. Send the top fifty to a human.
Anthropic's eval framework, OpenAI evals, and a custom Claude rubric all do the same job in different shapes. Most design teams want the custom route, because the rubric is the brand, and the brand is what the eval enforces.
A runnable rubric for brand voice
A rubric is not a vibe statement. It is a list of measurable criteria, a score scale, and a reason field. Here is a working voice rubric a Claude call can score in three seconds.
Score the copy 1 to 5 per criterion. One-line reason per score.
1. Lead-first. Does the first sentence answer the question?
2. Concrete. Does it name real products, numbers, moves?
3. Voice match. Does the tone match the brand profile?
4. No filler. Does every sentence earn its seat?
5. No banned constructions. Em dashes, AI-slop adjectives, hedging.
Pass: average 4.0+ AND no criterion below 3.
Output JSON: {scores, reasons, pass}
Run that rubric against five hundred AI-drafted product descriptions and it surfaces the thirty worth a human eye in under two minutes. Same shape works for layout, color usage, and component composition. Score, reason, threshold, JSON.
The rubric is the asset. Version it. Test it. Improve it on real failures. A team that ships a rubric and tunes it monthly is running a brand operating system. A team with only a voice doc is running a coin flip.

Layer four, human taste review at the top
Human review is for what automation cannot grade. Taste calls between three options that all passed lint, diff, and rubric. Edge cases the rubric missed. The decision to break the rule on purpose. The rule: the human only sees the top of the funnel.
If a designer is reviewing four thousand candidates a week, the stack is broken. If they review twenty and ship six, the stack is working. The senior eye gets pointed at choices that actually matter. This is where taste is the last moat. The eval stack is not a replacement for taste, it is what makes taste leverageable.
Conversion-as-eval closes the loop
Shipped surfaces feed conversion data back to the rubric. Click-through per variant. Time-on-page per layout. Save rates per visual treatment. The loop closes when the rubric absorbs the signal: criteria that correlated with conversion get weighted up, the ones that did not get weighted down or removed.
A rubric that never updates is a snapshot frozen in opinion. Brands running real eval stacks treat the rubric as living code: version-controlled, tuned monthly, audited quarterly. Vercel does this on Geist. Linear on writing. Stripe on the design system. The output looks like effortless brand consistency at AI volume, and it is the opposite of effortless. It is engineered.
The toolchain in 2026
Real tools. No invented categories.
- Playwright. Headless browser for screenshot capture. Free, scriptable. Leaves money on review surface.
- Pixelmatch. Pixel-level diff library. Pair with Playwright. Free. Not opinionated about what the diff means.
- Chromatic. Hosted visual review tied to Storybook. Best-in-class UI for component changes. Priced per seat.
- Storybook. Component isolation so the diff is the component, not page chrome. Free. Code-side, needs a dev.
- Anthropic evals. Framework for LLM-as-judge at scale with versioned rubrics. Docs skew ML, designers need a translator.
- OpenAI evals. Same job, different model family. Open-source. Defaults assume text, design teams wrap image scoring.
- Custom Claude rubric. Prompt plus API plus a JSON schema. Cheapest path to a working rubric. Your team owns maintenance.
- axe-core. Accessibility lint. Free, lives in CI. Catches WCAG, not aesthetic violations.
The starter stack for a small team is Playwright plus Pixelmatch plus a custom Claude rubric. Three tools, one afternoon, the eval pyramid running on the first three layers by tomorrow.
If you want help wiring this into your pipeline, hire Brainy. ClaudeBrainy ships rubric libraries and Skill packs that turn LLM-as-judge into a working surface. BrandBrainy ships the brand systems for AI generation the rubric scores against.
The new designer role, eval suite operator
When AI generates the candidates, the designer role shifts from making everything to running the eval suite that decides what ships. The job title emerging in 2026 looks more like ML evaluation engineer than visual designer. The senior designer of 2024 made fifty assets a quarter. The senior designer of 2026 ships rubrics, tunes thresholds, audits the queue, reviews the top fifty candidates a week.
The ladder reshapes around eval design. Junior runs the queue. Mid tunes the rubric on shipped data. Senior owns the eval system and defines criteria. Lead designs the loop between conversion data and rubric updates. "Do you have an eye" is now "do you have an eye and can you encode it."
Claude Skills sit underneath this role. The Skill is the rubric in package form. Ship it, install it, every candidate gets scored against the same encoded judgment. The senior eye runs against ten thousand candidates a day instead of fifty.

The AI-readiness checklist for design teams
Run this on your pipeline today. Fifteen minutes.
- Token validation runs on every component.
- Contrast and a11y lint runs in CI on every shipped surface.
- Visual regression runs on every PR.
- A written rubric exists for brand voice.
- A written rubric exists for layout and craft.
- An LLM scores AI candidates against the rubric before human review.
- Human review queue stays under one hundred candidates per week per designer.
- Conversion data flows back to the rubric monthly.
- The rubric is versioned.
- There is a named owner for the eval system.
Score under five, the team is shipping AI work on a coin flip. Five to seven, foundation is there but the loop is open. Eight or higher, the team is operating at the level AI-native product design actually requires.
Common traps when building the first eval stack
Four traps, all avoidable.
One, building the rubric in isolation. The rubric is the brand encoded for a model. Brand lead, design lead, senior writer in the room. Not one person guessing.
Two, no threshold. Scoring without a pass threshold is theater. Set the floor (average four out of five, no criterion below three is a working starter) and let the rubric reject candidates that miss.
Three, no versioning. A rubric that does not change is not running. Version it, log every change with a reason, audit drift quarterly.
Four, automating the human layer. The top of the pyramid is human on purpose. Teams that automate taste review skip the most leverageable hour of the week and ship eval-passing mediocrity at industrial volume.
FAQ
What are design evals?
Automated and structured checks that score AI-generated design output against measurable criteria, run before any candidate reaches a human or production. Four layers: lint and token validation, visual diff and regression, LLM-as-judge with a structured rubric, human taste review at the top.
Why do designers need evals when AI gets better every month?
Better models produce more candidates faster, not fewer candidates that are obviously correct. The bottleneck moved from making the asset to reviewing it, and review at AI volume requires a layered eval stack the same way model output at scale required one for ML teams.
What tools do I need to start an eval stack?
The minimum stack is Playwright for screenshot capture, Pixelmatch for visual diff, and a custom Claude rubric for LLM-as-judge. A couple hundred dollars in API spend per month for a small team. Stands up in an afternoon.
What is LLM-as-judge?
The pattern of having an LLM score model output against a structured rubric. The model receives the candidate plus the rubric prompt, returns a score per criterion with a one-line reason, and outputs structured JSON. Anthropic and OpenAI both ship eval frameworks. Most design teams write a custom Claude version because the rubric is the brand.
Can taste be encoded in a rubric?
Most of it, yes. The mechanical parts of taste (lead-first, concrete, no filler, voice match, layout craft, accessibility) are measurable. The taste calls a rubric cannot make are edge cases, break-the-rule decisions, and the choice between three options that all pass. Those stay human.
Start the eval stack this week
Three moves. No platform purchase required.
First, write the rubric. One page, five to seven criteria, one-to-five scale, pass threshold, reason field. Brand lead and design lead in the room. Ship version one Friday.
Second, wire LLM-as-judge. Claude API, prompt with the rubric, JSON output. Run it against the last hundred candidates the team shipped. Read the scores. Tune on the failures.
Third, install lint and visual diff on the next shipping surface. Playwright, Pixelmatch, axe-core, token validator. One afternoon. Bottom of the pyramid running.
If you want help building the eval stack into a working practice, hire Brainy. ClaudeBrainy ships rubric libraries and Skill packs so the team's senior eye runs against every candidate. BrandBrainy ships the brand operating system the rubric scores against. The next generation of design quality is engineered, not vibed, and the teams that build the stack first will operate the surface area three teams used to cover.
If you want help standing up an eval stack on your design pipeline, ClaudeBrainy ships Skill packs and rubric libraries that turn LLM-as-judge into leverage, and BrandBrainy ships the brand operating system the rubric scores against.
Get Started

