Generative-AI image evaluation for a frontier model lab.

4×

parallel evaluation frameworks, plus 50k UI screenshots

Generative-AI Labs Evaluation / RLHF Image Human preference

Challenge

Improving an image model requires large volumes of careful human judgment across several axes (aesthetics, prompt faithfulness, text rendering and anatomical correctness), applied consistently by evaluators who don't rush or over-use “neither.”

Approach

io-ai ran four parallel evaluation frameworks: aesthetic A/B (choose the more pleasing, hallucination-free image), prompt fidelity (which image best matches the prompt; flag disqualifying omissions and added hallucinations), typography accuracy (correct spelling, font and size match, letter-level checks), and human-anatomy QA (finger and limb counts, facial symmetry, proportion, connection errors).

Evaluators followed strict tie-break rules (“Both / Neither” used ≤10% of the time) and were barred from using other AI tools to judge.

Quality

Every task ran through io-ai's multi-level QA pipeline, with first-batch audits before scale-up and full rejection-reason metadata on every item, holding over 98% accuracy on rolling audits, with weekly status reporting.

Annotator→Peer check→Auditor→Lead review→Client delivery

Result

High-signal preference and evaluation data across four quality dimensions, delivered at scale with rubric discipline.

For the same lab we also delivered UI / screenshot annotation across 50 applications and 50,000 screenshots, with PII masking on every image.

More case studies

>98%

Aviation & Aerospace · Image

Let's talk

Bring us your hardest data problem.

Send us your data challenge and we'll scope a pilot, usually within a couple of working days.

Talk to us