Generative-AI image evaluation for a frontier model lab.

Challenge
Improving an image model requires large volumes of careful human judgment across several axes (aesthetics, prompt faithfulness, text rendering and anatomical correctness), applied consistently by evaluators who don't rush or over-use “neither.”
Approach
io-ai ran four parallel evaluation frameworks: aesthetic A/B (choose the more pleasing, hallucination-free image), prompt fidelity (which image best matches the prompt; flag disqualifying omissions and added hallucinations), typography accuracy (correct spelling, font and size match, letter-level checks), and human-anatomy QA (finger and limb counts, facial symmetry, proportion, connection errors).
Evaluators followed strict tie-break rules (“Both / Neither” used ≤10% of the time) and were barred from using other AI tools to judge.
Quality
Every task ran through io-ai's multi-level QA pipeline, with first-batch audits before scale-up and full rejection-reason metadata on every item, holding over 98% accuracy on rolling audits, with weekly status reporting.
Result
High-signal preference and evaluation data across four quality dimensions, delivered at scale with rubric discipline.
For the same lab we also delivered UI / screenshot annotation across 50 applications and 50,000 screenshots, with PII masking on every image.
More case studies

Safety-critical runway labeling for assisted-landing AI.

Maritime EO/IR object detection for autonomous vessels.

Subsea coating-breakdown defect annotation.
Bring us your hardest data problem.
Send us your data challenge and we'll scope a pilot, usually within a couple of working days.