Shaping a product designer
Company
Figma
Time
4 months in 2025
Role
AI Design Lead
Team
15 designers, ~8 researchers
Info
I was asked by Figma's leadership to direct and grow their AI design curation team from 5 to 15 product designers curating & evaluating designs for their inaugural custom models. I led design resource allocation, evals, and data curation, raising annotation quality and velocity 2-3x.
Results
I left after hiring a successor and the launch of our flagship AI feature Make, at Config '25.
Our team did not design the Figma Make product. All of the videos shown below are for demo purposes only to show the impact of our training on the visual fidelity of Figma's models.
Training
Unlike designing a product, training an AI model for design is not a deliverable problem. It is fundamentally an evaluation problem that hinges on subjective measures of taste, judgment, and aesthetic nuance.
Prompt Interface
Our largest performance bottleneck appeared not to be model architecture or compute, but the quality, calibration, and judgement/alignment of the evaluation sets themselves. Our team's mission became not just to curate data, but to define and measure aesthetic "quality" that training loops could optimize towards.
Attach Designs
Months away from Config and under public pressure from the rapid rise of vibe-coding tools like Lovable and V0, we worked nights and weekends to generate precision-ground evaluation sets. These sets became the ground truth targets for fine-tuning, enabling clearer precision, and a recall measurement for stylistic fidelity across different UI design patterns.
Paste Designs
Quality
We played with a vast array of data curation and evaluation tasks, including, but not limited to the curation of white-label interfaces, simple prototypes, and other visual & graphic inputs.
Attach Images
I improved the signal quality of these evaluation sets by facilitating the refinement of new prototype development standards, detailed prompt writing guides, and multi-dimensional output evaluation rubrics. These helped establish consistent scoring across aesthetic dimensions such as hierarchy, contrast, visual density, balance, and interaction coherence.
Prompt with Text
Our model improvement impact was measured through downstream performance tests: higher match rates between internal researcher taste surveys and model outputs, increased stylistic precision, and faster convergence and annotation alignment in successive fine-tuning runs.
Use Edit Tool
Velocity
Raising quality often meant that our team had to move slower. I ensured to mitigate the effect by providing individual coaching and documentation on capricious project needs that changed by day.
Publish 'Make Design'
I was able to improve the velocity of training data output by completing many of the more nuanced AI model evaluation tasks myself, including but not limited to 'bad prompt' evals. Later, I wrote guidelines for raising velocity around prototype generation for animations and game interfaces.