I Built a Taste-Aware AI Assistant: A Product Designer’s Perspective
How I turned subjective taste into repeatable behavior without letting AI hallucinate its way.
The problem
As a product designer, I’ve curated hundreds (maybe thousands) of UI inspirations. But when it’s time to use them, retrieval is messy and slow.
It gets worse when the task needs a specific style or a very particular taste.
Taste is subjective. Debugging taste is honestly weird.
LLMs can help with ranking and retrieval, but they’re inconsistent. They can also be convincingly wrong, especially when taste is involved.
So I built a small experiment to see if I could make taste repeatable. I wanted to translate taste into tags, rules, and evals that a model can follow without hallucinating its way into chaos.
My goal was to practice building an AI feature the way I would ship it in a real product: shape behavior with constraints, then test it so it stays consistent over time.
Product goal and success criteria
What the app does
- Input: I write a natural query like “Homepage with motion.”
- Output: 10 ranked references + 1 adjacent (experimental pick).
- Explainable: each pick includes a short why.
- Safe: no anti-examples.
- Grounded: no web browsing, no invented URLs, no “generated designs.”
- Honest UX: show warnings when a pick is weak or mismatched.
What success looks like
- Results feel taste-aligned (more often than not).
- The “why” is readable and believable.
- The system fails gracefully instead of silently.
- I can change the taste profile and measure the impact through eval runs.
Key design principles
The model is not the source of truth
LLMs are great at ranking a shortlist. But if you let the model run loose, it can hallucinate confidently and the UI still looks convincing while being wrong. That’s the worst kind of failure because it looks legit.
So I designed a boundary between what the AI suggests and what the product knows.
What the AI does
The model only does one job: pick IDs from a candidate list and explain why. That's it. No URLs. No titles. No "I found this site." Think of it as a judge.
What the server does
Everything factual comes from my library:
- Enrich: the server takes each ID and attaches the real title, URL, and tags from
library.json. - Clamp: if the model claims it “matched motion,” the server keeps that badge only if the item is actually tagged with motion. If the tag isn’t in the library, the badge gets removed.
- Guardrails: if a pick is weak or mismatched, the server adds warnings so the UI doesn’t pretend it’s confident.
This setup does most of the trust work. The AI can have an opinion about what fits the query, but it’s not allowed to invent facts.
It’s the same principle as a design system: you can remix components freely, but the tokens stay consistent.
Turning taste into labels
This next problem was more uncomfortable than I expected.
How do you describe taste for AI to reliably use?
When my library was just screenshots, the model mostly ranked by surface-level cues. That works sometimes, but it’s not something you can tune or trust. So I went back to something designers already understand: taxonomy. A mini design system for taste by turning screenshots -> JSON format.
The taxonomy (my taste tokens)
When searching for inspiration, my mental model usually goes like this: What kind of page -> What interactions + layout structure -> What visual aesthetics. So I decided to split the tags into 3 buckets:
pageTypes: [
"home",
"pricing",
"login",
],
uxPatterns: [
"grid-layout",
"micro-interaction",
"comparison-table",
],
visualStyles: [
"monochrome",
"high-contrast",
"minimal",
],This gave it a stable vocabulary so I could ask things like: "pricing" + "micro-interaction" + "avoid neon".
Anti-examples (contrasting taste)
I also added anti-examples (styles I don't prefer). This might sound negative, but it’s actually a product tool. It helps the AI learn boundaries and prevents "almost correct" results from creeping into the top picks.
hardConstraints: {
mustAvoidPatterns: ["scroll-jacking"],
mustAvoidStyle: [
"glassmorphism",
"glow-heavy",
"shadow-heavy",
"neon",
"neumorphism",
],
},Why tags matter?
This wasn’t about building a perfect labeling system. It was about making taste repeatable. It gave me 3 big wins:
- Better ranking outputs (the model has clearer signals)
- Better explanations (why + matched tags adds more meaning)
- Evals become possible (test whether results match intent)
At this point, the engine stopped being “a prompt” and started being a product: a library with structure, rules, and behavior I could measure.
The next part would be communicating trust without breaking the user experience.
Designing trust for AI
Even with rules and taxonomy in place, the product still runs into a reality problem. Sometimes the inspiration library simply doesn’t contain a perfect match. In those moments, AI tends to fill the gap with something that looks right.
Warnings
Warnings are my way of saying: How confident is this match, and what’s missing?
Important design choice
In my earlier version, warnings produced by the model were not consistent.
So I made the decision that warnings should be deterministic, meaning they shouldn’t come from the model.
The model can add a warning, but the server generates consistent warnings based on:
- Query intent
- Library tags
- Final match score
That keeps warnings consistent across runs and avoids depending on the model’s mood.
In short
Warnings are an important product decision. They’re the difference between: Here are some results vs Here are some results, and here's how confident you should be.
Tradeoffs
The main output is straightforward: show me 10 picks. But inspiration apps should also surprise you a little, so I added one more slot: an adjacent pick.
- Still aligned with my taste profile
- More experimental
- A way to get surprised.
Think of it like the "I'm Feeling Lucky" button in Google but without it being too random.
It sounded simple at first, until I had to decide whether adjacent pick should ever be empty. Because it has constraints:
- Must not duplicate a top pick
- Must not be anti-example
- Ideally should match user's intent (e.g. motion, home, geometric etc.)
Two paths ahead
- Option A: Adjacent pick must match query intent, else return nothing.
- Option B (current choice): Adjacent pick can be lenient, always show something.
As you can imagine there are pros and cons in both options. For example, Option A will always be semantically correct but will show zero results more often than not. Option B sometimes grabs the wrong page type, so it relies on warnings to maintain trust. I chose Option B.
Why Option B?
Since this is an inspiration app, it would defeat the purpose if it stopped showing anything just because it couldn’t be semantically perfect. But I still wanted the UI to admit when it’s stretching. So the rule becomes: Always return something, but never hide any mismatch.
Evals (keeping the system honest)
Warnings help the user understand uncertainty in the moment. But there is one lingering problem:
How do I know the system stays consistent after I change the library, tags, or taste profile?
If I tweak one rule, add new tags, or switch models, the UI might still look right while the results quietly drift. This is where evals come in. They are design QA for AI behavior.
Evals in plain English
They’re a set of repeatable test cases that ask:
- Did the system return the right kind of inspirations for the query?
- Did it avoid what I explicitly don't want?
- Did the strongest results show up at the top?
How mine works
I define test cases in a YAML file:
version: 0.3
owner: Javier
lastUpdated: 2026-01-28
cases:
- id: pricing-with-motion
query: "pricing page with motion"
must_include:
pageTypes_all: ["pricing"]
uxPatterns_any: ["motion", "micro-interaction"]
must_not_include:
visualStyles_any:
[
"glassmorphism",
"neon",
"scroll-jacking",
"glow-heavy",
"shadow-heavy",
]
notes: "Pricing is a hard gate. Motion is preferred; library may limit full coverage."Then a small script runs each case:
- Sends the query to the API
- Reads the returned
topPicks - Verifies each result against the library tags (the source of truth)
- Saves a run artifact in JSON so I can compare results
Basically, evals test whether the product behaves the way I designed it to.
Why evals matter?
This is the AI equivalent of regression testing since AI breakage is subtle:
- Ranking shifts
- Intent gets weaker
- Output looks plausible
Evals can catch drift.
Catching failures
Here’s a trimmed example of real failures from my run (my full run includes additional cases):
inspo-engine [main] % bun run eval:inspo
$ bun scripts/evals/runInspoEvals.mjs
FAIL pricing-with-motion (7125ms)
- Intent coverage too low: 2/10 matched (min 6)
PASS pricing-minimal-tier-comparison (4104ms)
PASS grid-layout-structure (6319ms)
PASS avoid-effects-heavy (4672ms)
...Intent coverage = how many of the top 10 results contain the required tags in my library.
pricing-with-motion failed because only 2/10 picks actually matched the motion intent at the tag level.
In this context, “matched” means the pick’s library tags actually include the tokens required by the eval (not what the model claims).
It revealed:
- My inspiration inventory is still limited
- My eval is stricter than my current tagging coverage supports
It gives me valuable feedback on what to improve next (e.g., increase pricing-with-motion intent coverage) instead of blindly tweaking system prompts.
This was the annoying truth: the library sets the ceiling. The model can’t retrieve what isn’t there, and no amount of prompt tweaking will conjure “pricing with motion” examples out of thin air. The eval didn’t expose a model problem. It exposed a coverage problem.
The initial inventory for my inspiration library is intentionally small so I can control the quality of the data and move fast.
What's next
This is a learning project. The app has about a hundred possible directions, which is exciting. But I'm keeping it private for now, because API costs add up fast.
Next, I’ll probably focus on these in no particular order:
- Expand the library and find a better way to turn screenshots into structured data
- Adjust eval thresholds (treat them as targets, not only pass or fail)
- Tighten tagging guidelines (for example, collapse “micro-interaction” into “motion”)
What I learned building this
- Why LLMs should be treated as reasoners, not databases
- How to design a “source of truth” boundary
- Why “clamping” matters: the UI should never display claimed matches that aren’t true
- How evals function like regression tests for subjective systems
A note on authorship
I’m not a backend engineer, so I approached this like product design: define constraints, set success criteria, then iterate with tight feedback loops.
AI helped draft parts of the API and eval runner, but I treated it like a collaborator, not a black box.
I iterated until I could explain how it works in plain English. Then I tightened the boundaries around what the model is allowed to do (IDs only, server enrichment, clamped tags) and used evals to catch regressions and inventory gaps.
Phew, that was a long read. Thanks for stopping by. If you have thoughts or ideas, especially about designing for AI, I’d genuinely love to hear them. Let's connect on X or LinkedIn