Is your AI product actually working? How to develop the right metric system

April 27, 2025

33

Table of Contents

In my first stint as a machine studying (ML) product supervisor, a easy query impressed passionate debates throughout capabilities and leaders: How do we all know if this product is definitely working? The product in query that I managed catered to each inside and exterior prospects. The mannequin enabled inside groups to determine the highest points confronted by our prospects in order that they might prioritize the correct set of experiences to repair buyer points. With such a posh net of interdependencies amongst inside and exterior prospects, choosing the proper metrics to seize the impression of the product was crucial to steer it in direction of success.

Not monitoring whether or not your product is working properly is like touchdown a aircraft with none directions from air visitors management. There may be completely no manner that you would be able to make knowledgeable selections to your buyer with out realizing what goes proper or unsuitable. Moreover, if you don’t actively outline the metrics, your group will determine their very own back-up metrics. The danger of getting a number of flavors of an ‘accuracy’ or ‘high quality’ metric is that everybody will develop their very own model, resulting in a situation the place you won’t all be working towards the identical end result.

For instance, once I reviewed my annual objective and the underlying metric with our engineering group, the rapid suggestions was: “However it is a enterprise metric, we already observe precision and recall.”

First, determine what you need to learn about your AI product

When you do get right down to the duty of defining the metrics to your product — the place to start? In my expertise, the complexity of working an ML product with a number of prospects interprets to defining metrics for the mannequin, too. What do I exploit to measure whether or not a mannequin is working properly? Measuring the result of inside groups to prioritize launches based mostly on our fashions wouldn’t be fast sufficient; measuring whether or not the client adopted options beneficial by our mannequin might danger us drawing conclusions from a really broad adoption metric (what if the client didn’t undertake the answer as a result of they simply needed to succeed in a help agent?).

Quick-forward to the period of enormous language fashions (LLMs) — the place we don’t simply have a single output from an ML mannequin, we now have textual content solutions, photos and music as outputs, too. The size of the product that require metrics now quickly will increase — codecs, prospects, kind … the listing goes on.

Throughout all my merchandise, when I attempt to give you metrics, my first step is to distill what I need to learn about its impression on prospects into just a few key questions. Figuring out the correct set of questions makes it simpler to determine the correct set of metrics. Listed here are just a few examples:

Did the client get an output? → metric for protection
How lengthy did it take for the product to supply an output? → metric for latency
Did the consumer just like the output? → metrics for buyer suggestions, buyer adoption and retention

When you determine your key questions, the following step is to determine a set of sub-questions for ‘enter’ and ‘output’ indicators. Output metrics are lagging indicators the place you may measure an occasion that has already occurred. Enter metrics and main indicators can be utilized to determine developments or predict outcomes. See beneath for tactics so as to add the correct sub-questions for lagging and main indicators to the questions above. Not all questions have to have main/lagging indicators.

Did the client get an output? → protection
How lengthy did it take for the product to supply an output? → latency
Did the consumer just like the output? → buyer suggestions, buyer adoption and retention
1. Did the consumer point out that the output is correct/unsuitable? (output)
2. Was the output good/truthful? (enter)

The third and closing step is to determine the tactic to assemble metrics. Most metrics are gathered at-scale by new instrumentation by way of information engineering. Nonetheless, in some cases (like query 3 above) particularly for ML based mostly merchandise, you might have the choice of handbook or automated evaluations that assess the mannequin outputs. Whereas it’s all the time finest to develop automated evaluations, beginning with handbook evaluations for “was the output good/truthful” and making a rubric for the definitions of fine, truthful and never good will allow you to lay the groundwork for a rigorous and examined automated analysis course of, too.

Instance use circumstances: AI search, itemizing descriptions

The above framework could be utilized to any ML-based product to determine the listing of main metrics to your product. Let’s take search for instance.

Query	Metrics	Nature of Metric
Did the client get an output? → Protection	% search periods with search outcomes proven to buyer	Output
How lengthy did it take for the product to supply an output? → Latency	Time taken to show search outcomes for the consumer	Output
Did the consumer just like the output? → Buyer suggestions, buyer adoption and retention Did the consumer point out that the output is correct/unsuitable? (Output) Was the output good/truthful? (Enter)	% of search periods with ‘thumbs up’ suggestions on search outcomes from the client or % of search periods with clicks from the client % of search outcomes marked as ‘good/truthful’ for every search time period, per high quality rubric	Output Enter

Query

Metrics

Nature of Metric

Did the client get an output? → Protection

% search periods with search outcomes proven to buyer

Output

How lengthy did it take for the product to supply an output? → Latency

Time taken to show search outcomes for the consumer

Output

Did the consumer just like the output? → Buyer suggestions, buyer adoption and retention

Did the consumer point out that the output is correct/unsuitable? (Output) Was the output good/truthful? (Enter)

% of search periods with ‘thumbs up’ suggestions on search outcomes from the client or % of search periods with clicks from the client

% of search outcomes marked as ‘good/truthful’ for every search time period, per high quality rubric

Output

Enter

How a few product to generate descriptions for a list (whether or not it’s a menu merchandise in Doordash or a product itemizing on Amazon)?

Query	Metrics	Nature of Metric
Did the client get an output? → Protection	% listings with generated description	Output
How lengthy did it take for the product to supply an output? → Latency	Time taken to generate descriptions to the consumer	Output
Did the consumer just like the output? → Buyer suggestions, buyer adoption and retention Did the consumer point out that the output is correct/unsuitable? (Output) Was the output good/truthful? (Enter)	% of listings with generated descriptions that required edits from the technical content material group/vendor/buyer % of itemizing descriptions marked as ‘good/truthful’, per high quality rubric	Output Enter

Query

Metrics

Nature of Metric

Did the client get an output? → Protection

% listings with generated description

Output

How lengthy did it take for the product to supply an output? → Latency

Time taken to generate descriptions to the consumer

Output

Did the consumer just like the output? → Buyer suggestions, buyer adoption and retention

Did the consumer point out that the output is correct/unsuitable? (Output) Was the output good/truthful? (Enter)

% of listings with generated descriptions that required edits from the technical content material group/vendor/buyer

% of itemizing descriptions marked as ‘good/truthful’, per high quality rubric

Output

Enter

The strategy outlined above is extensible to a number of ML-based merchandise. I hope this framework helps you outline the correct set of metrics to your ML mannequin.

Sharanya Rao is a gaggle product supervisor at Intuit.

Supply hyperlink

Tags
AI
AI News

Buy now

Is your AI product actually working? How to develop the right metric system

First, determine what you need to learn about your AI product

Instance use circumstances: AI search, itemizing descriptions

Related Articles

Nightfall launches ‘Nyx,’ an AI that automates data loss prevention at...

These $300 Swarovski earbuds aren’t my style, but their sound quality...

Female-founded semiconductor AI startup SixSense raises $8.5M

Leave a Reply Cancel reply

Latest Articles

Nightfall launches ‘Nyx,’ an AI that automates data loss prevention at...

These $300 Swarovski earbuds aren’t my style, but their sound quality...

Female-founded semiconductor AI startup SixSense raises $8.5M

Spotify removes AI-generated song falsely attributed to country singer who died...

Amazon DocumentDB Serverless database looks to accelerate agentic AI, cut costs