KPIs for gen AI: Why measuring your new AI is essential to its success

November 22, 2023

Nitin Aggarwal
Head of AI Services, Google Cloud

Amy Liu
Head of AI Solutions, Customer Value Advisory, Google Cloud

As we embark on the generative AI era, great promise and possibility surrounds this transformative technology. At the same time, many executives want to be sure they are actually making the most of their AI investments and achieving the promised outcomes.

Your organization probably measures hundreds, if not thousands, of metrics already, using key performance indicators, or KPIs, to gauge the progress of many projects and teams. The same should be true for your generative AI work — with a catch.

Measuring the performance of your gen AI experiments and pilots is crucial for a) ensuring their effectiveness and b) providing feedback that helps fine-tune the next iteration of the project (it’s so important, in fact, that prompt tuning has become a highly sought-after skill in the AI era). The challenge is, the technology is so new, many organizations have hardly figured out where to get started, let alone what to measure. And the very nature of gen AI can make those measurements more complex, too.

Having invested so much time developing AI technology in close collaboration with thousands of customers and partners over the years, the team at Google Cloud has created what we believe are some useful frameworks for others to consider.

Because generative AI is so different from many past technologies — particularly in the breadth of its usage and accessibility to technical and non-technical users alike — we’re reimagining metrics for AI as a series of coordinated stages: model quality, system quality, and business impact. These three segments provide a series of milestones and markers from before you start considering your first gen AI use cases through to the deployment and optimization of gen AI projects.

And so, through the seamless transitions between the stages of model quality, system quality, and business impact, metrics can further boost the efficiency and accelerate time-to-value of your AI throughout your organization. What’s so beneficial about these KPIs for AI is that they’re not only meant to drive successful results now but also continuously into the future.

The importance of KPIs in AI implementation

KPIs are essential in generative AI deployments for a number of reasons: objectively assessing performance, aligning with business goals, enabling data-driven adjustments, enhancing adaptability, facilitating clear stakeholder communication, and demonstrating the AI project's ROI. They are critical for measuring success and guiding improvements in AI initiatives.

Organizations that have incorporated AI KPIs into their strategies are already seeing results.

A study by MIT and Boston Consulting Group found that 70% of executives believe enhanced KPIs, coupled with performance boosts, are key to business success. And organizations leveraging AI-informed KPIs report that they are up to 5x more likely to see improved alignment between functions and 3x more likely to be agile and responsive than organizations who don’t. For maximum visibility and clarity, a dashboard with pertinent KPIs provide data-driven insights for organizations to refer to throughout deployment.

A holistic view of gen AI

To fully understand the business value of a gen AI model, you need to examine metrics in segments throughout the lifecycle of a model, from pre-concept to continuous optimization.

Informed by our AI work with both long-standing enterprise customers and some of the leading gen AI unicorns, the team at Google Cloud has identified aforementioned areas to anchor metrics around: model quality, system quality, and business impact.

By segmenting KPIs into these three core buckets, you gain a holistic overview of value generation that provides extraordinary levels of insight into the technical and business impact of your gen AI investments.

Model quality

Tracking model quality KPIs from the development phase throughout their operational lifespan is not merely about upholding quality. It’s a proactive pathway of continuous enhancement and breakthrough innovation. By prioritizing the model quality, you gain invaluable insight into how the model performs in the wild, as opposed to just on training data, while mitigating hallucinations.

It’s also important to evaluate the advantages and disadvantages of models at multiple levels based on context, references, and sampling. However, you should use a wide range of diverse data, irrespective of the framework.

Monitoring key indicators before and after launch essentially takes the model's pulse to ensure it's delivering on its promise. As feedback is generated, it can be fed back into the model to further enhance future outputs. This cycle can be repeated again and again, creating a virtuous cycle of model improvements. At the same time, errors that go unnoticed can be reinforcing over time. Getting things right early and often is, therefore, paramount.

Here are just some of the metrics we recommend for tracking model quality:

Quality index: An analysis of multiple metrics aggregated into a single value to represent overall model performance (e.g., BLEU, Rouge, SuperGlue, BIG-bench, CIDEr, METEOR)
Error rate: The percentage of responses provided by the model, which are incorrect or invalid. Human evaluation helps define and generate this metric.
Latency: The time delay between when a query is submitted to the model and when it returns the response. This includes the parallel processing capabilities of the model, model architecture, deployment infrastructure and availability.
Accuracy range: The baseline expectation for precision accuracy thresholds for the model to meet. For this metric, it is often helpful to establish a red team to analyze and challenge your model.
Safety Score: The number of harmful categories and topics that may be considered sensitive for the business.

System quality

To harness the full potential of your model, an end-to-end AI system is necessary to develop, tune, and deploy models at scale. This system should seamlessly integrate key components, including: data acquisition and pre-processing, context and prompt generation, model flow orchestration (whether in parallel or sequentially), an automated evaluation framework, and efficient management for both model and data.

Moreover, post-processing of results is crucial to ensure the output maximizes business value. The effectiveness of this orchestrated system hinges on successful integration with both upstream and downstream business processes, facilitating informed decision making.

Many organizations, having spent a decade or more digitally transforming, now have a technology stack that resembles Frankenstein’s monster — a mishmash of technologies, systems, and frameworks that have been cobbled together over time by different teams and departments. Unfortunately, understanding how everything works mostly exists within employees’ heads and nowhere else, making knowledge loss a frequent and real risk.

Fostering a knowledge-sharing ecosystem and striving for standardized technologies and processes helps ensure interoperability, quality control, and scalability. Therefore, having a strong, unified platform like Google Cloud’s Vertex AI, for example, can bring more order and control. Without a strong system design, even the most sophisticated AI initiatives are likely to end up being reduced to mere experiments that deliver little to no business value.

Equally important is maintaining a high-quality data environment. The success of a gen AI project is deeply intertwined with the integrity of its data, as models inherit the flaws of the data used to train it. Without proper data governance, models can easily be trained on low-quality, biased, or irrelevant data, increasing the chances of hallucination or problematic outputs. To mitigate the possibility of models perpetuating harmful biases, businesses should invest in labeling, organizing, and monitoring their data.

Here are some metrics to consider for tracking system quality:

Data relevance: The degree to which all of the data is necessary for the current model and project. Be warned, extraneous data can introduce biases and inefficiencies that can lead to harmful outputs.
Data and AI asset and reusability: The percentage of your data and AI assets that are discoverable and usable.
Throughput: The volume of information a gen AI system can handle in a specific period of time. Calculating this metric involves understanding the processing speed of the model, efficiency at scale, parallelization, and optimized resource utilization.
System latency: The time it takes the system to respond back with an answer. This includes any ingress- or egress-based networking delays, data latency, model latency, and so on.
Integration and backward compatibility: The upstream and downstream systems APIs available to integrate directly with gen AI models. You should also consider if the next version of models will impact the system built on top of existing models (not just limited to prompt engineering).

Business impact

Businesses gain diverse value from gen AI deployments, whether through creative automation and increases in code quality or reduced costs for hiring, training, and onboarding. With their versatile applications, large language and gen AI models can be deployed across various departments of an organization, including marketing, logistics, design, programming, and even legal. Each team, with its unique functions and objectives, can leverage these models to identify and capitalize on opportunities for optimization.

However, adoption doesn’t happen overnight — ingraining new AI-powered behaviors requires patience and persistence. That’s why tracking usage metrics is crucial for understanding how real humans are interacting with the model over time. Monitoring adoption rates within an organization provides insight into whether gen AI is becoming truly embedded in workflows.

If gen AI capabilities are customer-facing, usage metrics can reveal if people find them valuable and how often they are utilized. By isolating these metrics, organizations can gain clear insights on the user experience, which could otherwise get lost in the model optimization cycle. It shifts the focus from nitty-gritty technical details to a bird’s-eye view of the model's accessibility, reliability, and usability.

Below are the metrics we recommend for tracking business impact:

Adoption rate: The percentage of active users over the lifetime of a campaign or project divided by the total intended audience.
Frequency of use: The number of times queries are sent per user on a daily, weekly, or monthly basis.
Session length: The average duration of continuous interactions.
Queries per session: The number of queries users submit per session.
Query length: The average number of words or characters per query.
Abandonment rate: The percentage of sessions ended before users find answers.
User satisfaction: Surveys assessing user experience or other customer satisfaction metrics, such as Net Promoter Score (NPS).

While usage metrics let you zoom in on gen AI adoption by your customers and organization, business value metrics can also provide you with the evidence that these AI investments are making a positive impact on your bottom line. The innovative possibilities of generative AI models mean that you can identify new areas for optimization in departments previously untouched by AI technology. These expansive capabilities are revolutionizing and streamlining departmental functions, unlocking new levels of efficiency and innovation.

Some examples of business value improvement metrics include:

Customer service
- Reduction in average handling time and cost per interaction
- Lift in customer satisfaction (NPS)
- Agent productivity via gen AI assist tools
Marketing
- Time saved (e.g. hours) from streamlined processes: brief writing, editing, collaboration, etc.
- Higher return on ad spend (ROAS) due to increased personalization
- Augmented creativity and idea generation
Healthcare
- Increased time with patients by reducing administrative burdens
- Better patient outcomes from clear, consistent care plans
- Improved efficiency, reduced wait times, and higher care capacity
Retail
- Lift in revenue per visit
- Increases in sales through AI driven product suggestions
- Improvements in customer satisfaction/experience
Product development
- Percentage of content influenced by generative AI tools
- Employee hours saved from automating processes
- Accelerated time-to-value from product launches

Driving new value with each stage

With gen AI, organizations are doing more than simply creating models — they are knitting together a tapestry of unprecedented opportunities across the organization. At the same time, it’s easy for organizations to get overwhelmed as they start building, connecting, and collaborating with these new technologies.

A successful gen AI deployment is a journey of expanding capabilities, unlocked step by step. Adopting a holistic evaluation approach across model quality, system quality, and business impact can help ensure each step is integrated to accelerate learning and shorten time to value.

Establishing KPIs early is a pivotal step, especially as projects iterate and evolve. Each opportunity to refine generative AI processes for efficiency can compound over time, offering a continuous feedback loop. Gen AI is a new paradigm, demanding new practices, and tailored KPI stages that propel your project from prototype to production.