Why Generative-AI Apps’ Quality Often Sucks and What to Do About It | by Dr. Marcel Müller

Transitioning from PoCs to High-Quality AI Applications in Production

The world of generative AI has seen explosive growth over the past couple of years. New technologies have emerged that promise to streamline business processes, reduce wait times, and minimize defects in output. While interfaces such as ChatGPT make interaction with large language models (LLMs) straightforward, the significance of quality and suitability remains a pressing issue, particularly for businesses aiming to incorporate generative AI into their workflows.

Many managers and entrepreneurs have encountered roadblocks in their quests to launch high-quality AI applications. With more than three dozen AI projects under my belt, I’ve come to realize a common misconception: the belief that success hinges solely on the power of your underlying model. In reality, that’s only 30% of the equation.

Beyond the Basics: Techniques for Impactful AI Applications

Building effective LLM-based applications entails employing a multitude of techniques, patterns, and architectures. Whether it’s choosing the right foundation models, fine-tuning them, or leveraging architectures that incorporate retrieval-augmented generation (RAG), the options are extensive.

This article will serve as a guide for evaluating generative AI applications—both qualitatively and quantitatively—within specific business contexts. We will introduce key questions to consider, such as:

What contextual factors are essential for gauging the overall quality and utility of generative AI in enterprise settings?
At which stages in the development lifecycle should evaluation take place, and what specific goals should we aim for?
How can we implement various metrics to select, monitor, and improve the performance of AI applications?

This framework, referred to as PEEL (Performance Evaluation for Enterprise LLM Applications), will illustrate how to assess generative AI applications comprehensively.

Generative AI in Business Processes

Every organization thrives on its business processes—be it customer support, operations, or software development. Generative AI can significantly enhance these processes by accelerating workflows, improving outcomes, and even providing contextual answers.

Take the telecommunications industry, for example. A customer support agent typically funnels through multiple tasks to respond to customer inquiries. Using generative AI, the workflow can be optimized and streamlined, enabling more efficient responses to customer questions.

Here’s a simplified view of the typical process a customer support agent might follow:

Prioritize Incoming Requests: When a new inquiry arrives, the agent prioritizes it based on urgency.
Find Answers: The agent seeks the right answer and drafts a response.
Send and Await Feedback: After sending the email, the agent waits for customer replies, leading to an iterative cycle until resolution.

Leveraging Generative AI Workflows

Incorporating a generative AI workflow can streamline this process significantly. Rather than relying on a single call to ChatGPT or another LLM, an advanced orchestration can enhance these tasks. For instance, in the telco example, the following steps could be employed:

Extract the Question: The first step involves deriving a query from the customer’s email to effectively engage with a vector database.
Semantic Search: The next task utilizes semantic retrieval to find relevant context in the knowledge base.
Contextual Response Generation: With context in hand, the model generates the best possible answer tailored to the query.
Formal Response: Finally, the AI transforms the output into a well-structured email reflecting the company’s tone.

This orchestration not only improves efficiency but also ensures that sensitive company data and processes are executed seamlessly.

The Importance of Context and Orchestration

Evaluating generative AI applications in complex business scenarios requires more than only assessing the capabilities of foundational models. A deeper dive into context and orchestration processes is crucial for achieving high-quality outcomes.

When setting up advanced workflows with generative AI, the decision-making process involves selecting between evaluating a foundational model in isolation, a fine-tuned variant, or an entire orchestration of diverse models and techniques. Each choice reveals different insights critical for application development.

The Stages of Evaluation: From Concept to Production

Developing generative AI applications is typically an iterative process encompassing three key stages:

Before Building: Initial evaluation focuses on defining requirements for foundational models. Often, companies choose fine-tuning or RAG methods due to the costs of building models from scratch.
During Development: This stage emphasizes quality and performance requirements through representative example cases, using typical scenarios to define expectations.
In Production: Here, evaluation expands to adapt and identify unforeseen scenarios based on live user feedback.

Feedback gathered during production must feed back into development, creating a continuous cycle of improvement.

Methods for Comprehensive Evaluation

To ensure a robust evaluation process, we can employ various methods throughout each stage:

Synthetic Benchmarks: Using established benchmarks such as the AI2 Reasoning Challenge (ARC) or HellaSwag helps gauge the reasoning and commonsense capabilities of AI models.
Scenario-Based Testing: Developing detailed scenarios that mirror real-world use cases offers valuable insights into how well an AI application performs when faced with genuine inquiries.
Feedback Loops: Techniques such as user satisfaction sliders allow for gathering meaningful live feedback, which is crucial for optimizing functionality after deployment.

Final Thoughts

As demonstrated throughout this article, advanced testing and quality engineering concepts are essential for successful generative AI applications. Utilizing frameworks like PEEL positions organizations to test not just isolated models but the broader orchestration of tasks that deliver real value.

As businesses continue to navigate the complexities of AI application deployment, refining these tools and methods will be critical. The AI Buzz Hub team is excited to see where these breakthroughs take us. Want to stay in the loop on all things AI? Subscribe to our newsletter or share this article with your fellow enthusiasts.