Chat GPT made waves when it first launched, capturing the attention of entrepreneurs and tech enthusiasts worldwide. This surge in interest has led to an explosion of products and services leveraging large language models (LLMs) to tackle straightforward text-based tasks, from composing legal contracts and job descriptions to drafting emails and website content. And the demand for AI solutions in text processing continues to rise, as businesses seek to automate time-consuming tasks and free their teams for more complex work.
However, day-to-day operations often involve a broader range of data types beyond just text, such as spoken communication and intricate visuals. With the advent of multimodal models, we’re unlocking exciting new opportunities for AI to redefine workflows across various industries, bringing voice and vision capabilities to the forefront of business operations.
Exciting Developments in Multimodal Architecture
In the last year, advancements in multimodal models have made significant leaps forward, particularly in understanding context and minimizing erroneous outputs. We’re nearing human-like performance in areas like speech recognition, image analysis, and voice generation, paving the way for innovative AI applications.
Voice Capabilities
We’ve seen impressive strides in the core components of voice technology—speech-to-text and text-to-speech models. Numerous vendors are now offering these capabilities, resulting in a flurry of new conversational AI solutions. A prevailing architecture has been to transcribe voice to text, process it with an LLM, and then convert it back to audio. While effective, this method carries drawbacks like increased latency and a loss of emotional context during transcription.
Fortunately, recent developments have seen the introduction of speech-native models, such as OpenAI’s Realtime API. These models allow direct speech-to-speech interactions, significantly lowering latency and retaining contextual elements such as emotion and tone, making conversations feel more organic. We anticipate a revolutionary shift in the quality of conversational voice applications built on these advanced models.
Use Cases for Voice
The advancements in voice technology have given rise to several compelling applications, particularly in the realm of transcription. Here are four noteworthy examples:
- Medical Transcription: Abridge, a company backed by Bessemer, provides a leading medical transcription solution that generates notes from clinical discussions and suggests follow-ups, allowing doctors to focus more on patient care.
- Sales Training: Rillavoice is revolutionizing the home services industry by recording sales calls for training purposes, enabling managers to offer coaching feedback without the time-consuming process of in-person observations.
- Automated Customer Interactions: Voice agents are now handling inbound sales calls, booking appointments, and interacting with customer databases—ensuring that businesses don’t miss out on valuable leads, even after hours.
- Enhanced Customer Support: Modern voice agents outperform traditional interactive voice response systems, allowing for more intuitive and human-like customer interactions while freeing human agents for complex inquiries.
As the field progresses, we anticipate that low latency and emotional understanding will become standard expectations, while more sophisticated capabilities such as omnichannel communications and real-time translation will set leading solutions apart.
Vision Capabilities
On the visual front, models like GPT-4 with vision show promise in interpreting images and answering related questions. Future iterations, such as GPT-5, are expected to enhance this capability further, potentially encompassing video processing as well. Google’s Gemini 1.5 Pro already demonstrates a remarkable ability to understand both images and video with a vast context window.
Use Cases for Vision and Video
Initial applications for vision capabilities tend to fall into four broad categories:
- Data Extraction: Platforms like Raft utilize AI to pull information from unstructured documents, streamlining workflow processes in industries such as freight forwarding.
- Visual Inspection: Companies like xBuild are enhancing manual inspection processes in construction, enabling faster results and accuracy.
- Design Automation: AI is increasingly utilized to generate detailed architectural designs, freeing up engineers for higher-level work.
- Video Analytics: AI models that analyze video content for safety violations in manufacturing are rapidly advancing, promising future applications, especially in robotics.
In vision applications, simplicity often drives value, and integrating seamlessly with existing workflows is paramount for adoption. Therefore, focusing on less complex solutions initially can yield better results.
The Promise of AI Agents
Despite the initial hype falling short, AI agents are making meaningful strides. Recent models, designed to think critically and reason effectively, are tackling a variety of automation tasks involving text, voice, and vision.
- Sales and Marketing: AI agents are streamlining outreach by researching potential leads and crafting personalized communications.
- Negotiations: Agents developed by companies like Pactum are automating negotiations, optimizing deal terms across multiple stakeholders.
- Cybersecurity: AI agents assist in investigating security alerts, gathering information, and providing summaries, easing the workload for cybersecurity teams.
Focusing on complex reasoning tasks will set various AI solutions apart, with smart architectural choices ensuring consistency and performance over merely scaling data and compute capabilities.
Vertical AI Expands Its Horizons
Founders in the vertical AI space are harnessing these new capabilities to address an expanding array of real-world tasks, exceeding previous expectations. As voice and vision technologies become commoditized, new waves of vertical AI applications will transform industries and redefine workplace interactions.
Up Next: Novel Business Models
The progressive advancements in LLMs and generative AI are driving innovations not just in products, but in business models as well. In our next article, we will dive deep into emerging models like copilots, agents, and AI-enabled services—exploring their applications, pricing strategies, and potential impact.
If you’re working on a Vertical AI application, we’d love to hear from you! Reach out to us at VerticalAI@bvp.com.
The AI Buzz Hub team is excited to see where these breakthroughs take us. Want to stay in the loop on all things AI? Subscribe to our newsletter or share this article with your fellow enthusiasts.