Priyanka Kuvalekar
Contributor

Building enterprise voice AI agents: A UX approach

opinion
Apr 2, 202611 mins

Nobody wants to look silly yelling at an AI in a meeting. Making voice agents work at the office is more about "human" feel than better code.

image of glass speech bubbles
Credit: Shutterstock

The voice AI agents market is projected to grow from $2.4 billion in 2024 to $47.5 billion by 2034, a 34.8% compound annual growth rate. Yet only 1% of enterprises consider their AI deployments “mature” and fewer than 10% of AI use cases make it past pilot stage.

The models work but the gap is in how these systems are designed for real human interaction in enterprise collaboration, where voice commands trigger workflows, meetings have audiences and mistakes carry social weight. This article is about where they live and how to solve them.

Where enterprise voice AI breaks down

81% of consumers now use voice technology daily or weekly, but satisfaction hasn’t kept up. 65% of voice assistant users report regular misunderstandings. 41% admit to yelling at their voice assistant when things go wrong. These same people walk into work the next morning and are expected to trust a voice agent with their calendar, their meetings and their messages. The frustration they’ve learned at home sets the baseline expectation at work.

Most teams look at numbers like these and reach for technical fixes: Better speech recognition models, lower Word Error Rate (WER), faster processing. But WER tells you how well your system transcribed audio. It says nothing about whether someone trusted the agent enough to use it in front of their manager, or whether they’ll open it again next week. In enterprise collaboration, one misunderstood instruction and someone has a calendar invite they never asked for.

The root of the problem is a design assumption that keeps getting repeated: Treating voice AI as text with a microphone attached. Voice has its own constraints. Anything beyond a 500ms response breaks conversational flow. Commands arrive mixed in with meeting crosstalk and open-office noise. Users can’t scroll back through what the agent said. And when the system gets something wrong in a meeting, the embarrassment lands differently than a typo in a chat window.

When you map user journeys for voice-driven enterprise workflows, the breakdowns don’t cluster around transcription failures. They cluster around moments of social risk: Issuing a command in front of an executive, trusting the system to send the right message or waiting in awkward silence while the agent processes. Nielsen’s usability heuristics help explain why. Visibility of system status means something entirely different in a voice-only interface where there’s no progress bar, no loading spinner. Users are left interpreting silence, and that ambiguity is one of the strongest predictors of early abandonment.

UX principles for building voice AI agents

There’s a reason conversations have rhythm. Sacks, Schegloff and Jefferson (1974) documented that people take turns in speech on roughly 200-ms cycles, regardless of language. When a voice agent takes even slightly longer than that, the interaction starts to feel off. People won’t say ‘the latency was too high’. They’ll say the thing felt clunky, or they’ll just stop using it.

This means agents need to acknowledge while processing. ‘Got it, looking that up..’ feels collaborative. People describe faster-responding systems as “more helpful” even when task completion rates are identical. Google’s Speech-to-Text documentation recommends 100-ms frame sizes for streaming applications. Dan Saffer’s work on microinteractions is useful here. Think about what makes a phone call feel natural: The ‘mm-hmm’ that says someone is listening, a pause before an answer, the rising voice inviting you to keep going. Voice agents need all of that. None of it shows up in a spec, but it separates a system people tolerate from one they want to use.

Recovery matters as much as performance. People are forgiving the first time a voice agent gets something wrong. Second time, doubt creeps in. By the third, they’ve filed it under “doesn’t work” – thus impacting trust. The agent needs to explicitly state when it is confused or when it cannot give the correct response and offer workarounds like closest reference documents or next steps to create trust and transparency.

Implicit confirmation is another principle that pays off immediately in enterprise settings. ‘I’ve sent an updated sales invoice to your inbox’ works better than ‘Did you send a sales invoice to me? Please say yes or no’. There’s a half-second pause right before someone issues a voice command where they’re doubting if the agent is going to give the right response and if they should proceed. Good confirmation design takes that social risk down.

Finally, the environment is a design constraint, not a testing variable. Open offices, conference rooms, mobile use in transit, hybrid meetings: Each sounds different, and each creates different failure modes. Denoising and automatic speaker diarization aren’t nice-to-have features. They are table stakes.

The UX research playbook for building effective voice AI agents

Standard usability testing assumes the interface is visible and the system behaves the same way every time. Voice AI agents break both of those assumptions. The system’s behavior is non-deterministic, the interaction leaves no visual trace and the environment changes everything. The research approach has to account for all of that.

Contextual inquiry is essential because the acoustic environment is the primary design constraint. Observing someone use a voice agent while a coworker’s speakerphone bleeds through a conference room wall tells you more about what needs to change than any controlled study can. Think-aloud protocols need adaptation here too. Participants are already talking to the system, so concurrent think-aloud creates interference. The workaround is retrospective think-aloud with recordings, letting participants replay interactions and narrate what they were thinking at each point.

Field research only captures a snapshot, though. Diary studies take on a different role with AI voice agents than with traditional software. Instead of tracking feature usage, they track trust over time. Participants log not just what happened, but whether they’d repeat the interaction in front of colleagues. That’s how you spot trust starting to slip before your retention numbers do. Experience sampling picks up what even diary studies miss: You check in with people at random points while they’re actually using the agent, not after. Ask someone in a debrief and they’ll tell you it was fine. Their notes from the moment tell a different story.

Then there is Quantitative UX Research and Behavioral Data Collection. Look at conversation logs: How often does the agent fall back to a generic response? Where do people abandon a request halfway through? Which user segments hit more errors than others? That data shows you where the system is failing at scale. Pairing this with qualitative findings turns isolated observations into product decisions.

But the numbers that matter most aren’t the obvious ones. The pattern that keeps showing up is how often task completion and user satisfaction tell completely different stories. Someone finishes a task and still walks away frustrated: ‘It worked but I wouldn’t do that again in a meeting’. You only catch that divergence by pairing something like the System Usability Scale with behavioral data and qualitative follow-ups. Measurement works best when you’re looking at multiple levels at once. At the conversation level, you care about how the agent handles interruptions and how often it hits a fallback. At the business level, the question is simple: Did people keep using it after the first week? The interesting stuff lives in the gaps between those levels, and you’ll only see it if research teams are involved from the beginning, not called in after the product decisions are already locked.

Testing across the full range of speech patterns, accents and accessibility needs the product will encounter in production also reshapes product direction in ways teams don’t expect. The Speech Accessibility Project, run by the University of Illinois with Google, Apple and Amazon, trained models on a broader set of speech samples and saw accuracy jump by 18 to 60% for non-standard speech patterns. Card sorting exercises with diverse user groups regularly upend what product teams assumed users wanted. Also, curb-cut effects are real in voice AI: Building for users who depend entirely on voice produces better experiences across the board.

How UX research shapes agentic voice AI

When a voice agent moves from executing single commands to acting autonomously across enterprise workflows, the UX research problem changes. ‘Prepare tomorrow’s client meeting’ might involve pulling calendar data, finding documents and writing up a summary. Zoom’s AI Companion 3.0 works this way. The research question is no longer ‘did the system understand the words?’ It’s ‘does the person trust what the agent did on their behalf?’

The trust problem comes down to mental models. If someone says ‘reschedule tomorrow’s meetings’, they’re picturing the whole job: Check for conflicts, move the time slots, update the invites, notify the attendees. If the agent only moves the slots and silently drops the rest, that half-finished job feels worse than if it had just said ‘I can’t do that’. People shrug off an honest limitation. They don’t shrug off finding out an hour later nobody got notified.

What makes enterprise different is that the agent’s actions affect other people. An enterprise voice agent that misfires wastes your colleague’s time, sends your manager the wrong information or derails a meeting you weren’t even in. When the agent gets it wrong, other people pay the price and that makes people far less forgiving. A good way to catch these problems early in research is to ask participants to walk through what they expect the agent to do before it does it, then compare that against what actually happens. Those mismatches are early warnings. They’ll show up in your research months before they show up in support tickets or churn.

‘Least surprise’ carries extra weight in agentic contexts. Even when multiple things are happening behind the scenes, the person should get back one clear answer. Giving feedback during wait times, even “Let me pull together a few things for that,” buys the system a few seconds without silence. Journey mapping shows users lose confidence in the middle of a request, during that gap. That’s the moment to get right.

Teams also need to plan for novelty wearing off. Early on, people give the system a pass when it stumbles. That wears off fast. Around week two or three, the comparison shifts. People stop thinking ‘that’s pretty good for AI’ and start thinking ‘my admin assistant would have gotten that right’. At work, everyone already knows what competent help looks like: The assistant who juggles calendars, the IT person who fixes things without being asked twice, the colleague who never forgets to send the agenda. That’s the bar, and the only way to see whether the system is going to clear it over time is longitudinal research.

Design problems, not engineering ones

The problems with enterprise voice AI aren’t technical mysteries. The models work. What’s been missing is treating voice AI as a UX problem from the start, applying research practice to the specific challenges that voice and agentic AI create in enterprise collaboration. Social risk, autonomous trust decisions, the gap between what the system can do and what people will actually rely on: These are design problems, not engineering ones.

As voice AI agents grow more autonomous, the question researchers and builders should be asking together isn’t ‘does this work?’ It’s ‘do people trust it enough to let it act on their behalf, in front of other people, without checking its work first?’ That’s the real adoption threshold. The methods and principles to get there are well understood. What matters now is whether teams put UX researchers in the room early enough to use them.

Disclaimer: The views expressed in this article are my own and do not represent those of my employer.

This article is published as part of the Foundry Expert Contributor Network.
Want to join?

Priyanka Kuvalekar

Priyanka Kuvalekar is a senior UX researcher at Microsoft, leading mixed-method research for Microsoft Teams Calling and agentic AI collaboration experiences. She partners closely with product, design, engineering and data science to build trustworthy, accessible and usable communication workflows that directly influence product strategy and execution.

For more than eight years, Priyanka has led end-to-end UX research across Microsoft, Cisco Webex, Global Payments and Korn Ferry, shaping enterprise collaboration, communication, fintech and HR experiences. She holds a Master of Science in User Experience and Interaction Design and is also an accessibility champion, leading research with people with disabilities.

More from this author