Why AI Agents Still Suck at Real Work

According to Sifted, AI agents from companies like OpenAI, Anthropic, and Google are being touted as the next giant breakthrough by leaders including Sam Altman, but they’re still riddled with failure points that founders see daily. Anthropic’s Claude agent lost money running a vending machine for a month by giving away items for free, while Taco Bell is rethinking its AI drive-throughs after a customer crashed the system by ordering 18,000 waters. Replit accidentally deleted a company’s entire database using AI, and founders report latency increasing 4x since GPT-5’s release. The technology currently requires multiple layers of supervision and still struggles with hallucinations and multi-step tasks despite the massive hype surrounding autonomous AI agents.

Sponsored content — provided for informational and promotional purposes.

The Supervision Problem

Here’s the thing about these supposedly autonomous agents – they’re anything but. Matt Wilson from Jack & Jill says they constantly need “supervisor” LLMs to prevent them from getting carried away. His agent Jill will find candidates with red flags and still want to present them to clients. Basically, these systems lack the judgment to know when they’re making bad decisions. And Jan Philipp Harries from ellamind notes that tiny misses in multi-step plans snowball into completely wrong rabbit holes. So much for independence – they need constant babysitting.

The Hallucination and Latency Nightmare

Daniel Keinrath from Fonio drops the bombshell that “LLMs are getting worse” in some respects. Since GPT-5’s release, latency has increased 4x for them. Think about that – we’re supposed to be moving toward faster, more capable AI, but the wait times are getting longer. And hallucinations? They’re still a massive issue. The only way to reduce them is with really strong system prompts and small context windows, which basically means limiting what the AI can do. It’s like trying to fix a leaky pipe by turning off the water main.

Single Task Heroes, Multi-Task Zeros

Emma Burrows from Portia AI makes a crucial distinction that gets overlooked in all the hype. These agents are decent at single, specific tasks – like processing refunds. But give them multiple tasks at once? Their consistency falls apart completely. We’re seeing this play out in real time with tools that promise to handle everything but end up needing constant human intervention. The multi-agent systems that can handle diverse workflows are still in early development stages. It’s like having a specialist who’s great at one thing versus a generalist who’s mediocre at everything.

Time for a Reality Check

Look, the potential is obviously there. But we’re in that awkward phase where the marketing has raced ahead of the actual capabilities. When an AI can’t even run a vending machine profitably or handle a large water order without crashing, how ready is it really for complex business workflows? The founders working with this tech daily are telling us something important – we need to temper expectations. As one developer noted, we’re still in the “toy” phase with many of these agent applications. The infrastructure and reliability just aren’t there yet for serious deployment without heavy supervision and workarounds.