Sarah watched her computer screen in disbelief as the quarterly report she’d been waiting for came back completely wrong. The AI assistant she’d trusted to handle the data analysis had somehow mixed up revenue figures with employee headcount. What should have been a simple task turned into hours of manual corrections.
If this sounds familiar, you’re not alone. Millions of us rely on AI tools daily, assuming they can handle complex work just like human colleagues. But what happens when AI systems are given real office jobs with actual responsibilities?
A groundbreaking AI company experiment just answered that question, and the results might surprise you.
When AI Bots Became Your Coworkers
Researchers at Carnegie Mellon University created something unprecedented: a completely artificial company where every single “employee” was an AI system. This wasn’t just another chatbot test. They built a real workplace environment and handed over genuine business responsibilities to some of the most advanced AI models available today.
The experiment included heavy hitters from the AI world. Claude 3.5 Sonnet from Anthropic, OpenAI’s GPT-4o, Google Gemini, Amazon Nova, Meta’s Llama, and Alibaba’s Qwen all got job titles and desk assignments. Some became financial analysts, others took on project management roles, and a few were designated as software engineers.
“Instead of neat, one-off prompts, the models faced messy, multi-step tasks that looked a lot like an ordinary day at the office,” the researchers noted in their study.
To make things realistic, the AI company experiment included simulated departments like HR. The artificial employees had to send messages, request information, and coordinate with each other just like real staff members would. No hand-holding, no perfect instructions – just the chaos of actual workplace collaboration.
The Tasks That Broke the AI Workers
The researchers didn’t throw curveballs or trick questions at their AI employees. Instead, they assigned typical knowledge work that millions of people handle every day. Here’s what the artificial workforce was expected to manage:
- Navigate company file systems to analyze databases and extract meaningful insights
- Pull information from multiple documents and create coherent summaries
- Research and compare virtual office tours to recommend new company premises
- Communicate across departments to get approvals and gather additional data
- Follow complex instructions that mixed explicit steps with implied expectations
- Handle time-sensitive projects with budget constraints
- Make judgment calls when information was incomplete or conflicting
These tasks required more than just answering questions. The AI systems needed to plan ahead, take initiative, adapt when circumstances changed, and work within real-world limitations.
| AI Model | Success Rate | Success Rate (Partial Credit) | Primary Role |
|---|---|---|---|
| Claude 3.5 Sonnet | 24% | 34.4% | Financial Analyst |
| GPT-4o | 19% | 28.2% | Project Manager |
| Google Gemini | 16% | 25.1% | Software Engineer |
| Other Models | 12-18% | 20-27% | Various Roles |
“We expected some struggle, but the extent of the failures was eye-opening,” one researcher involved in the study explained. “These are supposed to be the most capable AI systems we have.”
Why Three-Quarters of Everything Went Wrong
The results were sobering. Even Claude 3.5 Sonnet, the top performer in this AI company experiment, successfully completed only about one in four assigned tasks. When researchers gave partial credit for work that was somewhat correct but incomplete, the success rate barely crept above one-third.
The other AI models performed even worse, with most failing 75-80% of their assignments entirely. This wasn’t about trick questions or impossible demands – these were the kinds of tasks that entry-level employees handle routinely.
Several patterns emerged from the failures. The AI systems consistently struggled with multi-step processes that required remembering context from earlier interactions. They had trouble when tasks required reaching out to other “departments” for information, often getting stuck in loops or abandoning efforts entirely.
Decision-making proved particularly challenging. When faced with incomplete information or conflicting data – situations that happen constantly in real workplaces – the artificial employees often froze or made arbitrary choices without explaining their reasoning.
“The AI models excel at individual tasks but fall apart when those tasks need to connect to a larger workflow,” noted a workplace automation expert reviewing the findings.
What This Means for Your Job and Future
These results have immediate implications for anyone wondering whether AI might replace their role. While AI tools can certainly help with specific tasks, the AI company experiment suggests we’re still far from artificial employees that can handle the full complexity of most knowledge work.
The experiment reveals why so many workplace AI implementations fall short of expectations. Companies investing millions in AI automation may need to recalibrate their timelines and expectations. The technology excels at narrow, well-defined tasks but struggles with the ambiguity and interconnectedness of real work environments.
For workers, this brings both relief and responsibility. Your job likely remains safe from AI replacement in the near term, but the technology will continue evolving. The smart approach involves learning to work alongside AI tools rather than competing against them.
Business leaders should take note of these findings when planning AI adoption strategies. Rather than expecting AI to replace entire roles, focusing on specific task automation while keeping humans in supervisory and coordination roles appears more realistic.
“This experiment shows us that AI is incredibly powerful for certain things, but we’re not close to artificial general intelligence in the workplace,” explained a technology consultant who wasn’t involved in the study. “The human elements of judgment, creativity, and complex problem-solving remain irreplaceable.”
FAQs
What was the main goal of this AI company experiment?
The researchers wanted to test whether current AI systems could handle real workplace responsibilities, not just answer individual questions.
Which AI model performed best in the experiment?
Claude 3.5 Sonnet had the highest success rate at 24%, though this still meant it failed three out of four tasks.
Why did the AI systems fail so often?
They struggled with multi-step processes, coordinating between departments, handling incomplete information, and making judgment calls in ambiguous situations.
Does this mean AI won’t replace jobs?
Not necessarily, but it suggests AI replacement of complex knowledge work is likely further away than many predictions suggest.
How can businesses use these findings?
Companies should focus on using AI for specific tasks while keeping humans in supervisory roles, rather than expecting complete job replacement.
What types of tasks did the AI employees handle best?
The AI systems performed better on single-step tasks with clear instructions, but struggled with anything requiring coordination or complex decision-making.