- The conversation every asset manager is having
- Six models, six different strengths
- Benchmark comparison: all five models head-to-head
- Why there is no single “best” LLM
- The five things no generic LLM can do for asset managers
- The right model for the right task
- How Sherpa solves this
- Where the industry is heading
The Conversation Every Asset Manager Is Having
If you run an asset management firm in 2026, you’ve had some version of this conversation in the last six months.
Someone on your team raises the idea of “using AI.” The room nods. One person has been experimenting with ChatGPT. Another prefers Claude for research. Someone in IT says Google Gemini has better integration with your workspace. The analyst swears by Perplexity for sourced research. And someone just read that Grok can pull real-time data from X. The compliance officer wants to know where the data goes with any of them.
And nobody can agree on which one your firm should standardise on.
Here’s what we’ve learned after two years of building AI specifically for asset managers: the debate over which LLM is “best” is the wrong debate entirely. Each of the major models is genuinely excellent at something. The real question is how you orchestrate the right model for each task — within a governed, domain-aware platform that actually connects to your operational data.
The LLM Landscape: Six Models, Six Different Strengths
The six major LLMs your firm is likely evaluating — OpenAI’s ChatGPT, Anthropic’s Claude, Google’s Gemini, xAI’s Grok, Perplexity, and Meta’s Llama — are all remarkable technology. But they are not interchangeable. Each has invested heavily into different strengths, and the 2026 benchmarks make those differences clear.
Let’s give credit where it’s due.
Benchmark Comparison: All Six Models Head-to-Head
We continuously benchmark LLMs across the dimensions that matter for asset management workflows. Here’s where things stand as of April 2026.
| Benchmark | What It Measures | ChatGPT GPT-5.4 |
Claude Opus 4.6 |
Gemini 3.1 Pro |
Grok 4.20 |
Perplexity Sonar Pro |
Llama 4 Maverick |
|---|---|---|---|---|---|---|---|
| MMLU-Pro | Broad knowledge | 91.8% Best | 89.5% Strong | 89.8% Strong | ~86% Good | 75.5% Good | 88.2% Strong |
| SWE-bench Verified | Real-world coding | 78.2% Strong | 80.8% Best | 78.8% Strong | ~74% Good | — | ~72% Good |
| GPQA Diamond | Scientific reasoning | 92.8% Strong | 91.3% Strong | 94.3% Best | 88.0% Good | — | 85.2% Strong |
| Chatbot Arena Elo | Human preference | Top-5 Strong | 1504 Best | 1493 Strong | 1491 Strong | — | 1467 Good |
| Mathematical reasoning | AIME / MATH-500 | Strong Strong | Strong Strong | Strong Strong | 93.3% / 99% Best | — | Strong Strong |
| Factuality (SimpleQA) | Verifiable accuracy | Strong Strong | Strong Strong | Strong Strong | Good Good | 0.858 Best | Strong Strong |
| Real-time data access | Live information | Via plugins Good | Limited Good | Google Search Strong | Native X/Twitter Best | Native web search Best | None Good |
| Long-form writing | Prose quality & nuance | Good Good | Excellent Best | Strong Strong | Good Good | Good Good | Good Good |
| Multimodal input | Image, video, audio | Strong Strong | Strong Strong | Excellent Best | Strong Strong | Limited Good | Strong Strong |
| Inference cost | Price-performance | Moderate Strong | High Good | Lowest Best | Moderate Strong | Low Best | Free (self-hosted) Best |
| Context window | Input capacity | 128K Good | 1M Best | 1M Best | 1M Best | 128K Good | 1M Best |
Sources: LM Arena, Artificial Analysis, MindStudio Benchmarks. Scores reflect published results as of April 2026; benchmarks evolve rapidly.
Why There Is No Single “Best” LLM
Look at that table. No single model wins every row. And that’s the point.
ChatGPT excels at structured reasoning and agentic computer use. It’s the model you want for complex multi-step logic, tool orchestration, and tasks that require interacting with systems. OpenAI has earned this — they’ve invested heavily in the o-series reasoning models and their agent capabilities are genuinely ahead.
Claude produces the most natural, nuanced writing and is the strongest at real-world coding. Anthropic’s Constitutional AI approach also makes it the most cautious about hallucination and the most transparent about uncertainty — qualities that matter enormously in regulated industries. When you need a 40-page client report that reads like it was written by a senior analyst, or you need code that handles edge cases correctly, Claude is the right choice.
Gemini leads on scientific reasoning, multimodal input, and raw speed. Google’s investment in Gemini’s reasoning capabilities has paid off — it tops GPQA at 94.3% and its ARC-AGI-2 performance on abstract reasoning is best in class. When you need to process a mix of PDFs, images, and spreadsheets at speed, Gemini delivers.
Grok has quietly become the mathematical reasoning powerhouse. xAI’s Grok 4.20 scores 93.3% on AIME and 99% on MATH-500 — numbers that no other model matches. Its native integration with X/Twitter gives it real-time access to market sentiment, breaking news, and trending topics. When your research team needs live sentiment data or your quant team needs complex calculations, Grok is the right tool.
Perplexity takes a fundamentally different approach: every answer comes with numbered, linked citations to source material. Its SimpleQA factuality score of 0.858 outperforms every other model on verifiable accuracy. The Model Council feature runs queries through multiple frontier LLMs simultaneously and shows side-by-side comparisons. When you need sourced, defensible research — the kind you can put in front of a compliance officer or a board — Perplexity is purpose-built for it.
Each of these models represents billions of dollars of research and genuinely world-class engineering. The mistake isn’t choosing any of them. The mistake is choosing just one.
The Five Things No Generic LLM Can Do for Asset Managers
Whether you use ChatGPT, Claude, Gemini, Grok, Perplexity, Llama, or all five, every generic LLM shares the same five fundamental gaps when deployed in an asset management context.
1. Access your live operational data
No generic LLM connects to your custodian, fund administrator, registry, CRM, or market data feeds. They operate on whatever text you paste into them. Someone still has to manually extract, clean, and format data before AI can help. That’s not automation — that’s adding a step.
2. Understand asset management terminology
Every asset manager uses different names for the same data. “FUA,” “AUM,” “Assets Under Management” — spelled three different ways across your own systems. Generic models don’t know that your CRM’s “total portfolio value” is the same as your custodian’s “FUA.” Without a domain-specific ontology layer, AI gives you confident-sounding wrong answers.
3. Respect your compliance rules
This gap matters more than ever in 2026. APRA’s CPS 230 is now in effect, requiring regulated entities to identify AI systems supporting critical operations. From December 2026, Privacy Act amendments require explanation of automated decisions, with penalties up to $50M.
Generic LLMs lack data residency controls, audit trails, role-based access, and compliance guardrails. It doesn’t matter how good the model is if your compliance team can’t sign off on it.
4. Take action — not just answer questions
A generic LLM can answer questions, but it can’t generate a compliant client report from live data, flag a compliance breach before it becomes regulatory, plan an optimal trip across 12 client meetings, or produce a board-ready summary from real-time FUA figures. The model is the brain. But without integrations, a data layer, and action capabilities, it’s a brain floating in empty space.
5. Operate within your existing infrastructure
Asset managers run on complex stacks of custodians, fund admins, registries, and CRMs built up over years. A purpose-built AI platform should sit on top of existing systems, not replace them. We’ve deployed across firms managing $3B to $116B in AuM — in every case, the platform connected to existing systems via APIs. Average time from kickoff to live: four weeks.
The Right Model for the Right Task
Here’s the insight that changed how we build AI for asset managers: the right answer isn’t picking one LLM and hoping it covers everything. It’s having the right LLM for each specific task.
Consider the range of AI tasks in a typical asset management firm:
- Generating a 40-page client report with nuanced commentary? You want the model that writes best — that’s Claude.
- Processing a mixed batch of PDFs, images, and spreadsheets from a custodian feed? You want multimodal strength and speed — that’s Gemini.
- Orchestrating a complex multi-step workflow that pulls data from five systems, runs compliance checks, and generates a board pack? You want the strongest reasoning and tool-use model — that’s ChatGPT’s o-series.
- Scanning real-time market sentiment and running quantitative calculations on fund performance? You want live data access and mathematical precision — that’s Grok.
- Producing sourced, defensible research on regulatory changes, competitor activity, or market trends — with every claim linked to its source? That’s Perplexity.
- Quick-turnaround tasks like drafting emails, summarising meeting notes, or answering ad-hoc data questions? You want the fastest inference at the lowest cost.
No single model is the best choice for all six of those tasks. And that’s before you factor in cost — using the most expensive frontier model for a simple email summary is throwing money away.
How Sherpa Solves This
Sherpa is Datafabric’s AI assistant, purpose-built for asset managers. But unlike any single LLM, Sherpa is designed as an orchestration layer that brings together the best models for each task.
Here’s what this looks like in practice:
We offer all the key LLMs. Sherpa isn’t locked to a single provider. We integrate ChatGPT, Claude, Gemini, Grok, Perplexity, Llama, and other models as they prove their worth. When a new model launches or an existing one improves, we test it and add it to the rotation.
We benchmark them continuously. Every model is evaluated against the specific tasks that matter for asset managers — report generation, data extraction, compliance checking, research, and more. Not generic benchmarks. Your benchmarks, on your data, for your workflows.
We set up the right model for the right task. Client report? Routed to the model that writes best. Document processing? Routed to the fastest multimodal model. Complex multi-step workflow? Routed to the strongest reasoning model. This happens automatically — your team doesn’t need to think about which model to use.
We monitor everything. Every query is logged. Every response is traceable. Quality scores are tracked over time. If a model’s performance degrades or a better option becomes available, Sherpa adapts. Cost is tracked per task, per model, so you know exactly what you’re paying for.
And critically, Sherpa provides the five capabilities that no generic LLM offers on its own: live data connectivity across your custodian, fund admin, CRM, and internal systems; a domain-specific ontology that understands asset management terminology; full compliance governance with audit trails, data residency, and role-based access; the ability to take action (generate reports, flag breaches, plan trips); and integration with your existing infrastructure without replacement.
The models are the brains. Sherpa is the nervous system that connects them to your business.
Where the Industry Is Heading
The shift from “pick one LLM” to “orchestrate the right LLM for each task” is happening across financial services. Microsoft’s 2026 financial services outlook identifies domain specialisation as one of the five key predictors of AI success. NVIDIA’s latest survey shows firms doubling down on industry-specific AI investment. Regulators — including ASIC and APRA — are making it clear that “we used an AI tool” is not an acceptable governance position.
The LLM wars will continue. New models will launch. Benchmarks will shift. Some of the numbers in this article will be outdated within months.
But the principle won’t change: the right answer for asset managers is not the best model. It’s the best model for each task, connected to your data, governed by your compliance rules, and monitored continuously.
That’s what Sherpa does. And that’s the difference between experimenting with AI and actually deploying it.