
How a Demo Outage Forced Us to Rebuild Our Entire AI Inference Architecture
A demo outage forced us to rebuild our entire AI inference architecture. We went from Mistral to Nebius Token Factory and saved 75% on our reasoning costs.
I was mid-demo. A prospect was watching me walk through Orbiq's trust center live. Then the answers stopped coming. Mistral was down. Not degraded — gone.
For a compliance platform where buyers evaluate your security posture in real time, "the AI is down" is not recoverable. We lost the demo slot. I opened a my notebook and wrote one line: we are never in this position again.
The problem was Mistral
After that call, I pulled up Mistral's status page: service unavailable. That's not a great look. Claude went down recently too. The problem was a single point of failure in the part of our stack that customers, and their prospects, touch directly sales conversations and won deals.
We fixed two things at once: added a new primary provider, and right-sized every model in the process.
Why Nebius Token Factory
We landed on Nebius Token Factory as our new primary for three reasons that matter specifically to us:
EU sovereignty. Nebius runs on EU infrastructure under EU jurisdiction. When we're selling to FinTech and HealthTech companies navigating GDPR and NIS2, "where does your AI inference run?" is a real question in security reviews. We now have a cleaner answer.
Zero data retention. Prompts and completions are not logged or retained for any purpose. Non-negotiable when security questionnaires and NDA-gated compliance materials flow through the system.
Open models at commodity prices. No licensing premium. We ended up running larger, better models at lower cost than before, which forced the model audit below.
The model audit
When you switch providers, you're forced to ask a question you should have asked earlier: is each AI call using the right model for the job?
Orbiq runs seven distinct AI tasks... for now. Here's what changed and why.
Language detection (Mistral Small 3.1 24B → Qwen3-14B): Three-way classification — English, German, or French. We were using a 24B multimodal model for it. Qwen3-14B handles this at lower latency and cost. One honest note: Mistral Small 3.1 actually beats Qwen3-14B on AIME-25 math reasoning (85.0 vs 73.7). For a language classifier, that doesn't matter.
Question parsing and evaluation (Mistral Medium 3 → Qwen3-30B-A3B-Instruct-2507): A MoE model with 30B total and 3B active parameters per pass. Faster than a dense equivalent, better structured output in our tests, roughly the same price.
Answer generation (Mistral Large 3 → Qwen3-235B-A22B-Instruct-2507): This one deserves the honest comparison. Mistral Large 3 is a genuinely strong model: 85.5% on MMLU, Chatbot Arena Elo ~1418, second among open-weight non-reasoning models. For general knowledge tasks, it's excellent. But on GPQA Diamond (graduate-level scientific reasoning), it scores 43.9%. Qwen3-235B-Instruct scores 95.6 on ArenaHard. For generating precise answers grounded in compliance documentation, the harder reasoning benchmark matters more. Output cost: $0.60/M vs $1.50/M.
Reasoning tasks (Magistral Medium → Qwen3-235B-A22B-Thinking-2507): Deep search and questionnaire evaluation — tasks that require actual chain-of-thought. The benchmarks are unambiguous:
| Benchmark | Magistral Medium | Qwen3-235B-Thinking |
|---|---|---|
| AIME-24 (pass@1) | 73.6% | 85.7% |
| GPQA Diamond | 70.83% | significantly higher |
| LiveCodeBench v5 | 59.36% | 70.7% |
Magistral Medium is a capable reasoning model. Qwen3-235B-Thinking is in a different tier. Output cost: $0.80/M vs $5.00/M.
The fallback architecture
Every AI call goes through a provider chain. Before each call, the system checks the provider's status page for major or critical outage, skip immediately and cache the decision for 5 minutes. If the primary call fails at runtime, exponential backoff kicks in, then falls through to the next provider. Mistral stays in the chain as our fallback.
Adding a provider or swapping a model is a config change, not a code change.
What to take from this
Right-size your models. A 24B multimodal model for language classification is waste. Audit every AI call and ask what the minimum viable model actually is: the latency improvement alone is worth it for synchronous user-facing features.
MoE models are the production default now. Qwen3-235B-A22B activates 22B parameters per forward pass. You get a 235B model's knowledge at a fraction of the inference cost.
Never benchmark on the wrong dimension. Mistral Large 3 wins on MMLU. Qwen3-235B wins on GPQA and ArenaHard. Your task profile determines which number matters.
Build the fallback in from day one. Seems like outages will occur more frequently in this geopolitical turmoil and demand for compute.
Make data residency a first-class property. If you sell to regulated EU companies, "EU infrastructure, zero data retention, sovereignty-aligned provider" belongs in your security questionnaire responses. It's a product decision as much as an infrastructure one.
Orbiq is a trust center platform that helps EU companies in regulated industries close enterprise deals faster by centralizing security documentation, automating questionnaire responses, and giving buyers self-service access to compliance posture. orbiqhq.com