Skip to main content
ollamalitellmself-hosted-ai

How to Set Up Ollama Failover Between an RTX 5090 and a Mac

The problem isn't the fallback idea itself, but that tools like openclaw handle long timeouts and provider switching poorly. For a self-hosted RTX 5090 + Mac setup, LiteLLM is a better choice. It offers more practical failover, timeout, and routing management, making the system reliable and predictable for AI automation tasks.

Where the openclaw setup breaks down

I looked into a use case with a gaming PC running qwen on an RTX 5090 and a MacBook M5 Pro as a backup. The scenario is very common: during the day, my local model flies on the powerful GPU; in the evening, I shut down Ollama on the desktop, and requests should seamlessly redirect to the Mac. On paper, this looks like a standard failover. In practice, it leads to hanging heartbeats, LLM API timeouts, and the feeling that the router just doesn't believe the first backend is dead.

And here's my main takeaway: the problem most likely isn't with Ollama itself, but with the proxy layer. If a proxy waits too long for a socket, can't aggressively cut timeouts, or lacks a proper health-check model, it will hang on a dead node for 40 minutes. This is especially painful for cron jobs and background agents because the pipeline seems alive when it's actually not.

I wouldn't build such a system around openclaw/lm-proxy if the goal is reliable, automatic switching. The idea of 'try connection1, except connection2' is sound, but it's not enough for production. You need short timeouts, retries, backend status checks, and clear routing.

What I would use instead

For this kind of self-hosted setup, I would use LiteLLM as a unified OpenAI-compatible proxy in front of the two Ollama instances. One Ollama runs on the RTX 5090, the other on the MacBook M5 Pro. Clients, agents, cron jobs, and everything else communicate with LiteLLM, not Ollama directly.

Why I like this approach: I change the endpoint in my applications once and then manage all the routing in one place. I can set the desktop as the primary backend, the Mac as a fallback, and even add a third layer if needed—a cloud service like OpenRouter. This isn't a makeshift AI integration anymore; it's a proper transport layer for local models.

The basic config idea is as follows:

  • primary: Ollama on RTX 5090
  • fallback: Ollama on Mac M5
  • timeout: 15-30 seconds, no more
  • retries: 1-2, otherwise you just spread out the delay
  • a common model alias for the client, so it doesn't care where the model is actually running

If your qwen3.5:35b-a3b on the PC doesn't match the name or resources of qwen3.5:27b on the Mac, I wouldn't hide this completely. It's better to give a logical name to the task class rather than pretending the models are identical. Otherwise, you'll encounter strange discrepancies in responses later and end up debugging model behavior instead of network issues.

What this changes for business and automation

The most valuable benefit here isn't saving money on cloud services, but predictability. When AI automation relies on local inference, you can't let a pipeline wait half an hour for a dead GPU. A business doesn't need hardware heroism; it needs a request route that doesn't hang.

Teams with multiple nodes and a clear AI architecture win: a powerful machine for heavy tasks, a weaker one as a backup, and the cloud as an emergency layer. Those who hardcode switching logic into every script lose. I've seen it before: it starts with 'let's just hardcode it for now,' and six months later, no one remembers why the nightly jobs only fail on Fridays.

At Nahornyi AI Lab, we often build such AI solutions for businesses: local models, proxies, queues, fallbacks, monitoring, and secure service publishing to the internal network. And almost always, the problem isn't the model itself. The glue between components breaks—timeouts, routing, health checks, queue states.

If done right, the PC + Mac setup is perfectly viable. I would put LiteLLM in as a layer, give each Ollama a short timeout, add a simple health endpoint, and separately test how streaming, cron, and long generations behave. One evening of setup turns magic into a manageable AI solutions architecture.

I'm Vadim Nahornyi from Nahornyi AI Lab, and I've shared this breakdown based on real-world experience—I don't just bookmark these self-hosted setups; I actually build, debug, and integrate them into workflows.

If you'd like, I can help analyze your specific stack: what to keep local, how to implement AI without fragile workarounds, and where to place a backup route. Contact us at Nahornyi AI Lab—we'll look at your case in detail.

Share this article