Run Local LLMs Across Multiple Macs with LiteLLM: A Multi-Machine Inference Setup Guide

If you’ve started running local LLMs on Apple Silicon, you know the appeal. Zero API costs, no rate limits, no sending your data to someone else’s server. A MacBook Air M4 with 24GB of unified memory can run a surprisingly capable model at 25-30 tokens per second.

But there’s a problem: one machine can only do so much. A 24GB Mac fits a strong coding model, but it can’t also run a lightweight model for background tasks without competing for memory. And sometimes your workload needs a frontier cloud model for tasks that genuinely require deeper reasoning.

I had two Macs on my desk: a MacBook Air M4 (24GB) and a Mac Mini M1 (16GB) running 24/7. For months they operated independently. Then I wired them together with LiteLLM as a unified proxy, routing different workloads to different hardware based on model size and task complexity. The MacBook Air handles the heavy local inference. The Mac Mini runs a lightweight model for fast, simple tasks and doubles as a Redis cache server. Cloud models handle whatever the local hardware can’t.

Here’s how to build the same setup.

The Architecture

The core idea: not every LLM request needs the same model. Simple tasks (reading a file, answering a quick question, generating a test) are perfect for a small, fast model. Complex coding tasks benefit from a larger model with more parameters. And frontier reasoning tasks need a cloud model.

Three model tiers, spread across three different compute targets:

Component	Hardware	Role
MacBook Air M4 (24GB)	Main machine	Runs LiteLLM proxy, primary local model (30B MoE, 19GB)
Mac Mini M1 (16GB)	Always-on server	Lightweight model (8B, 6GB), Redis caching
Cloud (any provider)	API subscription	Frontier reasoning when local models aren’t enough

Server racks with colorful network cables in a data center — Your home network as a miniature inference cluster. Each machine has a specialized role, and LiteLLM is the load balancer routing traffic.

How routing works

Your application or coding assistant sends every request through LiteLLM at localhost:4000. LiteLLM sees the model name you requested and routes it to the right backend using an ordered priority list:

Tier 1 (Simple/fast tasks): Hits the Mac Mini M1 running qwen3:8b first (fast, free, always on). Falls back to the same model on the MacBook Air if the Mini is unreachable, then to a cloud model.
Tier 2 (Main coding/complex tasks): Hits qwen3-coder:30b on the MacBook Air locally (no network hop). Falls back to a cloud model.
Tier 3 (Frontier reasoning): Cloud model only. No local fallback.

This isn’t a Claude Code thing. Any application that speaks the OpenAI API format can use this setup. A web chat interface, a LangChain app, a coding assistant, a script that automates text generation. They all send requests to localhost:4000, and LiteLLM handles the routing.

Why no local fallback for the reasoning tier

When you send a request that needs deep reasoning to a small local model, you get a response that looks plausible but lacks actual reasoning depth. You follow that bad plan for an hour before realizing the model couldn’t handle the complexity.

Silent degradation is more dangerous than an error. The reasoning tier falls back to another cloud model. If both cloud options fail, the request errors out. Your application knows something went wrong instead of proceeding with garbage output.

Why These Models

Model selection on Apple Silicon isn’t about benchmarks. It’s about what fits in your unified memory with room for the OS, the KV cache, and realistic context windows.

MacBook Air M4 (24GB) — `qwen3-coder:30b`

This is a 30B-parameter Mixture-of-Experts (MoE) model that activates only 3.3B parameters per token. The MoE architecture is what makes this work on a consumer laptop. You get the knowledge breadth of a 30B dense model but the inference speed of something much smaller.

At Q4_K_M quantization it takes 19GB, leaving 5GB for macOS, Ollama, and KV cache. It runs at 25-30 tok/s on the M4 Air.

Why not a smaller generalist? A 4B or 8B general-purpose model handles grep-like tasks fine. But for anything involving multi-file analysis, dependency tracing, or architecture discovery, the difference is real. The 30B coding specialist gets it right the first time more often, which means fewer corrective turns and lower total cost per completed task.

Why not a 35B generalist? At 21GB it leaves only 3GB for everything else. Any meaningful context window will push you into swap. The 30B coder is 2GB lighter and specifically trained for the kind of tasks you’d actually run locally.

Mac Mini M1 (16GB) — `qwen3:8b`

With 16GB unified memory, you have roughly 12-13GB available after macOS and Ollama. The 8B model at Q4 takes 6GB, leaving 6-7GB for the OS, Redis, KV cache, and context. It runs at 40+ tok/s on the M1.

This handles lightweight tasks: reading files, searching codebases, answering quick questions about individual functions, generating simple tests. The kind of stuff that doesn’t need the 30B model but you still want to run locally for speed and cost.

Quick Reference

Device	RAM	Model	Size (Q4)	Active Params	Speed
MacBook Air M4	24GB	`qwen3-coder:30b`	19GB	3.3B/token	~27 tok/s
Mac Mini M1	16GB	`qwen3:8b`	~6GB	8B	~40 tok/s
Cloud	Variable	Any frontier model	N/A	N/A	Variable

Context Window Math

Context consumes KV cache memory linearly. The model weights are fixed, but the context window grows. On the 24GB machine with a 19GB model:

Context	KV Cache (approx.)	Total Memory	Status
4K	~1.5GB	~20.5GB	Comfortable
8K	~3GB	~22GB	Fine for typical work
16K	~5-6GB	~24-25GB	Tight, may swap with other apps
32K	~10-12GB	~29-31GB	Will swap, not viable

On the Mac Mini (16GB, 6GB model):

Context	KV Cache (approx.)	Total Memory	Status
4K	~1GB	~7GB	Comfortable
8K	~2GB	~8GB	Fine
16K	~4GB	~10GB	Tight

Practical advice: configure 8K context windows for local models. If a request exceeds that, LiteLLM’s context_window_fallbacks automatically upgrades to the cloud.

Close-up of a circuit board with electronic components — Apple Silicon shares memory between the GPU and CPU. Every gigabyte matters. Budget your VRAM carefully based on model size plus expected context length.

Prerequisites

Ollama installed on both Macs (ollama.com)
Redis on the Mac Mini (brew install redis && brew services start redis)
Python 3.10+ with uv installed (pip install uv or brew install uv)
A cloud LLM API key (OpenAI, Anthropic, Google, or any provider LiteLLM supports)
Both Macs on the same home network

Step 1: Set Up the Mac Mini M1

The Mac Mini is your always-on server. It runs the lightweight model and Redis.

1.1 Install Ollama and Redis

# On the Mac Mini
brew install ollama redis
brew services start redis

1.2 Pull the lightweight model

ollama pull qwen3:8b

This downloads the 8B model at Q4, roughly 6GB.

1.3 Expose Ollama on the network

By default, Ollama only listens on localhost. To make it reachable from the MacBook Air:

# Add to ~/.zshrc on the Mac Mini
export OLLAMA_HOST="0.0.0.0:11434"

Restart Ollama. Then verify from the MacBook Air (replace the IP with your Mac Mini’s actual address):

curl http://192.168.1.100:11434/api/tags

You should see a JSON response listing the models on the Mac Mini.

Security note: Exposing Ollama on your LAN means any device on the network can reach it, including management endpoints that can pull or delete models. Fine for a home network. In a shared space, put a reverse proxy with basic auth in front of it.

1.4 Find the Mac Mini’s IP

# On the Mac Mini
ipconfig getifaddr en0

Note this address. You’ll use it in the LiteLLM config.

Step 2: Set Up the MacBook Air M4

This is where LiteLLM runs and where the primary local inference happens.

2.1 Pull the primary model

ollama pull qwen3-coder:30b

This downloads the 30B MoE model at Q4, roughly 19GB. Expect 25-30 tok/s on the M4 Air.

2.2 Pull a local fallback for the lightweight tier

If the Mac Mini goes down, you still want fast inference for simple tasks:

ollama pull qwen3:8b

Same model as the Mac Mini. Only loaded if the Mini is unreachable.

Laptop with code editor showing syntax-highlighted programming code — Your MacBook running the primary local model. LiteLLM at localhost:4000 routes every request to the right backend.

Step 3: Install and Configure LiteLLM

LiteLLM speaks the OpenAI API format on the frontend (what your applications send) and translates to whatever provider format is needed on the backend (Ollama, OpenAI, Anthropic, etc.).

3.1 Install LiteLLM

# On the MacBook Air
uv tool install 'litellm[proxy]'

This installs the latest stable version (1.80.x as of this writing).

3.2 The config

Create ~/litellm-config.yaml on the MacBook Air:

model_list:
  # =============================================
  # TIER 1 - Lightweight tasks
  # Routed to: Mac Mini M1 (network) -> M4 fallback -> cloud
  # =============================================
  - model_name: lightweight-tier
    litellm_params:
      model: ollama_chat/qwen3:8b
      api_base: http://192.168.1.100:11434
      extra_body:
        options:
          num_ctx: 8192
      timeout: 30
      order: 1
    model_info:
      id: "lightweight-macmini"
      max_tokens: 4096
  - model_name: lightweight-tier
    litellm_params:
      model: ollama_chat/qwen3:8b
      api_base: http://localhost:11434
      extra_body:
        options:
          num_ctx: 8192
      timeout: 30
      order: 2
    model_info:
      id: "lightweight-macbook"
      max_tokens: 4096
  - model_name: lightweight-tier
    litellm_params:
      model: openai/gpt-4.1-mini
      api_key: os.environ/OPENAI_API_KEY
      timeout: 60
      order: 3
    model_info:
      id: "lightweight-cloud"
      max_tokens: 4096
  # =============================================
  # TIER 2 - Main coding/complex tasks
  # Routed to: MacBook Air M4 (localhost) -> cloud
  # =============================================
  - model_name: coding-tier
    litellm_params:
      model: ollama_chat/qwen3-coder:30b
      api_base: http://localhost:11434
      extra_body:
        options:
          num_ctx: 8192
      timeout: 120
      order: 1
    model_info:
      id: "coding-local"
      max_tokens: 8192
  - model_name: coding-tier
    litellm_params:
      model: openai/gpt-4.1
      api_key: os.environ/OPENAI_API_KEY
      timeout: 120
      order: 2
    model_info:
      id: "coding-cloud"
      max_tokens: 8192
  # =============================================
  # TIER 3 - Frontier reasoning (cloud only)
  # No local fallback. Silent degradation to a
  # weaker model on reasoning tasks is worse than
  # a clear error.
  # =============================================
  - model_name: reasoning-tier
    litellm_params:
      model: openai/o3-mini
      api_key: os.environ/OPENAI_API_KEY
      timeout: 180
      order: 1
    model_info:
      id: "reasoning-cloud-primary"
      max_tokens: 16384
  - model_name: reasoning-tier
    litellm_params:
      model: openai/gpt-4.1
      api_key: os.environ/OPENAI_API_KEY
      timeout: 180
      order: 2
    model_info:
      id: "reasoning-cloud-secondary"
      max_tokens: 16384
litellm_settings:
  master_key: os.environ/LITELLM_MASTER_KEY
  drop_params: true
  drop_unsupported_params: true
  num_retries: 2
  request_timeout: 60
  allowed_fails: 3
  cooldown_time: 30
  # Redis cache on the Mac Mini
  cache: true
  cache_params:
    type: "redis"
    host: "192.168.1.100"
    port: 6379
    ttl: 3600
  fallbacks:
    - { "lightweight-tier": ["coding-tier"] }
    - { "coding-tier": ["reasoning-tier"] }
  context_window_fallbacks:
    - { "lightweight-tier": ["coding-tier", "reasoning-tier"] }
    - { "coding-tier": ["reasoning-tier"] }
router_settings:
  num_retries: 2
  timeout: 60

Replace 192.168.1.100 with your Mac Mini’s actual IP. Replace the OpenAI model names with whatever provider you prefer. LiteLLM supports 100+ providers, so swap in Anthropic, Google, or any other service your API key works with.

The order parameter controls fallback priority within each tier. LiteLLM tries order 1 first. If it fails (connection refused, timeout, too many errors), it tries order 2, then order 3.

3.3 Environment variables

# Add to ~/.zshrc on the MacBook Air
export LITELLM_MASTER_KEY="sk-your-random-string-here"
export OPENAI_API_KEY="your-api-key-here"

The LITELLM_MASTER_KEY can be any random string. It authenticates requests to your LiteLLM proxy.

3.4 Start LiteLLM

litellm --config ~/litellm-config.yaml

You should see output indicating the proxy is running on http://0.0.0.0:4000.

3.5 Verify

Test all three tiers:

export LITELLM_MASTER_KEY="sk-your-random-string-here"
# Tier 1: lightweight (Mac Mini M1, qwen3:8b)
curl -X POST http://localhost:4000/v1/chat/completions \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"lightweight-tier","max_tokens":100,"messages":[{"role":"user","content":"What does os.path.join do in Python?"}]}'
# Tier 2: coding (MacBook Air M4, qwen3-coder:30b)
curl -X POST http://localhost:4000/v1/chat/completions \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"coding-tier","max_tokens":100,"messages":[{"role":"user","content":"Refactor this Express middleware to use async/await"}]}'
# Tier 3: reasoning (cloud)
curl -X POST http://localhost:4000/v1/chat/completions \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"reasoning-tier","max_tokens":100,"messages":[{"role":"user","content":"Design a microservice architecture for a real-time collaboration system"}]}'

Check the LiteLLM dashboard at http://localhost:4000/ui to see request logs, latency, and which backend each request hit.

Using It From Your Application

This setup works with anything that speaks the OpenAI API format. Point your application’s base URL to http://localhost:4000/v1 and use the tier model names.

Python example with OpenAI SDK

from openai import OpenAI
client = OpenAI(
    base_url="http://localhost:4000/v1",
    api_key="sk-your-litellm-master-key"
)
# Simple question - routes to Mac Mini (qwen3:8b)
response = client.chat.completions.create(
    model="lightweight-tier",
    messages=[{"role": "user", "content": "What is a closure in JavaScript?"}]
)
# Complex coding task - routes to MacBook Air (qwen3-coder:30b)
response = client.chat.completions.create(
    model="coding-tier",
    messages=[{"role": "user", "content": "Write a Redis caching layer for this Express API"}]
)
# Deep reasoning - routes to cloud
response = client.chat.completions.create(
    model="reasoning-tier",
    messages=[{"role": "user", "content": "Analyze the trade-offs between event sourcing and CQRS for an e-commerce system"}]
)

Shell script example

#!/bin/bash
# A simple automation script that uses local LLMs
API_URL="http://localhost:4000/v1/chat/completions"
API_KEY="sk-your-litellm-master-key"
# Generate a commit message from git diff
DIFF=$(git diff --staged)
if [ -z "$DIFF" ]; then
    echo "No staged changes"
    exit 1
fi
COMMIT_MSG=$(curl -s "$API_URL" \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"coding-tier\",
    \"max_tokens\": 200,
    \"messages\": [
      {\"role\": \"system\", \"content\": \"Write a concise git commit message. Output only the message, nothing else.\"},
      {\"role\": \"user\", \"content\": \"$DIFF\"}
    ]
  }" | python3 -c "import sys,json; print(json.load(sys.stdin)['choices'][0]['message']['content'])")
echo "$COMMIT_MSG"

Integration with coding assistants

Most coding assistants that support custom API endpoints work with this setup. Set the base URL to http://localhost:4000/v1, configure the API key to your LiteLLM master key, and create multiple configurations for each tier. The assistant sends requests with the model name matching your tier names, and LiteLLM routes accordingly.

Caching with Redis

Many LLM workloads have high repetition. You ask about the same function, the same error message, the same architecture across multiple requests. Without caching, every one of those hits the model.

Redis caching in LiteLLM stores responses keyed by the exact request. It’s already configured in the config above. On the Mac Mini, verify it’s running:

redis-cli ping
# Should return: PONG
# Monitor cache activity
redis-cli monitor | grep litellm

The TTL is 3600 seconds (one hour). Adjust shorter for code that changes frequently, longer for stable documentation queries.

In practice, roughly 15-20% of lightweight-tier requests and 5-10% of coding-tier requests hit cache during a typical session. It compounds over a full day of work.

Observability: What to Actually Measure

The LiteLLM dashboard shows request logs and latency. But to know whether this setup is actually saving you money, track these metrics:

Fallback frequency. How often do requests escalate from local to cloud? If the lightweight tier hits cloud 40% of the time, your local model isn’t handling what you expected.
Latency percentiles. Track p50 and p95 per tier. A local model averaging 25 tok/s but with p95 at 5 tok/s (due to model swapping after idle periods) is worse than a consistent cloud model.
Task completion rate. How often does the model get it right on the first try? If your local model handles 80% of requests but 50% of those need correction, the economics look very different.
Context exhaustion events. How often do requests exceed the local context window and fall back to cloud? Frequent overflow means either your num_ctx is too small or you need more headroom.

The danger zone: 80% of requests hit local, but 80% of corrections come from those same local requests. The token savings evaporate. Cost per token is a vanity metric. Cost per completed task is what matters.

Making It Persistent

Ollama on the Mac Mini (launchd):

Create ~/Library/LaunchAgents/com.ollama.server.plist:





    Label
    com.ollama.server
    ProgramArguments
    
        /opt/homebrew/bin/ollama
        serve
    
    EnvironmentVariables
    
        OLLAMA_HOST
        0.0.0.0:11434
    
    RunAtLoad
    
    KeepAlive

launchctl load ~/Library/LaunchAgents/com.ollama.server.plist

Redis on the Mac Mini (already persistent if started with brew services):

brew services start redis

LiteLLM on the MacBook Air (tmux):

tmux new-session -d -s litellm 'litellm --config ~/litellm-config.yaml'

Or create a similar launchd plist.

Best Practices

1. Keep models warm. Ollama unloads models from memory when idle. The first request after a cold start is slow. Add a keepalive cron:

*/5 * * * * curl -s http://localhost:11434/api/tags > /dev/null
*/5 * * * * curl -s http://192.168.1.100:11434/api/tags > /dev/null

Or set OLLAMA_KEEP_ALIVE=10m in your environment to keep models loaded longer.

2. Set appropriate timeouts per tier. Lightweight (local 8B) gets 30 seconds. Coding (local 30B MoE) gets 120 seconds. Reasoning (cloud) gets 180 seconds. Already configured in the YAML above.

3. Close other apps during heavy sessions. The MacBook Air’s 24GB is shared between the model, macOS, and your apps. Running Chrome with 20 tabs while generating a 16K-context response is asking for swap.

4. Benchmark your actual context usage. Don’t assume 8K is right. Run a few sessions and check ollama ps to see how much memory the model actually uses with your typical prompts. Adjust num_ctx up or down based on real data.

5. Use semantic caching for more savings. LiteLLM supports Redis-backed semantic cache that matches near-duplicate requests, not just exact matches. This catches repeated patterns like “explain this function” across slightly different wordings. Enable it by adding semantic_cache: true to cache_params (requires an embedding model configured).

Troubleshooting

“model not found” errors: Verify the model name in your request matches a model_name in the LiteLLM config. Names must match exactly.

Mac Mini unreachable from MacBook Air: Ping the Mac Mini’s IP. Verify OLLAMA_HOST="0.0.0.0:11434" is set and Ollama is running on the Mini.

Context window errors from local models: The context_window_fallbacks in the config handle this automatically. If you still see errors, verify the fallback chain is correctly configured.

Slow first response: Ollama unloads idle models. Use the keepalive cron or OLLAMA_KEEP_ALIVE.

Redis connection refused: Verify Redis is running on the Mac Mini (redis-cli ping). LiteLLM falls back gracefully to no caching if Redis is unreachable.

Data center with server racks and a workstation monitor — Your multi-machine setup doesn’t need a data center. Two Macs on a home network, LiteLLM as the router, and a cloud API for the heavy lifting. Scale as needed.

What’s Next

Add a 32GB+ Mac. A Mac Mini M4 Pro with 32GB lets you run the 30B model at Q5_K_M (near-lossless quantization) with 16K context, or run a larger generalist model alongside the coding specialist.

Add more machines. Each additional Mac on your network is another inference node. Add them as new entries in the LiteLLM model_list with appropriate order values.

Build a complexity classifier. Instead of hard-coding which tier to use, build a lightweight classifier that routes based on prompt length, token count, and task type. Route automatically based on complexity rather than application-level model selection.

Semantic caching. LiteLLM supports Redis-backed semantic cache that catches near-duplicate requests beyond exact string matching. This would catch more repeated patterns across differently worded prompts.

Honest Economics

The setup isn’t free. You need two Macs (or at least one Mac plus another machine that can run Ollama). You need a cloud API subscription for the reasoning tier. And there’s your time to set it all up.

The savings come from two places: local inference on tasks that don’t need a frontier model, and Redis caching on repeated requests. In practice, the local coding tier handles maybe 60-70% of coding requests without hitting the cloud. The lightweight tier handles 80-90% of simple tasks locally. Cloud usage drops to genuine frontier tasks.

But here’s the nuance. Local models are slower than cloud frontier models. A task a frontier model handles in one turn might take two or three turns with a local 30B model. The token savings are partially offset by extra turns. Whether this saves money depends on your workload: if most requests are straightforward coding tasks, the savings are substantial. If most are architectural or exploratory, you hit the cloud more.

After running this setup, cloud API costs dropped noticeably compared to routing everything through a single cloud model. The exact savings depend on what you’re building. But the system is transparent about where every request goes, and you can tune the routing to match your actual usage patterns.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.