If you’ve started running local LLMs on Apple Silicon, you know the appeal. Zero API costs, no rate limits, no sending your data to someone else’s server. A MacBook Air M4 with 24GB of unified memory can run a surprisingly capable model at 25-30 tokens per second.
But there’s a problem: one machine can only do so much. A 24GB Mac fits a strong coding model, but it can’t also run a lightweight model for background tasks without competing for memory. And sometimes your workload needs a frontier cloud model for tasks that genuinely require deeper reasoning.
I had two Macs on my desk: a MacBook Air M4 (24GB) and a Mac Mini M1 (16GB) running 24/7. For months they operated independently. Then I wired them together with LiteLLM as a unified proxy, routing different workloads to different hardware based on model size and task complexity. The MacBook Air handles the heavy local inference. The Mac Mini runs a lightweight model for fast, simple tasks and doubles as a Redis cache server. Cloud models handle whatever the local hardware can’t.
Here’s how to build the same setup.
The Architecture
The core idea: not every LLM request needs the same model. Simple tasks (reading a file, answering a quick question, generating a test) are perfect for a small, fast model. Complex coding tasks benefit from a larger model with more parameters. And frontier reasoning tasks need a cloud model.
Three model tiers, spread across three different compute targets:
| Component | Hardware | Role |
|---|---|---|
| MacBook Air M4 (24GB) | Main machine | Runs LiteLLM proxy, primary local model (30B MoE, 19GB) |
| Mac Mini M1 (16GB) | Always-on server | Lightweight model (8B, 6GB), Redis caching |
| Cloud (any provider) | API subscription | Frontier reasoning when local models aren’t enough |
How routing works
Your application or coding assistant sends every request through LiteLLM at localhost:4000. LiteLLM sees the model name you requested and routes it to the right backend using an ordered priority list:
- Tier 1 (Simple/fast tasks): Hits the Mac Mini M1 running
qwen3:8bfirst (fast, free, always on). Falls back to the same model on the MacBook Air if the Mini is unreachable, then to a cloud model. - Tier 2 (Main coding/complex tasks): Hits
qwen3-coder:30bon the MacBook Air locally (no network hop). Falls back to a cloud model. - Tier 3 (Frontier reasoning): Cloud model only. No local fallback.
This isn’t a Claude Code thing. Any application that speaks the OpenAI API format can use this setup. A web chat interface, a LangChain app, a coding assistant, a script that automates text generation. They all send requests to localhost:4000, and LiteLLM handles the routing.
Why no local fallback for the reasoning tier
When you send a request that needs deep reasoning to a small local model, you get a response that looks plausible but lacks actual reasoning depth. You follow that bad plan for an hour before realizing the model couldn’t handle the complexity.
Silent degradation is more dangerous than an error. The reasoning tier falls back to another cloud model. If both cloud options fail, the request errors out. Your application knows something went wrong instead of proceeding with garbage output.
Why These Models
Model selection on Apple Silicon isn’t about benchmarks. It’s about what fits in your unified memory with room for the OS, the KV cache, and realistic context windows.
MacBook Air M4 (24GB) — qwen3-coder:30b
This is a 30B-parameter Mixture-of-Experts (MoE) model that activates only 3.3B parameters per token. The MoE architecture is what makes this work on a consumer laptop. You get the knowledge breadth of a 30B dense model but the inference speed of something much smaller.
At Q4_K_M quantization it takes 19GB, leaving 5GB for macOS, Ollama, and KV cache. It runs at 25-30 tok/s on the M4 Air.
Why not a smaller generalist? A 4B or 8B general-purpose model handles grep-like tasks fine. But for anything involving multi-file analysis, dependency tracing, or architecture discovery, the difference is real. The 30B coding specialist gets it right the first time more often, which means fewer corrective turns and lower total cost per completed task.
Why not a 35B generalist? At 21GB it leaves only 3GB for everything else. Any meaningful context window will push you into swap. The 30B coder is 2GB lighter and specifically trained for the kind of tasks you’d actually run locally.
Mac Mini M1 (16GB) — qwen3:8b
With 16GB unified memory, you have roughly 12-13GB available after macOS and Ollama. The 8B model at Q4 takes 6GB, leaving 6-7GB for the OS, Redis, KV cache, and context. It runs at 40+ tok/s on the M1.
This handles lightweight tasks: reading files, searching codebases, answering quick questions about individual functions, generating simple tests. The kind of stuff that doesn’t need the 30B model but you still want to run locally for speed and cost.
Quick Reference
| Device | RAM | Model | Size (Q4) | Active Params | Speed |
|---|---|---|---|---|---|
| MacBook Air M4 | 24GB | qwen3-coder:30b | 19GB | 3.3B/token | ~27 tok/s |
| Mac Mini M1 | 16GB | qwen3:8b | ~6GB | 8B | ~40 tok/s |
| Cloud | Variable | Any frontier model | N/A | N/A | Variable |
Context Window Math
Context consumes KV cache memory linearly. The model weights are fixed, but the context window grows. On the 24GB machine with a 19GB model:
| Context | KV Cache (approx.) | Total Memory | Status |
|---|---|---|---|
| 4K | ~1.5GB | ~20.5GB | Comfortable |
| 8K | ~3GB | ~22GB | Fine for typical work |
| 16K | ~5-6GB | ~24-25GB | Tight, may swap with other apps |
| 32K | ~10-12GB | ~29-31GB | Will swap, not viable |
On the Mac Mini (16GB, 6GB model):
| Context | KV Cache (approx.) | Total Memory | Status |
|---|---|---|---|
| 4K | ~1GB | ~7GB | Comfortable |
| 8K | ~2GB | ~8GB | Fine |
| 16K | ~4GB | ~10GB | Tight |
Practical advice: configure 8K context windows for local models. If a request exceeds that, LiteLLM’s context_window_fallbacks automatically upgrades to the cloud.
Prerequisites
- Ollama installed on both Macs (ollama.com)
- Redis on the Mac Mini (
brew install redis && brew services start redis) - Python 3.10+ with
uvinstalled (pip install uvorbrew install uv) - A cloud LLM API key (OpenAI, Anthropic, Google, or any provider LiteLLM supports)
- Both Macs on the same home network
Step 1: Set Up the Mac Mini M1
The Mac Mini is your always-on server. It runs the lightweight model and Redis.
1.1 Install Ollama and Redis
# On the Mac Mini
brew install ollama redis
brew services start redis
1.2 Pull the lightweight model
ollama pull qwen3:8b
This downloads the 8B model at Q4, roughly 6GB.
1.3 Expose Ollama on the network
By default, Ollama only listens on localhost. To make it reachable from the MacBook Air:
# Add to ~/.zshrc on the Mac Mini
export OLLAMA_HOST="0.0.0.0:11434"
Restart Ollama. Then verify from the MacBook Air (replace the IP with your Mac Mini’s actual address):
curl http://192.168.1.100:11434/api/tags
You should see a JSON response listing the models on the Mac Mini.
Security note: Exposing Ollama on your LAN means any device on the network can reach it, including management endpoints that can pull or delete models. Fine for a home network. In a shared space, put a reverse proxy with basic auth in front of it.
1.4 Find the Mac Mini’s IP
# On the Mac Mini
ipconfig getifaddr en0
Note this address. You’ll use it in the LiteLLM config.
Step 2: Set Up the MacBook Air M4
This is where LiteLLM runs and where the primary local inference happens.
2.1 Pull the primary model
ollama pull qwen3-coder:30b
This downloads the 30B MoE model at Q4, roughly 19GB. Expect 25-30 tok/s on the M4 Air.
2.2 Pull a local fallback for the lightweight tier
If the Mac Mini goes down, you still want fast inference for simple tasks:
ollama pull qwen3:8b
Same model as the Mac Mini. Only loaded if the Mini is unreachable.
Step 3: Install and Configure LiteLLM
LiteLLM speaks the OpenAI API format on the frontend (what your applications send) and translates to whatever provider format is needed on the backend (Ollama, OpenAI, Anthropic, etc.).
3.1 Install LiteLLM
# On the MacBook Air
uv tool install 'litellm[proxy]'
This installs the latest stable version (1.80.x as of this writing).
3.2 The config
Create ~/litellm-config.yaml on the MacBook Air:
model_list:
# =============================================
# TIER 1 - Lightweight tasks
# Routed to: Mac Mini M1 (network) -> M4 fallback -> cloud
# =============================================
- model_name: lightweight-tier
litellm_params:
model: ollama_chat/qwen3:8b
api_base: http://192.168.1.100:11434
extra_body:
options:
num_ctx: 8192
timeout: 30
order: 1
model_info:
id: "lightweight-macmini"
max_tokens: 4096
- model_name: lightweight-tier
litellm_params:
model: ollama_chat/qwen3:8b
api_base: http://localhost:11434
extra_body:
options:
num_ctx: 8192
timeout: 30
order: 2
model_info:
id: "lightweight-macbook"
max_tokens: 4096
- model_name: lightweight-tier
litellm_params:
model: openai/gpt-4.1-mini
api_key: os.environ/OPENAI_API_KEY
timeout: 60
order: 3
model_info:
id: "lightweight-cloud"
max_tokens: 4096
# =============================================
# TIER 2 - Main coding/complex tasks
# Routed to: MacBook Air M4 (localhost) -> cloud
# =============================================
- model_name: coding-tier
litellm_params:
model: ollama_chat/qwen3-coder:30b
api_base: http://localhost:11434
extra_body:
options:
num_ctx: 8192
timeout: 120
order: 1
model_info:
id: "coding-local"
max_tokens: 8192
- model_name: coding-tier
litellm_params:
model: openai/gpt-4.1
api_key: os.environ/OPENAI_API_KEY
timeout: 120
order: 2
model_info:
id: "coding-cloud"
max_tokens: 8192
# =============================================
# TIER 3 - Frontier reasoning (cloud only)
# No local fallback. Silent degradation to a
# weaker model on reasoning tasks is worse than
# a clear error.
# =============================================
- model_name: reasoning-tier
litellm_params:
model: openai/o3-mini
api_key: os.environ/OPENAI_API_KEY
timeout: 180
order: 1
model_info:
id: "reasoning-cloud-primary"
max_tokens: 16384
- model_name: reasoning-tier
litellm_params:
model: openai/gpt-4.1
api_key: os.environ/OPENAI_API_KEY
timeout: 180
order: 2
model_info:
id: "reasoning-cloud-secondary"
max_tokens: 16384
litellm_settings:
master_key: os.environ/LITELLM_MASTER_KEY
drop_params: true
drop_unsupported_params: true
num_retries: 2
request_timeout: 60
allowed_fails: 3
cooldown_time: 30
# Redis cache on the Mac Mini
cache: true
cache_params:
type: "redis"
host: "192.168.1.100"
port: 6379
ttl: 3600
fallbacks:
- { "lightweight-tier": ["coding-tier"] }
- { "coding-tier": ["reasoning-tier"] }
context_window_fallbacks:
- { "lightweight-tier": ["coding-tier", "reasoning-tier"] }
- { "coding-tier": ["reasoning-tier"] }
router_settings:
num_retries: 2
timeout: 60
Replace 192.168.1.100 with your Mac Mini’s actual IP. Replace the OpenAI model names with whatever provider you prefer. LiteLLM supports 100+ providers, so swap in Anthropic, Google, or any other service your API key works with.
The order parameter controls fallback priority within each tier. LiteLLM tries order 1 first. If it fails (connection refused, timeout, too many errors), it tries order 2, then order 3.
3.3 Environment variables
# Add to ~/.zshrc on the MacBook Air
export LITELLM_MASTER_KEY="sk-your-random-string-here"
export OPENAI_API_KEY="your-api-key-here"
The LITELLM_MASTER_KEY can be any random string. It authenticates requests to your LiteLLM proxy.
3.4 Start LiteLLM
litellm --config ~/litellm-config.yaml
You should see output indicating the proxy is running on http://0.0.0.0:4000.
3.5 Verify
Test all three tiers:
export LITELLM_MASTER_KEY="sk-your-random-string-here"
# Tier 1: lightweight (Mac Mini M1, qwen3:8b)
curl -X POST http://localhost:4000/v1/chat/completions \
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"lightweight-tier","max_tokens":100,"messages":[{"role":"user","content":"What does os.path.join do in Python?"}]}'
# Tier 2: coding (MacBook Air M4, qwen3-coder:30b)
curl -X POST http://localhost:4000/v1/chat/completions \
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"coding-tier","max_tokens":100,"messages":[{"role":"user","content":"Refactor this Express middleware to use async/await"}]}'
# Tier 3: reasoning (cloud)
curl -X POST http://localhost:4000/v1/chat/completions \
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"reasoning-tier","max_tokens":100,"messages":[{"role":"user","content":"Design a microservice architecture for a real-time collaboration system"}]}'
Check the LiteLLM dashboard at http://localhost:4000/ui to see request logs, latency, and which backend each request hit.
Using It From Your Application
This setup works with anything that speaks the OpenAI API format. Point your application’s base URL to http://localhost:4000/v1 and use the tier model names.
Python example with OpenAI SDK
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:4000/v1",
api_key="sk-your-litellm-master-key"
)
# Simple question - routes to Mac Mini (qwen3:8b)
response = client.chat.completions.create(
model="lightweight-tier",
messages=[{"role": "user", "content": "What is a closure in JavaScript?"}]
)
# Complex coding task - routes to MacBook Air (qwen3-coder:30b)
response = client.chat.completions.create(
model="coding-tier",
messages=[{"role": "user", "content": "Write a Redis caching layer for this Express API"}]
)
# Deep reasoning - routes to cloud
response = client.chat.completions.create(
model="reasoning-tier",
messages=[{"role": "user", "content": "Analyze the trade-offs between event sourcing and CQRS for an e-commerce system"}]
)
Shell script example
#!/bin/bash
# A simple automation script that uses local LLMs
API_URL="http://localhost:4000/v1/chat/completions"
API_KEY="sk-your-litellm-master-key"
# Generate a commit message from git diff
DIFF=$(git diff --staged)
if [ -z "$DIFF" ]; then
echo "No staged changes"
exit 1
fi
COMMIT_MSG=$(curl -s "$API_URL" \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d "{
\"model\": \"coding-tier\",
\"max_tokens\": 200,
\"messages\": [
{\"role\": \"system\", \"content\": \"Write a concise git commit message. Output only the message, nothing else.\"},
{\"role\": \"user\", \"content\": \"$DIFF\"}
]
}" | python3 -c "import sys,json; print(json.load(sys.stdin)['choices'][0]['message']['content'])")
echo "$COMMIT_MSG"
Integration with coding assistants
Most coding assistants that support custom API endpoints work with this setup. Set the base URL to http://localhost:4000/v1, configure the API key to your LiteLLM master key, and create multiple configurations for each tier. The assistant sends requests with the model name matching your tier names, and LiteLLM routes accordingly.
Caching with Redis
Many LLM workloads have high repetition. You ask about the same function, the same error message, the same architecture across multiple requests. Without caching, every one of those hits the model.
Redis caching in LiteLLM stores responses keyed by the exact request. It’s already configured in the config above. On the Mac Mini, verify it’s running:
redis-cli ping
# Should return: PONG
# Monitor cache activity
redis-cli monitor | grep litellm
The TTL is 3600 seconds (one hour). Adjust shorter for code that changes frequently, longer for stable documentation queries.
In practice, roughly 15-20% of lightweight-tier requests and 5-10% of coding-tier requests hit cache during a typical session. It compounds over a full day of work.
Observability: What to Actually Measure
The LiteLLM dashboard shows request logs and latency. But to know whether this setup is actually saving you money, track these metrics:
- Fallback frequency. How often do requests escalate from local to cloud? If the lightweight tier hits cloud 40% of the time, your local model isn’t handling what you expected.
- Latency percentiles. Track p50 and p95 per tier. A local model averaging 25 tok/s but with p95 at 5 tok/s (due to model swapping after idle periods) is worse than a consistent cloud model.
- Task completion rate. How often does the model get it right on the first try? If your local model handles 80% of requests but 50% of those need correction, the economics look very different.
- Context exhaustion events. How often do requests exceed the local context window and fall back to cloud? Frequent overflow means either your
num_ctxis too small or you need more headroom.
The danger zone: 80% of requests hit local, but 80% of corrections come from those same local requests. The token savings evaporate. Cost per token is a vanity metric. Cost per completed task is what matters.
Making It Persistent
Ollama on the Mac Mini (launchd):
Create ~/Library/LaunchAgents/com.ollama.server.plist:
Label
com.ollama.server
ProgramArguments
/opt/homebrew/bin/ollama
serve
EnvironmentVariables
OLLAMA_HOST
0.0.0.0:11434
RunAtLoad
KeepAlive
launchctl load ~/Library/LaunchAgents/com.ollama.server.plist
Redis on the Mac Mini (already persistent if started with brew services):
brew services start redis
LiteLLM on the MacBook Air (tmux):
tmux new-session -d -s litellm 'litellm --config ~/litellm-config.yaml'
Or create a similar launchd plist.
Best Practices
1. Keep models warm. Ollama unloads models from memory when idle. The first request after a cold start is slow. Add a keepalive cron:
*/5 * * * * curl -s http://localhost:11434/api/tags > /dev/null
*/5 * * * * curl -s http://192.168.1.100:11434/api/tags > /dev/null
Or set OLLAMA_KEEP_ALIVE=10m in your environment to keep models loaded longer.
2. Set appropriate timeouts per tier. Lightweight (local 8B) gets 30 seconds. Coding (local 30B MoE) gets 120 seconds. Reasoning (cloud) gets 180 seconds. Already configured in the YAML above.
3. Close other apps during heavy sessions. The MacBook Air’s 24GB is shared between the model, macOS, and your apps. Running Chrome with 20 tabs while generating a 16K-context response is asking for swap.
4. Benchmark your actual context usage. Don’t assume 8K is right. Run a few sessions and check ollama ps to see how much memory the model actually uses with your typical prompts. Adjust num_ctx up or down based on real data.
5. Use semantic caching for more savings. LiteLLM supports Redis-backed semantic cache that matches near-duplicate requests, not just exact matches. This catches repeated patterns like “explain this function” across slightly different wordings. Enable it by adding semantic_cache: true to cache_params (requires an embedding model configured).
Troubleshooting
“model not found” errors: Verify the model name in your request matches a model_name in the LiteLLM config. Names must match exactly.
Mac Mini unreachable from MacBook Air: Ping the Mac Mini’s IP. Verify OLLAMA_HOST="0.0.0.0:11434" is set and Ollama is running on the Mini.
Context window errors from local models: The context_window_fallbacks in the config handle this automatically. If you still see errors, verify the fallback chain is correctly configured.
Slow first response: Ollama unloads idle models. Use the keepalive cron or OLLAMA_KEEP_ALIVE.
Redis connection refused: Verify Redis is running on the Mac Mini (redis-cli ping). LiteLLM falls back gracefully to no caching if Redis is unreachable.
What’s Next
Add a 32GB+ Mac. A Mac Mini M4 Pro with 32GB lets you run the 30B model at Q5_K_M (near-lossless quantization) with 16K context, or run a larger generalist model alongside the coding specialist.
Add more machines. Each additional Mac on your network is another inference node. Add them as new entries in the LiteLLM model_list with appropriate order values.
Build a complexity classifier. Instead of hard-coding which tier to use, build a lightweight classifier that routes based on prompt length, token count, and task type. Route automatically based on complexity rather than application-level model selection.
Semantic caching. LiteLLM supports Redis-backed semantic cache that catches near-duplicate requests beyond exact string matching. This would catch more repeated patterns across differently worded prompts.
Honest Economics
The setup isn’t free. You need two Macs (or at least one Mac plus another machine that can run Ollama). You need a cloud API subscription for the reasoning tier. And there’s your time to set it all up.
The savings come from two places: local inference on tasks that don’t need a frontier model, and Redis caching on repeated requests. In practice, the local coding tier handles maybe 60-70% of coding requests without hitting the cloud. The lightweight tier handles 80-90% of simple tasks locally. Cloud usage drops to genuine frontier tasks.
But here’s the nuance. Local models are slower than cloud frontier models. A task a frontier model handles in one turn might take two or three turns with a local 30B model. The token savings are partially offset by extra turns. Whether this saves money depends on your workload: if most requests are straightforward coding tasks, the savings are substantial. If most are architectural or exploratory, you hit the cloud more.
After running this setup, cloud API costs dropped noticeably compared to routing everything through a single cloud model. The exact savings depend on what you’re building. But the system is transparent about where every request goes, and you can tune the routing to match your actual usage patterns.


