Lesson 2 of 6 · 15 min
Llama, Mistral, Gemma, Phi — Which Model for What?
Stop Guessing. Here’s the Actual Data.
Picking an open source model feels like choosing a restaurant in a foreign city — too many options, not enough context, and the reviews are all written by the chef. Let’s fix that with real numbers and specific recommendations.
The Big Four Model Families
Meta Llama 4 — The All-Rounder
Meta’s Llama 4 is the default recommendation for most use cases. The family uses Mixture-of-Experts architecture, which means you get huge model capacity without huge hardware bills.
| Model | Total Params | Active Params | Context Window | MMLU | HumanEval |
|---|---|---|---|---|---|
| Llama 4 Scout | 109B (16 experts) | 17B | 10M tokens | ~85 | ~80 |
| Llama 4 Maverick | 400B (128 experts) | 17B | 1M tokens | ~89 | ~85 |
| Llama 3.3 70B | 70B (dense) | 70B | 128K tokens | 86.0 | 72.6 |
| Llama 3.1 8B | 8B (dense) | 8B | 128K tokens | 69.4 | 62.5 |
Best for: General-purpose chat, long-context tasks (research, document analysis), RAG pipelines. The 10M context window on Scout is unmatched in the open source world.
Hardware: Scout fits on a single H100 (80GB) at full precision, or an RTX 4090 (24GB) with 4-bit quantization. The 8B variant runs on any modern laptop with 8GB+ RAM.
Mistral — The Speed Demon
Mistral AI from Paris builds models that consistently punch above their weight on speed. Their secret: aggressive optimization for inference throughput.
| Model | Total Params | Active Params | Context | MMLU | Throughput Edge |
|---|---|---|---|---|---|
| Mistral Large 3 | 675B (MoE) | 41B | 128K | ~88 | Highest in class |
| Mistral 24B (Small 3) | 24B | 24B | 32K | ~81 | 4 tok/s on mini PCs |
| Mistral 7B v0.3 | 7B | 7B | 32K | 64.1 | Best tok/$ ratio |
Best for: Latency-sensitive applications. Voice agents, trading assistants, customer-facing chatbots where you need sub-500ms response times. Mistral’s low memory consumption also makes it ideal for edge computing.
Hardware: Mistral 7B runs on 6GB VRAM (4-bit). The 24B model is the sweet spot for mini PCs and edge devices.
Google Gemma 3 — The Small Model King
Gemma is built on the same research as Google’s Gemini but optimized for consumer hardware. The killer feature: all Gemma 3 models include vision capabilities out of the box.
| Model | Params | Vision | Min RAM | Best Use Case |
|---|---|---|---|---|
| Gemma 3 2B | 2B | Yes | 4GB | Mobile, IoT, edge |
| Gemma 3 9B | 9B | Yes | 8GB | Laptop, desktop chat |
| Gemma 3 27B | 27B | Yes | 16GB | Workstation, RAG, coding |
Best for: Developers who need vision + text in one model. The 2B variant is the only credible option for phones and IoT devices. Pre-aligned and instruction-tuned out of the box — minimal prompt engineering needed.
Microsoft Phi — The Researcher’s Pick
Microsoft’s Phi family proves that training data quality can compensate for parameter count. Phi-4 (14B) routinely outperforms models 3-5x its size on reasoning benchmarks.
Best for: Reasoning-heavy tasks, math, structured data analysis. If your use case is “answer complex questions” rather than “generate creative text,” Phi punches hardest per parameter.
The Dark Horse: Qwen 3
Alibaba’s Qwen 3 deserves special mention. The Qwen 3 7B posts a HumanEval score of 76.0 — the highest of any model under 8B parameters, beating Llama 3.3’s 72.6 by 3.4 points. The Qwen 3 72B scores 83.1 MMLU and 84.2 HumanEval. If coding is your primary use case and you want a small model, Qwen 3 7B is the answer.
Qwen3-30B-A3B also excels in extended context (342K tokens) and achieves 20-22 tokens/sec on 80GB GPUs — strong competition for Llama 4 Scout on long-document tasks.
The Decision Matrix
| Use Case | Best Model | Why |
|---|---|---|
| General chat + assistant | Llama 4 Scout | Best balance of quality and context length |
| Coding on a laptop | Qwen 3 7B | Highest HumanEval in the 7B class |
| Real-time / voice | Mistral 7B or 24B | Lowest latency, best throughput |
| Vision + text | Gemma 3 9B or 27B | Built-in vision, no extra model needed |
| Reasoning / math | Phi-4 14B | Best reasoning per parameter |
| Edge / mobile | Gemma 3 2B | Only credible sub-4GB option |
| Long documents (100K+) | Llama 4 Scout | 10M token context window |
| Budget coding server | Qwen 3 72B | 84.2 HumanEval, enterprise quality |
Hardware Reality Check
The most common question: “Can I run this on my machine?”
- 8GB RAM / No GPU: Gemma 2B, Phi-3.5 Mini (3.8B). Expect 2-5 tokens/sec. Usable for testing, not comfortable for daily work.
- 16GB RAM / M1-M3 Mac: Any 7-8B model (4-bit). 10-20 tokens/sec. Solid daily driver.
- 24GB VRAM (RTX 4090/5090): Any model up to ~30B (4-bit). 20-40 tokens/sec. Professional quality.
- 64GB Unified (M2/M3 Max/Ultra): Llama 4 Scout (quantized). 8-15 tokens/sec. The laptop dream.
- 80GB (H100/A100): Llama 4 Scout full precision, any 70B model. Production-ready.
Next lesson: you’ll install Ollama and have a model running in under 5 minutes. No excuses.
Code Examples
# Quick size check: how much disk/RAM does each model need? (4-bit quantized)
# Gemma 2B: ~1.5 GB
# Llama 3.1 8B: ~4.7 GB
# Gemma 9B: ~5.5 GB
# Phi-4 14B: ~8.5 GB
# Mistral 24B: ~14 GB
# Gemma 27B: ~16 GB
# Llama 3.3 70B: ~40 GB
# Qwen 3 72B: ~42 GBKey Takeaways
- Llama 4 Scout (17B active / 109B total) is the best general-purpose choice — fits on a single H100 with a 10M token context window
- Mistral leads in speed and latency — best for real-time applications like voice agents or customer-facing bots under 500ms
- Gemma 3 from Google is the best small model for consumer hardware — 2B/9B/27B sizes with built-in vision capabilities
- Qwen 3 7B has the highest HumanEval score (76.0) of any sub-8B model — the coding champion for laptops
- For most developers, the 7-8B parameter tier is the sweet spot: fits in 8GB VRAM with 4-bit quantization
Lesson 2 of 6