Open Source AI Models Explained — Run GPT-Level Intelligence for $0 (Here’s How)

Lesson 2 of 6 · 15 min

Llama, Mistral, Gemma, Phi — Which Model for What?

Stop Guessing. Here’s the Actual Data.

Picking an open source model feels like choosing a restaurant in a foreign city — too many options, not enough context, and the reviews are all written by the chef. Let’s fix that with real numbers and specific recommendations.

The Big Four Model Families

Meta Llama 4 — The All-Rounder

Meta’s Llama 4 is the default recommendation for most use cases. The family uses Mixture-of-Experts architecture, which means you get huge model capacity without huge hardware bills.

ModelTotal ParamsActive ParamsContext WindowMMLUHumanEval
Llama 4 Scout109B (16 experts)17B10M tokens~85~80
Llama 4 Maverick400B (128 experts)17B1M tokens~89~85
Llama 3.3 70B70B (dense)70B128K tokens86.072.6
Llama 3.1 8B8B (dense)8B128K tokens69.462.5

Best for: General-purpose chat, long-context tasks (research, document analysis), RAG pipelines. The 10M context window on Scout is unmatched in the open source world.

Hardware: Scout fits on a single H100 (80GB) at full precision, or an RTX 4090 (24GB) with 4-bit quantization. The 8B variant runs on any modern laptop with 8GB+ RAM.

Mistral — The Speed Demon

Mistral AI from Paris builds models that consistently punch above their weight on speed. Their secret: aggressive optimization for inference throughput.

ModelTotal ParamsActive ParamsContextMMLUThroughput Edge
Mistral Large 3675B (MoE)41B128K~88Highest in class
Mistral 24B (Small 3)24B24B32K~814 tok/s on mini PCs
Mistral 7B v0.37B7B32K64.1Best tok/$ ratio

Best for: Latency-sensitive applications. Voice agents, trading assistants, customer-facing chatbots where you need sub-500ms response times. Mistral’s low memory consumption also makes it ideal for edge computing.

Hardware: Mistral 7B runs on 6GB VRAM (4-bit). The 24B model is the sweet spot for mini PCs and edge devices.

Google Gemma 3 — The Small Model King

Gemma is built on the same research as Google’s Gemini but optimized for consumer hardware. The killer feature: all Gemma 3 models include vision capabilities out of the box.

ModelParamsVisionMin RAMBest Use Case
Gemma 3 2B2BYes4GBMobile, IoT, edge
Gemma 3 9B9BYes8GBLaptop, desktop chat
Gemma 3 27B27BYes16GBWorkstation, RAG, coding

Best for: Developers who need vision + text in one model. The 2B variant is the only credible option for phones and IoT devices. Pre-aligned and instruction-tuned out of the box — minimal prompt engineering needed.

Microsoft Phi — The Researcher’s Pick

Microsoft’s Phi family proves that training data quality can compensate for parameter count. Phi-4 (14B) routinely outperforms models 3-5x its size on reasoning benchmarks.

Best for: Reasoning-heavy tasks, math, structured data analysis. If your use case is “answer complex questions” rather than “generate creative text,” Phi punches hardest per parameter.

The Dark Horse: Qwen 3

Alibaba’s Qwen 3 deserves special mention. The Qwen 3 7B posts a HumanEval score of 76.0 — the highest of any model under 8B parameters, beating Llama 3.3’s 72.6 by 3.4 points. The Qwen 3 72B scores 83.1 MMLU and 84.2 HumanEval. If coding is your primary use case and you want a small model, Qwen 3 7B is the answer.

Qwen3-30B-A3B also excels in extended context (342K tokens) and achieves 20-22 tokens/sec on 80GB GPUs — strong competition for Llama 4 Scout on long-document tasks.

The Decision Matrix

Use CaseBest ModelWhy
General chat + assistantLlama 4 ScoutBest balance of quality and context length
Coding on a laptopQwen 3 7BHighest HumanEval in the 7B class
Real-time / voiceMistral 7B or 24BLowest latency, best throughput
Vision + textGemma 3 9B or 27BBuilt-in vision, no extra model needed
Reasoning / mathPhi-4 14BBest reasoning per parameter
Edge / mobileGemma 3 2BOnly credible sub-4GB option
Long documents (100K+)Llama 4 Scout10M token context window
Budget coding serverQwen 3 72B84.2 HumanEval, enterprise quality

Hardware Reality Check

The most common question: “Can I run this on my machine?”

  • 8GB RAM / No GPU: Gemma 2B, Phi-3.5 Mini (3.8B). Expect 2-5 tokens/sec. Usable for testing, not comfortable for daily work.
  • 16GB RAM / M1-M3 Mac: Any 7-8B model (4-bit). 10-20 tokens/sec. Solid daily driver.
  • 24GB VRAM (RTX 4090/5090): Any model up to ~30B (4-bit). 20-40 tokens/sec. Professional quality.
  • 64GB Unified (M2/M3 Max/Ultra): Llama 4 Scout (quantized). 8-15 tokens/sec. The laptop dream.
  • 80GB (H100/A100): Llama 4 Scout full precision, any 70B model. Production-ready.

Next lesson: you’ll install Ollama and have a model running in under 5 minutes. No excuses.

Code Examples

model-sizes.sh
bash
# Quick size check: how much disk/RAM does each model need? (4-bit quantized)
# Gemma 2B:  ~1.5 GB
# Llama 3.1 8B:  ~4.7 GB
# Gemma 9B:  ~5.5 GB
# Phi-4 14B: ~8.5 GB
# Mistral 24B: ~14 GB
# Gemma 27B: ~16 GB
# Llama 3.3 70B: ~40 GB
# Qwen 3 72B: ~42 GB

Key Takeaways

  • Llama 4 Scout (17B active / 109B total) is the best general-purpose choice — fits on a single H100 with a 10M token context window
  • Mistral leads in speed and latency — best for real-time applications like voice agents or customer-facing bots under 500ms
  • Gemma 3 from Google is the best small model for consumer hardware — 2B/9B/27B sizes with built-in vision capabilities
  • Qwen 3 7B has the highest HumanEval score (76.0) of any sub-8B model — the coding champion for laptops
  • For most developers, the 7-8B parameter tier is the sweet spot: fits in 8GB VRAM with 4-bit quantization

Lesson 2 of 6

Related Resources

Weekly AI Digest