Llama, Mistral, Gemma, Phi — Which Model for What? — Open Source AI Models Explained — Run GPT-Level Intelligence for $0 (Here’s How)

Stop Guessing. Here’s the Actual Data.

Picking an open source model feels like choosing a restaurant in a foreign city — too many options, not enough context, and the reviews are all written by the chef. Let’s fix that with real numbers and specific recommendations.

The Big Four Model Families

Meta Llama 4 — The All-Rounder

Meta’s Llama 4 is the default recommendation for most use cases. The family uses Mixture-of-Experts architecture, which means you get huge model capacity without huge hardware bills.

Model	Total Params	Active Params	Context Window	MMLU	HumanEval
Llama 4 Scout	109B (16 experts)	17B	10M tokens	~85	~80
Llama 4 Maverick	400B (128 experts)	17B	1M tokens	~89	~85
Llama 3.3 70B	70B (dense)	70B	128K tokens	86.0	72.6
Llama 3.1 8B	8B (dense)	8B	128K tokens	69.4	62.5

Best for: General-purpose chat, long-context tasks (research, document analysis), RAG pipelines. The 10M context window on Scout is unmatched in the open source world.

Hardware: Scout fits on a single H100 (80GB) at full precision, or an RTX 4090 (24GB) with 4-bit quantization. The 8B variant runs on any modern laptop with 8GB+ RAM.

Mistral — The Speed Demon

Mistral AI from Paris builds models that consistently punch above their weight on speed. Their secret: aggressive optimization for inference throughput.

Model	Total Params	Active Params	Context	MMLU	Throughput Edge
Mistral Large 3	675B (MoE)	41B	128K	~88	Highest in class
Mistral 24B (Small 3)	24B	24B	32K	~81	4 tok/s on mini PCs
Mistral 7B v0.3	7B	7B	32K	64.1	Best tok/$ ratio

Best for: Latency-sensitive applications. Voice agents, trading assistants, customer-facing chatbots where you need sub-500ms response times. Mistral’s low memory consumption also makes it ideal for edge computing.

Hardware: Mistral 7B runs on 6GB VRAM (4-bit). The 24B model is the sweet spot for mini PCs and edge devices.

Google Gemma 3 — The Small Model King

Gemma is built on the same research as Google’s Gemini but optimized for consumer hardware. The killer feature: all Gemma 3 models include vision capabilities out of the box.

Model	Params	Vision	Min RAM	Best Use Case
Gemma 3 2B	2B	Yes	4GB	Mobile, IoT, edge
Gemma 3 9B	9B	Yes	8GB	Laptop, desktop chat
Gemma 3 27B	27B	Yes	16GB	Workstation, RAG, coding

Best for: Developers who need vision + text in one model. The 2B variant is the only credible option for phones and IoT devices. Pre-aligned and instruction-tuned out of the box — minimal prompt engineering needed.

Microsoft Phi — The Researcher’s Pick

Microsoft’s Phi family proves that training data quality can compensate for parameter count. Phi-4 (14B) routinely outperforms models 3-5x its size on reasoning benchmarks.

Best for: Reasoning-heavy tasks, math, structured data analysis. If your use case is “answer complex questions” rather than “generate creative text,” Phi punches hardest per parameter.

The Dark Horse: Qwen 3

Alibaba’s Qwen 3 deserves special mention. The Qwen 3 7B posts a HumanEval score of 76.0 — the highest of any model under 8B parameters, beating Llama 3.3’s 72.6 by 3.4 points. The Qwen 3 72B scores 83.1 MMLU and 84.2 HumanEval. If coding is your primary use case and you want a small model, Qwen 3 7B is the answer.

Qwen3-30B-A3B also excels in extended context (342K tokens) and achieves 20-22 tokens/sec on 80GB GPUs — strong competition for Llama 4 Scout on long-document tasks.

The Decision Matrix

Use Case	Best Model	Why
General chat + assistant	Llama 4 Scout	Best balance of quality and context length
Coding on a laptop	Qwen 3 7B	Highest HumanEval in the 7B class
Real-time / voice	Mistral 7B or 24B	Lowest latency, best throughput
Vision + text	Gemma 3 9B or 27B	Built-in vision, no extra model needed
Reasoning / math	Phi-4 14B	Best reasoning per parameter
Edge / mobile	Gemma 3 2B	Only credible sub-4GB option
Long documents (100K+)	Llama 4 Scout	10M token context window
Budget coding server	Qwen 3 72B	84.2 HumanEval, enterprise quality

Hardware Reality Check

The most common question: “Can I run this on my machine?”

8GB RAM / No GPU: Gemma 2B, Phi-3.5 Mini (3.8B). Expect 2-5 tokens/sec. Usable for testing, not comfortable for daily work.
16GB RAM / M1-M3 Mac: Any 7-8B model (4-bit). 10-20 tokens/sec. Solid daily driver.
24GB VRAM (RTX 4090/5090): Any model up to ~30B (4-bit). 20-40 tokens/sec. Professional quality.
64GB Unified (M2/M3 Max/Ultra): Llama 4 Scout (quantized). 8-15 tokens/sec. The laptop dream.
80GB (H100/A100): Llama 4 Scout full precision, any 70B model. Production-ready.

Next lesson: you’ll install Ollama and have a model running in under 5 minutes. No excuses.

Llama, Mistral, Gemma, Phi — Which Model for What?

Lesson Notes