Running LLMs
Use llama.cpp when you want more control over inference settings and quantization. Use Ollama when you want the simplest local runner with automatic model management.
Quick setup
| Tool | Install | Start | OpenAI-compatible endpoint |
|---|---|---|---|
| llama.cpp | Build from source or winget install -e --id ggml.llamacpp on Windows |
llama-server -hf ... |
http://localhost:8080/v1 |
| Ollama | Download from ollama.com or winget install -e --id Ollama.Ollama |
ollama pull <model> |
http://localhost:11434/v1 |
llama.cpp
llama.cpp is a C/C++ runtime for local inference on CPU or GPU.
| Flag | Description |
|---|---|
-hf |
Stream a model from Hugging Face |
--temp |
Sampling temperature |
--top-p |
Nucleus sampling threshold |
--top-k |
Top-K sampling limit |
--min-p |
Minimum probability threshold |
--spec-type |
Speculative decoding mode (e.g. draft-mtp for multi-token prediction) |
--spec-draft-n-max |
Draft tokens per step |
--ctx-size |
Context window size |
--no-mmap |
Disable memory-mapped loading |
--repeat-penalty |
Repetition penalty |
--jinja |
Enable Jinja chat templates |
--n-gpu-layers |
Layers to offload to GPU |
llama-server serves a local web UI and OpenAI-compatible API on port 8080 by default. Add --port <number> to change it.
Ollama
Ollama exposes an OpenAI-compatible API at http://localhost:11434/v1 and downloads models with ollama pull.
Context length
| VRAM | Default context length |
|---|---|
< 24 GiB |
4k tokens |
24–48 GiB |
32k tokens |
>= 48 GiB |
256k tokens |
Override it with OLLAMA_CONTEXT_LENGTH or the app setting slider.
Models
Qwen3.6-35B-A3B
Blog · Hugging Face · Ollama
llama-server -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M --temp 0.6 --top-p 0.95 --min-p 0.00 --top-k 20 --spec-type draft-mtp --spec-draft-n-max 2 --no-mmap --ctx-size 32768
ollama pull qwen3.6:35b-a3b
Qwen3-Coder-30B-A3B
Blog · Hugging Face · Ollama
llama-server -hf unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:UD-Q4_K_XL --temp 0.7 --top-p 0.80 --min-p 0.0 --top-k 20 --repeat-penalty 1.05 --jinja --n-gpu-layers 99 --ctx-size 32768
ollama pull qwen3-coder:30b
Gemma 4 E4B-IT
Blog · Hugging Face · Ollama
llama-server -hf unsloth/gemma-4-E4B-it-GGUF:Q4_K_M --temp 1.0 --top-p 0.95 --top-k 64
ollama pull gemma4:e4b-it-q4_K_M
gpt-oss
Blog · Hugging Face · Ollama
llama-server -hf unsloth/gpt-oss-20b-GGUF:Q4_K_M --temp 1.0 --top-p 1.0 --top-k 0 --jinja --n-gpu-layers 99 --ctx-size 16384
ollama pull gpt-oss:20b