Running LLMs

Use llama.cpp when you want more control over inference settings and quantization. Use Ollama when you want the simplest local runner with automatic model management.

Quick setup

Tool	Install	Start	OpenAI-compatible endpoint
llama.cpp	Build from source or `winget install -e --id ggml.llamacpp` on Windows	`llama-server -hf ...`	`http://localhost:8080/v1`
Ollama	Download from ollama.com or `winget install -e --id Ollama.Ollama`	`ollama pull <model>`	`http://localhost:11434/v1`

llama.cpp

llama.cpp is a C/C++ runtime for local inference on CPU or GPU.

Flag	Description
`-hf`	Stream a model from Hugging Face
`--temp`	Sampling temperature
`--top-p`	Nucleus sampling threshold
`--top-k`	Top-K sampling limit
`--min-p`	Minimum probability threshold
`--spec-type`	Speculative decoding mode (e.g. `draft-mtp` for multi-token prediction)
`--spec-draft-n-max`	Draft tokens per step
`--ctx-size`	Context window size
`--no-mmap`	Disable memory-mapped loading
`--repeat-penalty`	Repetition penalty
`--jinja`	Enable Jinja chat templates
`--n-gpu-layers`	Layers to offload to GPU

llama-server serves a local web UI and OpenAI-compatible API on port 8080 by default. Add --port <number> to change it.

Ollama

Ollama exposes an OpenAI-compatible API at http://localhost:11434/v1 and downloads models with ollama pull.

Context length

VRAM	Default context length
`< 24 GiB`	4k tokens
`24–48 GiB`	32k tokens
`>= 48 GiB`	256k tokens

Override it with OLLAMA_CONTEXT_LENGTH or the app setting slider.

Models

Qwen3.6-35B-A3B

Blog · Hugging Face · Ollama

llama-server -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M --temp 0.6 --top-p 0.95 --min-p 0.00 --top-k 20 --spec-type draft-mtp --spec-draft-n-max 2 --no-mmap --ctx-size 32768

ollama pull qwen3.6:35b-a3b

Qwen3-Coder-30B-A3B

Blog · Hugging Face · Ollama

llama-server -hf unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:UD-Q4_K_XL --temp 0.7 --top-p 0.80 --min-p 0.0 --top-k 20 --repeat-penalty 1.05 --jinja --n-gpu-layers 99 --ctx-size 32768

ollama pull qwen3-coder:30b

Gemma 4 E4B-IT

Blog · Hugging Face · Ollama

llama-server -hf unsloth/gemma-4-E4B-it-GGUF:Q4_K_M --temp 1.0 --top-p 0.95 --top-k 64

ollama pull gemma4:e4b-it-q4_K_M

gpt-oss

Blog · Hugging Face · Ollama

llama-server -hf unsloth/gpt-oss-20b-GGUF:Q4_K_M --temp 1.0 --top-p 1.0 --top-k 0 --jinja --n-gpu-layers 99 --ctx-size 16384

ollama pull gpt-oss:20b

Running LLMs

Quick setup

llama.cpp

Ollama

Context length

Models

Qwen3.6-35B-A3B

Qwen3-Coder-30B-A3B

Gemma 4 E4B-IT

gpt-oss

Resources