Skip to content

Running LLMs

Use llama.cpp when you want more control over inference settings and quantization. Use Ollama when you want the simplest local runner with automatic model management.

Quick setup

Tool Install Start OpenAI-compatible endpoint
llama.cpp Build from source or winget install -e --id ggml.llamacpp on Windows llama-server -hf ... http://localhost:8080/v1
Ollama Download from ollama.com or winget install -e --id Ollama.Ollama ollama pull <model> http://localhost:11434/v1

llama.cpp

llama.cpp is a C/C++ runtime for local inference on CPU or GPU.

Flag Description
-hf Stream a model from Hugging Face
--temp Sampling temperature
--top-p Nucleus sampling threshold
--top-k Top-K sampling limit
--min-p Minimum probability threshold
--spec-type Speculative decoding mode (e.g. draft-mtp for multi-token prediction)
--spec-draft-n-max Draft tokens per step
--ctx-size Context window size
--no-mmap Disable memory-mapped loading
--repeat-penalty Repetition penalty
--jinja Enable Jinja chat templates
--n-gpu-layers Layers to offload to GPU

llama-server serves a local web UI and OpenAI-compatible API on port 8080 by default. Add --port <number> to change it.

Ollama

Ollama exposes an OpenAI-compatible API at http://localhost:11434/v1 and downloads models with ollama pull.

Context length

VRAM Default context length
< 24 GiB 4k tokens
24–48 GiB 32k tokens
>= 48 GiB 256k tokens

Override it with OLLAMA_CONTEXT_LENGTH or the app setting slider.

Models

Qwen3.6-35B-A3B

Blog · Hugging Face · Ollama

llama-server -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M --temp 0.6 --top-p 0.95 --min-p 0.00 --top-k 20 --spec-type draft-mtp --spec-draft-n-max 2 --no-mmap --ctx-size 32768
ollama pull qwen3.6:35b-a3b

Qwen3-Coder-30B-A3B

Blog · Hugging Face · Ollama

llama-server -hf unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:UD-Q4_K_XL --temp 0.7 --top-p 0.80 --min-p 0.0 --top-k 20 --repeat-penalty 1.05 --jinja --n-gpu-layers 99 --ctx-size 32768
ollama pull qwen3-coder:30b

Gemma 4 E4B-IT

Blog · Hugging Face · Ollama

llama-server -hf unsloth/gemma-4-E4B-it-GGUF:Q4_K_M --temp 1.0 --top-p 0.95 --top-k 64
ollama pull gemma4:e4b-it-q4_K_M

gpt-oss

Blog · Hugging Face · Ollama

llama-server -hf unsloth/gpt-oss-20b-GGUF:Q4_K_M --temp 1.0 --top-p 1.0 --top-k 0 --jinja --n-gpu-layers 99 --ctx-size 16384
ollama pull gpt-oss:20b

Resources