AIIntermediate

Docker Model Runner Cheatsheet

Pull and run open-source LLMs locally with an OpenAI-compatible endpoint.

What it is

Use Model Runner to run open models (Llama, Mistral, Phi, Qwen, etc.) on your machine with GPU acceleration and a stable OpenAI-style API for your apps.

Installation

Enable in Docker Desktop → Settings → Beta features → 'Docker Model Runner'. Verify with `docker model --help`.

Quick start

docker model pull ai/llama3.2

Download a model from the catalog.

docker model run ai/llama3.2 "Summarize Docker in one sentence."

One-shot inference from the CLI.

docker model list

Show locally available models.

curl http://localhost:12434/engines/v1/models

Hit the OpenAI-compatible endpoint exposed by Docker Desktop.

Common commands

Task	Command	Description
Pull a model	`docker model pull <model>`	Fetch from the model catalog.
Run a model	`docker model run <model> "<prompt>"`	Run inference (interactive without prompt).
List models	`docker model list`	Locally available models.
Remove a model	`docker model rm <model>`	Delete local model weights.
View logs	`docker model logs <model>`	Inference engine logs.
Inspect a model	`docker model inspect <model>`	Metadata: size, quantization, context length.

Useful flags

Flag	Example	Meaning
-i, --interactive	`docker model run -i ai/llama3.2`	Open an interactive chat session.
--gpu	`docker model run --gpu ai/llama3.2 "hi"`	Request GPU acceleration (where supported).
--format	`docker model list --format json`	Machine-readable output.

Real-world examples

Chat with a local model from your app (OpenAI SDK)

import OpenAI from "openai";
const client = new OpenAI({
  baseURL: "http://localhost:12434/engines/v1",
  apiKey: "not-needed",
});
const res = await client.chat.completions.create({
  model: "ai/llama3.2",
  messages: [{ role: "user", content: "Hello!" }],
});
console.log(res.choices[0].message.content);

Pull a small model and run a quick prompt

docker model pull ai/phi3.5
docker model run ai/phi3.5 "Write a haiku about containers."

Reach the endpoint from another container (Docker Desktop)

curl http://model-runner.docker.internal/engines/v1/models

Best practices

Start with small models (3B–7B) before pulling 70B weights — disk and RAM go fast.
Use the OpenAI-compatible endpoint so your app code stays portable.
Remove unused models periodically with `docker model rm` to reclaim disk.

Troubleshooting

`docker model: command not found`

Enable Docker Model Runner in Desktop settings and restart.

Inference is very slow

Ensure GPU support is enabled in Desktop and try a smaller/quantized variant.

Port 12434 unreachable from a container

Use the Docker Desktop hostname `model-runner.docker.internal` instead of localhost.

Official Docker Docs references

Last reviewed: 2026-06-15