27 Feb 2025

Running QwenVL (qwen vision) locally with docker

TLDR

Qwen’s vision models are the best open source vision models available, but are tricky to run locally because they’re not supported by Ollama or llama.cpp. But If you have a linux box with an nvidia gpu you can run run qwen-vl directly by doing:

docker run \
    -p "9192:9192" \
    --gpus=all \
    --shm-size=8gb \
    ghcr.io/nikvdp/qwen-vl:7b

This will download and serve the 7b variant of Qwen vision with a (mostly) OpenAI compatible API at localhost:9192. For more background and info on other variants and building your own docker images read on.

Rolling your own

Even though all the major AI providers now have vision modes, if you’re looking for one that you can run at home (perhaps for building your own web browsing agent a là OpenAI’s operator) you may find that there are far fewer choices.

At the time of writing (assuming you don’t have access to a data center full of H100s to run the big boys) Llama’s vision series and Qwen’s qwen-vl model seem to be the best options. According to this leaderboard, Qwen’s vision models do better, but (much to my chagrin) they aren’t available on Ollama!

Apparently this is because they are built on a sufficiently different architecture that they aren’t runnable through llama.cpp (which is what ollama uses under the hood) without significant modifications, which was disappointing.

After some investigating I eventually came across this repo which uses huggingface’s transformers library to serve qwen-vl through an OpenAI compatible API (mostly, the /models endpoint isn’t implemented and it doesn’t support streaming, but gets the job done for basic use).

That sounded ideal, but the repo has a few quirks that didn’t suit my needs. First and foremost there was no simple way to run the model: you need to first clone the repo, then run some commands to build the docker images and download the model yourself, after which it saves the actual mode files into a volume on your local machine. The repo also doesn’t seem to have an easy way to run the 3b or 72b variant of qwen-vl (Qwen provides 3 variants of their vision series, 3b, 7b, and 72b).

So, I forked it into a version that can easily build all three variants (by setting a docker build arg) and lets you upload the entire image (including the model) to a docker registry so that it can be easily re-downloaded and run again later.

The 72b version is too big to run on my 3090 (and its docker image is too big to upload to github’s container registry without timing out) but the 7b and 3b variants run nicely this way.

I’ve uploaded the images I built with my fork of the repo into Github’s container registry (ghcr), so if you just want to run the model try this:

docker run \
    -p "9192:9192" \
    --gpus=all \
    --shm-size=8gb \
    ghcr.io/nikvdp/qwen-vl:7b

(or for the 3b version replace the qwen-vl:7b at the end withqwen-vl:3b)

If you’ve got powerful enough hardware to run the 72b variant, you can build the 72b image by doing:

export QWEN_MODEL="Qwen2.5-VL-72B-Instruct"
docker-compose build

That will cause docker-compose to build and save the 72b image to qwen25-vl-inference-openai-qwen-vl-api (this will take a while). Once it’s finished can run it by doing docker-compose up, or to run the image with the docker cli directly do:

docker run \
    -p "9192:9192" \
    --gpus=all \
    --shm-size=8gb \
    qwen25-vl-inference-openai-qwen-vl-api

Nik's Notes

Running QwenVL (qwen vision) locally with docker

TLDR

Rolling your own