{"id":301,"date":"2026-04-18T11:27:51","date_gmt":"2026-04-18T11:27:51","guid":{"rendered":"https:\/\/blog.ngocha.biz\/?p=301"},"modified":"2026-04-18T11:27:51","modified_gmt":"2026-04-18T11:27:51","slug":"deploying-llama-with-docker-and-vllm","status":"publish","type":"post","link":"https:\/\/blog.ngocha.biz\/?p=301","title":{"rendered":"How to Deploy Llama 3 with Docker and vLLM (Detailed Guide)"},"content":{"rendered":"<p>Managing GPU resources is a core challenge in <a href=\"https:\/\/devopscube.com\/devops-to-mlops\/\" rel=\"noreferrer\">MLOps<\/a>. In traditional web services, when a web service runs out of CPU  or RAM, you can add or swap the disk itself for cheap. <\/p>\n<p>But <strong>VRAM on a GPU<\/strong> is different you cannot simply add, it has a fixed capacity. The model either fits or it doesn&#8217;t.<\/p>\n<p>At 16GB of weights for a llama 3 8B (FP16) model, on a 20 GB card, there is almost no room for anything else without experimenting.<\/p>\n<p>Here is what this blog will cover.<\/p>\n<ul>\n<li>GPU Environment setup (Docker + NVIDIA runtime)<\/li>\n<li>Model deployment with llama &amp; vLLM<\/li>\n<li>Three optimization experiments (memory tuning, FP8 quantization, and  continuous batching)<\/li>\n<\/ul>\n<div class=\"kg-card kg-callout-card kg-callout-card-blue\">\n<div class=\"kg-callout-text\"><b><strong style=\"white-space: pre-wrap;\">End result:<\/strong><\/b> 2.2\u00d7 throughput (1.62 to 3.64 req\/s), 3x faster time to first token, 5 to 20+ concurrent users on the same hardware.<\/div>\n<\/div>\n<p>Let&#8217;s get started.<\/p>\n<h2 id=\"prerequisites\">Prerequisites<\/h2>\n<p>The following are the prerequisites to follow this guide.<\/p>\n<ul>\n<li>A Linux VM with an NVIDIA GPU attached<\/li>\n<li><a href=\"https:\/\/devopscube.com\/how-to-install-and-configure-docker\/\" rel=\"noreferrer\">Docker Installed on the VM<\/a><\/li>\n<\/ul>\n<p>The GPU type used in this guide is NVIDIA RTX 4000 GPU provisioned on <a href=\"https:\/\/devopscube.com\/get-free-digital-ocean-credits\/\" rel=\"noreferrer\">Digital Ocean<\/a>.<\/p>\n<figure class=\"kg-card kg-image-card\"><img decoding=\"async\" src=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/2026\/04\/image-40.png\" class=\"kg-image\" alt=\"\" loading=\"lazy\" width=\"1322\" height=\"594\" srcset=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w600\/2026\/04\/image-40.png 600w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w1000\/2026\/04\/image-40.png 1000w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/2026\/04\/image-40.png 1322w\" sizes=\"auto, (min-width: 720px) 720px\"><\/figure>\n<h2 id=\"environment-setup\">Environment setup<\/h2>\n<p>Start by SSH into your <a href=\"https:\/\/devopscube.com\/kubernetes-cluster-vagrant\/\" rel=\"noreferrer\">VM<\/a>. The following and confirming the driver is visible:<\/p>\n<pre><code class=\"language-bash\">nvidia-smi<\/code><\/pre>\n<p>You will see a table as below.<\/p>\n<figure class=\"kg-card kg-image-card kg-card-hascaption\"><img decoding=\"async\" src=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/2026\/03\/image-73.png\" class=\"kg-image\" alt=\"\" loading=\"lazy\" width=\"936\" height=\"428\" srcset=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w600\/2026\/03\/image-73.png 600w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/2026\/03\/image-73.png 936w\" sizes=\"auto, (min-width: 720px) 720px\"><figcaption><span style=\"white-space: pre-wrap;\">The drivers are visible.<\/span><\/figcaption><\/figure>\n<p>On my device it showed the RTX 4000 Ada with 20,475 MiB total VRAM which means the GPU is visible.<\/p>\n<div class=\"kg-card kg-callout-card kg-callout-card-blue\">\n<div class=\"kg-callout-emoji\">\ud83d\udca1<\/div>\n<div class=\"kg-callout-text\">In most GPU VMs on cloud platforms like DigitalOcean, AWS, etc, the driver will be pre-installed. You can see your GPU details from the start.<\/div>\n<\/div>\n<p>If you don&#8217;t see your GPU details you need to fix or install the driver before doing anything else. Refer <a href=\"https:\/\/docs.nvidia.com\/datacenter\/cloud-native\/container-toolkit\/latest\/install-guide.html?ref=devopscube.com\" rel=\"noreferrer\">NVIDIA container toolkit<\/a> page to install it.<\/p>\n<h2 id=\"deploying-the-first-model\">Deploying the first model<\/h2>\n<p><strong>I used <\/strong><a href=\"https:\/\/docs.vllm.ai\/en\/latest\/?ref=devopscube.com\" rel=\"noreferrer\"><strong>vLLM<\/strong><\/a> because it has paged attention. Every other inference server like <a href=\"https:\/\/ollama.com\/?ref=devopscube.com\" rel=\"noreferrer\">ollama<\/a>, and <strong><code>llama.cpp<\/code><\/strong> preallocates the KV cache as one contiguous block of VRAM, and most of that block sits empty. <\/p>\n<p>Paged attention breaks the cache into pages and allocates them only as needed. Cases like these where the model already takes up 16GB of the 20 GB VRAM available, this matters.<\/p>\n<p>Lets get started with the setup.<\/p>\n<h3 id=\"step-1-configure-nvidia-as-the-default-docker-runtime\">Step 1: Configure NVIDIA as the Default Docker Runtime<\/h3>\n<p>When Docker is freshly installed, its default runtime is runc, not nvidia. vLLM needs the NVIDIA runtime to access the GPU inside the container.<\/p>\n<pre><code class=\"language-bash\">nvidia-ctk runtime configure --runtime=docker<\/code><\/pre>\n<p>This creates \/etc\/docker\/daemon.json. Open it:<\/p>\n<pre><code class=\"language-bash\">nano \/etc\/docker\/daemon.json<\/code><\/pre>\n<p>Add the default-runtime line so it looks like this:<\/p>\n<pre><code class=\"language-json\">{\n    \"default-runtime\": \"nvidia\",\n    \"runtimes\": {\n        \"nvidia\": {\n            \"args\": [],\n            \"path\": \"nvidia-container-runtime\"\n        }\n    }\n}<\/code><\/pre>\n<p>Save, then restart Docker:<\/p>\n<pre><code class=\"language-bash\">systemctl restart docker<\/code><\/pre>\n<p>Verify:<\/p>\n<pre><code class=\"language-bash\">docker info | grep -i runtime<\/code><\/pre>\n<h3 id=\"step-2-create-a-directory-to-store-model-weights\">Step 2: Create a Directory to Store Model Weights<\/h3>\n<p>Llama 3 8B weights are over 15GB. Without this, every container restart re-downloads the entire model.<\/p>\n<pre><code class=\"language-bash\">mkdir -p \/opt\/models<\/code><\/pre>\n<p>This directory will be mounted into the container, so weights are downloaded once and reused on every restart.<\/p>\n<h3 id=\"step-3-get-a-huggingface-token\">Step 3: Get a HuggingFace Token<\/h3>\n<p>vLLM pulls the Llama 3 weights directly from HuggingFace on first startup. Llama 3 is a gated model, meta required you to agree to their licence before you can download it. You will need a HuggingFace account and an access token.<\/p>\n<p>Go to huggingface.co and sign in or create an account.<\/p>\n<p>Click your profile -&gt; Settings -&gt; Access Tokens -&gt; then click New token.<\/p>\n<figure class=\"kg-card kg-image-card\"><img decoding=\"async\" src=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/2026\/04\/image-49.png\" class=\"kg-image\" alt=\"creating huggingface access token\" loading=\"lazy\" width=\"1532\" height=\"734\" srcset=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w600\/2026\/04\/image-49.png 600w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w1000\/2026\/04\/image-49.png 1000w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/2026\/04\/image-49.png 1532w\" sizes=\"auto, (min-width: 720px) 720px\"><\/figure>\n<p>Then give it a name (eg <a href=\"https:\/\/devopscube.com\/deploying-llama-with-docker-and-vllm\/\" rel=\"noreferrer\">llama<\/a>,) and under permissions, enable only: &#8220;Read access to contents of all public gated repos.<\/p>\n<p>Scroll down and click Generate token. It will give you the token, keep it safe, we will use it during model deployment.<\/p>\n<figure class=\"kg-card kg-image-card\"><img decoding=\"async\" src=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/2026\/04\/image-52.png\" class=\"kg-image\" alt=\"creating huggingface access token\" loading=\"lazy\" width=\"1486\" height=\"958\" srcset=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w600\/2026\/04\/image-52.png 600w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w1000\/2026\/04\/image-52.png 1000w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/2026\/04\/image-52.png 1486w\" sizes=\"auto, (min-width: 720px) 720px\"><\/figure>\n<p>Once created, you can see your token as shown below.<\/p>\n<figure class=\"kg-card kg-image-card\"><img decoding=\"async\" src=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/2026\/04\/image-51.png\" class=\"kg-image\" alt=\"created huggingface access token\" loading=\"lazy\" width=\"1566\" height=\"732\" srcset=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w600\/2026\/04\/image-51.png 600w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w1000\/2026\/04\/image-51.png 1000w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/2026\/04\/image-51.png 1566w\" sizes=\"auto, (min-width: 720px) 720px\"><\/figure>\n<p>Set it on your VM as a variable.<\/p>\n<pre><code class=\"language-bash\">export HF_TOKEN=hf_your_token_here<\/code><\/pre>\n<h3 id=\"step-4-request-access-to-llama-3-on-huggingface\">Step 4: Request Access to Llama 3 on HuggingFace<\/h3>\n<p>Even with a valid token, the model is gated. You must manually accept Meta&#8217;s license, or the download will fail.<\/p>\n<p>Visit <a href=\"https:\/\/huggingface.co\/meta-llama\/Meta-Llama-3-8B-Instruct?ref=devopscube.com\">Meta-Llama-3-8B-Instruct<\/a> page.<\/p>\n<p>Then, click on the &#8220;Expand to review and access&#8221; toggle button.<\/p>\n<figure class=\"kg-card kg-image-card\"><img decoding=\"async\" src=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/2026\/04\/image-45.png\" class=\"kg-image\" alt=\"Request Access to Llama model on HuggingFace\" loading=\"lazy\" width=\"2000\" height=\"1241\" srcset=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w600\/2026\/04\/image-45.png 600w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w1000\/2026\/04\/image-45.png 1000w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w1600\/2026\/04\/image-45.png 1600w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/2026\/04\/image-45.png 2192w\" sizes=\"auto, (min-width: 720px) 720px\"><\/figure>\n<p>Scroll down and fill out the form to request.<\/p>\n<figure class=\"kg-card kg-image-card\"><img decoding=\"async\" src=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/2026\/04\/image-46.png\" class=\"kg-image\" alt=\"Request Access to Llama model on HuggingFace\" loading=\"lazy\" width=\"1862\" height=\"1318\" srcset=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w600\/2026\/04\/image-46.png 600w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w1000\/2026\/04\/image-46.png 1000w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w1600\/2026\/04\/image-46.png 1600w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/2026\/04\/image-46.png 1862w\" sizes=\"auto, (min-width: 720px) 720px\"><\/figure>\n<p>Then go to settings to check the status of your request.<\/p>\n<figure class=\"kg-card kg-image-card\"><img decoding=\"async\" src=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/2026\/04\/image-47.png\" class=\"kg-image\" alt=\"Request Access to Llama model on HuggingFace\" loading=\"lazy\" width=\"1708\" height=\"914\" srcset=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w600\/2026\/04\/image-47.png 600w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w1000\/2026\/04\/image-47.png 1000w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w1600\/2026\/04\/image-47.png 1600w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/2026\/04\/image-47.png 1708w\" sizes=\"auto, (min-width: 720px) 720px\"><\/figure>\n<p>Once your request is approved, you will see your status as accepted.<\/p>\n<figure class=\"kg-card kg-image-card\"><img decoding=\"async\" src=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/2026\/04\/image-48.png\" class=\"kg-image\" alt=\"Request for Llama model on HuggingFace got accepted\" loading=\"lazy\" width=\"1672\" height=\"556\" srcset=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w600\/2026\/04\/image-48.png 600w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w1000\/2026\/04\/image-48.png 1000w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w1600\/2026\/04\/image-48.png 1600w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/2026\/04\/image-48.png 1672w\" sizes=\"auto, (min-width: 720px) 720px\"><\/figure>\n<p>Access is usually granted within 5 minutes.<\/p>\n<h3 id=\"step-5-run-the-vllm-container\">Step 5:  Run the vLLM Container<\/h3>\n<p>Now, run the following command to run the llama3 model.<\/p>\n<pre><code class=\"language-bash\">docker run --runtime nvidia --gpus all \\\n  -v \/opt\/models:\/root\/.cache\/huggingface \\\n  -e HF_TOKEN=$HF_TOKEN \\\n  -p 8000:8000 \\\n  --ipc=host \\\n  --name llama3 \\\n  -d vllm\/vllm-openai:latest \\\n  --model meta-llama\/Meta-Llama-3-8B-Instruct \\\n  --gpu-memory-utilization 0.90<\/code><\/pre>\n<div class=\"kg-card kg-callout-card kg-callout-card-blue\">\n<div class=\"kg-callout-emoji\">\ud83d\udca1<\/div>\n<div class=\"kg-callout-text\">The first run will take some time because vLLM is downloading over 15 GB from HuggingFace. The starts after the weights are cached in <code spellcheck=\"false\" style=\"white-space: pre-wrap;\">\/opt\/models<\/code> will be way faster.<\/div>\n<\/div>\n<p>Here is what each flag does.<\/p>\n<p><strong><code>--runtime nvidia --gpus all<\/code> :<\/strong> gives the container full access to the GPU.<\/p>\n<p><strong><code>-v \/opt\/models:\/root\/.cache\/huggingface<\/code><\/strong> <strong>:<\/strong> lama 3 weights are over 15GB. Every time a container is restarted, the llama 3 weights are downloaded again, that is 15 GB for every restart. <\/p>\n<p>So we store the model weights on the host system, in a specific directory (eg: <code>\/opt\/models<\/code>).&nbsp;<\/p>\n<p>Now we can map the host directory to a directory inside a container that has started or restarted (eg: <code>\/root\/.cache\/huggingface<\/code>) using the -v (volume) option. <\/p>\n<p>This means that the container can access the model weights directly from the host systems storage.<\/p>\n<p><strong><code>-p 8000:8000<\/code><\/strong> <strong>: <\/strong>exposes the API on port 8000.<\/p>\n<p><strong><code>--ipc=host<\/code>:<\/strong> vLLM creates multiple worker processes that communicate over shared memory. vLLM needs far more than the shared memory <a href=\"https:\/\/devopscube.com\/docker-image-build-promotion-piepeline\/\" rel=\"noreferrer\">Docker<\/a>Docker provides (64mb).&nbsp;<\/p>\n<p>By setting the &#8211;ipc flag to host, vLLM uses the host machine&#8217;s memory instead of the default shared memory provided by Docker. Without this flag, we get CUDA IPC errors that look like GPU failures but are actually a shared memory problem.<\/p>\n<p><strong><code>vllm\/vllm-openai<\/code><\/strong> is the image we are using. It comes with <strong>vLLM<\/strong> pre-installed (An OpenAI-compatible API server)<\/p>\n<p>Run the following command to check if the model starts running.<\/p>\n<pre><code class=\"language-bash\">docker logs -f llama3<\/code><\/pre>\n<p>Do not send requests until you see:<\/p>\n<pre><code>Application startup complete.\n<\/code><\/pre>\n<h3 id=\"step-7-testing-the-model\">Step 7: Testing the Model<\/h3>\n<p>Before starting testing, make sure your VM has JQ installed, if not run the following command to install it.<\/p>\n<pre><code class=\"language-bash\">sudo apt update &amp;&amp; sudo apt install jq -y<\/code><\/pre>\n<p>To test the model, we will send a query using curl.<\/p>\n<pre><code class=\"language-bash\">curl -s http:\/\/localhost:8000\/v1\/chat\/completions \\\n  -H \"Content-Type: application\/json\" \\\n  -d '{\n    \"model\": \"meta-llama\/Meta-Llama-3-8B-Instruct\",\n    \"messages\": [{\"role\": \"user\", \"content\": \"What is vLLM?\"}],\n    \"max_tokens\": 50\n  }' | jq -r '.choices[0].message.content'<\/code><\/pre>\n<p>This gives an output similar to below.<\/p>\n<pre><code class=\"language-bash\">vLLM stands for Virtual Large Language Model. It's a type of artificial intelligence (AI) model that's designed to mimic the capabilities of a large language model (LLM) but is trained on a virtual environment rather than a physical one.<\/code><\/pre>\n<p>For me, the first response came back in 2.369 seconds. During the generation power draw went from 11 w (idle) to 105 w and GPU utilisation hit 100%.<\/p>\n<h2 id=\"understanding-resource-utilization\">Understanding resource utilization<\/h2>\n<p>With the model loaded and no requests running, the GPU was already at 18,850 MiB of 20,475 MiB. <strong>That is over 92% ful<\/strong>l. The weights are loaded into VRAM on startup and stay there. The GPU looks full the whole time, even when idle.<\/p>\n<pre><code class=\"language-bash\">nvidia-smi\ndocker stats llama3 --no-stream<\/code><\/pre>\n<p><strong>docker stats reports: <\/strong>CPU utilization less than 1%, RAM 4.55 GB of 31.34 GB, 95 active processes. CPU and RAM are nowhere near the limits. <a href=\"https:\/\/devopscube.com\/setup-gpu-operator-kubernetes\/\" rel=\"noreferrer\">GPU<\/a> is the only constraint in every test.<\/p>\n<h3 id=\"interpreting-utilization-metrics\"><strong>I<\/strong>nterpreting utilization metrics<\/h3>\n<p>GPU utilization percentage in nvidia-smi is not as accurate as it looks. It measures a very small part of a sampling window during which any instruction is executed. <\/p>\n<p>So a GPU showing 100% utilization doesn&#8217;t necessarily mean the GPU is doing intensive work, it could just be idle loops or spin-wait instructions running to check the readiness of the data being provided by the upstream components like the CPU or memory.<\/p>\n<p>The number that is more useful is the power draw. Around 130W with stable temperature means the GPU is actually computing, but 30W despite showing high utilization implies that the GPU is being starved of data or instructions.<\/p>\n<h3 id=\"throttling\">Throttling<\/h3>\n<p>GPU&#8217;s like the RTX 4000 automatically reduce the clock speeds when the hardware temperature reaches a certain threshold. In my case 83\u00b0C, to prevent overheating. <\/p>\n<p>This process is called throttling, while this helps with the longevity of the hardware and stability during sustained workloads, it can lead to decreased performance such as slower processing speed and reduced tokens per second mid response with no error message.<\/p>\n<h3 id=\"vram-breakdown-llama-3-8b-instruct-on-a-20-gb-card\"><strong>VRAM breakdown Llama 3 8B instruct on a 20 GB card.<\/strong><\/h3>\n<figure class=\"kg-card kg-image-card kg-card-hascaption\"><img decoding=\"async\" src=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/2026\/03\/image-73.png\" class=\"kg-image\" alt=\"\" loading=\"lazy\" width=\"936\" height=\"428\" srcset=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w600\/2026\/03\/image-73.png 600w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/2026\/03\/image-73.png 936w\" sizes=\"auto, (min-width: 720px) 720px\"><figcaption><span style=\"white-space: pre-wrap;\">The VRAM usage is 18850 MiB out of 20475 MiB total, which is about 92.1% utilisation.<\/span><\/figcaption><\/figure>\n<p><!--kg-card-begin: html--><\/p>\n<table>\n<thead>\n<tr>\n<th>Component<\/th>\n<th>VRAM<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Model weights (FP16)<\/td>\n<td>~16 GB<\/td>\n<\/tr>\n<tr>\n<td>KV cache<\/td>\n<td>~3\u20134 GB<\/td>\n<\/tr>\n<tr>\n<td>CUDA overhead<\/td>\n<td>~200 MB<\/td>\n<\/tr>\n<tr>\n<td>Headroom at 0.90 util<\/td>\n<td>~200 MB<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><!--kg-card-end: html--><\/p>\n<p>Look at the last row, there is no room for anything else.<\/p>\n<div class=\"kg-card kg-callout-card kg-callout-card-blue\">\n<div class=\"kg-callout-emoji\">\ud83d\udca1<\/div>\n<div class=\"kg-callout-text\">When the model weights take up 16GB out of 20GB, we encounter <b><strong style=\"white-space: pre-wrap;\">VRAM starvation<\/strong><\/b>. There is simply no room left for the KV Cache to breathe.<\/div>\n<\/div>\n<h2 id=\"optimisation-techniques\">Optimisation Techniques<\/h2>\n<p>Lets look at some of the optimization techniques<\/p>\n<h3 id=\"memory-optimization\">Memory optimization:<\/h3>\n<p>The gpu-memory-utilization flag sets the amount of VRAM vLLM gets on startup. It reserves the memory immediately.<\/p>\n<p>I tested 4 values for this flag:<\/p>\n<p><!--kg-card-begin: html--><\/p>\n<table>\n<thead>\n<tr>\n<th>Utilization<\/th>\n<th>VRAM Reserved<\/th>\n<th>Max Concurrent<\/th>\n<th>AVG Latency<\/th>\n<th>Result<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>0.50<\/td>\n<td>~10,240 MiB<\/td>\n<td>0<\/td>\n<td>\u2014<\/td>\n<td>Crash on startup \u2014 weights don&#8217;t fit<\/td>\n<\/tr>\n<tr>\n<td>0.70<\/td>\n<td>~14 GB<\/td>\n<td>5<\/td>\n<td>~450ms<\/td>\n<td>Stable<\/td>\n<\/tr>\n<tr>\n<td>0.90<\/td>\n<td>~18,432 MiB<\/td>\n<td>20<\/td>\n<td>~120ms<\/td>\n<td>Stable<\/td>\n<\/tr>\n<tr>\n<td>0.95<\/td>\n<td>19,874 MiB<\/td>\n<td>50+<\/td>\n<td>~90ms<\/td>\n<td>Works, but dangerously thin<\/td>\n<\/tr>\n<tr>\n<td>1.00<\/td>\n<td>N\/A<\/td>\n<td>0<\/td>\n<td>\u2014<\/td>\n<td>RuntimeError: Engine core initialization failed<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><!--kg-card-end: html--><\/p>\n<ul>\n<li>Llama 3 8B&#8217;s weights alone need around 15 GB, so the crash when the utilization flag is set to 0.5 is not surprising.<\/li>\n<li>When the flag is set to 1.00 (trying to reserve 100% of the VRAM) the engine fails to allocate small non paged memory blocks required by CUDA kernels. The system prevents full reservation to keep these small blocks free, showing a failure when you try to reserve 100%.<\/li>\n<\/ul>\n<div class=\"kg-card kg-callout-card kg-callout-card-blue\">\n<div class=\"kg-callout-emoji\">\ud83d\udca1<\/div>\n<div class=\"kg-callout-text\">Non-paged Memory Blocks: These are small chunks of GPU memory that must be available at runtime for CUDA kernels to execute properly.<\/div>\n<\/div>\n<p>Values ranging from 0.70 to 0.85 are what most tutorials suggest you should set the gpu-memory-utilization flag to, the thought process being to leave enough headroom and be safe. But this leaves very little room for the KV cache.<\/p>\n<p>After vLLM loads the weights, the remaining memory is automatically allocated to the KV cache. The KV cache is what lets the model track multiple conversations at a time. <\/p>\n<p>Every user in an ongoing conversation occupies cache space directly proportional to how far along the conversation they are. <\/p>\n<p>When the cache fills up the vLLM does not crash, it pauses one user\u2019s generation, to process someone else\u2019s, then resumes. This leads to latency spikes.<\/p>\n<ul>\n<li>At 0.70 , with weights themselves taking up all of the VRAM the effective cache available is extremely thin, which is why it can only support 5 concurrent users. <\/li>\n<li>At 0.90 with 2-3GB of cache available, 20+ concurrent users are supported.<\/li>\n<\/ul>\n<p>You can observe that the latency reduces significantly with the increase in value of the gpu-memory-utilization flag.<\/p>\n<p>To summarize, the gpu-memory-utilization flag sets the amount of VRAM vLLM grabs on startup. <\/p>\n<p>After the model weights are loaded, the remaining VRAM is assigned to KV cache, the more memory assigned to KV cache, the more concurrent users it can support. The latency reduces as well because the users are less likely to be preempted. <\/p>\n<h2 id=\"quantization\">Quantization<\/h2>\n<p>Switching from FP 16 to FP 8 version was a big improvement; the throughput increased 2.2x.<\/p>\n<div class=\"kg-card kg-callout-card kg-callout-card-blue\">\n<div class=\"kg-callout-emoji\">\ud83d\udca1<\/div>\n<div class=\"kg-callout-text\">This version of Llama 3 has 8 billion parameters. Standard LLM models like these use 16 bit floating point numbers to train each parameter which take upto 2 bytes of VRAM. Quantization maps these high precision values to a lower precision format like 8 bit or 4 bit.<\/div>\n<\/div>\n<p>For Llama 3 8 B instruct, the FP 16 version uses 2 bytes per parameter which means that for 8 billion parameters the model weight would be 16 GB. The FP 8 model uses 1 byte per parameter so for 8 billion parameters the model weight would be 8GB.<\/p>\n<p>We already know that after the model weights are loaded, the remaining memory would be assigned to KV cache. <\/p>\n<p>The FP16 version will only have ~4 GB for cache but the quantized FP8 version will have over 12 GB. We know that the number of concurrent users or concurrent requests is constrained by the VRAM assigned to the KV cache. <\/p>\n<p>FP 8 with over 12 GB of VRAM available for KV cache will allow more concurrent users or requests before memory runs out.<\/p>\n<div class=\"kg-card kg-callout-card kg-callout-card-blue\">\n<div class=\"kg-callout-emoji\">\ud83d\udca1<\/div>\n<div class=\"kg-callout-text\">The RTX 4000 ADA has dedicated FP 8 tensor cores and physical circuits on the chips designed for 8 bit floating point operations. It means a different and faster set of circuits doing the work. This helps increase the throughput gains over the memory savings.<\/div>\n<\/div>\n<p>Test: 100 requests, 20 concurrent, against the standard FP16 model first<\/p>\n<figure class=\"kg-card kg-image-card kg-card-hascaption\"><img decoding=\"async\" src=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/2026\/03\/data-src-image-06f19b0e-9a0d-4137-86c1-ab3513eb6492-1.png\" class=\"kg-image\" alt=\"\" loading=\"lazy\" width=\"942\" height=\"702\" srcset=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w600\/2026\/03\/data-src-image-06f19b0e-9a0d-4137-86c1-ab3513eb6492-1.png 600w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/2026\/03\/data-src-image-06f19b0e-9a0d-4137-86c1-ab3513eb6492-1.png 942w\" sizes=\"auto, (min-width: 720px) 720px\"><figcaption><span style=\"white-space: pre-wrap;\">FP16 baseline: 1.62 req\/s, 206 tokens\/s, 22,953ms mean time to first token.<\/span><\/figcaption><\/figure>\n<p>Then the same test on the FP 8 model.<\/p>\n<figure class=\"kg-card kg-image-card kg-card-hascaption\"><img decoding=\"async\" src=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/2026\/03\/data-src-image-a2ac4338-c11a-404f-97b4-d355655b4eef-1-1.png\" class=\"kg-image\" alt=\"\" loading=\"lazy\" width=\"745\" height=\"491\" srcset=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w600\/2026\/03\/data-src-image-a2ac4338-c11a-404f-97b4-d355655b4eef-1-1.png 600w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/2026\/03\/data-src-image-a2ac4338-c11a-404f-97b4-d355655b4eef-1-1.png 745w\" sizes=\"auto, (min-width: 720px) 720px\"><figcaption><span style=\"white-space: pre-wrap;\">&nbsp;FP 8: 3.64 req\/s, 466 tokens\/s, 7,289ms mean time to first token. 2.2\u00d7 throughput, 3\u00d7 faster time to first token.<\/span><\/figcaption><\/figure>\n<p>The FP 8 version was able to process 3.64 requests per second, while the FP 16 version could only do 1.62. The token generation speed also went from 206 to 466 per second.<\/p>\n<p>Quantizing from FP 16 to FP 8 does not show any visible dip in quality because most model weights are concentrated in a small range of values, so even though <br \/>FP 8&#8217;s precision is less than FP 16 (8-bit numerical representation over 16-bit), but it can still represent these weights accurately without losing important details.<\/p>\n<div class=\"kg-card kg-callout-card kg-callout-card-blue\">\n<div class=\"kg-callout-emoji\">\ud83d\udca1<\/div>\n<div class=\"kg-callout-text\">In neural networks weights are values which determine the strength of connection between neurons. These weights are learned during training and are important for the models accuracy. <\/div>\n<\/div>\n<h2 id=\"batching-and-throughput\">Batching and throughput<\/h2>\n<p>In static batching, requests wait for the current batch to finish before the next one starts. So a user asking a simple query is stuck behind someone who is generating a 500-word essay. <\/p>\n<p>With continuous batching, new requests fill the freed processing spots immediately at the token level rather than waiting for the whole batch to complete. <\/p>\n<p>This means when a model finishes generating a token for one request, that<em> <\/em>slot in the batch becomes available immediately. <\/p>\n<p>A new request can be inserted into that slot to generate its next token without waiting for the requests in the batch to complete first.<\/p>\n<div class=\"kg-card kg-callout-card kg-callout-card-blue\">\n<div class=\"kg-callout-emoji\">\ud83d\udca1<\/div>\n<div class=\"kg-callout-text\">Token is the smallest unit of text that a model processes or generates.<br \/>For example in the phrase &#8220;Pizza sauce&#8221;, the token might be [&#8220;Pizza&#8221;,&#8221;sauce&#8221;] or something even smaller like [&#8220;Pi&#8221;, &#8220;zz&#8221;, &#8220;a&#8221;, &#8220;sa&#8221;, &#8220;uc&#8221;, &#8221; e&#8221;] depending on the tokenizer.<\/p>\n<p>Models generate text sequentially, producing one token at a time until the output is complete.<\/p><\/div>\n<\/div>\n<p>I tested this with two terminals simultaneously sending requests. One sent a long essay request. The other sent &#8220;what is 2+2&#8221;. The math answer came back almost immediately while the essay was still being generated. <\/p>\n<p>GPU stayed at 100% the whole time, both were processing at the same time.<\/p>\n<figure class=\"kg-card kg-image-card kg-card-hascaption\"><img decoding=\"async\" src=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/2026\/03\/data-src-image-0404de8c-a755-417e-9ed4-e2a371277200-1-1.png\" class=\"kg-image\" alt=\"\" loading=\"lazy\" width=\"1283\" height=\"699\" srcset=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w600\/2026\/03\/data-src-image-0404de8c-a755-417e-9ed4-e2a371277200-1-1.png 600w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w1000\/2026\/03\/data-src-image-0404de8c-a755-417e-9ed4-e2a371277200-1-1.png 1000w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/2026\/03\/data-src-image-0404de8c-a755-417e-9ed4-e2a371277200-1-1.png 1283w\" sizes=\"auto, (min-width: 720px) 720px\"><figcaption><span style=\"white-space: pre-wrap;\">The math answer returned almost immediately while the long essay request is still processing.<\/span><\/figcaption><\/figure>\n<h2 id=\"batching-optimisation\">Batching optimisation<\/h2>\n<p>The objective was to increase throughput through a batching configuration.<\/p>\n<p>Benchmark results for different parameters of &#8211;max-num-sequence (100 requests, 20 concurrent).<\/p>\n<p><!--kg-card-begin: html--><\/p>\n<table>\n<thead>\n<tr>\n<th>Configuration<\/th>\n<th>Parameters<\/th>\n<th>Throughput (req\/s)<\/th>\n<th>Mean Latency (ms)<\/th>\n<th>Observation<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Config A (Conservative)<\/td>\n<td>max-seqs: 32<\/td>\n<td>1.45<\/td>\n<td>680<\/td>\n<td>Under-utilized<\/td>\n<\/tr>\n<tr>\n<td>Config B (Moderate)<\/td>\n<td>max-seqs: 128<\/td>\n<td>1.62<\/td>\n<td>850<\/td>\n<td>Best<\/td>\n<\/tr>\n<tr>\n<td>Config C (Aggressive)<\/td>\n<td>max-seqs: 256<\/td>\n<td>1.63<\/td>\n<td>1200<\/td>\n<td>Latency spike, no speed gain<\/td>\n<\/tr>\n<tr>\n<td>Config D (Maximum)<\/td>\n<td>max-seqs: 512<\/td>\n<td>1.60<\/td>\n<td>1800<\/td>\n<td>Diminishing returns<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><!--kg-card-end: html--><\/p>\n<div class=\"kg-card kg-callout-card kg-callout-card-blue\">\n<div class=\"kg-callout-emoji\">\ud83d\udca1<\/div>\n<div class=\"kg-callout-text\">&#8220;<b><code spellcheck=\"false\" style=\"white-space: pre-wrap;\"><strong>--max-num-sequence<\/strong><\/code><\/b><b><strong style=\"white-space: pre-wrap;\">&#8220;<\/strong><\/b> refers to the configuration parameter that controls the maximum number of sequences processed concurrently, mostly in the context of machine learning inference or data processing. This parameter is important because it directly affects the system&#8217;s throughput and latency.<\/div>\n<\/div>\n<p>128 is the best, beyond that throughput doesn&#8217;t increase much but latency increases significantly.<\/p>\n<figure class=\"kg-card kg-image-card kg-card-hascaption\"><img decoding=\"async\" src=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/2026\/03\/data-src-image-06f19b0e-9a0d-4137-86c1-ab3513eb6492-1.png\" class=\"kg-image\" alt=\"\" loading=\"lazy\" width=\"942\" height=\"702\" srcset=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w600\/2026\/03\/data-src-image-06f19b0e-9a0d-4137-86c1-ab3513eb6492-1.png 600w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/2026\/03\/data-src-image-06f19b0e-9a0d-4137-86c1-ab3513eb6492-1.png 942w\" sizes=\"auto, (min-width: 720px) 720px\"><figcaption><span style=\"white-space: pre-wrap;\">FP16 baseline: 1.62 req\/s, 206 tokens\/s, 22,953ms mean time to first token.<\/span><\/figcaption><\/figure>\n<h2 id=\"architecture-overview\">Architecture overview<\/h2>\n<p>The following image shows the final deployment stack. Llama 3 8B instruct running inside a <a href=\"https:\/\/devopscube.com\/cloud-based-docker-container-monitoring\/\" rel=\"noreferrer\">Docker container<\/a> with NVIDIA runtime on an RTX 4000 ADA with 20 GB VRAM. vLLM has paged attention, and it manages VRAM allocation. <\/p>\n<figure class=\"kg-card kg-image-card\"><img decoding=\"async\" src=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/2026\/04\/image-65.png\" class=\"kg-image\" alt=\"Llama 3 8B instruct running inside a docker container with NVIDIA runtime\" loading=\"lazy\" width=\"2000\" height=\"2642\" srcset=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w600\/2026\/04\/image-65.png 600w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w1000\/2026\/04\/image-65.png 1000w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w1600\/2026\/04\/image-65.png 1600w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w2400\/2026\/04\/image-65.png 2400w\" sizes=\"auto, (min-width: 720px) 720px\"><\/figure>\n<p>After quantising (to FP 8) the model occupies 8 GB VRAM leaving over 12 GB for KV cache, which allows 20+ concurrent users.<\/p>\n<p>Continuous batching fills freed token slots immediately, preventing simple requests from waiting behind long ones. The whole thing is exposed on an Open-AI compatible API on port 8000.<\/p>\n<h2 id=\"conclusion\">Conclusion<\/h2>\n<p>The blog was more about what I found interesting while deploying Llama 3-8B instruct on a NVIDIA RTX 4000 ADA (20GB VRAM). I started with a 16 GB model on a 20 GB card, almost no room for anything else.<\/p>\n<p>Three changes filled the gap. Quantising from FP 16 to FP 8 reduced the model weights by half and freed up around 8 GB, which went directly to KV cache.<\/p>\n<p>Setting GPU memory utilization to 0.90 (90% VRAM utilization) instead of the suggested conservative values gave the cache some space to actually be useful.<\/p>\n<p>Continuous batching handled the scheduling. New requests fill the freed up slots at the token level, so a simple query wont sit and wait behind a long essay generation. <br \/>128 sequences can be processed concurrently before latency climbs without any significant throughput gain.<\/p>\n<div class=\"kg-card kg-callout-card kg-callout-card-blue\">\n<div class=\"kg-callout-text\"><b><strong style=\"white-space: pre-wrap;\">End Result: <\/strong><\/b>requests per second increased from 1.62 to 3.64, the tokens generated per second increased from 206 to 466, and finally the model was able to handle 20+ concurrent users (up from 5 users).<\/div>\n<\/div>\n<p>The limits is the end. If a future model has heavier weights, you can either quantize further or get a bigger card. There is no way out.<\/p>\n<hr>\n<p><strong>Ngu\u1ed3n:<\/strong> <a href=\"https:\/\/devopscube.com\/deploying-llama-with-docker-and-vllm\/\" target=\"_blank\" rel=\"noopener noreferrer\">How to Deploy Llama 3 with Docker and vLLM (Detailed Guide) \u2014 DevOpsCube<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Source: https:\/\/devopscube.com\/deploying-llama-with-docker-and-vllm\/<\/p>\n","protected":false},"author":1,"featured_media":302,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-301","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-devops"],"_links":{"self":[{"href":"https:\/\/blog.ngocha.biz\/index.php?rest_route=\/wp\/v2\/posts\/301","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.ngocha.biz\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.ngocha.biz\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.ngocha.biz\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.ngocha.biz\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=301"}],"version-history":[{"count":0,"href":"https:\/\/blog.ngocha.biz\/index.php?rest_route=\/wp\/v2\/posts\/301\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.ngocha.biz\/index.php?rest_route=\/wp\/v2\/media\/302"}],"wp:attachment":[{"href":"https:\/\/blog.ngocha.biz\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=301"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.ngocha.biz\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=301"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.ngocha.biz\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=301"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}