top of page

How We Cut AI Costs by 80% Running Ollama on Azure - A Production Story

  • Writer: Pankaj Naik
    Pankaj Naik
  • Mar 25
  • 4 min read

Updated: Mar 26


The Problem We Were Trying to Solve


At PANTA Flows, we build workflow automation tools. Like most modern AI SaaS products, we started exploring ways to embed AI into our product - specifically for summarization and small language tasks that would help our users make faster decisions.

The obvious choice? OpenAI or Claude.


But when we ran the numbers, the cost picture wasn't pretty. For our expected traffic of around 200 requests per day - modest by any standard - we were looking at €300–500 per month for GPT-4 level models. And here's the thing: we didn't need GPT-4. We needed something that could reliably summarize a document or extract key points. A 3 billion parameter model would do just fine.

That's when we started asking a different question:

"What if we just ran the model ourselves?"

Why Self-Hosted LLM Made Sense for Us


We weren't building a chatbot. We weren't doing complex reasoning. Our use cases were:

  • Summarizing workflow descriptions

  • Creating scenes from scripts and summarising files

  • Small classification tasks


For these tasks, a quantized small model running on CPU is not just acceptable, it's perfectly adequate. The key insight was that we didn't need GPT-4 intelligence, we needed GPT-4 availability.


Self-hosting gave us:

  • Predictable costs - one VM price per month, no per-token billing

  • Data privacy - nothing leaves our infrastructure

  • Speed - sub-2 second responses for warm requests

  • Control - we own the stack


Choosing the Right Model


After evaluating several options, we landed on Qwen2.5:3b - a 1.9GB quantized model (Q4_K_M)


Why this model?


Architecture Decision: Simple Wins


Before we get into the technical setup, I want to talk about what we deliberately chose NOT to build.


Early in the planning phase, we considered:

  • Docker containers with auto-scaling

  • Azure Container Apps

  • Message queues (Azure Service Bus, Redis)

  • API gateway wrappers


We rejected all of it. Here's why:

For 200 requests per day that's roughly one request every 7 minutes on average - adding Kubernetes, containers and message queues would be solving problems we don't have yet. The engineering overhead alone would outweigh any benefit.


Our final architecture is deliberately boring:

No queues. No containers. No orchestration. Just a VM running a process.


Infrastructure Setup


The VM

We chose Azure Standard_D2as_v4 ; 2 vCPU, 8GB RAM, running Ubuntu 22.04 LTS.


Memory breakdown:

Tight but perfectly fine for sequential inference at low traffic. Total cost: €65.78/month.


For disk, we kept the default 30GB OS disk but switched from Premium SSD to Standard SSD, saving ~€7-10/month. Since Ollama loads the model into RAM on first request, disk speed only matters at cold start. The difference between Premium and Standard SSD is maybe 3 seconds on startup. Not worth the premium.


VNet Planning: The Part We Almost Got Wrong


This is where most guides skip the important detail. We have three environments: dev, stage, and prod. Each runs as a separate Azure App Service.


Our initial instinct was to just open port 11434 on the VM and call it a day. That would have been a security disaster as Ollama has no authentication by default. Anyone with our VM's public IP could run inference and rack up compute costs.


Instead, we designed a proper VNet subnet structure:

Each App Service connects to its dedicated subnet via VNet Integration. All subnets are in the same VNet, so they can communicate privately. Ollama never needs a public port open.


The result: port 11434 is completely invisible to the internet. Only our backend services can reach it and only via private IP.


Deploying Ollama


Installation

Simple. The install script handles everything including creating a systemd service.


Configuring the Service


The default Ollama installation binds to 127.0.0.1 localhost only. Since our App Services need to reach it over the VNet, we need it listening on all interfaces. We also add queue settings to handle concurrent requests gracefully.


Our final /etc/systemd/system/ollama.service:


Breaking down the environmental variables:

Variable

Value

Reason

OLLAMA_HOST

0.0.0.0:11434

Accept VNet traffic

OLLAMA_NUM_PARALLEL

1

Safe for 2 vCPU; no thrashing

OLLAMA_MAX_QUEUE

5

Buffer for traffic spikes

OLLAMA_MAX_LOADED_MODELS

1

Keep model warm in memory


Why We Didn't Need Azure Service Bus


This is the question we get asked most often: "Why not use a proper queue?"

Ollama has a built-in queue. For our traffic pattern, it works like this:

Azure Service Bus costs €10+/month and adds operational complexity. Ollama's built-in queue costs €0 and requires three lines of config. The math is obvious.


Connecting the Backend


Backend Integration Code


Our environment strategy is clean:


Performance in Production


Real numbers from our deployment:


For a startup running low-traffic AI workloads, the performance is more than adequate.

The biggest win was the cost reduction:

For an early-stage product, that's a meaningful difference.


When to Scale Beyond This


Our current setup handles 200 requests/day comfortably. Here's when we'd consider scaling:

The beauty of this architecture is that scaling is incremental. We don't need to redesign anything, just add layers as traffic grows.


Key Takeaways


  1. Self-hosted LLM is viable for small-medium workloads - don't assume you need managed AI services

  2. Simple architecture beats clever architecture - a VM and a process is often enough

  3. VNet integration is the right security model - never expose Ollama publicly

  4. Ollama's built-in queue is underrated - you don't need Redis for 200 req/day

  5. Plan your network topology first - subnet structure decisions are hard to change later

  6. Cold start is solvable - OLLAMA_MAX_LOADED_MODELS=1 keeps the model warm


What's Next


We are planning to:

  • Add Azure Monitor alerts on VM CPU/memory

  • Implement keep-warm cron job to eliminate cold starts

  • Evaluate larger models (7B) as we upgrade the VM

  • Add streaming responses for better response


If you are building a similar setup or have questions about our approach, feel free to reach out.

 
 
bottom of page