More Than Just Savings – The Risk of Bloated Cloud Infrastructure
In the cloud, you can scale infinitely. But you can also burn infinite amounts of money and build yourself a ticking time bomb. It all started with a client facing an absurdly high Azure bill: nearly one million euros in annual costs for a Kubernetes platform (AKS) hosting four apps in early development – without a single end customer. An insane price tag, born from the "build now, optimize later" mantra, typical when time-to-market trumps cost control.
Not an isolated case: Studies show that exploding cloud costs are common. Many companies overpay by around 35% (Source: CloudComputing-Insider), often due to poor planning, lack of optimization, or missing expertise. Up to 80% struggle with unexpectedly high costs (Sources: Gartner, AP-Verlag). Our client's problem was therefore symptomatic.
But the costs were just the tip of the iceberg. The real drama lurked deeper: A massively oversized, complex infrastructure posed significant risks to stability and availability. Error-prone deployments could bring everything down. These costs were an alarm signal for deep-seated technical and organizational issues.
So, how could we tackle this mountain of technical debt and costs without jeopardizing growth?
This post outlines our journey – from quick wins to in-depth analysis and establishing FinOps practices. A journey demonstrating that targeted cost optimization often leads directly to higher technical quality and stability.
The result: Cloud costs halved while apps and teams grew (from 4 to 10+), and an infrastructure that could no longer be brought to its knees.
The Starting Point: Uncontrolled Growth in Azure and Kubernetes
Before optimization, the infrastructure resembled a digital jungle, cultivated under the maxim "growth at any cost." A central Kubernetes platform (AKS) on Azure, spread across four environments, hosted four product teams with internal apps. The foundation: around 100 VMs for the AKS clusters – an impressive number for just a few applications, some barely used.
How did this happen? The platform was developed with a focus on rapid development, scalability, and features, while costs were ignored. The "big playground, details later" philosophy led to fundamental problems:
- Systemic Overprovisioning: Pods were allocated resources based on the "more is more" principle, not actual need. Developers requested resources generously without knowing the real demand – frontends requesting 2 GB of RAM instead of 200 MB were common. Multiplied across four environments, this created an enormous, unused resource hunger.
- Architectural Missteps: The "just do it" culture led to bizarre solutions, like using Pod RAM as an improvised, expensive in-memory key-value store (8 GB per Pod). Such patterns revealed flaws in both infrastructure and application architecture.
- Ignored Cloud Saving Mechanisms: AKS node VMs ran almost exclusively at expensive on-demand prices. Azure Reserved Instances (RIs) or Savings Plans were not utilized – a classic mistake for constant baseline loads.
- Inefficient, Risky Database Usage: Azure databases were often massively oversized because inefficient queries strained the CPU. Instead of optimizing queries, they scaled up. Even worse: Shared databases posed a huge risk – a faulty release could cripple the database for all teams, on top of the costs for potent instances.
This mix of massive overprovisioning, questionable architectures, risky shared resources, and ignorance of savings potential showed: A fundamental cleanup was necessary.
Phase 1: Quick Wins – Tidying Up Systematically (First 30% Savings)
After the initial assessment, it was clear: time to bring out the machete and tackle the "low-hanging fruits." We focused on eliminating the most obvious waste and achieving quick wins to build trust.

Illustration of Kubernetes Right-Sizing.
Measure 1: Kubernetes Right-Sizing – Shrinking with a Learning Curve
The main issue was the massive overprovisioning of Pods (GBs of RAM, dozens of CPU cores reserved, barely used). Instead of gut feeling, we needed data.
- Analysis with Goldilocks: The Goldilocks tool (based on Vertical Pod Autoscaler) ran for weeks in the clusters, collecting CPU/memory usage data and providing concrete recommendations for
requests
and limits
. (Manual monitoring is also possible, but Goldilocks offered a quick overview).
- Iterative Adjustment & Observation: We didn't blindly follow the recommendations but reduced resources (
requests
/limits
) step-by-step (starting with Dev/Staging), in consultation with the teams, closely monitoring applications for stability and performance.
- Quality Gains Through Cost Pressure: This "slow tightening of the screws" revealed problems within the apps themselves! Memory leaks, previously unnoticed in the abundance of resources, surfaced; CPU spikes indicated inefficient code. The cost pressure indirectly forced teams to improve their software quality – an important side effect!
- Adjusting Replicas: After the Pods were correctly sized and running stably, we reduced the often unnecessarily high number (e.g., 3+) of replicas, as there was no load justifying them and HPA was missing.
Result: AKS nodes were better utilized, and the total number of nodes decreased significantly.
Measure 2: Azure Reserved Instances – Ending On-Demand Waste
In parallel, we addressed the Azure VMs. The remaining, now better-utilized baseline load of AKS nodes was still running at expensive on-demand prices.
- Identification & Switch to RIs: We determined the constant VM baseline load after optimization and purchased Azure Reserved Instances for it (1-year term for flexibility). This brought substantial discounts for compute power that was needed anyway.
Interim Result: The first 30% achieved!
Through Kubernetes Right-Sizing and RIs alone, the monthly Azure costs dropped by around 30%. This success was crucial: It proved feasibility, built acceptance, and provided energy for tackling the more complex, remaining 70% of the costs. This was just the beginning.
Phase 2: Detective Work – Tracking Down Hidden Costs
A thirty percent saving was a great initial success, but the journey was far from over. The remaining costs, close to half a million euros annually, still felt exorbitant for a growing platform with internal applications. There had to be other, less obvious cost drivers. The question was: Where were they hiding?
Measure 3: The Storage Cost Trap – A Surprisingly Expensive Item
Analyzing Azure costs, we noticed something significant: Cloud storage expenses were substantial and unexpectedly high. Storage space itself is usually cheaper, so we looked for the cause of the high volume.
Deeper analysis revealed the main driver: the sheer amount of log and tracing data from our Elasticsearch/Kibana solution, especially in the Development and Staging environments. Unfiltered terabytes of data, often from development experiments, generated storage and transfer costs in the mid five-figure euro range annually and strained the logging system.
Countermeasures & Result: We revised the logging strategies: Clear guidelines for log levels per environment, adjusted retention periods, and a developer-driven "Community of Practice" for more conscious logging were key. The reduced log volume directly lowered costs and relieved the Elasticsearch cluster.
Measure 4: Granular Cost Transparency – Knowing Who Consumes What
Simultaneously, we tackled the lack of cost allocation. For true FinOps, we needed to answer the question: "Which team/product causes which costs?". Only then can accountability and optimization incentives emerge.
Challenge & Implementation:
Allocation wasn't trivial (due to shared resources). We relied on:
- Consistent Azure Tagging: A mandatory schema (
Team
, Product
, Environment
, etc.), enforced via Azure Policies.
- Kubernetes Namespaces: Clear separation of workloads per team/product in AKS.
- Dedicated Log Indices: Separate Elasticsearch indices per team/product for cost allocation, analysis, and retention policies.
- Breaking Up Shared Services: Migration to dedicated databases per core application/team. This drastically improved transparency, stability, and autonomy. Remaining shared services were allocated via cost keys.
- Kubecost/OpenCost: Implementation for granular cost breakdown within Kubernetes clusters based on resource usage (CPU, RAM, Storage) per Pod/Namespace, reconciled with the Azure costs of nodes/volumes.
Measure 5: Reporting and Insights – Turning Data into Decisions
The collected data from Azure Cost Management, Kubecost, etc., had to be made usable.
- Automated Reports & Visualization: Monthly, detailed cost reports per team/product were generated and visualized in Grafana and the central BI tool. Every team could see their costs, drivers, and trends live at any time.
- Empowering Teams: This transparency was key. Teams and Product Owners had a factual basis for cost discussions. They could estimate the financial impact of new features ("Feature X will cost us Y euros more on the DB"). The discussion shifted from blame to constructive considerations: "Is this feature worth the cost, or should we invest in optimization?".
This second phase of detective work and radical transparency was crucial. It uncovered hidden cost drivers like logging and established tools for sustainable cost management. The foundation for the targeted cost halving was laid, and teams were empowered to take responsibility for their cloud spending.

Grafana dashboard for Cost transparency.
Challenges: Technology, Teams, and Culture – The Rocky Path to Efficiency
Sounds like a smooth plan? Far from it! Despite good results, the path to efficiency wasn't a walk in the park through the cloud. Every optimization required discussions, persuasion, and overcoming technical and cultural barriers.
- The Human in the Machine Room: Between Fear and Habit
Often, the biggest hurdle was human. Developers accustomed to resource abundance reacted skeptically to savings targets ("Will it hold up?"). Extensive communication, testing, and gentle pressure were needed to break old habits. Uncovering memory leaks through right-sizing also showed: Optimization improves quality. Nevertheless, the struggle for every megabyte often remained tough.
- When Technology (Doesn't Yet) Cooperate: The Missing Autoscaler
Sometimes, the technology was simply missing, like Horizontal Pod Autoscaling (HPA). Without automatic adjustments, we had to run expensive, oversized replicas – fixed costs that only later architectural adaptations like HPA could reduce. A clear sign: Cost optimization and modernization go hand in hand.
- The Long Haul for Cultural Change: FinOps is a Process
The most persistent challenge was embedding genuine cost awareness within the company. Transparency was the first step, but it took time for teams to actively use the data. FinOps isn't a project; it's a continuous cultural shift that needs constant attention for cost awareness to become second nature.
Overcoming these hurdles was just as important as the technology itself. Only through the interplay of technology, teams, and culture can the fruits of optimization be harvested sustainably. And the results showed: The effort was worth it.
The Results: Costs Halved, Quality Doubled
After intensive analysis and optimization, the results were in – impressive on paper and in practice.
Hard Facts: Mission Cost Halving Accomplished!
The goal of halving the high 6-figure Azure costs was exceeded. Through right-sizing, Reserved Instances, log optimization, and transparency, spending decreased by more than half – despite growing from four to over ten applications!
Unexpected Gains: More Than Just Savings
The kicker: The optimization turned out to be a free quality improvement program.
- More Stability: Right-sizing exposed weaknesses (memory leaks, inefficiencies). Fixing them led to significantly fewer outages and measurably higher availability (99.995% over a 12-month period). The leaner infrastructure was more robust.
- More Efficiency: Fewer resources meant less complexity, faster deployments, and easier management.
- FinOps Mindset Established: Thanks to transparency, teams actively tracked costs. The question "What does this cost?" became part of development and decision-making.
- More Conscious Developers: The team learned to code more resource-efficiently, log purposefully, and understand the cost implications.
In the end, we had a cheaper, better, more stable, and future-proof platform. Cost optimization was thus an investment in technical excellence and resilience.
Ensuring Sustainability: FinOps Isn't a Project, It's a Fitness Program
Costs halved, platform stable – mission accomplished? Not quite. Successful optimization is like a fitness program: Without ongoing training, you quickly end up back on the couch. FinOps isn't a one-time action but a continuous task for everyday practice.
So how do you maintain success and keep cost awareness alive?
- Transparency Remains Key: Dashboards are the daily compass. The question "What are we currently spending?" must be answerable with a click at any time for data-driven decisions.
- Activate Early Warning Systems: Automatic alerts trigger alarms when costs get out of hand. This allows unintended cost explosions (e.g., from debug logs) to be detected and stopped early, before they blow up the monthly bill.
- Plan with Foresight: Cost data becomes the basis for budgets and forecasts. Teams can better estimate the financial impact of new features and have informed discussions about priorities.
- Anchor Responsibility in the Team: Transparency creates ownership. Teams are also responsible for their costs. The central architect or FinOps team acts as a coach, not a controller, providing tools and support. Regular cost reviews become routine.
Establishing FinOps permanently means creating a culture where costs are as natural a consideration as performance or security. It requires the right tools, clear responsibilities, and keeping the topic visible. Only then does a project become a sustainable improvement that strengthens the company long-term – both financially and technically.
Conclusion: Excellence Over Mere Savings
Our journey began with a shock: Nearly €1M in cloud costs for a few internal apps and a risky infrastructure. The outcome: Halved expenses, a stable platform, >10 satisfied teams. Our learning: Cloud cost optimization is often disguised quality improvement. By following the money, we uncovered technical debt and inefficient processes – from memory leaks to log tsunamis. Cost pressure enforced discipline, better architecture, and smart solutions.
The result was more than just a nicer Azure bill: a more robust, efficient platform for more productive teams. The "unexpected" gains in stability and know-how were invaluable.
And how does it look on your end? Do you truly understand your cloud bill? Can you identify your cost drivers – and the technical skeletons in the closet? Question the status quo, create transparency! It pays off twice: for the budget and the technology. Because having costs under control usually means having the infrastructure under control too.