SLA vs Reality: How Data Centres Actually Deliver Uptime

A signed service level agreement contrasted with data centre operations staff monitoring systems in a control room.

Service Level Agreements (SLAs) are everywhere in infrastructure. They appear in contracts, proposals, procurement checklists and board updates. They are often treated as a proxy for resilience.

From a data centre operator’s perspective, that assumption is understandable, but incomplete. An SLA can describe a target availability level without explaining the operational discipline required to achieve it. Uptime is not a statement. It is a sustained outcome delivered day after day, at 02:00 as much as 14:00, through routine change as well as real disruption.

At Onix, we operate a Tier IV data centre in Accra. That Tier IV standard is commonly associated with 99.995% availability, which translates to roughly 26 minutes of downtime per year.

What an SLA actually is (and isn’t)

An SLA (Service Level Agreement) is a formal commitment between a service provider and a customer that defines expected service performance, usually expressed through measurable targets such as availability, response times, or resolution windows.

It covers:

  • How much uptime is being committed to?
  • How is that uptime measured and reported?
  • What happens if the target is missed? (Often service credits.)

What an SLA is not is a blueprint for reliability. It rarely tells you how resilience is engineered, how failures are prevented, how changes are controlled, or how incident response is rehearsed.

A useful way to think about it is:

  • The SLA is the scoreboard.
  • Reliability is the training ground.

A quick note on “the nines”

Those percentages look close, but operationally they are worlds apart.

  • 99.9% availability allows for roughly 8.8 hours of downtime per year.
  • 99.995% availability allows for roughly 26.3 minutes per year.
The operational reality behind “always on”

Data centres do not deliver uptime through wording. They deliver it through routine, disciplined work that most customers never see.

Day-to-day reliability typically includes:

  • Monitoring and alerting: not just visibility, but tuned alerts that catch drift early, not only when systems fail loudly.
  • Escalation paths: clear ownership, response targets, and the authority to act quickly.
  • Maintenance planning and testing: uptime depends on planned work being done carefully, repeatedly, and with evidence.
  • Redundancy that actually works: power, cooling, and connectivity resilience only count if they can take a hit and keep running.
  • Human readiness: shift handovers, runbooks, training, drills. People are part of the system.

This is where the key distinction matters. Design resilience is what a facility could withstand on paper. Operational resilience is what it does withstand in real life, with real workloads and real constraints.

Where the gap really lives

When organisations feel “we had an SLA, but we still went down”, the gap usually comes from a handful of predictable places.

1) Measurement and reporting windows

SLAs depend on how “uptime” is defined and where it is measured. Is it measured at the facility boundary, the network edge, a service demarcation point, or at the application layer? What counts as downtime versus degraded service?

If you do not know what is being measured, the SLA cannot protect you from the outage you actually care about.

2) Exclusions and shared responsibility

Most SLAs include exclusions. Some are fair. Some are misunderstood. In many environments, the facility and core services can be operating normally while the customer architecture is the point of failure.

Mature organisations treat uptime as a shared design and operational responsibility, not a single-party promise.

3) Root cause discipline vs symptom logging

“Connectivity issue” is not a post-incident analysis. Neither is “power event” or “device failure”. Reliability improves when incidents are investigated properly, contributing factors are documented, and fixes are verified.

Without that discipline, you get repeat incidents dressed up as one-offs.

4) Change risk

A large portion of outages are not caused by disasters. They are caused by normal work: patching, upgrades, configuration changes, maintenance windows and vendor interventions.

The more complex the environment, the more likely it is that a seemingly minor change triggers an unexpected chain reaction.

5) Edge cases under real load

Systems behave differently at scale and under pressure. A configuration that holds in testing may buckle during peak throughput, unusual traffic patterns, or upstream instability.

Reliability requires engineering for these edge cases rather than assuming the “average day” is the standard.

What really keeps systems up

If you want to assess whether uptime is likely to be delivered in practice, look for these operational behaviours.

  • Proactive testing: failover tests, load tests, routine verification. Redundancy that is not tested is just duplicated uncertainty.
  • Parallel power philosophy: clearly engineered power paths, a well-understood UPS strategy, and maintained switching procedures.
  • Connectivity diversity: multiple carriers, physically diverse routes where possible, and clarity on what “diverse” means in practice.
  • Controlled change management: planning, peer review, maintenance windows, rollback plans, and strict execution.
  • Incident response rehearsal: tabletop exercises, response playbooks, and a culture that improves after each event.
  • Governance and accountability: ownership, audit trails, post-incident reviews, and continuous improvement.

These are not “nice to have”. They are the operational foundation that turns uptime from a statistic into a reality.

Why this matters for 2026 and beyond

Reliability is no longer just a technical KPI. It is a strategic asset.

In 2026 and beyond, infrastructure choices sit inside broader frameworks: business continuity, customer trust, regulatory confidence, cyber resilience and operational risk. Downtime is not only lost revenue. It is reputational damage and operational instability.

That is why it is worth moving beyond “What SLA do you offer?” to the stronger question:

What operational discipline makes that SLA credible?

Closing thought

An SLA matters. It sets expectations and accountability. But on its own, it is not reliability.

SLAs declare intent. Reliable operations deliver it.

If understanding reliability in practice matters to your strategy in 2026, reassessing infrastructure operations is essential.

If you want to understand how uptime is delivered in practice — not just how it’s written into an SLA, our operations team is always open to a conversation. Reach out to us by clicking here.