Lesson 09/12Advanced13 min read·3 diagrams

Operating a Data Center

DCIM, BMS, capacity planning, change management, incident response. The day-to-day discipline that keeps a 99.99%-uptime facility actually running. This is where everything that was over-engineered in design pays off — or doesn't.

1 · DCIM — the operations control plane

DCIM = Data Center Infrastructure Management. The software layer that tracks every rack, U slot, cable, asset, power feed, and environmental sensor. Without DCIM, you can't answer basic questions like "how much spare power is in row 4?" or "which rack is server X-1234 in?".

Sunbird dcTrack
DCIM
Asset + capacity tracking
Schneider EcoStruxure
DCIM + BMS
Vendor-integrated
Vertiv Trellis
DCIM
Strong on power chain
Nlyte
DCIM
Workflow + asset
Open-source: NetBox
IPAM/DCIM
Free, dominant in cloud DCs

2 · BMS — Building Management System

BMS controls and monitors the physical building: chillers, pumps, fans, dampers, fire suppression, access control. It's a separate system from DCIM (which focuses on IT) but they integrate at the dashboard layer. Common BMS platforms: Honeywell, Johnson Controls Metasys, Siemens Desigo, Schneider EcoStruxure.

3 · Capacity planning

Three resources need constant tracking: power, cooling, and space. The trick is that they don't deplete at the same rate. You'll often run out of cooling capacity in a row before space, or run out of power before cooling.

  • Power capacity — track PDU and UPS loading, project growth from procurement pipeline.
  • Cooling capacity — CDU loading, return temp deltas, chilled water flow.
  • Space capacity — U slots, rack slots, but also clearance for liquid manifolds.

The metric: stranded capacity. If you have 100 MW of utility power but only 80 MW of cooling, your "real" capacity is 80 MW. Operators obsess over reducing stranded capacity.

4 · Incident response

Every outage starts with an alarm — usually from BMS, DCIM, or a customer ticket. Mature operations have a runbook for every common alarm and a tiered escalation:

  1. Tier 1 (NOC): Triage, runbook execution, "is something on fire?".
  2. Tier 2 (DC engineer): Hands-on, swap parts, repatch cables.
  3. Tier 3 (specialist): Power engineer, mechanical engineer, network architect.
  4. Vendor escalation: NVIDIA, Schneider, Cisco — call out to the manufacturer.

Every Tier-IV facility runs 24/7 staffing on-site. Hyperscalers operate "remote hands" services for tenants who don't want their own staff there.

5 · Maintenance & change management

N+1 redundancy means you can take any one component out for service without losing capacity. Concurrently maintainable (Tier III) means you can do this without an outage. Fault tolerant (Tier IV) means even an unplanned failure during maintenance doesn't take you down.

All scheduled work happens through a Method of Procedure (MOP) — a written, peer-reviewed step-by-step plan with rollback points. Hyperscalers publish hundreds of MOPs per year.

Source: The Uptime Institute Operations Sustainability framework; Schneider EcoStruxure documentation; ISO/IEC 22237 (data center facility infrastructure standards).

Lesson 09 — TL;DR

  • • DCIM tracks every IT asset; BMS controls the building. Both feed the NOC.
  • • Capacity = min(power, cooling, space). Reducing stranded capacity is a continuous job.
  • • Every alarm has a runbook. Tiered escalation: NOC → DC eng → specialist → vendor.
  • • N+1, concurrently maintainable, fault-tolerant — three levels of operational rigor.
  • • MOPs are the universal currency of safe change.

Useful? Share so the next engineer learns this faster.

Share: