Introduction
Modular UPS architecture divides the power path into independent power, battery and control modules that can be hot-swapped while the system continues to supply the load. This design dramatically reduces Mean-Time-to-Repair (MTTR), but only if the operator can decide—within seconds—which battery module is unhealthy and why. The following text summarizes field-proven techniques that allow a technician to locate a defective battery module in less than five minutes without removing the wrong pack or shutting the bus down.
Typical Failure Signature of a Battery Module
A battery module in a modular UPS is normally a 20–60 Ah, 42–54 V lithium-ion or VRLA string with its own Battery Management System (BMS). The failures most frequently seen in the field are:
a) Internal cell open-circuit (voltage collapses under load)
b) Cell short-circuit (module voltage lower by 2 V or 3.6 V than nominal)
c) BMS Hall sensor drift (current reading offset >3 %)
d) MOSFET fuse-blown inside the module (zero charge current)
e) Over-temperature shutdown (heat run, fan blocked, >60 °C)
f) Capacity fade (<70 % of nameplate after 400 cycles)
Each signature leaves a different footprint in the data. The art is to map the footprint to the physical module in the shortest possible time.
Four-Layer Diagnostic Model
Layer-1: System-level alarms (UPS LCD / SNMP)
Layer-2: Module-level telemetry (voltage, current, temp, impedance)
Layer-3: Cell-level trending (SOC imbalance, ΔV >300 mV)
Layer-4: Waveform capture (millisecond resolution during fault)
The technician should always start at Layer-1 and descend only as far as necessary to make the go/no-go decision.
Layer-1 – Use the UPS Alarm Register
Modern modular UPS (Eaton 9PX, Vertiv Liebert APM, APC Symmetra PX) publish a binary alarm register as part of their SNMP MIB or Modbus map.
Key OIDs to poll:
batteryTestFail (1.3.6.1.4.1.935.1.1.100.5.1.6)
batteryModuleFault (1.3.6.1.4.1.935.1.1.100.5.2.1.5)
batteryModulePosition (1.3.6.1.4.1.935.1.1.100.5.2.1.2)
A single snmpget will return the slot number of the module that raised the flag. Record the slot; do not swap anything yet. Confirm that the alarm is present in two consecutive polls 30 s apart to avoid reacting to a spurious spike.
Layer-2 – Compare Module Telemetry
Open the UPS web GUI or use the vendor’s software (Eaton IPM, Vertiv LIFE, APC StruxureWare). Export the real-time table that contains, for every battery module:
Paste the table into Excel and calculate the deviation from the median:
ΔV = |V_module − V_median|
ΔT = T_module − T_min
A module that simultaneously shows:
ΔV >0.5 V AND ΔT >5 °C AND Impedance >150 % of factory baseline
is flagged as suspect-1. Typically only one module meets all three criteria, giving a 60-second identification.
Layer-3 – Deep Dive into Cell Imbalance
If the UPS allows cell-level access (most lithium-ion modules do), open the “cell voltage” page. A healthy module keeps the sixteen 3.6 V cells within 50 mV of each other during float.
Rule: max(cellV) − min(cellV) >300 mV → imbalance >8 % capacity.
If the imbalance is localised inside the previously flagged suspect-1 module, you have cross-verified the fault; if the imbalance is spread over two adjacent modules, the problem is more likely a loose interconnect than a single bad cell.
Layer-4 – Waveform Capture for Intermittent Faults
Some faults appear only during the millisecond transition from mains to battery. Use the built-in “fault recorder” function that is already present in many Chinese modular UPS platforms
. The recorder continuously writes 500 µs samples to a ring RAM; when the DSP throws a fault code it freezes 200 ms of post-fault data. Download the COMTRADE file and look at:
A module whose current trace stays flat at 0 A while the others ramp up has an open MOSFET fuse and must be pulled.
Passive IR Scan – Optional but Fast
If the cabinet door can be opened safely, use a pocket thermal imager (FLIR ONE, ≤USD 300). Scan the battery drawers within 10 s. A module that is >6 °C hotter than its neighbours almost always contains a high-impedance cell or balancing MOSFET running continuously. Mark the hot drawer with tape; the IR image is admissible evidence for a warranty claim.
One-Minute “Swap-and-Watch” Test
When the above data still leave ambiguity (for example two modules show similar ΔV), execute a minimal-invasive test:
Note the instantaneous battery current on the UPS LCD.
Swap the positions of suspect-1 and its left neighbour (hot-swap, <30 s).
Watch the current redistributes: if the alarm follows the module, the module is bad; if the alarm stays in the slot, the slot wiring or back-plane is bad.
This test costs one minute and prevents mis-placing a good module into the scrap bin.
Automated Battery Self-Test – Final Confirmation
Trigger the “battery capacity test” from the front panel. A 20 % discharge is low-risk yet sufficient to expose a 30 % capacity fade. A module that drops its voltage below 42 V (for 48 V lithium) before the test ends is tagged “Replace”. Abort the test immediately if any cell goes below 2.5 V to avoid deep-discharge damage.
Common Field Mistakes to Avoid
Do not trust a green LED on the module; the LED only reflects the BMS “present” signal, not capacity.
Do not rely on internal resistance alone; new VRLA modules may read 8 mΩ while aged but still functional modules read 12 mΩ—yet both are acceptable. Always combine at least two indicators.
Do not overlook the inter-module data cable; a loose RJ45 can simulate a “missing module” alarm. Re-seat cables first.
Do not perform a 100 % discharge test on the production floor; it stresses the remaining good modules and extends recharge time to hours.
Document and Close the Loop
After replacement, save the following in the CMMS (computerised maintenance management system):
Alarm snapshot (SNMP or screenshot)
Telemetry CSV file
Thermal image (if taken)
Serial number of removed module
Serial number of new module
Date-stamp and technician ID
This package builds a statistical base that can later be mined for predictive models.
Predictive Extension – Machine-Learning Overlay
Once 50 or more historical fault packages are available, train a gradient-boosting classifier using the features: ΔV, ΔT, impedance, cycle count, calendar age, ambient temperature. The model can forecast “probability of failure within 30 days” with ~87 % precision, allowing the site to order spare modules just-in-time and cut spare inventory by 40 %
.
Summary Workflow ( Pocket Card )
Read alarm register → get slot X (30 s)
Export telemetry → flag outliers (60 s)
Check cell imbalance → confirm (60 s)
Optional IR scan → mark hot drawer (30 s)
Swap-and-watch → fault follows module? (60 s)
20 % self-test → final proof (5 min)
Total elapsed time <10 min; system remains on-line throughout.
By rigorously following the four-layer diagnostic model and using the swap-and-watch test as the tie-breaker, any field technician—without specialised battery laboratories—can localise a defective battery module in a modular UPS in less time than it takes to find the screwdriver set.