Together, your Internet even better

DC faiilures : caused by the network ?

While power outages are a frequent cause of data center outages, they are no longer the only ones. Indeed, IT system failures and network errors are causing more and more failures. That's why the Uptime Institute looked at known outages to find out what caused unplanned service interruptions. To do this, the company has analyzed 162 service interruptions reported in traditional and social media over the past three years.

27 outages were reported in the media in 2016, 57 in 2017 and 78 in 2018. "Service outages are increasingly making headlines in the media," said Andy Lawrence, the Institute's Executive Director of Research. This does not necessarily mean that the number of failures is skyrocketing, but rather that downtime is attracting more and more attention. "It is clear that for users, the impact of outages is certainly more damaging today," he adds.

The study revealed that in global outages, network and IT system problems are more often blamed than those related to power supply. This is explained by the fact that power supply systems are more reliable than in the past and that there are fewer power outages in data centers.

At the same time, the increasing complexity of IT environments is causing a growing number of IT and network problems. "Data is now dispersed in multiple locations, with critical dependencies on the network, on how applications are architected and on how databases replicate each other. It is a very complex system, and it now takes fewer events to disrupt its operation," said Todd Trader, Vice President of Optimization and IT Strategy at the Uptime Institute.

This trend is all the more pronounced when comparing causes from one year to the next. 28% of outages were due to power supply problems in 2017 compared to 11% the following year. IT system failures remained relatively constant: 32% in 2017 and 35% in 2018. Outages due to network problems have increased significantly, from 19% in 2017 to 32% in 2018. "Things are linked not to one or two sites but to three or four or more sites, or even more, The network plays an increasingly important role in computer resilience," says Todd Traver.

In order to be able to distinguish an interruption that can threaten the activity of a company from a just disturbing failure, the Uptime Institute has developed an evaluation grid with a scale of 5 levels:

  • Level 1: refers to a negligible stop. The failure is recordable but there is little or no obvious impact on services and no service interruption.
  • Level 2: refers to a minimal interruption of service. Services are disrupted, but the effect on users, customers or reputation is minimal.
  • Level 3: refers to a service interruption that is significant to the company. These are interruptions in customer or user service, most often of limited scope, duration or impact. The financial impact is minimal or non-existent but there is some impact on reputation or compliance.
  • Level 4: concerns a serious operational or service failure leading to service disruption and/or operations involving financial loss, non-compliance, reputation damage and possibly even security issues with possible loss of customers.
  • Level 5: Describes a critical failure for the company or mission, resulting in a major and damaging interruption of services and/or operations, involving significant financial loss, security issues, non-compliance, customer losses and reputation damage.

 

This analysis was further developed by researchers who specifically identified the origin of data center failures.

The most common reasons for failures when the network is down:

  • fiber cuts outside the datacenter and insufficient number of routing alternatives
  • intermittent failure of the main switches and absence of secondary routers
  • major switch failure without backup
  • incorrect traffic configuration during maintenance
  • incorrect configuration of routers and networks defined by softwar
  • failure to power individual unsaved components such as switches and routers


For IT, the most common causes are:

  • poorly managed upgrade
  • failure and subsequent data corruption of a large number of disks or SAN storage systems
  • synchronization failure or programming errors in the load balancing or traffic management system
  • poorly programmed failure / synchronization or disaster recovery system
  • power loss to unsaved individual components


When the power supply fails, the reasons for the failures are:

  • lightning causes overvoltages and power outages
  • intermittent failures with transfer switches and inability to start generators or transfer to a second datacenter
  • inverter failures and lack of transfer to secondary systems
  • the supplier is unable to deliver the necessary power with subsequent failure of the generator or inverter
  • damage to computer equipment caused by overvoltages

 

"In general, companies should pay more attention to the resilience of data centers. They need to know their architectures, to understand all the interdependencies, to identify the reasons for failures, to plan solutions in case of failure. However, this last aspect is often neglected," adds Todd Traver.

 

 Read the article

 

Source : Le Monde Informatique

 

 

 

 

FaLang translation system by Faboba