Researchers discover a cause of catastrophic computer failures, how to fix problem

A fault called partial partitioning has been identified as the culprit of catastrophic computer system failures. The good news is that researchers have not only identified it; they have figured out how to fix it.

Computer scientists at the University of Waterloo identified the fault, which can cause data loss, system crashes, or data corruption in many computer systems.

“These failures can result in the shutting down a banking system for hours, losing your data or photos, search engines being buggy, some messages or emails getting lost or the loss of data from your database,” said Samer Al-Kiswany, a professor in Waterloo’s David R. Cheriton School of Computer Science and co-author of the study.

Partial partitioning works by disrupting the communication between some but not all computers in a cluster. A cluster is a set of connected computers that work together in a way that makes them appear to the user as a single system.

To fix the problem, Al-Kiswany and his team developed a novel approach, called network partitioning fault tolerance layer (Nifty), to prevent these system failures. Nifty is a simple and transparent software solution that does not require changes to the existing system.

“Partial partitioning is a catastrophic failure that is easy to manifest and is deterministic, which means if a sequence of events happens, the failure is going to occur,” said Al-Kiswany. “We found that the partition in only one node was responsible for the manifestation of all failures, which is scary because even misconfiguring one firewall in a single node leads to a catastrophic failure.”

In undertaking the study, the researchers conducted a comprehensive review of system failures caused by partial partitioning in 12 popular systems. They found that 75 per cent of the failures they studied have a catastrophic impact, such as data loss and data corruption. They also discovered that 84 per cent of the failures don’t send users error or warning messages. The study further revealed that 24 per cent of the failures have a lasting impact, so even after fixing the partition, the failure will persist.

The team of computer scientists then dissected the design of eight popular systems and identified four principled approaches for tolerating partial partitions. Further analysis, however, revealed that implemented fault tolerance techniques are inadequate.

“Our findings motivated us to build Nifty, a transparent communication layer that masks partial network partitions,” said Mohammed Alfatafta, one of the graduate students who worked on Nifty under the supervision of Al-Kiswany. “Nifty builds an overlay between nodes to detour signals around partial partitions. Our prototype evaluation with six popular systems shows that Nifty overcomes the shortcomings of current fault tolerance approaches and effectively masks partial partitions while imposing negligible overhead.”

The study, Toward a Generic Fault Tolerance Technique for Partial Network Partitioning, authored by Waterloo’s Faculty of Mathematics’ Al-Kiswany and his graduate students; Alfatafta, Basil Alkhatib and Ahmed Alquraan was recently presented at the 14th USENIX Symposium on Operating Systems Design and Implementation. The researchers have made Nifty’s source code publicly available.

/Public Release. The material in this public release comes from the originating organization and may be of a point-in-time nature, edited for clarity, style and length. View in full here.