When Failure Is Good News

a corridor between two walls of black computing elements with green and blue tiny lights

A computing server corridor in CERN's main data centre. (Image: Anthony Grossir/CERN)

"We tried to break it - and failed." For several hours, the IT teams subjected the software managing CERN's IT workload to a stress test, sending a huge number of tasks to be processed in an attempt to crash the system … without success. "Even if this was just a test, and certainly not carried out in production conditions, the results achieved were surprisingly positive, especially considering that this is only the start of our journey towards ensuring we are prepared for the High-Luminosity LHC (HL-LHC)," says Antonio Delgado, the IT expert in charge of the test.

CERN's IT teams have started a series of tests to prepare the computing infrastructure for the huge amounts of data that are expected to be produced by the HL-LHC experiments. The machine, which is due to start operating in 2030, will generate many more collisions than its predecessor. The luminosity will jump from the current 125 inverse femtobarns per year with the LHC to 300 inverse femtobarns per year, or even higher, with the HL-LHC. Considering that 1 inverse femtobarn corresponds to around 100 million million (potential) collisions, the LHC's upgrade will generate massive amounts of data for CERN's IT infrastructure to process.

The test, which was carried out in October, aimed to test the software managing CERN's compute workload - in other words, the software that collects the requests sent by physicists and distributes them to the computers. During the stress test, the system successfully executed more than two million tasks ("jobs" in the physicist's vocabulary) simultaneously over 13 hours. Some 16 800 jobs per minute were injected into the system, i.e. about 20 times the current average throughput. The whole system withstood the load and the average job handling time remained reasonable (around 5 minutes) for such a scale.

CERN's data workload management system uses HTCondor, the open-source software originally developed at the Center for High Throughput Computing in the Department of Computer Sciences at the University of Wisconsin-Madison. The system relies on two core components: the collector daemon and the negotiator daemon, a daemon being a type of software that runs as a background process. Together, the two components collect the job requests sent by users, monitor the compute resources available in the pool and use this information to match submitted jobs with suitable machines. "CERN has been using HTCondor for our batch processing since 2016," explains Ben Jones, leader of the team that manages the function. "The excellent relationship with the developers helped us - and, indeed, other high-energy physics sites - to scale the technology to match the needs of the experiments." All in all, HTCondor provides the job queueing mechanism, scheduling policy, priority scheme, resource monitoring and resource management.

This test will be followed by many others, which will also involve CERN's disk-based system for storing the vast amounts of data produced by the scientific community.

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.