The CMS collaboration at CERN has released into the open 18 new datasets, comprising proton-proton collision data recorded by CMS at the Large Hadron Collider (LHC) during the second half of 2011. LHC data are unique and are of interest to the scientific community as well as to those in education. Preserving the data and the knowledge to analyse them is critical. CMS has therefore committed to releasing its research data openly, with up to 100% being made available 10 years after recording them; the embargo gives the scientists working on CMS adequate time to analyse the data themselves.
The total data volume of this latest release is 96 terabytes. Not only does this batch complement the data from the first half of 2011, released back in 2016, it also provides additional tools, workflows and examples as well as improved documentation for analysing the data using cloud technologies. The data and related materials are available on the CERN Open Data portal, an open repository built using CERN’s home-grown and open source software, Invenio.
Previous releases from CMS included the full recorded data volume from 2010 and half the volumes from 2011 and 2012 (the first “run” of the LHC). Special “derived datasets”, some for education and others for data science, have allowed people around the world to “rediscover” the Higgs boson in CMS open data. Novel papers have also been published using CMS data, by scientists unaffiliated with the collaboration.
In the past, those interested in analysing CMS open data needed to install the CMS software onto virtual machines to re-create the appropriate analysis environment. This made it challenging to scale up a full analysis for research use, a task that requires considerable computing resources. With this batch, CMS has updated the documentation for using software containers with all the software pre-installed and added workflows running on them, allowing the data to be easily analysed in the cloud, either at universities or using commercial providers. Some of the new workflows are also integrated with REANA, the CERN platform for reusable analyses.
CMS and the CERN Open Data team have been working closely with current and potential users of the open data – in schools, in higher education and in research – to improve the services offered. The search functionality of the portal has been updated with feedback from teachers who participated in dedicated workshops at CERN in previous years, the documentation has been enhanced based on conversations with research users and a new online forum has been established to provide support. In September, CMS is organising a virtual workshop for theoretical physicists interested in using the open data.
“We are thrilled to be able to release these new data and tools from CMS into the public domain,” says Kati Lassila-Perini, who has co-led the CMS project for open data and data preservation since its inception. “We look forward to seeing how the steps we have taken to improve the usability of our public data are received by the community of users, be it in education or in research.”