Research Pinpoints Bugs In Popular Science Software

A go-to software platform scientists use to do their work could become less glitchy, thanks to University of Alberta research.

A comprehensive study of the vulnerabilities in Jupyter Notebook, a popular open-source web application researchers use to explore and analyze their study data, pinpoints the most common bugs in the software — a first step to improving it.

"By understanding Jupyter Notebook weaknesses, smarter, more reliable tools for users and developers can be created," says Thibaud Lutellier, assistant professor of computing science and mathematics at Augustana Campus, and lead author on the study.

Underpinning research for core industries like health care, finance and technology, accuracy in data science is vital, he adds, noting that in Canada, investment in the field almost doubled in 10 years, with estimates ranging from $15-$21 billion in 2008 to $29-$40 billion in 2018.

Widely used in data science and machine learning, Jupyter Notebook creates a single, interactive document that combines live code, results and explanatory notes for research studies, making it an effective all-in-one tool. It also offers more flexibility than traditional programming setups, because data can be loaded non-sequentially.

"It's an interactive way to do programming, to explore and interpret data, without having to reload everything; you can rewind a bit, which makes it very convenient," Lutellier says.

But that unique feature also makes Jupyter Notebook vulnerable to bugs, he notes.

"It's a lot easier to accidentally break something in the code or to set up the system incorrectly, because you're changing things all the time." 

And because a wide range of users — many of them non-experts in computer science — can access the software, that increases the likelihood of defects and misconfigurations, says Lutellier.

Those vulnerabilities can cause problems such as data loss or inaccurate interpretation of results, and can even lead to ransomware attacks, he notes. 

To find out what factors contribute to bugs, the researchers collected and analyzed almost 9,000 Jupyter Notebooks from GitHub and Kaggle, two major online "filing cabinets" for software developers. 

Lutellier, Augustana undergraduate research participant Harsh Darji, and researchers from Concordia University and ETH Zurich explored whether certain traits, such as how complex a notebook was or the number of people who worked on it, were connected to having more bugs. They also created a detailed bug taxonomy to classify the different kinds they found, and reviewed security updates and reports to figure out the potential risks when using these notebooks. 

Their assessment showed that having multiple people working together on the same notebook was more likely to produce bugs — a surprising finding, says Darji.

"We'd thought that the problem would be code complexity, but what we found is that if a team of people work on the same piece of code with Jupyter Notebook, the code is more likely to be wrong. The more collaborators there are, the more likely it is that bugs will be introduced."

The research also uncovered two main types of bugs: those introduced when users improperly set up, or configured, their notebooks, and incorrect use of built-in features.

In reviewing the Jupyter Notebook ecosystem, its vulnerabilities show there's currently a trade-off between usability and security, Lutellier suggests.

"It's flexible and faster than other software, but the code written in it is likely going to be a lot more buggy, and it's going to be more difficult to work collaboratively. That raises concerns about the reproducibility, maintainability and security of projects done on Jupyter Notebook."

The study's insights highlight the need for software developers and AI engineers to build better configuration management and collaborative work tools around Jupyter Notebook, says Lutellier, whose research is now focused on developing a new AI tool to automatically detect those bugs. 

Providers should improve support tools to help large teams use notebooks safely, and as users, data scientists need to work carefully and make better use of collaborative tools and existing bug detection systems, he says. 

"By reducing these errors, notebooks become more reliable for everyone, helping data scientists focus on solving problems rather than fixing coding mistakes."

The study was funded through a Natural Sciences and Engineering Research Council of Canada Discovery Grant.

/University of Alberta Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.