Software Test Instability Disrupts Multiple Projects

Kyushu University

Fukuoka, Japan—In a study published in IEEE Transactions on Software Engineering on May 26, 2026, researchers from Kyushu University have found that "flaky tests," which are unstable software tests that seem to randomly pass or fail, do not stay confined to the projects they originate in and often spread across entire ecosystems. After analyzing hundreds of interconnected projects in OpenStack, a widely used open-source cloud computing platform, the research team found that 55% of projects were affected by cross-project instability, resulting in a cumulative loss of 1,156 days of developer time.

Complex software systems, such as those used in cloud platforms, banking services, healthcare records, and government infrastructure, rely heavily on automated testing to ensure reliability. Each time a developer modifies code, automated tests run to confirm that nothing breaks. This process is known as Continuous Integration (CI) and allows software to evolve quickly while maintaining stability. Without it, even small errors could disrupt critical services that are used daily by millions of people.

However, not all test failures indicate real defects; "flaky tests" are a prime example. These tests behave unpredictably, passing in one run and failing in another without any code changes. As a result, developers are forced to spend time investigating false alarms and rerunning tests, requiring significant effort and computational resources. While companies like Microsoft and Google have reported high costs associated with flaky tests, most research has focused on individual projects. This leaves an important question unanswered: what happens in large, interconnected ecosystems where many projects share code, dependencies, and testing infrastructure?

In this study, the research team, led by Assistant Professor Tao Xiao and Professor Yasutaka Kamei from Kyushu University's Faculty of Information Science and Electrical Engineering , in collaboration with the University of Waterloo, Canada, as part of the Adopting Sustainable Partnerships for Innovative Research Ecosystem (ASPIRE) project, conducted a comprehensive analysis of the OpenStack ecosystem. They examined 649 projects, over 29,000 code reviews, and more than 73,000 code changes to understand how test instability behaves at scale.

The team found evidence of two key phenomena. The first is cross-project flakiness, where a single unstable test affects multiple projects. The second is inconsistent flakiness, where the same test behaves differently depending on the project in which it runs. In total, they identified 1,535 tests that caused failures across multiple projects and 1,105 cases in which flaky behavior varied across projects. Notably, around 70% of unit tests—which are typically designed to check small, isolated pieces of code—were found to exhibit cross-project instability, challenging assumptions about their reliability.

Importantly, the researchers found that instability was often caused by environmental and system-level factors rather than problems in the code itself. These included timing-related problems in CI systems, temporary server problems or resource availability issues, mismatches in software dependencies, and inconsistencies in testing configurations across projects. Because many of these factors are shared across projects, flakiness can propagate widely.

As Kamei explains, "Our findings show that test instability is not a local issue but an ecosystem-wide problem. Addressing it requires coordinated efforts across projects, rather than isolated fixes, to reduce wasted development time and computational resources."

The study also points toward practical improvements, such as standardizing CI environments, improving dependency management, and developing tools to detect and classify flaky tests early. These measures could help developers focus on real issues instead of repeatedly rerunning tests.

"Our work contributes to improving the reliability and efficiency of software development processes and paves the way for the development of intelligent, trustworthy testing infrastructures that support the growing demands of modern digital society," concludes Kamei.

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.

You might also like