ResourcesResearchA Comprehensive Study on Quality Assurance Tools for Java
research
Published on
September 6, 2024

A Comprehensive Study on Quality Assurance Tools for Java

Written By
Ding Sun
Share

How can we effectively detect and address quality issues in Java code bases to improve software quality and maintainability?

We recently had an enlightening interview with NTU researchers Dr. Han Liu and Prof. Yang Liu, where we explored the significant challenges in ensuring code quality and how comprehensive studies like theirs provide critical insights. Their work, featured in "A Comprehensive Study on Quality Assurance Tools for Java" and presented at ISSTA 2023, sheds light on the capabilities and limitations of various QA tools. Stay tuned for key takeaways and expert insights on improving software quality and maintainability.

Scantist: Could you briefly introduce the motivation behind conducting a comprehensive study on Java QA tools? What gaps in existing research were you aiming to address?

Liu Han: Well, we noticed that previous research often didn't delve deep into the coverage and granularity of scanning rules. There were also inconsistencies in how effective these tools were reported to be, likely due to different benchmarks and methodologies. We wanted to provide a more holistic view by evaluating these tools across various dimensions, including their performance on different types of bugs and their overall efficiency. Our goal was to help developers make more informed choices by understanding the strengths and limitations of each tool.

Scantist: In your evaluation of scanning rules, how did you ensure the mapping to CWE categories was accurate and comprehensive?

Liu Han: We were really thorough with this part. We used a method where one person would map the rules, and then two others would confirm it. This way, we minimized any potential bias or mistakes. The CWE framework was our go-to because it covers a wide range of software weaknesses. We wanted to ensure that we captured both the strengths and the gaps in the tools' coverage, especially since some areas, like domain-specific issues, tend to get overlooked.

Scantist: What were the most significant findings regarding the coverage of different QA tools? Were there any surprising gaps or strengths?

Liu Han: It was interesting to see that tools like SonarQube and Error Prone had pretty broad coverage, especially for common coding issues. But we were surprised by the gaps, particularly in specialized areas like user interface security. These tools are great for general issues, but they might miss some critical vulnerabilities if you're working in a specific domain. It really highlighted the importance of choosing the right tool for the job, depending on what kind of issues you're most concerned about.

Scantist: How did the granularity of scanning rules vary between the tools, and what impact does this have on their effectiveness?

Liu Han: The granularity really varied. For example, Infer and Semgrep were quite detailed, especially in areas like security and memory issues. This is great if you need in-depth analysis, but it can also mean they miss broader issues. On the other hand, tools like SonarQube and PMD are less detailed but cover a wider range of problems. So, if you're looking for comprehensive coverage, you might lean towards those. It’s all about balancing between the depth of analysis and the breadth of coverage.

Scantist: Can you discuss the methodology used to assess the detection rates of these tools across different benchmarks? What were the key challenges encountered?

Liu Han: Sure! We tested each tool on five different benchmarks, which together had 1,425 bugs. We made sure to use a consistent setup so that the comparisons would be fair. One of the big challenges was dealing with the different requirements each tool had—some needed source code, others required binaries, and some even needed to be run during compilation. Another challenge was the diversity of the bugs in these benchmarks. It really put the tools to the test, highlighting strengths and weaknesses depending on the bug types they encountered.

Scantist: Were there particular types of bugs or benchmarks where certain tools performed notably better or worse? What factors contributed to these differences?

Liu Han: Definitely. For example, PMD and SonarQube were really good at catching common coding issues across most benchmarks. However, Infer shined when it came to detecting complex, memory-related bugs. These differences largely came down to each tool's focus and the rule sets they used. Tools designed to target specific vulnerabilities did great in those areas but weren't as versatile overall. On the other hand, general-purpose tools offered broader coverage but sometimes missed more niche issues. It’s a trade-off between specialization and versatility.

Scantist: What did your study reveal about the accuracy and usefulness of the warnings generated by these tools?

Liu Han: We found that while a lot of warnings were accurate, they didn't always pinpoint specific bugs. Many just flagged general coding issues, which can be helpful but also overwhelming if you get too many false positives. However, warnings about serious issues like null pointer dereferences were usually spot-on and very useful. It’s a mixed bag, really. The key takeaway is that while these warnings are helpful, developers need to use their judgment to prioritize which ones to act on, especially when they're trying to sift through a lot of information.

Scantist: How can developers leverage the insights from these warnings to improve their code quality, even when the warnings do not directly indicate bugs?

Liu Han: Even if the warnings don't point to an immediate bug, they’re still valuable. They often highlight areas of the code that could be risky or inefficient. For example, warnings about using deprecated methods or having too many nested loops can guide developers to clean up their code, making it more maintainable and less prone to future errors. It’s all about using these tools as part of a broader quality improvement process. Addressing these warnings can prevent bigger issues down the line and help maintain a higher standard of code quality.

Scantist: How did the tools' time performance vary with the size and complexity of the projects? Were there any tools that stood out in terms of efficiency?

Liu Han: Time performance varied quite a bit. PMD was consistently quick, even with larger projects, making it the most efficient. Semgrep, on the other hand, was slower across the board, mostly because of its initial setup process. Larger projects definitely increased the time for most tools, especially SonarQube and Infer, which perform more detailed analyses. It really depends on what you prioritize—if you need a quick scan, PMD is great, but if you need a thorough analysis, you might need to budget more time.

Scantist: Based on your findings, what are the key areas for future research and development in QA tools? How can these tools evolve to better meet the needs of developers?

Liu Han: Moving forward, we should expand the coverage of these tools, especially in niche areas like UI security. Improving the precision of warnings to reduce false positives is also crucial. Integration with CI/CD pipelines would make these tools more practical for real-time feedback. Additionally, tools need to evolve to handle complex issues like logical errors better. Providing customizable rule sets will help developers tailor these tools to their specific needs, making them more versatile and useful in a variety of development environments.