Comparison and Evaluation on Static Application Security Testing (SAST) Tools for Java

Written By

Ding Sun

Comparison and Evaluation on Static Application Security Testing (SAST) Tools for Java

How can we effectively detect and address known vulnerabilities in existing Java applications to enhance software security and reliability?

‍

We recently conducted an insightful interview with NTU researchers Dr. Kaixuan Li and Prof. Yang Liu. Our discussion focused on the critical role of evaluating SAST tools and how their latest research sheds light on the strengths and limitations of these tools in identifying vulnerabilities. This important work has been published in ESEC/FSE 2023. Stay tuned for key takeaways and expert perspectives on improving the reliability and security of software systems.

‍

Scantist: Could you briefly introduce the motivation behind your study and the significance of evaluating SAST tools for Java?

‍

Kaixuan: The impetus for our study arises from the critical necessity of early detection and remediation of security vulnerabilities within the software development lifecycle, particularly for Java, a widely utilized programming language. Our evaluation of Static Application Security Testing (SAST) tools aims to systematically assess their efficacy in identifying potential security flaws. The significance of this work lies in its potential to guide developers and security professionals in selecting appropriate tools, thus enhancing overall software security by providing empirical evidence of tool performance.

‍

Scantist: What criteria did you use for selecting the seven SAST tools from the initial set of 161, and how did these criteria ensure a representative sample of the available tools?

‍

Kaixuan: Our selection criteria were meticulously designed to encompass tools that support Java, are available free of charge, actively maintained, and offer a command-line interface. We prioritized tools with a primary focus on security vulnerabilities, rather than general code quality issues, and required comprehensive documentation of their detection rules. This approach ensured a representative sample by including tools that vary in underlying technology—ranging from semantic to syntactic analysis—thus providing a holistic view of the capabilities and limitations inherent in the current landscape of SAST tools.

‍

Scantist: Can you describe the process and challenges of constructing the Java CVE Benchmark, and why you chose the OWASP Benchmark for the synthetic dataset?

‍

Kaixuan: The construction of the Java CVE Benchmark involved a rigorous process, starting with the identification of Java programs with disclosed CVEs, followed by the precise mapping of affected versions and method-level details. One of the primary challenges was ensuring the accuracy and relevance of this benchmark, which was addressed through cross-validation by domain experts. The OWASP Benchmark was selected for its comprehensive and regularly updated set of test cases, providing a reliable baseline for evaluating the tools' effectiveness in detecting synthetic vulnerabilities. This dual approach of using both real-world and synthetic datasets enables a more nuanced evaluation of the tools' practical applicability.

‍

‍Scantist: How did you measure the effectiveness of the SAST tools, and what were the key findings regarding their performance on synthetic versus real-world benchmarks?

‍

Kaixuan: The effectiveness of the SAST tools was measured using established metrics such as recall, precision, and F1-score for the OWASP Benchmark, and the proportion of detected CVEs for the Java CVE Benchmark. Our key findings indicate a significant disparity between synthetic and real-world performance. While the tools demonstrated high efficacy on the OWASP Benchmark, detecting the majority of synthetic vulnerabilities, their performance on real-world vulnerabilities was markedly lower, with only 12.7% detection. This highlights the limitations of current SAST tools in handling the complexities of real-world software vulnerabilities.

‍

Scantist: What were the root causes identified for the varying detection results of SAST tools, particularly their performance discrepancies between synthetic and real-world benchmarks?

‍

Kaixuan: The primary root causes for the discrepancies include the inadequate implementation of detection rules, the limited scope of rule coverage, and the inherent complexity of real-world software systems. Synthetic benchmarks typically present simplified and idealized scenarios, which are more easily addressed by existing detection algorithms. In contrast, real-world environments often involve intricate and less predictable patterns that these tools are not adequately equipped to handle. The study underscores the necessity for more sophisticated rule sets and enhanced analytical techniques to bridge this gap.

‍

Scantist: How consistent were the detection results among the tools, and how did these results compare with the tools' claimed capabilities?

‍

Kaixuan: The study revealed substantial inconsistencies among the tools in terms of detection results. Many tools claimed broad support for detecting a wide range of vulnerabilities; however, their actual performance often did not align with these claims. This overstatement of capabilities points to a significant disparity between theoretical and practical performance, highlighting a need for more accurate representation of tool effectiveness in their documentation. Such inconsistencies necessitate a critical approach when evaluating tool claims and underscore the importance of empirical validation.

‍

Scantist: What did the study reveal about the performance of the SAST tools in terms of time cost, and how did tool efficiency vary with program size?

‍

Kaixuan: Our analysis demonstrated that the performance of SAST tools, in terms of time cost, varies notably with the size of the program being analyzed. Syntax-based tools generally exhibited superior performance in terms of speed, efficiently handling large codebases. In contrast, semantic-based tools, which often provide deeper and more thorough analyses, required significantly more time, particularly for programs exceeding 50k lines of code. This observation underscores a critical trade-off between the depth of analysis and performance efficiency, necessitating a balanced approach based on specific use-case requirements.

‍

Scantist: Based on your findings, what recommendations do you have for SAST tool developers, and what future research directions do you consider crucial in this field?

‍

Kaixuan: We recommend that SAST tool developers focus on enhancing the precision and comprehensiveness of detection rules, particularly for complex real-world scenarios. Additionally, improving the scalability and efficiency of these tools for larger codebases is essential. Future research should prioritize the development of more comprehensive real-world benchmarks and explore hybrid analysis techniques that integrate static and dynamic methodologies. Moreover, the application of machine learning to enhance detection accuracy represents a promising direction. These advancements are critical for addressing current limitations and improving the practical utility of SAST tools.

‍