How can we efficiently recover accurate software architecture in evolving C/C++ and Java codebases to improve system maintainability?
We recently conducted an insightful interview with NTU researchers Yiran Zhang and Prof. Liu Yang, where we explored the complexities of architecture drift in large software systems and how their innovative tool, SARIF, offers a comprehensive solution. SARIF integrates dependencies, code text, and folder structure to ensure precise and scalable architecture recovery. This significant research was presented at ESEC/FSE 2023. Stay tuned for key takeaways and expert perspectives on improving architecture accuracy and system maintainability through advanced recovery techniques.
Scantist: What motivated you to develop SARIF, and what specific issues in software architecture recovery were you aiming to address?
Zhang Yiran: SARIF addresses limitations in current architecture recovery tools, which typically rely on only one or two information sources and often apply them in a coarse-grained manner. These methods lack precision, leading to inaccurate results. SARIF integrates multiple information types—dependencies, code text, and folder structure—and refines these inputs to offer a more accurate and automated recovery of system architectures. The growing complexity of modern software systems, particularly on platforms like GitHub, drove the need for a tool like SARIF, which adapts to project-specific features.
Scantist: How did you determine the balance between dependencies, code text, and folder structure in SARIF’s architecture recovery process?
Zhang Yiran: SARIF’s dynamic weighting system adjusts the importance of each information source—dependencies, code text, and folder structure—based on the project. Dependencies provide structural insights, code text offers semantic context, and folder structure adds organizational hints. SARIF first recovers architecture from each source independently and compares the results. The most reliable information source is given a higher weight, ensuring a balanced and project-specific recovery process. This adaptive method improves SARIF’s accuracy and makes it versatile for different software systems.
Scantist: How do the new metrics—ARI and a2aad j—improve upon traditional measures like MoJoFM in evaluating architecture recovery accuracy?
Zhang Yiran: ARI and a2aad j address the limitations of traditional metrics like MoJoFM, which tend to favor smaller clusters and have limited dynamic range. ARI evaluates how consistently components are clustered across both recovered and ground truth architectures, without bias toward cluster size. a2aad j provides a granular analysis by breaking down dissimilarities into reassignment and cluster addition/removal costs. Together, these metrics offer a more nuanced and reliable assessment of architecture recovery, especially in complex, large-scale systems, making SARIF's results more accurate.
Scantist: SARIF demonstrated a 36.1% accuracy improvement over previous techniques. What contributed to this gain in accuracy?
Zhang Yiran: SARIF’s improved accuracy stems from its fine-grained analysis and multi-source fusion. By dynamically balancing dependencies, code text, and folder structure, SARIF gains a comprehensive view of system architecture. SARIF’s advanced techniques, such as community detection and topic modeling, allow it to precisely cluster system components. Additionally, the tool adapts to each project, ensuring that the most relevant information is weighted accordingly. This method consistently outperformed other state-of-the-art tools across multiple software projects, leading to SARIF’s 36.1% increase in architecture recovery accuracy.
Scantist: How did the three types of information (dependencies, code text, and folder structure) contribute to SARIF’s architecture recovery, and how did dynamic weighting impact the results?
Zhang Yiran: Dependencies provided critical structural insights into software architecture, while code text added semantic information, particularly in projects with strong naming conventions. Folder structure contributed when project organization was logical, though its usefulness varied. SARIF’s dynamic weighting system ensured the most relevant information was prioritized for each project. In cases like Bash, folder structure played a significant role, whereas in Libxml2, textual information was more important. This adaptability allowed SARIF to balance the sources, leading to more accurate recovery results across diverse projects.
Scantist: What were some of the biggest challenges you encountered during SARIF’s development, and how did you address them?
Zhang Yiran: Balancing the level of granularity in architecture recovery was a significant challenge. Different projects require varying levels of detail, so SARIF’s community detection algorithm was designed to allow flexible granularity. Scalability was another issue, particularly for large systems like Chromium. We optimized SARIF’s underlying algorithms to handle these datasets efficiently. Additionally, generalizing SARIF across different programming languages required extensive testing and fine-tuning, ensuring that the tool’s information fusion and dynamic weighting system would work across a range of software environments.
Scantist: How well does SARIF generalize across different software ecosystems, and what are its potential applications in industry?
Zhang Yiran: SARIF was tested on 900 GitHub projects and demonstrated generalizability across various ecosystems, including C, C++, and Java. In industry, SARIF could be applied in reverse engineering, software maintenance, and security audits, where accurate system architecture recovery is essential. Its adaptability to different types of projects and programming languages makes it a versatile tool for real-world applications. The tool’s automated and comprehensive recovery of architecture can help streamline maintenance and system understanding, reducing the effort required for manual intervention.
Scantist: Looking ahead, how do you see SARIF evolving, and what are some potential improvements you would like to make?
Zhang Yiran: We aim to expand SARIF’s support for additional programming languages and frameworks to increase its applicability. Refining metrics like ARI and a2aad j is another priority, enhancing their ability to capture architecture recovery nuances. Additionally, improving SARIF’s performance on large-scale systems, such as those with millions of lines of code, is a key goal. We also plan to integrate SARIF with existing development tools like IDEs and CI/CD pipelines, making architecture recovery a more seamless part of the development workflow, and improving SARIF’s user interface for non-experts.