OSSFP: Precise and Scalable C/C++ Third-Party Library Detection using Fingerprinting Functions

Written By

Ding Sun

OSSFP: Precise and Scalable C/C++ Third-Party Library Detection using Fingerprinting Functions

How can we effectively detect and address known vulnerabilities in existing C/C++ code bases to enhance software security and reliability?

‍

We recently conducted an insightful interview with NTU researchers Dr. Wujiahui and Prof. Liu Yang. Our discussion focused on the significant impact of known vulnerabilities in existing code bases and how innovative solutions like OSSFP can help address these challenges. Notably, this work has been published by ICSE 2023. Stay tuned for key takeaways and expert perspectives on enhancing software security and reliability.

‍

Scantist: What motivated you and your team to develop OSSFP?

‍

Jiahui: We developed OSSFP to address the shortcomings of existing Software Composition Analysis (SCA) tools, which struggle with accurately detecting third-party libraries (TPLs) in C/C++ projects. These tools often fail with nested TPLs and common functions, leading to high false positive and negative rates. Our solution focuses on three key areas: generating precise signatures by emphasizing core functions, avoiding predefined thresholds that can cause false negatives, and ensuring scalability for large datasets. OSSFP filters out non-essential functions and creates unique fingerprints based on core functions, enhancing accuracy and eliminating the need for thresholds. It improves scalability by reducing feature size to 1.06% of total functions, boosting performance. Experimental results show OSSFP achieves 90.34% precision and 90.84% recall, outperforming tools like CENTRIS and Snyk CLI.

‍

Scantist: What are the main contributions of OSSFP to the field of software composition analysis?

‍

Jiahui: OSSFP significantly advances software composition analysis by introducing a novel framework for precise and scalable TPL detection in large-scale projects. First, it selects core functions and generates unique fingerprints, enabling rapid and accurate TPL identification. Second, we implemented OSSFP and built a comprehensive TPL fingerprinting database with 23,427 C/C++ repositories, including 585,683 versions and 90 billion lines of code. Third, our experiments demonstrate OSSFP's superior performance, achieving 90.34% precision and 90.84% recall in detecting 896 TPLs across 100 software projects. Additionally, OSSFP is highly scalable, identifying all TPLs per project in just 0.12 seconds on average, making it 22 times faster than CENTRIS.

‍

Scantist: Can you describe the core methodology behind OSSFP and how it improves upon existing SCA tools?

‍

Jiahui: OSSFP’s methodology comprises three offline phases: Feature Generation, Index Building, and Fingerprint Selection. In Feature Generation, we clone target GitHub repositories and generate features like hash values, lines of code, and cyclomatic complexity for each function. Index Building involves creating a library function hash index and a distinct function hash index by removing duplicate functions within and across libraries. Fingerprint Selection focuses on filtering functions to retain core functions only, using three filtering steps: clone function filtering, supporting function filtering, and common function filtering. This process generates unique fingerprints for precise TPL detection. OSSFP's structured approach enhances accuracy and scalability, outperforming existing SCA tools.

‍

Scantist: what unique aspects of the feature generation you build in OSSFP and how it contributes to accurate TPL detection?

‍

Jiahui: OSSFP’s feature generation process provides a detailed profile of each function by cloning target GitHub repositories and utilizing git tags for version information. We generate features such as hash values, author times, lines of code, cyclomatic complexity, and Halstead volume. This thorough data collection ensures a rich dataset representing each function’s characteristics. The uniqueness of this process lies in its precision and comprehensiveness, which lay a strong foundation for the subsequent indexing and filtering phases. By capturing detailed function-level features, we enhance TPL detection accuracy, ensuring that the generated fingerprints are based on robust and representative data.

‍

Scantist: What are the innovative elements of the index building phase in OSSFP, and how do they enhance the detection of third-party libraries?

‍

Jiahui: The index building phase of OSSFP features two key steps: creating a library function hash index and a distinct function hash index. We remove duplicate functions within and across library versions, ensuring only unique functions are retained. This dual-level deduplication filters out noise and redundancy, significantly improving TPL detection accuracy. By maintaining a clean, precise dataset and accurately attributing each function hash to its original library, OSSFP lays the groundwork for generating precise fingerprints. This innovative approach enhances detection accuracy by ensuring that each function hash is unique and correctly attributed.

‍

Scantist: How does the fingerprint selection process in OSSFP innovate the detection of third-party libraries, and what makes it unique?

‍

Jiahui: The fingerprint selection process in OSSFP is key to its innovation. We use a three-step filtering process: clone function filtering, supporting function filtering, and common function filtering. Clone functions are removed to eliminate duplicated code. Supporting function filtering excludes simple, non-core functions, and common function filtering removes widely used functions not unique to any library. By focusing on core functions, we generate precise, unique fingerprints for TPL detection. This meticulous filtering reduces false positives and negatives, enhancing accuracy. OSSFP’s approach outperforms existing tools by ensuring the generated fingerprints are highly representative and effective for TPL detection.

‍

Scantist: Ok, so what is the accuracy of OSSFP in detecting TPLs compared to related works, specifically CENTRIS?

‍

Jiahui: OSSFP demonstrated high accuracy without utilizing thresholds, achieving 90.34% precision and 90.84% recall on the ground truth dataset. It outperformed CENTRIS by 3.71% in precision and 35.31% in recall. These results indicate that OSSFP meets both R1 (generating representative signatures for accurate TPL detection) and R2 (abandoning predefined thresholds) requirements in Section I. The experiments confirm OSSFP's superior accuracy and reliability in detecting TPLs without the need for thresholds.

‍

‍

Scantist: How is the scalability of OSSFP in terms of time efficiency and data size?

‍

Jiahui: OSSFP's selection of core functions significantly reduces feature size to 1.06% of the original size. It identifies all TPLs per project in just 0.12 seconds on average, 22 times faster than CENTRIS. Processing each library takes less than 30 seconds, demonstrating OSSFP's efficiency with large datasets. These results show OSSFP meets the R3 requirement in Section I, ensuring time efficiency and reduced feature size for large-scale applications.

‍

Scantist: How does each function filtering step contribute to the accuracy improvement of OSSFP?

‍

Jiahui: The ablation study for each function revealed that non-core functions negatively impact accuracy. OSSFP's filtering of clone, supporting, and common functions significantly improves precision. By utilizing only core functions, OSSFP generates highly representative fingerprints for TPL detection. The ablation experiment confirms features derived from core functions are accurate and representative, fulfilling the R1 requirement in Section I. This comprehensive filtering process ensures OSSFP’s robustness and precision in TPL detection.

‍

‍