How can we effectively detect and resolve license conflicts in open-source software to ensure compliance and reduce legal risks?
We recently conducted an insightful interview with Dr. Liu Tao, a leading researcher in software licensing and compliance, and Professor Liu Yang from NTU. Our discussion delved into the complexities of license ambiguities and conflicts within SPDX licenses and how innovative solutions from his research, Catch the Butterfly, aim to tackle these challenges. This important work has been recognized and published, shedding light on how developers can better manage third-party library licenses. Stay tuned for key insights and expert strategies on maintaining compliance and mitigating legal risks in open-source software development.
Scantist: Could you briefly explain the motivation behind this research and why addressing license conflicts in software development is critical?
Liu Tao: The motivation behind this research stems from the widespread adoption of third-party libraries (TPLs), which has accelerated software development but also introduced legal risks. Over 79% of modern software includes TPLs, but developers often lack expertise in handling the complexities of software licenses. Legal violations from license incompatibility can lead to serious issues like economic losses, as seen in cases like Cisco's violation of the GNU General Public License. Our research aims to bridge the gap by offering a high-quality dataset of 453 SPDX licenses, providing a clear, comprehensive framework for understanding and avoiding license conflicts in TPL ecosystems.
Scantist: Could you summarize the main challenges your research addresses and how you propose to solve them?
Liu Tao: Absolutely. We identified four main challenges: 1) There is no standardized model for interpreting licenses, as platforms like Choosealicense and TLDRLegal use different sets of terms. 2) License texts are often ambiguous, making them difficult for non-professionals to interpret. 3) Existing conflict models are inadequate, particularly when handling complex copyleft licenses. 4) Developers frequently use irregular license identifiers, complicating the management of TPLs. To tackle these, we developed a standardized set of 22 terms, clarified ambiguous language with expert review, enhanced conflict models to include new categories like copyleft conflicts, and emphasized the need for better license assignment practices.
Scantist: One of the challenges you identified was the lack of a standardized license model. How does your research propose to tackle the issue of inconsistent term sets across platforms like Choosealicense, TLDRLegal, and OpenEuler?
Liu Tao: Indeed, the inconsistency across these platforms created a major challenge. For example, Choosealicense defines 16 terms, TLDRLegal defines 23, and OpenEuler defines 18, making it hard for developers to understand licenses in a consistent way. To address this, we conducted a differential analysis and developed a unified set of 22 standardized terms, which covers the key aspects of mainstream licenses. These terms were validated with legal experts, achieving a high consistency rate of 83.68% across multiple reviewers. Our goal was to provide a reliable baseline that developers can use to interpret license terms uniformly across different platforms.
Scantist: Software licenses often include ambiguous legal language that is hard for non-professionals to interpret. How did your team address this ambiguity in the dataset you created?
Liu Tao: License text can be incredibly complex, with intricate legal provisions that non-professionals struggle to interpret. To resolve this, we manually examined and labeled each license term, working closely with legal experts to ensure accuracy. For instance, we found that platforms like TLDRLegal mislabeled up to 17.53% of terms. By involving domain experts, we reduced these ambiguities and created a clearer, more consistent framework. The result is a dataset where developers can easily understand terms without needing deep legal expertise. Our manual approach also improved accuracy in term labeling, which existing automated tools often overlook.
Scantist: You highlight the lack of a comprehensive model for identifying license conflicts. How did you improve upon existing models, and what does your new conflict model contribute to the field?
Liu Tao: The existing models were overly simplistic, focusing mostly on basic conflicts without accounting for more complex cases like copyleft licenses. To improve this, we introduced three major categories of conflicts: rights conflicts, obligations conflicts, and copyleft conflicts. These conflict categories were developed by comparing license terms on a granular level, enabling us to identify 28,918 rights conflicts, 140,870 obligation conflicts, and 14,593 copyleft conflicts. This comprehensive model offers a clearer understanding of how different licenses interact and provides developers with a roadmap for avoiding legal violations, particularly when dealing with highly restrictive licenses.
Scantist: Irregular license assignment practices seem to complicate the identification of all TPLs. How do you think this problem can be mitigated moving forward?
Liu Tao: This is a widespread issue, as developers often assign licenses with irregular or incomplete identifiers, making it harder to track TPLs accurately. To mitigate this, we advocate for a more standardized approach to license assignment, where developers follow clear guidelines. Our research also highlighted the importance of automated tools for detecting irregular licenses early in the process. For example, we found that about 36.42% of SPDX licenses are not indexed correctly by platforms like Choosealicense. Standardization efforts in the community, along with better automated tools, will be key to improving license tracking and compliance.
Scantist: In your research, what are the major terms found in SPDX licenses, and how do these terms contribute to the diversity of licenses? Could you also explain what factors make some licenses more prone to conflicts?
Liu Tao: The major terms we identified include core rights such as the ability to distribute, modify, and sublicense software, as well as obligations like including copyright and license notices. These terms are common across most SPDX licenses. However, the diversity in licenses arises from more specific obligations, like the requirement to state changes or give credit. For instance, while 67.8% of conflicts stem from rights like sublicensing, obligations such as including notices contribute to 85.3% of conflicts. The difference in how licenses handle these obligations is what makes some licenses more prone to conflicts, especially when integrating permissive and copyleft licenses together.
Scantist: Your study revisits the NPM ecosystem and its license usage. What did you find about the use of minority licenses, and does the adoption of MIT or ISC licenses protect maintainers from potential conflicts?
Liu Tao: In the NPM ecosystem, MIT and ISC licenses dominate, covering about 78% of all libraries. However, minority licenses, particularly copyleft ones like GPL, still account for significant use, leading to potential conflicts. For example, we identified that 14.8K libraries use GPL licenses, which are known for their restrictive nature. While adopting MIT or ISC licenses offers better protection, it doesn’t fully shield maintainers from conflicts. About 67.8% of rights conflicts in NPM occur due to the permissive nature of MIT and its interaction with more restrictive licenses, particularly around sublicensing and patent claims.
Scantist: What key lessons would you share with developers regarding best practices for license compliance and conflict avoidance?
Liu Tao: The most important lesson is that developers need to be proactive. First, always ensure your software is properly licensed—don’t rely on defaults. Second, regularly check your dependencies for potential license conflicts, especially when integrating third-party libraries. Our study found that 5.33 million direct dependencies in the NPM ecosystem had potential license conflicts, with the majority stemming from obligation conflicts. Using automated tools to audit these dependencies can help developers catch issues early and avoid legal risks down the line.
Scantist: What insights from your work would you emphasize to other researchers studying software licensing, and where do you see the most room for future research?
Liu Tao: For researchers, the biggest insight is the need for high-quality, validated data. Many studies rely on incomplete or inconsistent datasets, which can lead to skewed results. We also see a significant gap in research on how copyleft licenses interact with other license types. Our work has only scratched the surface in refining conflict models, especially when it comes to automating the detection of complex conflicts. Future research should focus on developing more sophisticated tools that can handle these nuances and provide real-time insights into license compatibility.
Scantist: What do you hope the broader impact of this research will be on the open-source community and software development practices moving forward?
Liu Tao: I hope this research raises awareness about the risks associated with license incompatibility. Our dataset and findings should help developers make more informed decisions about the licenses they use. In the long run, I’d like to see more standardization in license usage, which would reduce conflicts and improve compliance across the industry. We also hope this research encourages the development of more advanced tools that can automate much of the compliance process, making it easier for developers to avoid these legal pitfalls without needing deep legal expertise.