Professor Oscar (Quique) Goñi has investigated source code similarity detection in Large Language Model (LLM) outputs using the osskb.org service knowledge base from the Software Transparency Foundation and open source software developed by SCANOSS.
While recent research has identified concerns regarding LLMs generating code that closely resembles their training data, the full extent of this similarity across the broader open-source ecosystem remains unexplored. Jerónimo, Professor Quique’s colleague at Software Transparency Foundation and SCANOSS, will describe the findings, which indicate that code similarity in LLM outputs may be more prevalent than previously indicated when evaluated against a broader open-source code base.
Jerónimo will also describe how this study contributes to the ongoing discussion of LLM-generated code’s originality and its implications for software licensing compliance, while validating the effectiveness of lightweight similarity detection algorithms as preliminary indicators for more comprehensive analysis.