Jump to section

What is an open source LLM?

Copy URL

large language model (LLM) is a type of artificial intelligence (AI) model that utilizes machine learning techniques to understand and generate human language. The issue of what characteristics are necessary for an AI model to be considered “open source” is a topic of considerable discussion and debate.

The meaning of “open source” in the context of conventional software has been well settled for over 25 years. However, controversies occasionally arise over efforts to loosen or alter the established definition, which sometimes take the form of open-washing. The community understanding of open source developed organically over many decades. It has been distilled in the Open Source Definition (OSD) maintained by the Open Source Initiative as well as the Free Software Definition maintained by the Free Software Foundation.

At a high level, software must meet 2 criteria to be open source:

1. The source code must be available to recipients.

2. The software must be available under licensing terms that are sufficiently permissive, granting freedoms to run, copy, modify and redistribute. The OSD makes clear that source code must be “the preferred form in which a programmer would modify the program” (adapting phrasing from the GPL), and that open source licenses cannot discriminate against persons or groups or fields of endeavor. 

Unlike conventional open source code, which consists mainly of programming instructions, LLMs are created using:

  • Lots (and lots) of training data. This training data may contain copyrighted works or private data, which creates a legal issue when it comes to sharing.
  • Numerical parameters known as weights. These parameters determine how the input data is processed into a meaningful output and are key in shaping the model’s understanding of language. Think of weights as the building blocks that create a model's “brain” and determine how it prioritizes topics as it processes information.

In other words, it’s not just about code anymore. LLMs are much more complex as they require mathematical models and data sets to create. While “open” LLMs may disclose model weights and starting code, they may not necessarily share each data source used to create the LLM in the first place. An open sourceLLM, on the other hand, would share each step and data source along with a permissive license to allow others to use, build upon, and further distribute that model. 

When the recipes for LLMs are distributed for use without charge, individuals and organizations get the opportunity to build upon the work of others. This leads to many benefits, such as:

Collaborative improvement: Fostering collaboration from diverse sources is arguably the biggest benefit of open source LLMs. Creating more access to generative AI (gen AI) technologies allows for more experimentation and learning while reducing biases, increasing accuracy, and improving performance.

Transparency: If we don’t know how a model was trained, how can we trust the output? An open source LLM provides full transparency to how it was trained. This helps users understand how features work and gives them the information they need to decide how (or if) they’ll use the technology.

Less environmental impact: When models are transparent, we can see what work has already been done. This eliminates redundancies in training and evaluation systems, which would otherwise create additional computation and emissions.

Financial accessibility: LLMs typically cost a lot of money to train from scratch and are overall resource intensive. If you access a proprietary LLM, you’re potentially responsible for licensing fees. The ability to build upon someone else’s finished work for free lowers the barrier to entry for organizations that otherwise couldn’t afford to develop an LLM.

Webinar: Get the most out of AI with open source

Open source principles are responsible for many foundational aspects of the internet as we know it. The open source development model has led to some of the most important applications and cloud platforms in use today.

This spirit of freedom continues on a spectrum when it comes to large language models and how “open” or “closed” they are to the public. The term “open source” has been used colloquially to refer to any LLM that’s downloadable on platforms like Hugging Face free of charge. 

This is the case with Meta’s Llama 2 model. However, the terms for Llama 2 don’t fit the common definition of open source software. This is because there are conditions and restrictions the user must agree to within the license agreement. That is, Meta has put in place certain legal and moral restrictions, like what constitutes “acceptable use.” Secondly, the license agreement requires any organization with a specific number of monthly users to file for an additional license from Meta.

Open source-licensed models
The Granite family of models from IBM Research and the Mistral AI models are examples of LLMs available under an Apache 2.0 license. This means the models are free for commercial use without restrictions. However, even these models don’t make all their training data available for inspection, in some cases due to licensing restrictions.

Red Hat envisions a future where anyone can contribute, review, and build upon code from an open, trustworthy foundation. We believe using an open development model helps create more stable, secure, and innovative technologies. As AI continues to grow, our open source platforms can help you build, deploy, and monitor AI models and applications for your own needs with your own data.

Red Hat® Enterprise Linux® AI is a foundation model platform for harmoniously developing, testing, and running Granite family LLMs for enterprise applications. With the technological foundation of Linux, containers, and automation, Red Hat’s open hybrid cloud strategy gives you the flexibility to run your AI applications anywhere you need them.

Created by IBM and Red Hat, InstructLab is an open source project and community for enhancing LLMs following open source principles. The InstructLab project gathers a set of training data curated by humans, generates synthetic data based on the seed training data, then uses the synthetic data to retrain the base model. Community contributions can lead to regular iterative builds of enhanced LLMs. InstructLab is a cost-effective solution for improving the alignment of LLMs and opens the doors for those with minimal machine learning experience to contribute.

Built using open source technologies, Red Hat OpenShift® AI is an enterprise-ready AI application platform that helps teams build, operate, and scale with confidence. OpenShift AI allows data acquisition and preparation, model training and fine-tuning, model serving and monitoring, and hardware acceleration.