What most people get wrong about open source AI

Many AI models branded as “open source” do not meet true open source standards

When companies label their AI models as “open source,” most of the time, they’re not being transparent. You see the tag, but you don’t get the full picture. Training data isn’t disclosed. Weights are locked down. Licenses include vague restrictions. And yet, these models are publicly pushed as “open.” That’s not how open source works.

Mike Lieberman, CTO of Kusari, put it clearly: many so-called open models don’t tell you what’s in the training data, which could include copyrighted or biased material. If you’re deploying one of these models and the data turns out to be problematic, legally or ethically, you or your company could carry the liability. In closed-source models, that risk is largely on the vendor. With open models, it’s shared, or entirely on you.

This isn’t hypothetical. A study examined over 40 fine-tuned large language models (LLMs) and found that very few qualified as truly open source by any accepted definition. Most lacked transparency in data, weighting methods, or even usage rights. That’s a red flag.

If you’re running a company and thinking about integrating generative AI, take this seriously. Don’t assume open source equals safe, or even legal. This is about intellectual property, reputational risk, and long-term viability. Before integrating any “open” AI into your stack, vet the model’s transparency. Do a real license audit. Know what you’re building on.

The definition of “open source AI” remains ambiguous and underdeveloped

AI doesn’t fit the traditional open source blueprint. In classic software, you can release the source code, let others modify it, share it, and that’s pretty much it. With AI, the value is also in what you feed the model, the data, and how you trained it. That means you can’t judge openness on code alone.

There is no universal standard, yet. The Open Source Initiative (OSI) is trying to define open source AI with version 0.0.8 of its evolving spec. Their current draft stresses that source code, model parameters, and training documentation must be available for the model to qualify as open source. But, and this is key, they’ve left actual training datasets as optional. That’s a problem.

Marcus Edel, machine learning lead at Collabora, argues that real open source includes everything: the data, the code, the weights, all of it. Paul Harrison from Mattermost adds that the content needs to be legally clean, open source, Creative Commons–licensed, or explicitly approved. Otherwise, you risk legal exposure and community fragmentation.

Nathan Lambert, a scientist at AI2, acknowledges the confusion this creates in the market. Roman Shaposhnik, cofounder of Ainekko, offers a solution, return to genuine collaboration. To him, open source isn’t just access; it’s participation. Right now, most AI models are built inside single vendors, using internal infrastructure, with limited contributor access. That’s not open source. That’s closed development, wrapped in an open label.

For executives, here’s the point: don’t rely on the label. Define internally what open source means for your organization and how it aligns with your legal and technical risk profile. If the data isn’t open, and the community can’t contribute, it’s not really open, and you shouldn’t bet core products on it without knowing where the seams are.

Many foundational AI models branded as open source conflict with traditional open source practices

Many of the most powerful AI models being marketed today as “open” are built behind closed doors. Companies develop these systems internally, often using proprietary datasets and exclusive compute infrastructure. The result is a product that looks open on the surface but doesn’t align with what open source actually means in software development: transparency, reproducibility, and broad community collaboration.

Nathan Benaich and Alex Chalmers from Air Street Capital have been clear about this. They point out that foundational models typically come from central teams with tight control. The contributor pool is narrow. External participation is often blocked or heavily gated. The development process isn’t transparent, and there’s little to no independent verification of what’s under the hood.

That makes these models hard to trust in critical environments. You can’t easily verify how they were built, what the real data provenance is, or who has influenced the output. For companies integrating AI at scale, this should be a signal to step back and reassess. If your business depends on a model, you need to know exactly how it works and what it’s built from.

If you’re in a leadership role and evaluating models for integration, or even internal use, ask about access. Ask about contributors. Who trained it? With what resources? Who else has validated those results? This isn’t just due diligence. It’s a baseline requirement. You wouldn’t roll out financial systems without understanding how they operate. Do the same with AI.

The risks of using purportedly open source AI hinge on transparency and licensing complexities

When the licensing on a model is unclear or restrictive, it’s not just a legal inconvenience, it’s a business risk. Many “open” models include limitations tucked into their terms of use, which can restrict commercial deployment, limit scale, or even bar certain industries from using them. If you’re unaware of those terms, you could unknowingly violate license terms or put your operation in legal jeopardy.

Take Meta’s LLaMa 2. It’s not truly open, even though it’s advertised that way. Its license specifically blocks use by any company with over 700 million monthly active users. This means if you’re at scale, it’s not available to you, regardless of technical ability or commercial use case. These restrictions are subtle, but they matter. Stefano Maffulli from the OSI flags these disguised limitations as a major concern. Companies assume openness but get caught by restrictive provisions later.

Mike Lieberman of Kusari explains that ambiguity in openness creates downstream risk. Vendors may shield themselves, but when you fine-tune that model or embed it into your systems, the exposure shifts to you. You’re now potentially liable for unseen licensing gaps or misuse of restricted data. Roman Shaposhnik warns that these liabilities aren’t evenly distributed. Once you move into custom fine-tuning, you’re on your own in terms of legal protection.

If you’re an executive approving AI deployments, enforce a process around license validation. Don’t accept generalized assurances. Ask legal to vet every clause. Ensure your compliance team understands how the license impacts long-term scaling. And make sure your engineering team can explain exactly where the training data and weights came from. Scrutiny now will prevent reputational and legal damage down the road.

Open source AI offers significant benefits through transparency, collaboration, and improved security

When AI models are made truly open, meaning not just the code, but the training data, parameters, and documentation are available, security improves quickly. More trained eyes on the model means faster identification of flaws, biases, and hidden vulnerabilities. This kind of transparency can’t be replicated in closed models.

Mike Lieberman, CTO of Kusari, explained that with open access, organizations can evaluate and correct information or bias in the training data, something proprietary models don’t allow. This becomes especially relevant in regulated sectors or regions where ethical or legal standards differ from those of the model’s creators.

Beyond fixing errors, open source collaboration shortens innovation cycles. Teams can avoid duplicating work by building on community contributions instead of constantly reinventing the architecture and pipeline. It also reduces the environmental toll of retraining large-scale models when open checkpoints or tuned versions are already available.

Leonard Tang, cofounder of Haize Labs, commented that public visibility itself becomes a form of quality assurance. With more developers inspecting the codebase and training processes, reliability increases naturally. Executives looking to build trustworthy AI systems should be prioritizing open models, not for cost savings, but because of the technical and compliance edge transparency enables.

Evaluating open source AI requires thorough scrutiny of licensing, data provenance, and model behavior

You can’t just deploy an open source AI model and assume it’s safe or legal. It’s not like traditional open source libraries with years of well-documented use. These LLMs are new, large, and often assembled from multiple unknown components. Before you release one into production, you need to audit its origin, confirm its integrity, and test its behavior.

Licensing is the starting point. According to Mike Lieberman, engineering leadership must review not only the code license but also any terms applying to model weights and pre-trained parameters. These licenses can include limits on commercial use or embed obligations that are incompatible with your product or customer base. Stefano Maffulli, executive director of OSI, stresses the importance of checking data provenance: where it came from, how it was collected, and if it contains personal or copyrighted material.

Evaluating the model’s behavior is just as important. Paul Harrison of Mattermost advises close monitoring of system output to validate accuracy, and educating users about uncertainty in AI-generated results. If you’re hosting the model yourself, secure your input pipeline. Use vetted data. Clean your logs. Prevent poorly filtered sources from weakening your model.

Nathan Lambert of AI2 notes that tracing data lineage is especially important if the model touches personal or sensitive content. Provenance isn’t optional, it’s foundational to any responsible or compliant AI deployment.

For C-level leaders, the action item is simple: bake this scrutiny into your evaluation process. Build a checklist that includes legal review, data verification, and security assessment before moving to production. By treating AI model adoption with the same level of governance as any other critical infrastructure, you reduce your exposure and improve your long-term leverage.

There is a pressing need for better governance and clearer definitions

Open source AI is scaling fast across industries, but the regulatory and governance frameworks haven’t kept pace. The U.S. National Telecommunications and Information Administration (NTIA), in its July 2024 report, made that point clear. It identified critical risks tied to making foundation model weights openly accessible, risks involving national security, privacy, civil rights, and misuse that can’t be addressed with current oversight structures.

That doesn’t mean open source AI should be restricted. It means the industry needs a structured path to ensure accountability, security, and transparency, whether models are open or not. Paul Harrison from Mattermost supports increased governance but is explicit about the method: avoid placing unreasonable burdens on open source maintainers. They don’t have the resources of commercial vendors, and asking them to solve governance alone is ineffective.

Mike Lieberman, CTO at Kusari, offers a solution. Instead of dumping responsibility on project communities, fund them and equip them with tools, like OpenSSF’s GUAC project, to trace and manage software supply chains. These tools give maintainers and adopters the visibility needed to understand what’s in a model, how it behaves, and whether it complies with internal policy or external law.

The lack of clear legal definition also creates uneven risk distribution among developers, vendors, and users. Roman Shaposhnik from Ainekko stresses that until the industry establishes a clear, reliable standard for what qualifies as open source AI, legal ambiguity will persist, and that undermines confidence in adoption.

For executives, this roadmap is straightforward. Don’t wait for regulation to force your hand. Invest early in governance, align procurement with transparent standards, and push your vendors for traceability across every layer of their AI models. Regulatory clarity is coming. The companies that take the lead on security, compliance, and transparency now will be the ones shaping how this tech evolves later.

In conclusion

If you’re making decisions around AI adoption, don’t get distracted by the “open source” label. It doesn’t always mean what it should, and in this space, unclear definitions create real business risk. Legal ambiguity, hidden training data, and inconsistent licensing aren’t theoretical problems. They’re operational liabilities.

The upside is clear: transparency breeds trust, improves security, and fuels innovation through community input. But with that potential comes accountability. As a leader, you’re responsible for asking hard questions. Is the model’s data traceable? Are the license terms acceptable? Can your team explain how the model was trained, and by whom?

Open source AI can be a strategic advantage, but only when governed with intent. Clarity, verification, and internal standards are what separate high-performing, compliant systems from those that quietly stockpile risk. The models you choose today will shape your tech stack, and your exposure, for years. Choose with precision.

Alexander Procter

August 6, 2025

11 Min

Tags: Artificial Intelligence

Product Design & Research
Scalable AI starts with clean data
Oct 28, 2025

9 min
Product Design & Research
Boost your conversion rate with smarter UX design
Oct 28, 2025

11 min
Product Design & Research
Why your design system isn’t taking off
Oct 28, 2025

10 min

What most people get wrong about open source AI

Many AI models branded as “open source” do not meet true open source standards

The definition of “open source AI” remains ambiguous and underdeveloped

Many foundational AI models branded as open source conflict with traditional open source practices

The risks of using purportedly open source AI hinge on transparency and licensing complexities

Open source AI offers significant benefits through transparency, collaboration, and improved security

Evaluating open source AI requires thorough scrutiny of licensing, data provenance, and model behavior

There is a pressing need for better governance and clearer definitions

In conclusion

Scalable AI starts with clean data

Boost your conversion rate with smarter UX design

Why your design system isn’t taking off

The best upskilling tips for Apple IT professionals

Why Headless CMS is Revolutionizing the eCommerce Landscape

Building cyber resilience into digital products is a modern essential

A spark of digital innovation

Last-mile delivery software: Leveraging real-time data for efficiency

Responsive vs adaptive design: Choosing the right approach

Enhancing customer loyalty: The importance of digital order tracking on eCommerce platform

Exploring the potential of multi-access edge computing in IoT applications

Balancing personalization and privacy in a digital world

Long-tail vs Short-tail keywords: Which one is better for conversions

The shift to mobile: How cross-device insights are changing marketing strategies

4 key solutions to avoiding time estimation pitfalls for project managers

Hire the top 3% of digital talents

Start your day
with a Spark!

What most people get wrong about open source AI

Many AI models branded as “open source” do not meet true open source standards

The definition of “open source AI” remains ambiguous and underdeveloped

Many foundational AI models branded as open source conflict with traditional open source practices

The risks of using purportedly open source AI hinge on transparency and licensing complexities

Open source AI offers significant benefits through transparency, collaboration, and improved security

Evaluating open source AI requires thorough scrutiny of licensing, data provenance, and model behavior

There is a pressing need for better governance and clearer definitions

In conclusion

Start your day with a Spark!

Start your day
with a Spark!