It’s not clear if this is piracy. In the US, it’s obviously an ongoing fight. Basically, what you describe is “books3”, put together with scripts by Aaron Swartz.
It’s legal in Japan, if the purpose is only AI training and not enjoyment. I’m not sure if there are issues regarding DRM or such.
In the EU, the dataset and resulting model would be illegal. Any business offering the model would be in hot water, but I think internal use would be fine.
It probably says somewhere where you dled the model. It’s also in the metadata. I forget where it’s displayed. Maybe in the terminal window.
Things you should know:
L3 is probably not the right base for the task. Maybe Phi-3 or Cohere.
It remains a dangerous dead end. Any competent fraud will remove the watermark, or use a generator that doesn’t add one. Giving people the idea that the absence of a watermark makes something trustworthy, can only help bad actors.
access to the training data
That’s just not realistic. There are too many legal problems with that.
Besides, Llama 3 was trained on 15 trillion tokens. Whatcha gonna do with something like that?
Reading the license, there’s 3 things.
There must be attribution. Finetunes, merges, etc need to have “Llama 3” at the beginning of the model name. This is probably consistent with FOSS.
Your use of Llama has to “adhere to the Acceptable Use Policy for the Llama Materials”. AFAIK, it’s an open question whether ethical licenses can be considered FOSS.
Finally, you must not use it, if you had more than 700 million active users in March 2024 (the calendar month before the release). I’m not sure about the legal definition of “active user”. I doubt it’s very many companies, though. In practice, it’s probably less of a restriction than copyleft, but still, strictly speaking, that’s not FOSS.
Is that the 2B or the 7B?
@Mistral@lemmings.world Can you draw ascii art of a tractor?
This sounds like some weirdly petty political wrangling that would delight any full-blooded bureaucrat.
The desire to make demands about training data is weird. Open source has never included a requirement to provide documentation of any kind. If there was some requirement for documentation, few would care and most just do their thing anyway. FOSS licenses facilitate sharing by giving people an easy way to make their code legally usable by others.
There’s nothing that quite matches source code + compiled binary. There are permissively licensed datasets and models. I’ll call either open source. Neither is equivalent to source code but either can be a source.
looking at SF, California
There are a number of US states that have legalized it sooner and more thoroughly than California (not to mention Germany). What did SF specifically do?
minimize their own liability
I think maybe that is behind some earlier releases but Google is already operating Gemini. It’s not going to help them with liability.
Eventually, the cost of making/releasing these models is simply insignificant. Maybe they hope that people messing around with them will give them tips for Gemini. Maybe they hope to familiarize potential future employees with Google specific AI (giving them a larger and more qualified hiring pool).
They may also hope for a bit of good will, but that may easily backfire.
It means that they want people to consult the code as a reference for how to best use the hardware acceleration.
If all software uses their cards to best effect, that makes their cards more useful and thus more valuable; making them money. If only their own frontend can do that, they lose out on most of that, while also having to spend money to make sure that the rest of the software, like the UI, is competitive.
via https://duckduckgo.com/?q=DuckDuckGo+AI+Chat&ia=chat&duckai=1 with GPT-4o mini