oh hey just as a PSA, any code any of y'all may have on github might have been scraped by @swheritage, a self-proclaimed "preservation" org which
- just hoovered up vast amounts of data without asking or telling anyone
- insists on deadnaming trans people forever for "integrity" reasons
- used it to build an LLM training data set
https://huggingface.co/datasets/bigcode/the-stack-v2 to check and for opt-out instructions
@outie @swheritage shit, they got all of my repos. 😡
@jaredwhite @outie Time to sue then. Those "BuT oPt-OuT iS a FoRm Of CoNsEnT!" shitfaces need to learn a valuable lesson, *especially* if what they do wouldn't be possible with opt-in. If @swheritage can't live without it then the organisation deserves to become history.
@ao @Natanox @jaredwhite @outie @swheritage theatre Open source isn’t public domain. open source licenses only count if you follow their terms, which an AI doesn’t.
@arborelia @Natanox @jaredwhite @outie @swheritage
I was talking about them publishing data, that is legally okay. Using it in an AI may be legally problematic, but providing a dataset that follows licenses is afaik legally okay. I'm just trying to say that there's no real reason to hope for a successful lawsuit against this particular project.
That said: They force you to agree to this in order to get access to it for AI purposes, which I think is way better than many other datasets with our data in them: https://www.softwareheritage.org/2023/10/19/swh-statement-on-llm-for-code/
@ao @Natanox @jaredwhite @outie @swheritage their dataset also _doesn’t_ follow licenses. they just took stuff with no license. They openly say this.
On the AI model side, asking users to obey licenses for them (so they don’t have to) sure is a gambit.
@arborelia @Natanox @jaredwhite @outie @swheritage hm yeah I missed that part. yeah guess that's a good point.
@ao @Natanox @jaredwhite @outie @swheritage disregard the extra word from phone keyboard