Already looking forward to the fallout of all this "AI" nonsense in 3 - 5 years, after they run out of high quality training data, like StackOverflow, years before. At this point all you're going to have is "AI" trained on "AI" slop.
I had same idea but then i was told that they will read manuals and code. I laughed a bit but can't really say if that's true or not.
What if it's closed source code with shitty doc
If you’re interacting with it, it’s reading your code and able to refine and train itself on it. It will then create its own comprehensive documentation if needed.
I mean ideally these things would be trained off principles and foundations like manuals and schematics rather than hodge-podge
But we already know, with straight rips from proprietary/copyright, you get answers that point for point match any number of the thousand “How To X in Lang Y” tutorials wrote by people using those same introductory paragraphs flipped and perverted to some “crafty” way to do it that further separates you from the actual underlying understanding you would’ve just got reading the damn manual.
Those will also be more and more LLM generated, so I a way it will be training on itself, which is not that valuable since those texts are more of the same token structures
Actually it’s the other way. Now the humans are correcting what AI has written. And also prod tested. So free labeling of data.
I've always found this a very strange narrative.
I really can't see why people think LLMs are only years away from model collapse and there is nothing these researchers can do about it, as if they're not way smarter than all of us anyway.
Why can't they just let AIs use tools to execute code and if the code runs successfully, it will then be used as training data?
Runs successfully according to what, the tests that the AI deleted? Or worse, the tests that the AI wrote?
So machine learning? Just fire random things till it works? LLMs are better suited for code since they guess the next part of a word based on the words before. And it knows the best match, by having learned from al the training data.
So it does keep learning right now, but it’s just more and more of the same. No real new ideas.
This plus other factors are already used in RLVR. I'm not sure why you're getting so many downvotes, this is an important part of post training modern SOTA models.
Not sure they will keep training on AI generated data. They may limit themselves to everything prior to GPT-3 release and then get smarter about how to deal with that
If you're interested in the answer read forward, I've been doing research then built a company with AI for the past 15 years.
A [coding] LLM is trained on raw code, directly [stolen] from github or other sources. Learning to code is a unsupervised task, you don't need quality data, you need AMOUNT of data in uppercase, quality is totally irrelevant in large scale unsupervised tasks.
As more and more people are using and sending them code, they now have access to a sheer amount of data and direct feedback from their willing users, on a scale 100x more than SO, simply because only a negligible fraction of stackoverflow visitors actually wrote answers, 99.9% didn't even have an account on the website. Every single LLM user is providing large amounts of first party data to the LLM. Bad code or AI slop is irrelevant because it learns the language.
Initially [first iterations of LLM] models were then finetuned [supervised step] manually on a tons of different crafted problems they were supposed to solve. This is where non functional and non working code were fixed, AI slop is irrelevant because this step exists.
As feedback became a real loop as they actually have users now and don't need to rely on weirdly specific manually cafted problems, they can now be finetuned to solve real world problems of their real users. AI slop is even less of a problem because this step is getting better and better, the fact that the some code was written using AI gives an easier time to the AI.
"It's easier to fix your own code than someone else's", AI slop is a problem for us not the LLMs.
SO is no longer a relevant source for any LLM simply because their scale is too small, github is still valuable but for how long? with tool like cursor being pushed by LLM providers they'll slowly and surely get direct access to a shitton of private, indie and amateur codebases. They will slowly outscale anything that currently exists in terms of first party data.
I didn't think that the user response (= user validation) could be used to manually finetune the AI. But now that you point it out, that make it obvious! Thanks
I'm wondering, did you use AI to format your answer?
I didn't use AI in any form in my answer, I'm simply a working professional that produced numerous papers (I won't doxx myself nor advertise my company obviously)
AI initially acquired a specific writing style resulting from learning to produce output specifically crafter by academic researchers most of which were either directly done, edited or approved by an AI research group. The initial iterations of LLM had a writing style that was the one of their makers, I have the same writing style as AI researchers hence you might find my style closely ressembling ealier LLMs.
It got to the apex of the uncanny valey as it was getting better and better at sounding academic and researcher alike. As it's now attempting to tackle harder problems and by doing so is back to sounding more and more relatable and "human" as it improves in its ability to successfully convey a message to any specific user.
Nah all you need is hundreds of thousands of people using your code editor and/or git hosting service.
Then you can just TAKE the data 😁 problem solved
microsoft would nev-
Exactly this. People seem to forget that AI is being fine-tuned based on its use. It’s not just copy/pasting code from the internet, we are giving it human relational feedback training just by interacting with it. It will learn from us until it doesn’t need us at all.
AI needs humans!
For now.
All the stuff I looked up in the past month had zero answer on stackoverflow so ¯\_(ツ)_/¯
It’s hilarious watching this sub copeseethe over AI.
Yeah, this joke of a meme was probably true 3 years ago. These days I highly doubt that a company like Anthropic relies on outdated answers from Stackoverflow.
Documentation for new libraries will be ai generated (already is where I work for new stuff) but its reviewed by humans, so future AI learning on it shouldn’t be a problem
The first Stackoverflow responders had to read documentation
Or just learn from its own usage
lol. Nearly 80% of all accepted answers on SO are wrong it’s usually the 3rd or 4th one down that is correct.
This subreddit turned into an anti AI circle jerk. The more anti LLM memes I see here, the more it shows that people are panicking over being let go or were already let go. Who knows what LLM will really learn from in a couple of years, but as things stand we are already seeing a significant downsizing in many firms while they still manage to retain same levels of productivity.
I dont remember AI closing my chat for duplicate and then linking to a deleted page/ an unanswered question
So what?
Without Stack Overflow, even artificial intelligence cries over a compilation error
Great, now AI will start responding
“What have you tried?”
This one, otoh, is fairly legit.
If you ask the same question twice ever you get hit with "Duplicate question" with a link that leads to something barely relevant.