Discussion

vigbiorn

:j::cs::js::perl:

Great, now AI will start responding

Closed, off topic. Already answered [here](deadlink)

11 hours ago

RichCorinthian

“What have you tried?”

This one, otoh, is fairly legit.

4 hours ago

Lehsyrus

:cp:

If you ask the same question twice ever you get hit with "Duplicate question" with a link that leads to something barely relevant.

4 hours ago

RiceBroad4552

:s:

Already looking forward to the fallout of all this "AI" nonsense in 3 - 5 years, after they run out of high quality training data, like StackOverflow, years before. At this point all you're going to have is "AI" trained on "AI" slop.

11 hours ago

Professional_Top8485

I had same idea but then i was told that they will read manuals and code. I laughed a bit but can't really say if that's true or not.

10 hours ago

firecorn22

What if it's closed source code with shitty doc

10 hours ago

Objective_Dog_4637

:j:

If you’re interacting with it, it’s reading your code and able to refine and train itself on it. It will then create its own comprehensive documentation if needed.

5 hours ago

DamUEmageht

I mean ideally these things would be trained off principles and foundations like manuals and schematics rather than hodge-podge

But we already know, with straight rips from proprietary/copyright, you get answers that point for point match any number of the thousand “How To X in Lang Y” tutorials wrote by people using those same introductory paragraphs flipped and perverted to some “crafty” way to do it that further separates you from the actual underlying understanding you would’ve just got reading the damn manual.

5 hours ago

_krinkled

:js:

Those will also be more and more LLM generated, so I a way it will be training on itself, which is not that valuable since those texts are more of the same token structures

10 hours ago

achilliesFriend

:j:

Actually it’s the other way. Now the humans are correcting what AI has written. And also prod tested. So free labeling of data.

5 hours ago

ReadyAndSalted

I've always found this a very strange narrative.

These AI services will never get dumber, as they can always just continue to use their current model if shit hits the fan.
AI outputs will often get filtered and corrected before public use, which leads to the training data of the models (internet or source code data) on average being higher quality than its raw outputs.
Reinforcement learning has been demonstrating incredible success recently (see deepseek R1, openAI's o series of models, Gemini 2.5 pro, etc...) and these are not reliant on massive text corpora, unlike the pretraining stage.

I really can't see why people think LLMs are only years away from model collapse and there is nothing these researchers can do about it, as if they're not way smarter than all of us anyway.

3 hours ago

npquanh30402

Why can't they just let AIs use tools to execute code and if the code runs successfully, it will then be used as training data?

11 hours ago

Reashu

Runs successfully according to what, the tests that the AI deleted? Or worse, the tests that the AI wrote?

11 hours ago

_krinkled

:js:

So machine learning? Just fire random things till it works? LLMs are better suited for code since they guess the next part of a word based on the words before. And it knows the best match, by having learned from al the training data.

So it does keep learning right now, but it’s just more and more of the same. No real new ideas.

10 hours ago

ReadyAndSalted

This plus other factors are already used in RLVR. I'm not sure why you're getting so many downvotes, this is an important part of post training modern SOTA models.

4 hours ago

YaVollMeinHerr

Not sure they will keep training on AI generated data. They may limit themselves to everything prior to GPT-3 release and then get smarter about how to deal with that

9 hours ago

SaltMaker23

:p::py::js::c::unity::math:

If you're interested in the answer read forward, I've been doing research then built a company with AI for the past 15 years.

A [coding] LLM is trained on raw code, directly [stolen] from github or other sources. Learning to code is a unsupervised task, you don't need quality data, you need AMOUNT of data in uppercase, quality is totally irrelevant in large scale unsupervised tasks.

As more and more people are using and sending them code, they now have access to a sheer amount of data and direct feedback from their willing users, on a scale 100x more than SO, simply because only a negligible fraction of stackoverflow visitors actually wrote answers, 99.9% didn't even have an account on the website. Every single LLM user is providing large amounts of first party data to the LLM. Bad code or AI slop is irrelevant because it learns the language.

Initially [first iterations of LLM] models were then finetuned [supervised step] manually on a tons of different crafted problems they were supposed to solve. This is where non functional and non working code were fixed, AI slop is irrelevant because this step exists.

As feedback became a real loop as they actually have users now and don't need to rely on weirdly specific manually cafted problems, they can now be finetuned to solve real world problems of their real users. AI slop is even less of a problem because this step is getting better and better, the fact that the some code was written using AI gives an easier time to the AI.

"It's easier to fix your own code than someone else's", AI slop is a problem for us not the LLMs.

SO is no longer a relevant source for any LLM simply because their scale is too small, github is still valuable but for how long? with tool like cursor being pushed by LLM providers they'll slowly and surely get direct access to a shitton of private, indie and amateur codebases. They will slowly outscale anything that currently exists in terms of first party data.

8 hours ago

YaVollMeinHerr

I didn't think that the user response (= user validation) could be used to manually finetune the AI. But now that you point it out, that make it obvious! Thanks

I'm wondering, did you use AI to format your answer?

8 hours ago

SaltMaker23

:p::py::js::c::unity::math:

I didn't use AI in any form in my answer, I'm simply a working professional that produced numerous papers (I won't doxx myself nor advertise my company obviously)

AI initially acquired a specific writing style resulting from learning to produce output specifically crafter by academic researchers most of which were either directly done, edited or approved by an AI research group. The initial iterations of LLM had a writing style that was the one of their makers, I have the same writing style as AI researchers hence you might find my style closely ressembling ealier LLMs.

It got to the apex of the uncanny valey as it was getting better and better at sounding academic and researcher alike. As it's now attempting to tackle harder problems and by doing so is back to sounding more and more relatable and "human" as it improves in its ability to successfully convey a message to any specific user.

5 hours ago

sorryfortheessay

Nah all you need is hundreds of thousands of people using your code editor and/or git hosting service.

Then you can just TAKE the data 😁 problem solved

11 hours ago

i-am-called-glitchy

:re::py::doge::cat_blep:

microsoft would nev-

7 hours ago

Objective_Dog_4637

:j:

Exactly this. People seem to forget that AI is being fine-tuned based on its use. It’s not just copy/pasting code from the internet, we are giving it human relational feedback training just by interacting with it. It will learn from us until it doesn’t need us at all.

5 hours ago

Afsheen_dev

AI needs humans!

11 hours ago

Objective_Dog_4637

:j:

For now.

5 hours ago

Devatator_

:cs:

All the stuff I looked up in the past month had zero answer on stackoverflow so ¯⁠\⁠_⁠(⁠ツ⁠)⁠_⁠/⁠¯

9 hours ago

Objective_Dog_4637

:j:

It’s hilarious watching this sub copeseethe over AI.

5 hours ago

inglandation

Yeah, this joke of a meme was probably true 3 years ago. These days I highly doubt that a company like Anthropic relies on outdated answers from Stackoverflow.

2 hours ago

dscarmo

Documentation for new libraries will be ai generated (already is where I work for new stuff) but its reviewed by humans, so future AI learning on it shouldn’t be a problem

4 hours ago

lurkingReeds

The first Stackoverflow responders had to read documentation

8 hours ago

AHardCockToSuck

Or just learn from its own usage

4 hours ago

ToMorrowsEnd

lol. Nearly 80% of all accepted answers on SO are wrong it’s usually the 3rd or 4th one down that is correct.

1 hour ago

Long-Refrigerator-75

This subreddit turned into an anti AI circle jerk. The more anti LLM memes I see here, the more it shows that people are panicking over being let go or were already let go. Who knows what LLM will really learn from in a couple of years, but as things stand we are already seeing a significant downsizing in many firms while they still manage to retain same levels of productivity.

5 hours ago

MaffinLP

I dont remember AI closing my chat for duplicate and then linking to a deleted page/ an unanswered question

5 hours ago

Voxmaris

So what?

11 hours ago

JuicyPeachWhispers

Without Stack Overflow, even artificial intelligence cries over a compilation error

10 hours ago

ProgrammerHumor

youNeedStackOverflowDespiteHavingAi

https://i.redd.it/pprmr7y9odcf1.jpeg

Discussion

vigbiorn

:j::cs::js::perl:

Great, now AI will start responding

Closed, off topic. Already answered [here](deadlink)

11 hours ago

RichCorinthian

“What have you tried?”

This one, otoh, is fairly legit.

4 hours ago

Lehsyrus

:cp:

If you ask the same question twice ever you get hit with "Duplicate question" with a link that leads to something barely relevant.

4 hours ago

RiceBroad4552

:s:

11 hours ago

Professional_Top8485

I had same idea but then i was told that they will read manuals and code. I laughed a bit but can't really say if that's true or not.

10 hours ago

firecorn22

What if it's closed source code with shitty doc

10 hours ago

Objective_Dog_4637

:j:

If you’re interacting with it, it’s reading your code and able to refine and train itself on it. It will then create its own comprehensive documentation if needed.

5 hours ago

DamUEmageht

I mean ideally these things would be trained off principles and foundations like manuals and schematics rather than hodge-podge

5 hours ago

_krinkled

:js:

Those will also be more and more LLM generated, so I a way it will be training on itself, which is not that valuable since those texts are more of the same token structures

10 hours ago

achilliesFriend

:j:

Actually it’s the other way. Now the humans are correcting what AI has written. And also prod tested. So free labeling of data.

5 hours ago

ReadyAndSalted

I've always found this a very strange narrative.

These AI services will never get dumber, as they can always just continue to use their current model if shit hits the fan.
AI outputs will often get filtered and corrected before public use, which leads to the training data of the models (internet or source code data) on average being higher quality than its raw outputs.
Reinforcement learning has been demonstrating incredible success recently (see deepseek R1, openAI's o series of models, Gemini 2.5 pro, etc...) and these are not reliant on massive text corpora, unlike the pretraining stage.

I really can't see why people think LLMs are only years away from model collapse and there is nothing these researchers can do about it, as if they're not way smarter than all of us anyway.

3 hours ago

npquanh30402

Why can't they just let AIs use tools to execute code and if the code runs successfully, it will then be used as training data?

11 hours ago

Reashu

Runs successfully according to what, the tests that the AI deleted? Or worse, the tests that the AI wrote?

11 hours ago

_krinkled

:js:

So it does keep learning right now, but it’s just more and more of the same. No real new ideas.

10 hours ago

ReadyAndSalted

This plus other factors are already used in RLVR. I'm not sure why you're getting so many downvotes, this is an important part of post training modern SOTA models.

4 hours ago

YaVollMeinHerr

Not sure they will keep training on AI generated data. They may limit themselves to everything prior to GPT-3 release and then get smarter about how to deal with that

9 hours ago

SaltMaker23

:p::py::js::c::unity::math:

If you're interested in the answer read forward, I've been doing research then built a company with AI for the past 15 years.

"It's easier to fix your own code than someone else's", AI slop is a problem for us not the LLMs.

8 hours ago

YaVollMeinHerr

I didn't think that the user response (= user validation) could be used to manually finetune the AI. But now that you point it out, that make it obvious! Thanks

I'm wondering, did you use AI to format your answer?

8 hours ago

SaltMaker23

:p::py::js::c::unity::math:

I didn't use AI in any form in my answer, I'm simply a working professional that produced numerous papers (I won't doxx myself nor advertise my company obviously)

5 hours ago

sorryfortheessay

Nah all you need is hundreds of thousands of people using your code editor and/or git hosting service.

Then you can just TAKE the data 😁 problem solved

11 hours ago

i-am-called-glitchy

:re::py::doge::cat_blep:

microsoft would nev-

7 hours ago

Objective_Dog_4637

:j:

5 hours ago

Afsheen_dev

AI needs humans!

11 hours ago

Objective_Dog_4637

:j:

For now.

5 hours ago

Devatator_

:cs:

All the stuff I looked up in the past month had zero answer on stackoverflow so ¯⁠\⁠_⁠(⁠ツ⁠)⁠_⁠/⁠¯

9 hours ago

Objective_Dog_4637

:j:

It’s hilarious watching this sub copeseethe over AI.

5 hours ago

inglandation

Yeah, this joke of a meme was probably true 3 years ago. These days I highly doubt that a company like Anthropic relies on outdated answers from Stackoverflow.

2 hours ago

dscarmo

Documentation for new libraries will be ai generated (already is where I work for new stuff) but its reviewed by humans, so future AI learning on it shouldn’t be a problem

4 hours ago

lurkingReeds

The first Stackoverflow responders had to read documentation

8 hours ago

AHardCockToSuck

Or just learn from its own usage

4 hours ago

ToMorrowsEnd

lol. Nearly 80% of all accepted answers on SO are wrong it’s usually the 3rd or 4th one down that is correct.

1 hour ago

Long-Refrigerator-75

5 hours ago

MaffinLP

I dont remember AI closing my chat for duplicate and then linking to a deleted page/ an unanswered question

5 hours ago

Voxmaris

So what?

11 hours ago

JuicyPeachWhispers

Without Stack Overflow, even artificial intelligence cries over a compilation error

10 hours ago