ProgrammerHumor

youNeedStackOverflowDespiteHavingAi

youNeedStackOverflowDespiteHavingAi
https://i.redd.it/pprmr7y9odcf1.jpeg
Reddit

Discussion

RiceBroad4552
:s:

Already looking forward to the fallout of all this "AI" nonsense in 3 - 5 years, after they run out of high quality training data, like StackOverflow, years before. At this point all you're going to have is "AI" trained on "AI" slop.

3 hours ago
Professional_Top8485

I had same idea but then i was told that they will read manuals and code. I laughed a bit but can't really say if that's true or not.

3 hours ago
firecorn22

What if it's closed source code with shitty doc

2 hours ago
_krinkled
:js:

Those will also be more and more LLM generated, so I a way it will be training on itself, which is not that valuable since those texts are more of the same token structures

2 hours ago
YaVollMeinHerr

Not sure they will keep training on AI generated data. They may limit themselves to everything prior to GPT-3 release and then get smarter about how to deal with that

2 hours ago
SaltMaker23
:p::py::js::c::unity::math:

If you're interested in the answer read forward, I've been doing research then built a company with AI for the past 15 years.

A [coding] LLM is trained on raw code, directly [stolen] from github or other sources. Learning to code is a unsupervised task, you don't need quality data, you need AMOUNT of data in uppercase, quality is totally irrelevant in large scale unsupervised tasks.

As more and more people are using and sending them code, they now have access to a sheer amount of data and direct feedback from their willing users, on a scale 100x more than SO, simply because only a negligible fraction of stackoverflow visitors actually wrote answers, 99.9% didn't even have an account on the website. Every single LLM user is providing large amounts of first party data to the LLM. Bad code or AI slop is irrelevant because it learns the language.

Initially [first iterations of LLM] models were then finetuned [supervised step] manually on a tons of different crafted problems they were supposed to solve. This is where non functional and non working code were fixed, AI slop is irrelevant because this step exists.

As feedback became a real loop as they actually have users now and don't need to rely on weirdly specific manually cafted problems, they can now be finetuned to solve real world problems of their real users. AI slop is even less of a problem because this step is getting better and better, the fact that the some code was written using AI gives an easier time to the AI.

"It's easier to fix your own code than someone else's", AI slop is a problem for us not the LLMs.

SO is no longer a relevant source for any LLM simply because their scale is too small, github is still valuable but for how long? with tool like cursor being pushed by LLM providers they'll slowly and surely get direct access to a shitton of private, indie and amateur codebases. They will slowly outscale anything that currently exists in terms of first party data.

1 hour ago
YaVollMeinHerr

I didn't think that the user response (= user validation) could be used to manually finetune the AI. But now that you point it out, that make it obvious! Thanks

I'm wondering, did you use AI to format your answer?

26 minutes ago
npquanh30402

Why can't they just let AIs use tools to execute code and if the code runs successfully, it will then be used as training data?

3 hours ago
Reashu

Runs successfully according to what, the tests that the AI deleted? Or worse, the tests that the AI wrote?

3 hours ago
_krinkled
:js:

So machine learning? Just fire random things till it works? LLMs are better suited for code since they guess the next part of a word based on the words before. And it knows the best match, by having learned from al the training data.

So it does keep learning right now, but it’s just more and more of the same. No real new ideas.

2 hours ago
vigbiorn
:j::cs::js::perl:

Great, now AI will start responding

Closed, off topic. Already answered [here](deadlink)

3 hours ago
Afsheen_dev

AI needs humans!

4 hours ago
sorryfortheessay

Nah all you need is hundreds of thousands of people using your code editor and/or git hosting service.

Then you can just TAKE the data 😁 problem solved

3 hours ago
i-am-called-glitchy
:re::py::doge::cat_blep:

microsoft would nev-

1 minute ago
Devatator_
:cs:

All the stuff I looked up in the past month had zero answer on stackoverflow so ¯⁠\⁠_⁠(⁠ツ⁠)⁠_⁠/⁠¯

2 hours ago
JuicyPeachWhispers

Without Stack Overflow, even artificial intelligence cries over a compilation error

2 hours ago
lurkingReeds

The first Stackoverflow responders had to read documentation 

27 minutes ago
Voxmaris

So what?

4 hours ago