That why i was suggesting at some point it will all come from googles resources. News groups, forums, mails, chats, bla bla. Eventually they will team up with the likes of FB and they will share data as the smaller players are wiped out.. If its not in the big garden, it wont exist.
Fri Jun 13 2025 18:00:35 UTC from IGnatius T FoobarExtrapolate that out far enough and all the source sites go out of business and Google has nothing left to train the AI.
I know we don't agree on this one, but its more than just regurgitation. And they can ( and do ) get trained on actual technical documents. Operational manuals. Stuff like that. Its not just web crawling forums. Sure, its a large % but if that vanished, it would not kill things.
Fri Jun 13 2025 18:00:35 UTC from IGnatius T Foobar. But if all the developers move to pair programming with AI (which I admit I have been doing a bit of lately) there's no one left to contribute answers to Stack Exchange. Same end effect.
Forgot to add this to the above.
Case in point: Go look up 'the pile'. Due to that, there were/are several law suits for IP infringement against some of the bigger players, as basically its a huge database of pirated manuals/books/papers/etc that was used in early model training.
But fast forward to today, for the most part no one is creating models from scratch anymore and current generation AI models are training their successors, as well as refining what they have, and adding new data to that to increase knowledge/accuracy, i really do believe that the loss of 'public forums' would not matter at this point. ( its called synthetic data and distillation ) Perhaps it would effect things like 'current popular TV' or useless garbage like that, but actual work ( coding, medical, chip design, mathematics, chemistry, science stuff, etc ) can get by just fine with the models we have now + curated, and often private, training content, so it wont hinder things at all in the bigger picture.
But. im just an amateur in all this and know just enough to get by since i was doing this stuff decades ago with "simple" ( by today's standards ) robotics when i was with GM and collision detection doing hobby work and had a head start. But if you want a true expert's view, you know who to talk to :)
Yes, the days of being reliant on web-scraping for core training is over. But also, i agree, if the plethora of data 'out there' was suddenly gone and all we had to work with was distillation, refined synthetic data and 'reference manuals', it would make the models take a hit with popular culture and current events. But overall, it would not be a deal killer or anything like that for the advancement. And other than edge cases and a few exceptions, like this one, most work on models is starting to head more towards STEM and 'targeted' use. Like for robotics, IoT, custom 'chat agents' with proprietary company data, and less about 'general purpose'.
As i have mentioned i do make my living in the AI world, and even i am not too fond of forcing users into that sort of thing. 1: let them choose 2: don't be "creating" the content, just summarize the search, and include all references. If they start truly "creating" the content it will give them too much control over the results and they will 'become the web', in effect. And then, aside from the money, how does that go again? "Those who control the present."
Of course i fully feel that AI has its place in a lot of areas, but boxing in users into only generated content, and letting a single source win over all others, is not one of them. Even if they were a 'good company'.
Yes, the days of being reliant on web-scraping for core training is
over. But also, i agree, if the plethora of data 'out there' was
suddenly gone and all we had to work with was distillation, refined
synthetic data and 'reference manuals', it would make the models take a
hit with popular culture and current events.
It's interesting to watch Grok 3 augment its vast corpus of data with the latest posts on X.
On the other hand, maybe the degrading feedback loop continues there too, if more X posts are made by AI. (Yes I know, some of you hate Elon, blah blah blah, that's not relevant here.) I suppose any source can be back-polluted that way. What bothers me the most is that half the stuff on YouTube these days is AI slop. I like to watch cooking videos, and I'm seeing ones now that are just a photo of the finished plate and a computer voice reciting the recipe. That's not even generative AI, it's just straight-up crap factory. I'd like computer generated content to be flagged as computer generated, but Google doesn't care if half of what they feed out is complete garbage as long as the ad plays.
(See how I brought it back to the room topic?) And that is why, more than ever, I am PROUD to be using ad blockers.
That is an example of what i was talking about for Google. Their AI will eat data from all its products.
Tue Jun 17 2025 00:42:44 UTC from IGnatius T FoobarIt's interesting to watch Grok 3 augment its vast corpus of data with the latest posts on X.
It can't just eat itself. Absorbing outside knowledge also becomes increasingly difficult as the Dead Internet Theory slowly becomes reality. As I mentioned here a few weeks ago, someone recently measured Internet traffic as having crossed the 50% mark for bot-originated volume. (No, I don't know what metric they used, and for the purpose of this discussion it doesn't matter.)
Don't get me wrong, I do agree that Generative AI is useful. However, it's currently turning the world's computer resources into gray goo. We need a correction. There is too much fatigue.
After all, why are we all here on this board? Because we know everyone here is human.
That is what "synthetic data" is all about.
Thu Jun 19 2025 21:18:22 UTC from IGnatius T Foobar
It can't just eat itself.
After all, why are we all here on this board? Because we know
everyone here is human.
Yeah, about that. I've been developing doubt in a couple of cases...
Are you sure?
Are any of us real? What is real?
Thu Jun 19 2025 21:18:22 UTC from IGnatius T Foobar
After all, why are we all here on this board? Because we know everyone here is human.
Like it or not, im real, unless some of my other theories pan out and nothing is real...
LoL
Mon Jun 23 2025 23:18:58 UTC from zelgomerAfter all, why are we all here on this board? Because we know
everyone here is human.
Yeah, about that. I've been developing doubt in a couple of cases...
2025-06-24 21:32 from Nurb432 <nurb432@uncensored.citadel.org>
Like it or not, im real, unless some of my other theories pan out and
nothing is real...
LoLMon Jun 23 2025 23:18:58 UTC from zelgomerAfter all, why are we all here on this board? Because we know
everyone here is human.
Yeah, about that. I've been developing doubt in a couple of cases...
Well for one, real as he may be, I suspect that darknetuser may in fact be a horse.
Well for one, real as he may be, I suspect that darknetuser may in
fact be a horse.
Don't care. Even if he is a horse he's still our friend.
For all you know I might be a brain in a vat with an ethernet cable.
Horses are cool.
Wed Jun 25 2025 02:26:07 UTC from zelgomer
Well for one, real as he may be, I suspect that darknetuser may in fact be a horse.