Man the models can’t store verbatim its training data, the amount of data is turned into a model that is hundreds or thousands of times smaller than the original source data. If it was capable of simply recovering everything that it was trained on this would be some magical compression algorithm and that by itself would be extremely impressive.
Oh ok, you want to claim this is compressing the entirety of the internet in a model that isn’t even 1 terabyte of data and be unimpressed that is something.
But it isn’t compression. It is a mathematical fact that neural networks are universal function approximators, this is undisputed, and analytic functions are continuous so to be an analytical function approximator it must be able to fill in the gaps between discrete data points by itself, which necessarily means spiting out data outside of the input distribution, data it has not seen.
Not sure why you feel the need to put words in my mouth. It wasn’t trained on “the entirety of the Internet,” but rather less than a terabyte of it. So yeah, that would probably take up less than a terabyte.
They do not store anything verbatim; They instead store the directions in which various words and related concepts relate to one another in some gigantic multidimensional space.
I highly suggest you go learn what they actually do before you continue talking out of your ass about them
You said it matches text to its training data, which it does not do.
Your single-phrase statement only works for very short, non-repetitive phrases. As soon as your phrase repeats a token more than a few times, the statistics for the tokens change and could result in nonsensical output that repeats through subsections of the training data.
And even then for that single non-repetitive phrases, the reason you would get that single phrase back is not because it would be “matching on” the phrase. It is because the token weights would effectively encode that the statistical likelihood of the “next token” in the generated output is 100% for a given token when the evaluated token precedes it in the training phrase. Or in other words: Your training data being a single phrase maniplates the statistics so that the most likely output is that single phrase.
However, that is a far cry from simple “matching” against the training data. Which is what you said it does.
I wonder how much was scraped from knowyourmeme.com
I mean it still parsed the specific text in the meme and formulated a coherent explanation of this specific meme, not just the meme format
Or it matched the text with an existing explanation upon which it was indexed.
Lmao you think it found a specific explanation for this specific variation of this meme?
For each phrase, yes.
That’s not how GPTs work
That’s literally how they work
Man the models can’t store verbatim its training data, the amount of data is turned into a model that is hundreds or thousands of times smaller than the original source data. If it was capable of simply recovering everything that it was trained on this would be some magical compression algorithm and that by itself would be extremely impressive.
Congratulations on discovering compression
Oh ok, you want to claim this is compressing the entirety of the internet in a model that isn’t even 1 terabyte of data and be unimpressed that is something.
But it isn’t compression. It is a mathematical fact that neural networks are universal function approximators, this is undisputed, and analytic functions are continuous so to be an analytical function approximator it must be able to fill in the gaps between discrete data points by itself, which necessarily means spiting out data outside of the input distribution, data it has not seen.
TBF, compression is related to ML. Hence, the Hutter Prize. Thinking of LLMs as lossy compression algorithms is a decent analogy.
Not sure why you feel the need to put words in my mouth. It wasn’t trained on “the entirety of the Internet,” but rather less than a terabyte of it. So yeah, that would probably take up less than a terabyte.
They do not store anything verbatim; They instead store the directions in which various words and related concepts relate to one another in some gigantic multidimensional space.
I highly suggest you go learn what they actually do before you continue talking out of your ass about them
If you trained a GPT on a single phrase, all you’d get out of it would be the single phrase.
The mechanism of storage doesn’t need to be just the verbatim source material, which is not even close to what I said.
You said it matches text to its training data, which it does not do.
Your single-phrase statement only works for very short, non-repetitive phrases. As soon as your phrase repeats a token more than a few times, the statistics for the tokens change and could result in nonsensical output that repeats through subsections of the training data.
And even then for that single non-repetitive phrases, the reason you would get that single phrase back is not because it would be “matching on” the phrase. It is because the token weights would effectively encode that the statistical likelihood of the “next token” in the generated output is 100% for a given token when the evaluated token precedes it in the training phrase. Or in other words: Your training data being a single phrase maniplates the statistics so that the most likely output is that single phrase.
However, that is a far cry from simple “matching” against the training data. Which is what you said it does.
If it doesn’t use its training data, what’s the training data for?