The Depth of GPT Embeddings
Posted by bsstahl on 2023-10-03 and Filed Under: tools
I've been trying to get a handle on the number of representations possible in a GPT vector and thought others might find this interesting as well. For the purposes of this discussion, a GPT vector is a 1536 dimensional structure that is unit-length, encoded using the text-embedding-ada-002 embedding model.
We know that the number of theoretical representations is infinite, being that there are an infinite number of possible values between 0 and 1, and thus an infinite number of values between -1 and +1. However, we are not working with truly infinite values since we need to be able to represent them in a computer. This means that we are limited to a finite number of decimal places. Thus, we may be able to get an approximation for the number of possible values by looking at the number of decimal places we can represent.
Calculating the number of possible states
I started by looking for a lower-bound for the value, and incresing fidelity from there. We know that these embeddings, because they are unit-length, can take values from -1 to +1 in each dimension. If we assume temporarily that only integer values are used, we can say there are only 3 possible states for each of the 1536 dimensions of the vector (-1, 0 +1). A base (B) of 3, with a digit count (D) of 1536, which can by supplied to the general equation for the number of possible values that can be represented:
V = BD or V = 31536
The result of this calculation is equivalent to 22435 or 10733 or, if you prefer, a number of this form:
10000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Already an insanely large number. For comparison, the number of atoms in the universe is roughly 1080.
We now know that we have at least 10733 possible states for each vector. But that is just using integer values. What happens if we start increasing the fidelity of our value. The next step is to assume that we can use values with a single decimal place. That is, the numbers in each dimension can take values such as 0.1 and -0.5. This increases the base in the above equation by a factor of 10, from 3 to 30. Our new values to plug in to the equation are:
V = 301536
Which is equivalent to 27537 or 102269.
Another way of thinking about these values is that they require a data structure not of 32 or 64 bits to represent, but of 7537 bits. That is, we would need a data structure that is 7537 bits long to represent all of the possible values of a vector that uses just one decimal place.
We can continue this process for a few more decimal places, each time increasing the base by a factor of 10. The results can be found in the table below.
B | Example | Base-2 | Base-10 |
3 | 1 | 2435 | 733 |
30 | 0.1 | 7537 | 2269 |
300 | 0.01 | 12639 | 3805 |
3000 | 0.001 | 17742 | 5341 |
30000 | 0.0001 | 22844 | 6877 |
300000 | 0.00001 | 27947 | 8413 |
3000000 | 0.000001 | 33049 | 9949 |
30000000 | 0.0000001 | 38152 | 11485 |
300000000 | 0.00000001 | 43254 | 13021 |
3000000000 | 0.000000001 | 48357 | 14557 |
30000000000 | 1E-10 | 53459 | 16093 |
3E+11 | 1E-11 | 58562 | 17629 |
This means that if we assume 7 decimal digits of precision in our data structures, we can represent 1011485 distinct values in our vector.
This number is so large that all the computers in the world, churning out millions of values per second for the entire history (start to finish) of the universe, would not even come close to being able to generate all of the possible values of a single vector.
What does all this mean?
Since we currently have no way of knowing how dense the representation of data inside the GPT models is, we can only guess at how many of these possible values actually represent ideas. However, this analysis gives us a reasonable proxy for how many the model can hold. If there is even a small fraction of this information encoded in these models, then it is nearly guaranteed that these models hold in them insights that have never before been identified by humans. We just need to figure out how to access these revelations.
That is a discussion for another day.