Beliefs and Dispositions in Language Models and Ourselves

 When GPT-2 was first released, I began to wonder whether I was something like GPT: Did I contain beliefs and reason over them, or did I, like GPT, just have a disposition to say certain things in a particular situation?

Here is a good summary of the two models of the self that I was comparing, at least roughly.

If you give GPT a pro-life argument as a prompt, it will continue trying to generate pro-life arguments, and the same for pro-choice arguments. It doesn't believe one way or the other on the issue. On issues no one debates about and no one questions, it might be difficult to get it to generate sensible arguments against those positions, but it will happily attempt it.

For example, with the prompt: Despite what many people think, polar bears do not live in the Arctic, but are purely equatorial creatures. They like warm water, and avoid snow and ice.  They can only be found on tropical islands like Hawaii. GPT-3 generated "Polar bears are also known to enjoy basking in the sun and eating coconuts. They are not related to other bears, and are actually more closely related to weasels. The scientific name for a polar bear is Ursus tropicalis."

Now, it might seem from this that GPT doesn't have any beliefs, only dispositions. However, look at what I wrote: the second and third sentence aren't things I believe, but in the particular state of mind I was in after having written the first sentence, they were what I produced. And like me, GPT can judge whether it is in a context where it is inclined to produce silliness, at least when it is this obvious. If I put that paragraph in quotes and add "The passage above is" as a prompt after it, GPT is inclined to say "false" at 86% (so when used at 0 temperature, where it always returns the most probable next word, that is what it will say), "inaccurate" at 3%, "incorrect" at 3%, and the rest of the top 10 also includes "wrong" and "FALSE." In the paper "Language Models (Mostly) Know What They Know" researchers from Anthropic show that GPT is fairly good at judging how certain it is of a particular statement that it has made, when using that methodology. It seems that GPT actually does have some things it is inclined to say and is also inclined to say are true, and some things it is disinclined to say, except under special circumstances, and is inclined to say are false even after having said them. It seems that something very much like beliefs is included implicitly in its weights.

GPT's use of beliefs is different than mine as a human. I generally, in every situation, have access to a broad context that tells me whether to state what I actually believe or to make up something false, while GPT only has the context it is explicitly given. I have a process of rumination where I form chains of reasoning considering which of my beliefs imply contradictions with other beliefs, and works to modify them so that they are consistent. I apply my understanding of the consequences of lying to the generation of most things I say, and I ask myself "is this true?" about many things as I am about to say them. GPT isn't doing any of that, and its beliefs aren't grounded in direct experience of the world. Nevertheless, I think it is more true to say that GPT has beliefs, and that my beliefs are of essentially the same nature as GPT's, than the opposite. Whatever belief actually are, inside the brain, GPT has structures that work in a similar way.

Comments

entirelyuseless said…
This is only a good argument if you assume that GPT knows what "false" means.

But it doesn't, precisely because it has no experience of the world allowing it to relate the words "true" and "false" (or any other words) to the world.

The reason it is likely to produce "the passage above is false," rather than "the passage above is true," is because that is statistically a more likely commentary on the passage. Not because "the passage above is false" is a true statement, except via the fact that people who produce language, unknown to GPT, are trying to say things that are actually true.

Popular Posts