Just a note for reference, because I find myself explaining this repeatedly.

When you listen to a conversation between two people, at any given point you may have your own idea of what you’d say in the participant’s shoes. But you are usually not particularly surprised by anything that they say. I mean, you may be very surprised at the level of content if at least one of the participants thinks differently from you, but you’ll almost never be surprised at the level of response-relation. That is, if the two participants are human, and A says “Do you like the pizza?”, you will hardly find a B that responds with “The last oil change was way more expensive than usual”. This simply doesn’t make sense, and a human would recognise this. Machines, however, usually don’t. To recognise whether or not a conversation makes any sense at all is what I’ll call “the problem of the third-person view”.

In the third-person view, many responses are coherent at any given point in the conversation. In contrast, when somebody says “Hello” to you, you come up with one specific response, such as “Hello there!”. You don’t think “Well, I now could say ‘Hello, there!’, or just ‘Hello’, or maybe ‘Hi’, and I’ll now make a random choice between those appropriate responses… ‘Hello, there!’”. In any given situation you take one particular action, and you can think of yourself as a state-action-mapping. Since you’ll never be in the same state twice (you learn and your environment changes and you remember your own actions, etc.), for a human this map is very complicated, but it is usually coherent – in the sense I described above. When we build an assistant, we usually want it to not just give coherent responses (like GPT-3), but we want it to give particular responses. I’ll call this “the problem of the first-person view”.