Why AI language models choke on too much text
This means that the total computing power required for attention grows quadratically with the total number of tokens. Suppose a 10-token prompt requires 414,720 attention operations. Then: Processing a 100-token….