Linear Transformers Are Secretly Fast Weight Memory Systems

We show the formal equivalence of linearised self-attention mechanisms and fast weight memories from the early ’90s. From this observation we infer a memory capacity limitation of recent linearised softmax attention variants...

With finite memory, a desirable behaviour of fast weight memory models is to manipulate the contents of memory and dynamically interact with it. Inspired by previous work on fast weights, we propose to replace the update rule with an alternative rule yielding such behaviour. We also propose a new kernel function to linearise attention, balancing simplicity and effectiveness. We conduct experiments on synthetic retrieval

 

 

To finish reading, please visit source site