MobileBERT: BERT for Resource-Limited Devices
For a second, let’s focus solely on the teacher. If we continuing the path past the MHA-block, things remain the same compared to a vanilla transformer block until we reach the second “Add & Norm” operation. After this layer, we have a bottleneck transform, this time to reduce the dimension back to that of the input. This allows us to perform another Add & Norm operation with the transformer block input before feeding the result onto the next block.
Stacked FFN
Let’s move our attention to (c), the student, in the figure above. The same analysis as above holds