MobileBERT: BERT for Resource-Limited Devices

For a second, let’s focus solely on the teacher. If we continuing the path past the MHA-block, things remain the same compared to a vanilla transformer block until we reach the second “Add & Norm” operation. After this layer, we have a bottleneck transform, this time to reduce the dimension back to that of the input. This allows us to perform another Add & Norm operation with the transformer block input before feeding the result onto the next block.

Stacked FFN

Let’s move our attention to (c), the student, in the figure above. The same analysis as above holds

To finish reading, please visit source site

MobileBERT: BERT for Resource-Limited Devices

Stacked FFN

Leave a Reply Cancel reply