DeLighT: Very Deep and Light-weight Transformers
DeLighT: Very Deep and Light-weight Transformers
This repository contains the source code of our work on building efficient sequence models: DeFINE (ICLR’20) and DeLighT (preprint).
Overview
In this repository, we share the source code of our paper DeLight, that delivers similar or better performance than
transformer-based models with significantly fewer parameters. DeLighT more efficiently allocates parameters both (1)
within each Transformer block using DExTra, a deep and light-weight transformation and (2) across blocks using
block-wise scaling, that allows for shallower and narrower DeLighT blocks near the input and wider and deeper
DeLighT blocks near the output. Overall, DeLighT networks are 2.5 to 4 times deeper than standard transformer models
and yet have fewer parameters and operations. For details, see our papers: DeFINE and
and DeLighT.