ULMFiT for Genomic Sequence Data

This is an implementation of ULMFiT for genomics classification using Pytorch and Fastai. The model architecture used is based on the AWD-LSTM model, consisting of an embedding, three LSTM layers, and a final set of linear layers.
The ULMFiT approach uses three training phases to produce a classification model:
- Train a language model on a large, unlabeled corpus
- Fine tune the language model on the classification corpus
- Use the fine tuned language model to initialize a classification model
This method is particularly advantageous for genomic data, where large amounts of unlabeled data is abundant and labeled data is scarce. The ULMFiT approach allows us to train a model on a large, unlabeled genomic corpus in an unsupervised fashion. The pre-trained language model serves as a