Large-scale Near-deduplication Behind BigCode
People who are interested in document-level near-deduplication at a large scale, and have some understanding of hashing, graph and text processing. Motivations It is important to take care of our data before feeding it to the model, at least Large Language
Read more