Measuring dataset similarity using optimal transport
Is FashionMNIST, a dataset of images of clothing items labeled by category, more similar to MNIST or to USPS, both of which are classification datasets of handwritten digits? This is a pretty hard question to answer, but the solution could have an impact on various aspects of machine learning. For example, it could change how practitioners augment a particular dataset to improve the transferring of models across domains or how they select a dataset to pretrain on, especially in scenarios where labeled data from the target domain of interest is scarce.
In our recent paper, “Geometric Dataset Distances via Optimal Transport,” we propose the Optimal Transport Dataset Distance, or the OTDD for short, an approach to defining and computing similarities, or distances, between classification datasets. The OTDD relies on optimal transport (OT), a flexible geometric method for comparing probability distributions, and can be used to compare any two datasets, regardless of whether their label sets are directly comparable. As a bonus, the OTDD returns a coupling of the two datasets being compared, which can be understood as a set of soft correspondences between individual items in the datasets. Correspondences can be