How to Prepare a French-to-English Dataset for Machine Translation
Last Updated on April 30, 2020
Machine translation is the challenging task of converting text from a source language into coherent and matching text in a target language.
Neural machine translation systems such as encoder-decoder recurrent neural networks are achieving state-of-the-art results for machine translation with a single end-to-end system trained directly on source and target language.
Standard datasets are required to develop, explore, and familiarize yourself with how to develop neural machine translation systems.
In this tutorial, you will discover the Europarl standard machine translation dataset and how to prepare the data for modeling.
After completing this tutorial, you will know:
- The Europarl dataset comprised of the proceedings from the European Parliament in a host of 11 languages.
- How to load and clean the parallel French and English transcripts ready for modeling in a neural machine translation system.
- How to reduce the vocabulary size of both French and English data in order to reduce the complexity of the translation task.
Kick-start your project with my new book Deep Learning for Natural Language Processing, including step-by-step tutorials and the Python source code files for all examples.
Let’s get started.