Neural Machine Translation

GradInit: Learning to Initialize Neural Networks for Stable and Efficient Training

Changes in neural architectures have fostered significant breakthroughs in language modeling and computer vision. Unfortunately, novel architectures often require re-thinking the choice of hyperparameters (e.g., learning rate, warmup schedule, and momentum coefficients) to maintain stability of the optimizer… This optimizer instability is often the result of poor parameter initialization, and can be avoided by architecture-specific initialization schemes. In this paper, we present GradInit, an automated and architecture agnostic method for initializing neural networks. GradInit is based on a simple heuristic; […]

Read more

Sparsely Factored Neural Machine Translation

The standard approach to incorporate linguistic information to neural machine translation systems consists in maintaining separate vocabularies for each of the annotated features to be incorporated (e.g. POS tags, dependency relation label), embed them, and then aggregate them with each subword in the word they belong to. This approach, however, cannot easily accommodate annotation schemes that are not dense for every word… We propose a method suited for such a case, showing large improvements in out-of-domain data, and comparable quality […]

Read more

Linear Transformers Are Secretly Fast Weight Memory Systems

We show the formal equivalence of linearised self-attention mechanisms and fast weight memories from the early ’90s. From this observation we infer a memory capacity limitation of recent linearised softmax attention variants… With finite memory, a desirable behaviour of fast weight memory models is to manipulate the contents of memory and dynamically interact with it. Inspired by previous work on fast weights, we propose to replace the update rule with an alternative rule yielding such behaviour. We also propose a […]

Read more

Machine Translation Weekly 69: One-Short learning in MT

This week I will discuss a paper about the one-shot vocabulary learning abilities of machine translation. The title of the paper is Continuous Learning in Neural Machine Translation using Bilingual Dictionaries and will be presented at EACL in May this year. A very similar idea is also presented in a paper Facilitating Terminology Translation with Target Lemma Annotations that will be presented at the same conference. One-shot learning is the ability to learn from a single example. In the context […]

Read more

Machine Translation Weekly 68: Pre-editing of MT inputs

Today, I am going to comment on a paper that systematically explores something that probably many MT users do this is pre-editing (editing the source sentence) to get a better output of an MT that is treated as a black box. The title of the paper is Understanding Pre-Editing for Black-Box Neural Machine Translation by authors from Nagoya University and NICT in Japan and will appear at this year’s EACL. Pre-editing is something I often do when I use automatic […]

Read more

Machine Translation Weekly 67: Where the language neurality of mBERT reside?

If someone told me ten years ago when I was a freshly graduated bachelor of computer science that there would models that would produce multilingual sentence representation allowing zero-shot model transfer, I would have hardly believed such a prediction. If they added that the models would be total black boxes and we would not know why it worked, I would think they were insane. After all, one of the goals of the mathematization of stuff in science is to make […]

Read more

Machine Translation Weekly 66: Means against ends of sentences

This week I am going to revisit the mystery of decoding in neural machine translation for one more time. It has been more than a year ago when Felix Stahlberg and Bill Byrne discovered the very disturbing feature of neural machine translation models – that the most probable target sentence is an empty sequence and this it is a sort of luck that we decode good translations from the models (MT Weekly 20). The paper disproved the narrative of NMT […]

Read more

Machine Translation Weekly 65: Sequence-to-sequence models and substitution ciphers

Today, I am going to talk about a recent pre-print on sequence-to-sequence models for deciphering substitution ciphers. Doing such a thing was somewhere at the bottom of my todo list for a few years, I suggested it as a thesis topic to several master students and no one wanted to do it, so I am glad that someone finally did the experiments. The title of the preprint is Can Sequence-to-Sequence Models Crack Substitution Ciphers? and the authors are from the […]

Read more

Machine Translation Weekly 64: Non-autoregressive Models Strike Back

Half a year ago I featured here (MT Weekly 45) a paper that questions the contribution of non-autoregressive models to computational efficiency. It showed that a model with a deep encoder (that can be parallelized) and a shallow decoder (that works sequentially) reaches the same speed with much better translation quality than NAR models. A pre-print by Facebook AI and CMU published on New Year’s Eve, Fully Non-autoregressive Neural Machine Translation: Tricks of the Trade, presents a new fully non-autoregressive […]

Read more

Subformer: Exploring Weight Sharing for Parameter Efficiency in Generative Transformers

The advent of the Transformer can arguably be described as a driving force behind many of the recent advances in natural language processing. However, despite their sizeable performance improvements, as recently shown, the model is severely over-parameterized, being parameter inefficient and computationally expensive to train… Inspired by the success of parameter-sharing in pretrained deep contextualized word representation encoders, we explore parameter-sharing methods in Transformers, with a specific focus on encoder-decoder models for sequence-to-sequence tasks such as neural machine translation. We […]

Read more
1 7 8 9 10 11 14