Unsupervised Quality Estimation for Neural Machine Translation

August 31, 2020 By: Marina Fomicheva, Shuo Sun, Lisa Yankovskaya, Frédéric Blain, Francisco (Paco) Guzman, Mark Fishel, Nikolaos Aletras, Vishrav Chaudhary, Lucia Specia Abstract Quality Estimation (QE) is an important component in making Machine Translation (MT) useful in real-world applications, as it is aimed to inform the user on the quality of the MT output at test time. Existing approaches require large amounts of expert annotated data, computation and time for training. As an alternative, we devise an unsupervised approach […]

Read more

Lemotif: An Affective Visual Journal Using Deep Neural Networks

Abstract We present Lemotif, an integrated natural language processing and image generation system that uses machine learning to (1) parse a text-based input journal entry describing the user’s day for salient themes and emotions and (2) visualize the detected themes and emotions in creative and appealing image motifs. Synthesizing approaches from artificial intelligence and psychology, Lemotif acts as an affective visual journal, encouraging users to regularly write and reflect on their daily experiences through visual reinforcement. By making patterns in […]

Read more

Weak-Attention Suppression For Transformer Based Speech Recognition

Abstract Transformers, originally proposed for natural language processing (NLP) tasks, have recently achieved great success in automatic speech recognition (ASR). However, adjacent acoustic units (i.e., frames) are highly correlated, and long-distance dependencies between them are weak, unlike text units. It suggests that ASR will likely benefit from sparse and localized attention. In this paper, we propose Weak-Attention Suppression (WAS), a method that dynamically induces sparsity in attention probabilities. We demonstrate that WAS leads to consistent Word Error Rate (WER) improvement […]

Read more

How to Develop LARS Regression Models in Python

Regression is a modeling task that involves predicting a numeric value given an input. Linear regression is the standard algorithm for regression that assumes a linear relationship between inputs and the target variable. An extension to linear regression involves adding penalties to the loss function during training that encourage simpler models that have smaller coefficient values. These extensions are referred to as regularized linear regression or penalized linear regression. Lasso Regression is a popular type of regularized linear regression that […]

Read more

Machine Translation Weekly 55: Social Polarization Seen through Word Embeddings

This week, I am going to have a closer look at a paper that creatively uses methods for bilingual word embeddings for social media analysis. The paper’s preprint was uploaded last week on arXiv. The title is “We Don’t Speak the Same Language: Interpreting Polarization through Machine Translation,” and most of the authors CMU in Pittsburgh. The paper’s central assumption is that the polarization of different opinion groups, especially in the USA, went so far that some words have totally […]

Read more

Generating Synthetic Data with Numpy and Scikit-Learn

Introduction In this tutorial, we’ll discuss the details of generating different synthetic datasets using Numpy and Scikit-learn libraries. We’ll see how different samples can be generated from various distributions with known parameters. We’ll also discuss generating datasets for different purposes, such as regression, classification, and clustering. At the end we’ll see how we can generate a dataset that mimics the distribution of an existing dataset. The Need for Synthetic Data In data science, synthetic data plays a very important role. […]

Read more

Python: Get Number of Elements in a List

Introduction Getting the number of elements in a list in Python is a common operation. For example, you will need to know how many elements the list has whenever you iterate through it. Remember that lists can have a combination of integers, floats, strings, booleans, other lists, etc. as their elements: # List of just integers list_a = [12, 5, 91, 18] # List of integers, floats, strings, booleans list_b = [4, 1.2, “hello world”, True] If we count the […]

Read more

Issue #103 – LEGAL-BERT: The Muppets straight out of Law School

16 Oct20 Issue #103 – LEGAL-BERT: The Muppets straight out of Law School Author: Akshai Ramesh, Machine Translation Scientist @ Iconic Introduction BERT (Bidirectional Encoder Representations from Transformers) is a large-scale pre-trained autoencoding language model that has made a substantial contribution to natural language processing (NLP) and has been studied as a potentially promising way to further improve neural machine translation (NMT). “Given that BERT is based on a similar approach to neural MT in Transformers, there’s considerable interest and […]

Read more

Nearest Shrunken Centroids With Python

Nearest Centroids is a linear classification machine learning algorithm. It involves predicting a class label for new examples based on which class-based centroid the example is closest to from the training dataset. The Nearest Shrunken Centroids algorithm is an extension that involves shifting class-based centroids toward the centroid of the entire training dataset and removing those input variables that are less useful at discriminating the classes. As such, the Nearest Shrunken Centroids algorithm performs an automatic form of feature selection, […]

Read more

Quick Guide: Steps To Perform Text Data Cleaning in Python

Introduction Twitter has become an inevitable channel for brand management. It has compelled brands to become more responsive to their customers. On the other hand, the damage it would cause can’t be undone. The 140 character tweets has now become a powerful tool for customers / users to directly convey messages to brands. For companies, these tweets carry a lot of information like sentiment, engagement, reviews and features of its products and what not. However, mining these tweets isn’t easy. Why? Because, before you mine this data, you need […]

Read more
1 754 755 756 757 758 914