Multilingualism in Natural Language Processing: Targeting Low Resource Indian Languages

Introduction A language is a systematic form of communication that can take a variety of forms. There are approximately 7,000 languages believed to be spoken across the globe. Despite this diversity, the majority of the world’s population speaks only a fraction of these languages. In Spite of such a rich diversity Languages are still evolving across time much like the society we live in. While the English language is uniform, having the distinct status of being the official language of […]

Read more

Machine Translation Weekly 63: Maximum Aposteriori vs. Minimum Bayes Risk decoding

This week I will have a look at the best paper from this year’s COLING that brings an interesting view on inference in NMT models. The title of the paper is “Is MAP Decoding All You Need? The Inadequacy of the Mode in Neural Machine Translation” and its authors are from the University of Amsterdam. NMT models learn the conditional probability of the next word in a target sentence given the source sentence and the previous words in the target […]

Read more

What Is Meta-Learning in Machine Learning?

Meta-learning in machine learning refers to learning algorithms that learn from other learning algorithms. Most commonly, this means the use of machine learning algorithms that learn how to best combine the predictions from other machine learning algorithms in the field of ensemble learning. Nevertheless, meta-learning might also refer to the manual process of model selecting and algorithm tuning performed by a practitioner on a machine learning project that modern automl algorithms seek to automate. It also refers to learning across […]

Read more

Spelling Correction in Python with TextBlob

Introduction Spelling mistakes are common, and most people are used to software indicating if a mistake was made. From autocorrect on our phones, to red underlining in text editors, spell checking is an essential feature for many different products. The first program to implement spell checking was written in 1971 for the DEC PDP-10. Called SPELL, it was capable of performing only simple comparisons of words and detecting one or two letter differences. As hardware and software advanced, so have […]

Read more

Jump Search in Python

Introduction Finding the right data we need is an age-old problem before computers. As developers, we create many search algorithms to retrieve data efficiently. Search algorithms can be divided into two broad categories: sequential and interval searches. Sequential searches check each element in a data structure. Interval searches check various points of the data (called intervals), reducing the time it takes to find an item, given a sorted dataset. In this article, you will cover Jump Search in Python – […]

Read more

CCAligned: A Massive Collection of Cross-Lingual Web-Document Pairs

Abstract Cross-lingual document alignment aims to identify pairs of documents in two distinct languages that are of comparable content or translations of each other. In this paper, we exploit the signals embedded in URLs to label web documents at scale with an average precision of 94.5% across different language pairs. We mine sixty-eight snapshots of the Common Crawl corpus and identify web document pairs that are translations of each other. We release a new web dataset consisting of over 392 […]

Read more

Dense Passage Retrieval for Open-Domain Question Answering

November 16, 2020 By: Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih Abstract Open-domain question answering relies on efficient passage retrieval to select candidate contexts, where traditional sparse vector space models, such as TF-IDF or BM25, are the de facto method. In this work, we show that retrieval can be practically implemented using dense representations alone, where embeddings are learned from a small number of questions and passages by a simple dual-encoder […]

Read more

Measuring the Similarity of Grammatical Gender Systems by Comparing Partitions

Abstract A grammatical gender system divides a lexicon into a small number of relatively fixed grammatical categories. How similar are these gender systems across languages? To quantify the similarity, we define gender systems extensionally, thereby reducing the problem of comparisons between languages’ gender systems to cluster evaluation. We borrow a rich inventory of statistical tools for cluster evaluation from the field of community detection (Driver and Kroeber, 1932; Cattell, 1945), that enable us to craft novel information-theoretic metrics for measuring […]

Read more

An Imitation Game for Learning Semantic Parsers from User Interaction

November 16, 2020 By: Ziyu Yao, Yiqi Tang, Wen-tau Yih, Huan Sun, Yu Su Abstract Despite the widely successful applications, building a semantic parser is still a tedious process in practice with challenges from costly data annotation and privacy risks. We suggest an alternative, human-in-the-loop methodology for learning semantic parsers directly from users. A semantic parser should be introspective of its uncertainties and prompt for user demonstrations when uncertain. In doing so it also gets to imitate the user behavior […]

Read more

Generating Fact Checking Briefs

Abstract Fact checking at scale is difficult—while the number of active fact checking websites is growing, it remains too small for the needs of the contemporary media ecosystem. However, despite good intentions, contributions from volunteers are often error-prone, and thus in practice restricted to claim detection. We investigate how to increase the accuracy and efficiency of fact checking by providing information about the claim before performing the check, in the form of natural language briefs. We investigate passage-based briefs, containing […]

Read more
1 698 699 700 701 702 919