An Unsupervised method for OCR Post-Correction and Spelling Normalisation for Finnish

Historical corpora are known to contain errors introduced by OCR (optical character recognition) methods used in the digitization process, often said to be degrading the performance of NLP systems. Correcting these errors manually is a time-consuming process and a great part of the automatic approaches have been relying on rules or supervised machine learning… We build on previous work on fully automatic unsupervised extraction of parallel data to train a character-based sequence-to-sequence NMT (neural machine translation) model to conduct OCR […]

Read more

Research Collection – Re-Inventing Storage for the Cloud Era

“One of the challenges for us as a company, and us as an industry, is that many of the technologies we rely on are beginning to get to the point where either they are at the end, or they’re starting to get to the point where you can see the end. Moore’s Law is a well-publicized one and we hit it some time ago. And that’s a great opportunity, because whenever you get that rollover, you get an opportunity to […]

Read more

Generating Command-Line Interfaces (CLI) with Fire in Python

Introduction A Command-line interface (CLI) is a way to interact with computers using textual commands. A lot of tools that don’t require GUIs are written as CLI tools/utilities. Although Python has the built-in argparse module, other libraries with similar functionality do exist. These libraries can help us in writing CLI scripts, providing services like parsing options and flags to much more advanced CLI functionality. This article discusses the Python Fire library, written by Google Inc., a useful tool to create […]

Read more

Voice Separation with an Unknown Number of Multiple Speakers

Abstract We present a new method for separating a mixed audio sequence, in which multiple voices speak simultaneously. The new method employs gated neural networks that are trained to separate the voices at multiple processing steps, while maintaining the speaker in each output channel fixed. A different model is trained for every number of possible speakers, and the model with the largest number of speakers is employed to select the actual number of speakers in a given sample. Our method […]

Read more

Low-Resource Domain Adaptation for Compositional Task-Oriented Semantic Parsing

Abstract Task-oriented semantic parsing is a critical component of virtual assistants, which is responsible for understanding the user’s intents (set reminder, play music, etc.). Recent advances in deep learning have enabled several approaches to successfully parse more complex queries (Gupta et al., 2018; Rongali et al., 2020), but these models require a large amount of annotated training data to parse queries on new domains (e.g. reminder, music). In this paper, we focus on adapting task-oriented semantic parsers to low-resource domains, […]

Read more

Unsupervised Translation of Programming Languages

Abstract A transcompiler, also known as source-to-source translator, is a system that converts source code from a high-level programming language (such as C++ or Python) to another. Transcompilers are primarily used for interoperability, and to port codebases written in an obsolete or deprecated language (e.g. COBOL, Python 2) to a modern one. They typically rely on handcrafted rewrite rules, applied to the source code abstract syntax tree. Unfortunately, the resulting translations often lack readability, fail to respect the target language […]

Read more

Text Mining hack: Subject Extraction made easy using Google API

Let’s do a simple exercise. You need to identify the subject and the sentiment in following sentences: Google is the best resource for any kind of information. I came across a fabulous knowledge portal – Analytics Vidhya Messi played well but Argentina still lost the match Opera is not the best browser Yes, like UAE will win the Cricket World Cup. Was this exercise simple? Even if this looks like a simple exercise, now imagine creating an algorithm to do this? How does that […]

Read more

Artificial Intelligence Demystified

Introduction Artificial Intelligence has become a very popular term today. There is sure to be at least one article in the newspaper daily on the revolutionary advancements made in the field. But, there seems to be some confusion about what AI really is. Is it Robotics? Will the Terminator movie actually come true? Or is it something that has crept into our daily lives without us even realizing it? This article will give you a broad understanding on the buzzwords […]

Read more

Text Classification & Word Representations using FastText (An NLP library by Facebook)

Introduction If you put a status update on Facebook about purchasing a car -don’t be surprised if Facebook serves you a car ad on your screen. This is not black magic! This is Facebook leveraging the text data to serve you better ads. The picture below takes a jibe at a challenge while dealing with text data. Well, it clearly failed in the above attempt to deliver the right ad. It is all the more important to capture the context […]

Read more

Ultimate guide to deal with Text Data (using Python) – for Data Scientists and Engineers

Introduction One of the biggest breakthroughs required for achieving any level of artificial intelligence is to have machines which can process text data. Thankfully, the amount of text data being generated in this universe has exploded exponentially in the last few years. It has become imperative for an organization to have a structure in place to mine actionable insights from the text being generated. From social media analytics to risk management and cybercrime protection, dealing with text data has never […]

Read more
1 730 731 732 733 734 914