A method for cleaning and classifying text using transformers
NLP Translation and Classification
The repository contains a method for classifying and cleaning text using NLP transformers.
Overview
The input data are web-scraped product names gathered from various e-shops. The products are either monitors or printers. Each product in the dataset has a scraped name containing information about the product brand, and product model name, but also unwanted noise – irrelevant information about the item. Additionally, only some records are relevant, meaning that they belong to the correct category: monitor or printer, while other records belong to unwanted categories like accessories or TVs.
The goal of the tasks is to preprocess web-scraped data by removing noisy records and cleaning product names. Preliminary experiments showed that classic machine learning methods like tf-idf vectorization and classification struggled to achieve good results. Instead NLP