A Web scraping library and command-line tool for text discovery and extraction
Description
Trafilatura is a Python package and command-line tool which seamlessly downloads, parses, and scrapes web page data: it can extract metadata, main body text and comments while preserving parts of the text formatting and page structure. The output can be converted to different formats.
Distinguishing between a whole page and the page’s essential parts can help to alleviate many quality problems related to web text processing, by dealing with