Assignment1 code for the Web Data Processing System

Web Data Processing System Assignment 1 – 2021 – Group 26 Zhining Bai Bowen Lyu Tianshi Chen Yiming Xu Description This is a Python program to Entity Linking by processing WARC files. We recognize entities from web pages and link them to a Knowledge Base(Wikidata). The pipeline for this program as below: Read WARC Use pyspark to read large-scale warc files, so the program supports parallel computing. Extract text information from HTML files by using beautifulsoup. Named entity recognition Extract […]

Read more