Assignment1 code for the Web Data Processing System

Web Data Processing System Assignment 1 – 2021 – Group 26
- Zhining Bai
- Bowen Lyu
- Tianshi Chen
- Yiming Xu
Description
This is a Python program to Entity Linking by processing WARC files. We recognize entities from web pages and link them to a Knowledge Base(Wikidata). The pipeline for this program as below:
Read WARC
- Use
pyspark
to read large-scale warc files, so the program supports parallel computing. - Extract text information from HTML files by using
beautifulsoup
.
Named entity recognition
- Extract entities by using
recognize_entities_bert
model fromsparknlp
.
Disambiguation and NIL
We considered the popularity of the candidate page as well as the