Assignment1 code for the Web Data Processing System

Web Data Processing System Assignment 1 – 2021 – Group 26

  • Zhining Bai
  • Bowen Lyu
  • Tianshi Chen
  • Yiming Xu

Description

This is a Python program to Entity Linking by processing WARC files. We recognize entities from web pages and link them to a Knowledge Base(Wikidata). The pipeline for this program as below:

image

Read WARC

  • Use pyspark to read large-scale warc files, so the program supports parallel computing.
  • Extract text information from HTML files by using beautifulsoup.

Named entity recognition

  • Extract entities by using recognize_entities_bert model from sparknlp.

Disambiguation and NIL

We considered the popularity of the candidate page as well as the

 

 

 

To finish reading, please visit source site