Domain-specific language model pretraining for biomedical natural language processing
COVID-19 highlights a perennial problem facing scientists around the globe: how do we stay up to date with the cutting edge of scientific knowledge? In just a few months since the pandemic emerged, tens of thousands of research papers have been published concerning COVID-19 and the SARS-CoV-2 virus. This explosive growth sparks the creation of the COVID-19 Open Research Dataset (CORD-19) to facilitate research and discovery. However, a pandemic is just one salient example of a prevailing challenge to this community. PubMed, the standard repository for biomedical research articles, adds 4,000 new papers every day and over a million every year.
It is impossible to keep track of such rapid progress by manual efforts alone. In the era of big data and precision medicine, the urgency has never been higher to advance natural language processing (NLP) methods that can help scientists stay versed in the deluge of information. NLP can help researchers quickly identify and cross-reference important findings in papers that are both directly and tangentially related to their own research at a large scale—instead of researchers having to sift through papers manually for relevant findings or recall them from memory.
In this blog post,