PyTorch/HuggingFace Implementation of URLTran: Improving Phishing URL Detection Using Transformers

PyTorch implementation of Improving Phishing URL Detection via Transformers Paper

Data

The paper used ~1.8M URLs (90/10 split on benign vs. malicious). There are few places to gather malicious URLs. My recommendation is to do the following:

Malicious URLs

OpenPhish will provide 500 malicious URLs for free in TXT form. You can access that data here.

Likewise, PhishTank is an excellent resource that provides a daily feed of malicious URLs in CSV or JSON format. You can gather ~5K through the following link.

Finally, there is an excellent OpenSource project, Phishing.Database, run by Mitchell Krog. There is a ton of data available here to plus up your dataset.

Benign Data

I gathered benign URL data via two methods. The first was to

To finish reading, please visit source site