PyTorch/HuggingFace Implementation of URLTran: Improving Phishing URL Detection Using Transformers
PyTorch implementation of Improving Phishing URL Detection via Transformers Paper
Data
The paper used ~1.8M URLs (90/10 split on benign vs. malicious). There are few places to gather malicious URLs. My recommendation is to do the following:
Malicious URLs
OpenPhish will provide 500 malicious URLs for free in TXT form. You can access that data here.
Likewise, PhishTank is an excellent resource that provides a daily feed of malicious URLs in CSV or JSON format. You can gather ~5K through the following link.
Finally, there is an excellent OpenSource project, Phishing.Database, run by Mitchell Krog. There is a ton of data available here to plus up your dataset.
Benign Data
I gathered benign URL data via two methods. The first was to