Fault tolerant, stealthy, distributed web crawling with Pyppeteer
crawler-cluster Distributed, Fault-Tolerant Web Crawling. Multi-process, multiple workers Client process queues tasks in Redis. Worker nodes pull tasks from Redis, execute task, and store results in Redis. Client process pulls results from Redis. Pros: Worker nodes can run on any machine. Add or remove worker nodes at runtime without disrupting the system. Achieves fault-tolerance through process isolation and monitoring. Workers are ran as systemd services, where each service is the smallest possible processing unit (either a single browser with a […]
Read more