Fault tolerant, stealthy, distributed web crawling with Pyppeteer


Distributed, Fault-Tolerant Web Crawling.

Multi-process, multiple workers

  1. Client process queues tasks in Redis.
  2. Worker nodes pull tasks from Redis, execute task, and store results in Redis.
  3. Client process pulls results from Redis.


  • Worker nodes can run on any machine.
  • Add or remove worker nodes at runtime without disrupting the system.
  • Achieves fault-tolerance through process isolation and monitoring.
    • Workers are ran as systemd services, where each service is the smallest possible processing unit (either a single browser with a single page, or a single vanilla HTTP client).
      Browsers with only a single page (single tab open) are less prone to crashes and there’s also no disadvantage in terms of system resource usage as running n single-page browsers simultaneously will use almost identical resources as a single browser running




      To finish reading, please visit source site