mirror of https://github.com/openzim/zimit.git synced 2025-09-22 11:22:23 -04:00

Go to file

Ilya Kreymer 1de577bd78 use puppeteeer-cluster for parallel crawling

use yargs to parse command-line args

2020-09-19 22:19:20 +00:00

.github

Github Kiwix Sponsoring page link

2020-02-01 18:14:09 +01:00

.gitignore

initial setup - single url capture with existing browser image, pywb, puppeteer and warc2zim

2020-09-19 17:38:52 +00:00

config.yaml

initial setup - single url capture with existing browser image, pywb, puppeteer and warc2zim

2020-09-19 17:38:52 +00:00

Dockerfile

initial setup - single url capture with existing browser image, pywb, puppeteer and warc2zim

2020-09-19 17:38:52 +00:00

index.js

use puppeteeer-cluster for parallel crawling

2020-09-19 22:19:20 +00:00

LICENSE

Added LICENSE document

2020-09-01 10:22:32 +02:00

package.json

use puppeteeer-cluster for parallel crawling

2020-09-19 22:19:20 +00:00

README.md

use puppeteeer-cluster for parallel crawling

2020-09-19 22:19:20 +00:00

run.sh

use puppeteeer-cluster for parallel crawling

2020-09-19 22:19:20 +00:00

uwsgi.ini

initial setup - single url capture with existing browser image, pywb, puppeteer and warc2zim

2020-09-19 17:38:52 +00:00

yarn.lock

use puppeteeer-cluster for parallel crawling

2020-09-19 22:19:20 +00:00

README.md

zimit

This version of Zimit runs a single-site headless-Chrome based crawl in a Docker container and produces a ZIM of the crawled content.

The system uses:

oldwebtoday/chrome - to install a recent version of Chrome 84
puppeteer-cluster - for running Chrome browsers in parallel
pywb - in recording mode for capturing the content
warc2zim - to convert the crawled WARC files into a ZIM

The driver in index.js crawls a given URL using puppeteer-cluster.

After the crawl is done, warc2zim is used to write a zim to the /output directory, which can be mounted as a volume.

Usage

zimit is intended to be run in Docker.

The following is an example usage. The --cap-add and --shm-size flags are needed for Chrome.

The image accepts the following parameters:

"" - the url to be crawled (required)
--workers N - number of crawl workers to be run in parallel
--wait-until - Puppeteer setting for how long to wait for page load. See page.goto waitUntil options. The default is load, but for static sites, --wait-until domcontentloaded may be used to speed up the crawl (to avoid waiting for ads to load for example).

Example command:

docker run -d -e NAME=myzimfile -v /output:/output --cap-add=SYS_ADMIN --cap-add=NET_ADMIN --shm-size=1gb openzim/zimit "<URL>" --workers 2 --wait-until domcontentloaded

Previous version

A first version of a generic HTTP scraper was created in 2016 during the Wikimania Esino Lario Hackathon.

That version is now considered outdated and archived in 2016 branch.