zimit
This version of Zimit runs a single-site headless-Chrome based crawl in a Docker container and produces a ZIM of the crawled content.
The system uses:
oldwebtoday/chrome
- to install a recent version of Chrome 84puppeteer-cluster
- for running Chrome browsers in parallelpywb
- in recording mode for capturing the contentwarc2zim
- to convert the crawled WARC files into a ZIM
The driver in index.js
crawls a given URL using puppeteer-cluster.
After the crawl is done, warc2zim is used to write a zim to the /output
directory, which can be mounted as a volume.
Usage
zimit
is intended to be run in Docker.
To build locally run:
docker build -t openzim/zimit .
The image accepts the following parameters:
URL
- the url to be crawled (required)--workers N
- number of crawl workers to be run in parallel--wait-until
- Puppeteer setting for how long to wait for page load. See page.goto waitUntil options. The default isload
, but for static sites,--wait-until domcontentloaded
may be used to speed up the crawl (to avoid waiting for ads to load for example).--name
- Name of ZIM file (defaults to the hostname of the URL)--output
- output directory (defaults to/output
)--limit U
- Limit capture to at most U URLs--exclude <regex>
- skip URLs that match the regex from crawling. Can be specified multiple times.
The following is an example usage. The --cap-add
and --shm-size
flags are needed to run Chrome in Docker.
Example command:
docker run -v /output:/output --cap-add=SYS_ADMIN --cap-add=NET_ADMIN --shm-size=1gb openzim/zimit URL --name myzimfile --workers 2 --wait-until domcontentloaded
The puppeteer-cluster provides monitoring output which is enabled by default and prints the crawl status to the Docker log.
Previous version
A first version of a generic HTTP scraper was created in 2016 during the Wikimania Esino Lario Hackathon.
That version is now considered outdated and archived in 2016
branch.