mirror of https://github.com/openzim/zimit.git synced 2025-09-26 06:13:41 -04:00

Go to file

Ilya Kreymer 91fe76c56e work on automated capture of vidoe (#9 )

- add autoplay behavior to reload known video sites to autoplay
- for video/audio on page, queue directly for loading if video.src or audio.src set to valid url, otherwise load through play in browser (may be slower)
- add extra wait if reloading for autoplay
- timeouts: set timeout for puppeteer-cluster double to timeout of page to avoid hitting that timeout during regular operation
- use browser from oldwebtoday/chrome:84 and puppeteer-core instead of puppeteer browser for consistent results
- temp testing: use custom wabac.js sw for testing (will use default from warc2zim), using warc2zim fuzzy-match branch for now

2020-10-21 06:09:10 +00:00

.github

Github Kiwix Sponsoring page link

2020-02-01 18:14:09 +01:00

.dockerignore

add .dockerignore to speed up docker build

2020-10-16 19:12:15 +00:00

.gitignore

initial setup - single url capture with existing browser image, pywb, puppeteer and warc2zim

2020-09-19 17:38:52 +00:00

autoplay.js

work on automated capture of vidoe (#9 )

2020-10-21 06:09:10 +00:00

config.yaml

replace run.sh with python runner zimit.py, as suggested in #28

2020-10-16 18:54:04 +00:00

crawler.js

work on automated capture of vidoe (#9 )

2020-10-21 06:09:10 +00:00

Dockerfile

work on automated capture of vidoe (#9 )

2020-10-21 06:09:10 +00:00

LICENSE

Added LICENSE document

2020-09-01 10:22:32 +02:00

package.json

work on automated capture of vidoe (#9 )

2020-10-21 06:09:10 +00:00

README.md

replace run.sh with python runner zimit.py, as suggested in #28

2020-10-16 18:54:04 +00:00

sw.js

work on automated capture of vidoe (#9 )

2020-10-21 06:09:10 +00:00

uwsgi.ini

config work: pass remaining config opts to warc2zim, fixes #13

2020-10-06 06:25:40 +00:00

yarn.lock

work on automated capture of vidoe (#9 )

2020-10-21 06:09:10 +00:00

zimit.py

work on automated capture of vidoe (#9 )

2020-10-21 06:09:10 +00:00

README.md

Zimit

Zimit is a scraper allowing to create ZIM file from any Web site.

Technical background

This version of Zimit runs a single-site headless-Chrome based crawl in a Docker container and produces a ZIM of the crawled content.

The system uses:

oldwebtoday/chrome - to install a recent version of Chrome 84
puppeteer-cluster - for running Chrome browsers in parallel
pywb - in recording mode for capturing the content
warc2zim - to convert the crawled WARC files into a ZIM

The driver in index.js crawls a given URL using puppeteer-cluster.

After the crawl is done, warc2zim is used to write a zim to the /output directory, which can be mounted as a volume.

Usage

zimit is intended to be run in Docker.

To build locally run:

docker build -t openzim/zimit .

The image accepts the following parameters:

--url URL - the url to be crawled (required)
--workers N - number of crawl workers to be run in parallel
--wait-until - Puppeteer setting for how long to wait for page load. See page.goto waitUntil options. The default is load, but for static sites, --wait-until domcontentloaded may be used to speed up the crawl (to avoid waiting for ads to load for example).
--name - Name of ZIM file (defaults to the hostname of the URL)
--output - output directory (defaults to /output)
--limit U - Limit capture to at most U URLs
--exclude <regex> - skip URLs that match the regex from crawling. Can be specified multiple times.
--scroll [N] - if set, will activate a simple auto-scroll behavior on each page to scroll for upto N seconds

The following is an example usage. The --cap-add and --shm-size flags are needed to run Chrome in Docker.

Example command:

docker run  -v /output:/output --cap-add=SYS_ADMIN --cap-add=NET_ADMIN \
       --shm-size=1gb openzim/zimit URL --name myzimfile --workers 2 --wait-until domcontentloaded

The puppeteer-cluster provides monitoring output which is enabled by default and prints the crawl status to the Docker log.

Nota bene

A first version of a generic HTTP scraper was created in 2016 during the Wikimania Esino Lario Hackathon.

That version is now considered outdated and archived in 2016 branch.

License

GPLv3 or later, see LICENSE for more details.