Additional README.md changes (#16)

This commit is contained in:
Kelson 2020-09-25 12:02:43 +02:00 committed by GitHub
parent 252516e38c
commit bb5b7e48c1
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -1,10 +1,15 @@
Zimit
=====
Zimit is a scraper allowing to create ZIM file from any Web site.
[![CodeFactor](https://www.codefactor.io/repository/github/openzim/zimit/badge)](https://www.codefactor.io/repository/github/openzim/zimit)
[![Docker Build Status](https://img.shields.io/docker/build/openzim/zimit)](https://hub.docker.com/r/openzim/zimit)
[![Docker Build Status](https://img.shields.io/docker/cloud/build/openzim/zimit)](https://hub.docker.com/r/openzim/zimit)
[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)
Technical background
--------------------
This version of Zimit runs a single-site headless-Chrome based crawl in a Docker container and produces a ZIM of the crawled content.
The system uses:
@ -15,15 +20,17 @@ The system uses:
The driver in `index.js` crawls a given URL using puppeteer-cluster.
After the crawl is done, warc2zim is used to write a zim to the `/output` directory, which can be mounted as a volume.
After the crawl is done, warc2zim is used to write a zim to the
`/output` directory, which can be mounted as a volume.
## Usage
Usage
-----
`zimit` is intended to be run in Docker.
To build locally run:
```
```bash
docker build -t openzim/zimit .
```
@ -37,25 +44,28 @@ The image accepts the following parameters:
- `--limit U` - Limit capture to at most U URLs
- `--exclude <regex>` - skip URLs that match the regex from crawling. Can be specified multiple times.
The following is an example usage. The `--cap-add` and `--shm-size` flags are needed to run Chrome in Docker.
The following is an example usage. The `--cap-add` and `--shm-size`
flags are needed to run Chrome in Docker.
Example command:
```
docker run -v /output:/output --cap-add=SYS_ADMIN --cap-add=NET_ADMIN --shm-size=1gb openzim/zimit URL --name myzimfile --workers 2 --wait-until domcontentloaded
```bash
docker run -v /output:/output --cap-add=SYS_ADMIN --cap-add=NET_ADMIN \
--shm-size=1gb openzim/zimit URL --name myzimfile --workers 2 --wait-until domcontentloaded
```
The puppeteer-cluster provides monitoring output which is enabled by default and prints the crawl status to the Docker log.
The puppeteer-cluster provides monitoring output which is enabled by
default and prints the crawl status to the Docker log.
Nota bene
---------
A first version of a generic HTTP scraper was created in 2016 during
the [Wikimania Esino Lario
Hackathon](https://wikimania2016.wikimedia.org/wiki/Programme/Kiwix-dedicated_Hackathon).
<hr>
## Previous version
A first version of a generic HTTP scraper was created in 2016 during the [Wikimania Esino Lario Hackathon](https://wikimania2016.wikimedia.org/wiki/Programme/Kiwix-dedicated_Hackathon).
That version is now considered outdated and [archived in `2016` branch](https://github.com/openzim/zimit/tree/2016).
That version is now considered outdated and [archived in `2016`
branch](https://github.com/openzim/zimit/tree/2016).
License
-------