From bb5b7e48c1422fb3879c95bc9fedb7fadd412136 Mon Sep 17 00:00:00 2001 From: Kelson Date: Fri, 25 Sep 2020 12:02:43 +0200 Subject: [PATCH] Additional README.md changes (#16) --- README.md | 40 +++++++++++++++++++++++++--------------- 1 file changed, 25 insertions(+), 15 deletions(-) diff --git a/README.md b/README.md index 16f69d0..fc2b99d 100644 --- a/README.md +++ b/README.md @@ -1,10 +1,15 @@ Zimit ===== +Zimit is a scraper allowing to create ZIM file from any Web site. + [![CodeFactor](https://www.codefactor.io/repository/github/openzim/zimit/badge)](https://www.codefactor.io/repository/github/openzim/zimit) -[![Docker Build Status](https://img.shields.io/docker/build/openzim/zimit)](https://hub.docker.com/r/openzim/zimit) +[![Docker Build Status](https://img.shields.io/docker/cloud/build/openzim/zimit)](https://hub.docker.com/r/openzim/zimit) [![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0) +Technical background +-------------------- + This version of Zimit runs a single-site headless-Chrome based crawl in a Docker container and produces a ZIM of the crawled content. The system uses: @@ -15,15 +20,17 @@ The system uses: The driver in `index.js` crawls a given URL using puppeteer-cluster. -After the crawl is done, warc2zim is used to write a zim to the `/output` directory, which can be mounted as a volume. +After the crawl is done, warc2zim is used to write a zim to the +`/output` directory, which can be mounted as a volume. -## Usage +Usage +----- `zimit` is intended to be run in Docker. To build locally run: -``` +```bash docker build -t openzim/zimit . ``` @@ -37,25 +44,28 @@ The image accepts the following parameters: - `--limit U` - Limit capture to at most U URLs - `--exclude ` - skip URLs that match the regex from crawling. Can be specified multiple times. -The following is an example usage. The `--cap-add` and `--shm-size` flags are needed to run Chrome in Docker. +The following is an example usage. The `--cap-add` and `--shm-size` +flags are needed to run Chrome in Docker. Example command: -``` -docker run -v /output:/output --cap-add=SYS_ADMIN --cap-add=NET_ADMIN --shm-size=1gb openzim/zimit URL --name myzimfile --workers 2 --wait-until domcontentloaded +```bash +docker run -v /output:/output --cap-add=SYS_ADMIN --cap-add=NET_ADMIN \ + --shm-size=1gb openzim/zimit URL --name myzimfile --workers 2 --wait-until domcontentloaded ``` -The puppeteer-cluster provides monitoring output which is enabled by default and prints the crawl status to the Docker log. +The puppeteer-cluster provides monitoring output which is enabled by +default and prints the crawl status to the Docker log. +Nota bene +--------- +A first version of a generic HTTP scraper was created in 2016 during +the [Wikimania Esino Lario +Hackathon](https://wikimania2016.wikimedia.org/wiki/Programme/Kiwix-dedicated_Hackathon). -
- -## Previous version - -A first version of a generic HTTP scraper was created in 2016 during the [Wikimania Esino Lario Hackathon](https://wikimania2016.wikimedia.org/wiki/Programme/Kiwix-dedicated_Hackathon). - -That version is now considered outdated and [archived in `2016` branch](https://github.com/openzim/zimit/tree/2016). +That version is now considered outdated and [archived in `2016` +branch](https://github.com/openzim/zimit/tree/2016). License -------