mirror of
https://github.com/openzim/zimit.git
synced 2025-09-23 03:52:16 -04:00
Additional README.md changes (#16)
This commit is contained in:
parent
252516e38c
commit
bb5b7e48c1
40
README.md
40
README.md
@ -1,10 +1,15 @@
|
||||
Zimit
|
||||
=====
|
||||
|
||||
Zimit is a scraper allowing to create ZIM file from any Web site.
|
||||
|
||||
[](https://www.codefactor.io/repository/github/openzim/zimit)
|
||||
[](https://hub.docker.com/r/openzim/zimit)
|
||||
[](https://hub.docker.com/r/openzim/zimit)
|
||||
[](https://www.gnu.org/licenses/gpl-3.0)
|
||||
|
||||
Technical background
|
||||
--------------------
|
||||
|
||||
This version of Zimit runs a single-site headless-Chrome based crawl in a Docker container and produces a ZIM of the crawled content.
|
||||
|
||||
The system uses:
|
||||
@ -15,15 +20,17 @@ The system uses:
|
||||
|
||||
The driver in `index.js` crawls a given URL using puppeteer-cluster.
|
||||
|
||||
After the crawl is done, warc2zim is used to write a zim to the `/output` directory, which can be mounted as a volume.
|
||||
After the crawl is done, warc2zim is used to write a zim to the
|
||||
`/output` directory, which can be mounted as a volume.
|
||||
|
||||
## Usage
|
||||
Usage
|
||||
-----
|
||||
|
||||
`zimit` is intended to be run in Docker.
|
||||
|
||||
To build locally run:
|
||||
|
||||
```
|
||||
```bash
|
||||
docker build -t openzim/zimit .
|
||||
```
|
||||
|
||||
@ -37,25 +44,28 @@ The image accepts the following parameters:
|
||||
- `--limit U` - Limit capture to at most U URLs
|
||||
- `--exclude <regex>` - skip URLs that match the regex from crawling. Can be specified multiple times.
|
||||
|
||||
The following is an example usage. The `--cap-add` and `--shm-size` flags are needed to run Chrome in Docker.
|
||||
The following is an example usage. The `--cap-add` and `--shm-size`
|
||||
flags are needed to run Chrome in Docker.
|
||||
|
||||
Example command:
|
||||
|
||||
```
|
||||
docker run -v /output:/output --cap-add=SYS_ADMIN --cap-add=NET_ADMIN --shm-size=1gb openzim/zimit URL --name myzimfile --workers 2 --wait-until domcontentloaded
|
||||
```bash
|
||||
docker run -v /output:/output --cap-add=SYS_ADMIN --cap-add=NET_ADMIN \
|
||||
--shm-size=1gb openzim/zimit URL --name myzimfile --workers 2 --wait-until domcontentloaded
|
||||
```
|
||||
|
||||
The puppeteer-cluster provides monitoring output which is enabled by default and prints the crawl status to the Docker log.
|
||||
The puppeteer-cluster provides monitoring output which is enabled by
|
||||
default and prints the crawl status to the Docker log.
|
||||
|
||||
Nota bene
|
||||
---------
|
||||
|
||||
A first version of a generic HTTP scraper was created in 2016 during
|
||||
the [Wikimania Esino Lario
|
||||
Hackathon](https://wikimania2016.wikimedia.org/wiki/Programme/Kiwix-dedicated_Hackathon).
|
||||
|
||||
<hr>
|
||||
|
||||
## Previous version
|
||||
|
||||
A first version of a generic HTTP scraper was created in 2016 during the [Wikimania Esino Lario Hackathon](https://wikimania2016.wikimedia.org/wiki/Programme/Kiwix-dedicated_Hackathon).
|
||||
|
||||
That version is now considered outdated and [archived in `2016` branch](https://github.com/openzim/zimit/tree/2016).
|
||||
That version is now considered outdated and [archived in `2016`
|
||||
branch](https://github.com/openzim/zimit/tree/2016).
|
||||
|
||||
License
|
||||
-------
|
||||
|
Loading…
x
Reference in New Issue
Block a user