Ilya Kreymer
d9ba7e246f
add .bandit exclusion for codefactor
2020-10-16 19:21:37 +00:00
Ilya Kreymer
65b3b533b7
add .dockerignore to speed up docker build
2020-10-16 19:12:15 +00:00
Ilya Kreymer
2c1b401e93
fix codefactor complaints
2020-10-16 19:11:31 +00:00
Ilya Kreymer
c26fe5d4cd
replace run.sh with python runner zimit.py, as suggested in #28
...
should fix arg parsing issues in #28,#18
warc2zim now called directly from zimit.py, both for arg check and for actual zim creation
crawler renamed to crawler.js, no longer handles zim creation, only crawling
add signal handling to both zimit and crawler.js for smooth shutdown, should fix #25
pywb: update to latest dev version with dedup support, add redis for deduplication
2020-10-16 18:54:04 +00:00
rgaudin
9f62f52a02
Merge pull request #27 from openzim/opts-tweak
...
more options tweaking:
2020-10-09 15:52:17 +00:00
Ilya Kreymer
9046f21f53
quote repeated params, add space
2020-10-09 15:29:07 +00:00
Ilya Kreymer
1198d1b6b9
correct parsing of repeated params, fixes #26 , #18
...
switch to 'zimit' CMD instead of ENTRYPOINT (#21 )
2020-10-09 15:19:20 +00:00
rgaudin
0a494ee168
Merge pull request #24 from openzim/run-script-cleanup
...
Run Script + Param Validation Cleanup
2020-10-09 09:29:59 +00:00
Ilya Kreymer
bd8d90efa5
quote params passed to warc2zim, should fix #18
2020-10-09 05:37:30 +00:00
Ilya Kreymer
e608cbd71e
bump timeout to 90s, per suggestions in #20
2020-10-09 05:24:11 +00:00
Ilya Kreymer
9e9ec82ad1
param validation for warc2zim:
...
- ensure trailing slash is added #19
- better handling of boolean params #18
2020-10-09 04:42:46 +00:00
Ilya Kreymer
2b2f96d983
run script improvements:
...
- set permissions on volume dir to address #22
- propagate return code #21
- catch to SIGTERM and SIGINT signals to cleanup temp dir
2020-10-09 04:23:11 +00:00
renaud gaudin
736deccbb5
using tagged version for base image
...
so we have a known python version
so there are wheels suitable for the docker image (cp38)
2020-10-08 12:01:05 +00:00
renaud gaudin
901729a069
added link to /dev/shm info on readme
2020-10-07 13:56:21 +00:00
rgaudin
8842bf7048
Merge pull request #17 from openzim/config-opts
...
Config option improvements + warc2zim passing + SSL check disabled
2020-10-07 10:30:54 +00:00
renaud gaudin
1eee9ab633
added locales to image
2020-10-07 10:29:38 +00:00
Ilya Kreymer
880165ce91
fix Dockerfile per codefactor recommends
2020-10-06 16:01:55 +00:00
Ilya Kreymer
94f0b7362d
Merge README changes from master
2020-10-06 15:53:22 +00:00
Ilya Kreymer
3519d32ba6
disable https checking for fetch() head check (pywb already ignores https certs for capture), should fix #10
2020-10-06 15:49:45 +00:00
Ilya Kreymer
24c843c4af
update to latest warc2zim (1.1.0)
2020-10-06 15:36:45 +00:00
Ilya Kreymer
daa2492655
config work: pass remaining config opts to warc2zim, fixes #13
...
warc2zim check: add runWarc2Zim() to test warc2zim opts before running for validity
run script: create temp dir in output dir to ensure all data is on the volume
run script: add --keep option to keep temp dir, otherwise delete
2020-10-06 06:25:40 +00:00
Ilya Kreymer
e4128c8183
add help text/validation for all config options, url now must be passed in with --url
...
add --scroll boolean option, which activates simple autoscroll behavior
use chrome user-agent for manual fetch
reenable pywb option
cleanup Dockerfile: update to warc2zim 1.0.1, install fonts-stix for math science sites
update README
2020-09-29 05:22:33 +00:00
Kelson
bb5b7e48c1
Additional README.md changes ( #16 )
2020-09-25 12:02:43 +02:00
rgaudin
252516e38c
Merge pull request #14 from openzim/kelson42-patch-1
...
Update README.md
2020-09-25 09:47:29 +00:00
Kelson
ac650bff05
Update README.md
2020-09-25 11:36:30 +02:00
rgaudin
01f2471ab8
Merge pull request #11 from openzim/develop
...
Initial prototype
2020-09-23 08:44:34 +00:00
renaud gaudin
71e94914aa
Added gevent update to prevent segfault in uwsgi
2020-09-23 08:42:08 +00:00
Ilya Kreymer
6a925748d5
excludes: fix no exclude default
2020-09-22 18:12:15 +00:00
Ilya Kreymer
f25b390f15
add regex exclusions
2020-09-22 17:48:09 +00:00
Ilya Kreymer
f252245983
try using regular puppeteer, only copy deps from chrome image
...
pywb: increase uwsgi processes, disable autoindex/autofetch for better perf
2020-09-22 06:09:33 +00:00
Ilya Kreymer
b00c4262a7
add --limit param for max URLs to be captured
...
add 'html check', only load HTML in browsers, load other content-types directly via pywb, esp for PDFs (work on #8 )
improved error handling
2020-09-21 07:16:26 +00:00
Ilya Kreymer
ff2773677c
crawling: move checking logic to shouldCrawl, remove hashtag before checking seen list
2020-09-19 23:19:21 +00:00
Ilya Kreymer
9b23de828b
Update README.md
2020-09-19 15:53:23 -07:00
Ilya Kreymer
4e04645e6b
move warc2zim to be launched by node process
2020-09-19 22:47:19 +00:00
Ilya Kreymer
1de577bd78
use puppeteeer-cluster for parallel crawling
...
use yargs to parse command-line args
2020-09-19 22:19:20 +00:00
Ilya Kreymer
7346527a81
initial setup - single url capture with existing browser image, pywb, puppeteer and warc2zim
2020-09-19 17:38:52 +00:00
rgaudin
bdfd9be399
Added LICENSE document
2020-09-01 10:22:32 +02:00
renaud gaudin
15cf636ff3
reset master branch for 2020 codebase
2020-08-19 09:36:48 +02:00
Kelson
d178431e20
Github Kiwix Sponsoring page link
2020-02-01 18:14:09 +01:00
Kelson
77efa285e0
Create FUNDING.yml
proof-of-concept
2019-06-22 08:08:19 +02:00
Alexis Métaireau
6b10be5557
Add a few options to HTTrack
2016-06-21 15:47:15 +02:00
Alexis Métaireau
eee3447fd1
Rename logging by log_file
2016-06-20 19:17:21 +02:00
Alexis Métaireau
6c1f22ae96
Always append to log files.
2016-06-20 18:55:59 +02:00
Alexis Métaireau
df2d0ccada
Rename the "website" endpoint to "website-zim".
...
Fix #17
2016-06-20 18:54:45 +02:00
Alexis Métaireau
ddb0eb69e3
Add a status API. Fix #5
2016-06-20 18:46:30 +02:00
Alexis Métaireau
728a90a7dd
Fix markup
2016-06-20 15:36:40 +02:00
Alexis Métaireau
13d04caf5c
Define the exposed API in the README.
...
Fix #13
2016-06-20 15:22:15 +02:00
Alexis Métaireau
c84e6cc5d3
Refactor the ZimCreator class.
...
Fixes #12 #14
2016-06-20 14:48:16 +02:00
Alexis Métaireau
8ce39f00f9
Replace material design by bootstrap.
...
It's visually more pleasant :)
2016-06-20 09:59:31 +02:00
Alexis Métaireau
6d7affc01b
Add a frontend to start jobs.
2016-06-19 18:57:05 +02:00