228 Commits

Author SHA1 Message Date
rgaudin
f6282dbf14
Merge pull request #36 from openzim/video-capture-work
work on automated capture of video (#9)
2020-10-28 19:12:41 +00:00
Ilya Kreymer
ae9aba7a00 set default newContext to page as before 2020-10-28 18:19:27 +00:00
Ilya Kreymer
8ceabce0e9 update to warc2zim 1.3.0 2020-10-28 18:15:30 +00:00
Ilya Kreymer
a425cd6956 - add 'newContext' command line option to specify the context for each new url: new page, new session, or new browser
- convert the scope option to be a regex instead of just prefix
- remove custom wabac.js, now using released version in warc2zim
2020-10-27 18:00:44 +00:00
Ilya Kreymer
91fe76c56e work on automated capture of vidoe (#9)
- add autoplay behavior to reload known video sites to autoplay
- for video/audio on page, queue directly for loading if video.src or audio.src set to valid url, otherwise load through play in browser (may be slower)
- add extra wait if reloading for autoplay
- timeouts: set timeout for puppeteer-cluster double to timeout of page to avoid hitting that timeout during regular operation
- use browser from oldwebtoday/chrome:84 and puppeteer-core instead of puppeteer browser for consistent results
- temp testing: use custom wabac.js sw for testing (will use default from warc2zim), using warc2zim fuzzy-match branch for now
2020-10-21 06:09:10 +00:00
rgaudin
c6f27f3bf6
Merge pull request #30 from openzim/fix-typo
cleanup: fix typo in print msg
2020-10-20 16:32:29 +00:00
Ilya Kreymer
ab4e2e1a14 cleanup: fix typo in print msg 2020-10-20 15:55:25 +00:00
rgaudin
d8e313c492
Merge pull request #29 from openzim/python-runner
Replace shell script with python zimit.py + crawl dedup improvements
2020-10-20 08:08:44 +00:00
Ilya Kreymer
904c95963c update to warc2zim 1.2.0, fixes from code review:
- pass warc directory to warc2zim, supported in 1.2.0
- use Path for temp_root_dir
- use seconds instead of millis for page timeout, update help text
- fix help text for --scope
- restrict waitUntil to valid choices
2020-10-19 19:44:01 +00:00
renaud gaudin
fb2232d8b1 not using partial paths 2020-10-19 15:14:46 +00:00
Ilya Kreymer
2e2db2f352 simplification: remove zimit user, su, and run chrome as root with --no-sandbox
log exclusion regex
2020-10-16 21:04:10 +00:00
Ilya Kreymer
5b3101f2d8 add missing crawler.js! 2020-10-16 20:29:51 +00:00
Ilya Kreymer
d9ba7e246f add .bandit exclusion for codefactor 2020-10-16 19:21:37 +00:00
Ilya Kreymer
65b3b533b7 add .dockerignore to speed up docker build 2020-10-16 19:12:15 +00:00
Ilya Kreymer
2c1b401e93 fix codefactor complaints 2020-10-16 19:11:31 +00:00
Ilya Kreymer
c26fe5d4cd replace run.sh with python runner zimit.py, as suggested in #28
should fix arg parsing issues in #28,#18
warc2zim now called directly from zimit.py, both for arg check and for actual zim creation
crawler renamed to crawler.js, no longer handles zim creation, only crawling
add signal handling to both zimit and crawler.js for smooth shutdown, should fix #25
pywb: update to latest dev version with dedup support, add redis for deduplication
2020-10-16 18:54:04 +00:00
rgaudin
9f62f52a02
Merge pull request #27 from openzim/opts-tweak
more options tweaking:
2020-10-09 15:52:17 +00:00
Ilya Kreymer
9046f21f53 quote repeated params, add space 2020-10-09 15:29:07 +00:00
Ilya Kreymer
1198d1b6b9 correct parsing of repeated params, fixes #26, #18
switch to 'zimit' CMD instead of ENTRYPOINT  (#21)
2020-10-09 15:19:20 +00:00
rgaudin
0a494ee168
Merge pull request #24 from openzim/run-script-cleanup
Run Script + Param Validation Cleanup
2020-10-09 09:29:59 +00:00
Ilya Kreymer
bd8d90efa5 quote params passed to warc2zim, should fix #18 2020-10-09 05:37:30 +00:00
Ilya Kreymer
e608cbd71e bump timeout to 90s, per suggestions in #20 2020-10-09 05:24:11 +00:00
Ilya Kreymer
9e9ec82ad1 param validation for warc2zim:
- ensure trailing slash is added #19
- better handling of boolean params #18
2020-10-09 04:42:46 +00:00
Ilya Kreymer
2b2f96d983 run script improvements:
- set permissions on volume dir to address #22
- propagate return code #21
- catch to SIGTERM and SIGINT signals to cleanup temp dir
2020-10-09 04:23:11 +00:00
renaud gaudin
736deccbb5 using tagged version for base image
so we have a known python version
so there are wheels suitable for the docker image (cp38)
2020-10-08 12:01:05 +00:00
renaud gaudin
901729a069 added link to /dev/shm info on readme 2020-10-07 13:56:21 +00:00
rgaudin
8842bf7048
Merge pull request #17 from openzim/config-opts
Config option improvements + warc2zim passing + SSL check disabled
2020-10-07 10:30:54 +00:00
renaud gaudin
1eee9ab633 added locales to image 2020-10-07 10:29:38 +00:00
Ilya Kreymer
880165ce91 fix Dockerfile per codefactor recommends 2020-10-06 16:01:55 +00:00
Ilya Kreymer
94f0b7362d Merge README changes from master 2020-10-06 15:53:22 +00:00
Ilya Kreymer
3519d32ba6 disable https checking for fetch() head check (pywb already ignores https certs for capture), should fix #10 2020-10-06 15:49:45 +00:00
Ilya Kreymer
24c843c4af update to latest warc2zim (1.1.0) 2020-10-06 15:36:45 +00:00
Ilya Kreymer
daa2492655 config work: pass remaining config opts to warc2zim, fixes #13
warc2zim check: add runWarc2Zim() to test warc2zim opts before running for validity
run script: create temp dir in output dir to ensure all data is on the volume
run script: add --keep option to keep temp dir, otherwise delete
2020-10-06 06:25:40 +00:00
Ilya Kreymer
e4128c8183 add help text/validation for all config options, url now must be passed in with --url
add --scroll boolean option, which activates simple autoscroll behavior
use chrome user-agent for manual fetch
reenable pywb option
cleanup Dockerfile: update to warc2zim 1.0.1, install fonts-stix for math science sites
update README
2020-09-29 05:22:33 +00:00
Kelson
bb5b7e48c1
Additional README.md changes (#16) 2020-09-25 12:02:43 +02:00
rgaudin
252516e38c
Merge pull request #14 from openzim/kelson42-patch-1
Update README.md
2020-09-25 09:47:29 +00:00
Kelson
ac650bff05
Update README.md 2020-09-25 11:36:30 +02:00
rgaudin
01f2471ab8
Merge pull request #11 from openzim/develop
Initial prototype
2020-09-23 08:44:34 +00:00
renaud gaudin
71e94914aa Added gevent update to prevent segfault in uwsgi 2020-09-23 08:42:08 +00:00
Ilya Kreymer
6a925748d5 excludes: fix no exclude default 2020-09-22 18:12:15 +00:00
Ilya Kreymer
f25b390f15 add regex exclusions 2020-09-22 17:48:09 +00:00
Ilya Kreymer
f252245983 try using regular puppeteer, only copy deps from chrome image
pywb: increase uwsgi processes, disable autoindex/autofetch for better perf
2020-09-22 06:09:33 +00:00
Ilya Kreymer
b00c4262a7 add --limit param for max URLs to be captured
add 'html check', only load HTML in browsers, load other content-types directly via pywb, esp for PDFs (work on #8)
improved error handling
2020-09-21 07:16:26 +00:00
Ilya Kreymer
ff2773677c crawling: move checking logic to shouldCrawl, remove hashtag before checking seen list 2020-09-19 23:19:21 +00:00
Ilya Kreymer
9b23de828b
Update README.md 2020-09-19 15:53:23 -07:00
Ilya Kreymer
4e04645e6b move warc2zim to be launched by node process 2020-09-19 22:47:19 +00:00
Ilya Kreymer
1de577bd78 use puppeteeer-cluster for parallel crawling
use yargs to parse command-line args
2020-09-19 22:19:20 +00:00
Ilya Kreymer
7346527a81 initial setup - single url capture with existing browser image, pywb, puppeteer and warc2zim 2020-09-19 17:38:52 +00:00
rgaudin
bdfd9be399
Added LICENSE document 2020-09-01 10:22:32 +02:00
renaud gaudin
15cf636ff3 reset master branch for 2020 codebase 2020-08-19 09:36:48 +02:00