162 Commits

Author SHA1 Message Date
renaud gaudin
76c92bdb4c Fixed #76: more flexible url redirects acceptance
- accepts redirects to same first-level domain
- accepts redirects matching scope
2021-01-15 12:50:53 +00:00
renaud gaudin
610ecc7e5c using docker publish v5 v1.1.2 2021-01-14 18:27:07 +00:00
rgaudin
a60f7a392f
Merge pull request #79 from openzim/custom-css
Add custom-css option support (warc2zim)
2021-01-14 18:24:26 +00:00
renaud gaudin
871f7ab58d Add custom-css option support (warc2zim) 2021-01-14 18:11:22 +00:00
rgaudin
e91cd7921e
Added domains blocklist (#77)
All domains from the 3 [anudeepND](https://github.com/anudeepND/blacklist) lists
are now blocked at local resolver level by updating /etc/hosts in entrypoint.

- this saves network and CPU resources by failing early.
- this is wanted in almost all cases
- can be bypassed by setting a blank entrypoint
2021-01-12 07:31:16 +01:00
renaud gaudin
f4c11dc948 using published version of action 2020-12-22 15:48:12 +00:00
renaud gaudin
01302d3885 added package assignment 2020-12-22 11:15:51 +00:00
renaud gaudin
f72caad35c added Docker publish GA 2020-12-22 11:10:53 +00:00
renaud gaudin
71603f8a15 fixed version number in changelog 2020-12-22 11:09:41 +00:00
rgaudin
ff5c6b3dc9
Merge pull request #68 from openzim/github-bots
GitHub bots
2020-12-15 11:23:28 +00:00
Emmanuel Engelhart
0cb3db6f16 Add move/stale bots configuration 2020-12-15 12:19:21 +01:00
Ilya Kreymer
508286ef78
Update to latest version of browsertrix-crawler (0.1.4) (#66)
to add autofetch support for srcset (and also stylesheets)
should fix (#63)
v1.1.1
2020-12-14 09:36:41 +01:00
renaud gaudin
56d319ce3f added changelog v1.1 2020-12-14 08:13:54 +00:00
rgaudin
f6d44314cd
Fixed #58: updated README with limitations 2020-12-12 13:58:32 +00:00
rgaudin
eb5ca99bfb
Merge pull request #62 from openzim/progres
Enhanced --statsFilename support
v1.0
2020-12-10 10:50:18 +00:00
renaud gaudin
85fad62b61 Updated test to new stats files
- verify output of crawl, warc2zim and zimit file
- using a simpler tag for CI test image as to not confuse it with public image
2020-12-10 10:44:49 +00:00
renaud gaudin
3ffa34d46e Enhanced --statsFilename support
- `--statsFilename` to now represent overall zimit progress and not just crawling
- Exposing a simpler (`done`, `total`) json format for progress
- Live converting individual step's progres into this file
- using warc2zim 1.3.3 for its `--progress-file` support
- Currently arbitrarily assigning 90% to crawl and 10% to warc2zim
2020-12-10 10:44:39 +00:00
rgaudin
b9ed1d00a2
Merge pull request #60 from openzim/stats
stats: add support for stats output after every page crawled, fixes #39
2020-12-04 11:21:44 +00:00
Ilya Kreymer
5084c54af6 stats: add support for stats output after every page crawled, fixes #39
tests: integration test checks for stats.json
2020-12-02 16:28:25 +00:00
rgaudin
9422defe86
Merge pull request #54 from openzim/mobile-useragent
Mobile Device + User Agent Support
2020-11-16 11:14:52 +00:00
Ilya Kreymer
c0bb0503b8 add support for --useSitemap <url> flag to load additional URLs, potentially fixing #34!
reformat
2020-11-14 22:01:36 +00:00
Ilya Kreymer
a801a1eef6 ci: improve tests, validate all UA, and check for at least one found 2020-11-14 20:50:03 +00:00
Ilya Kreymer
4723376ebc ci: add --keep to keep warc files 2020-11-14 20:33:36 +00:00
Ilya Kreymer
5e4b3d80b3 ci: path fix 2020-11-14 20:30:15 +00:00
Ilya Kreymer
82f0fae959 update to warc2zim 1.3.2
fix ci test command
2020-11-14 20:27:43 +00:00
Ilya Kreymer
a930542af8 mobile + user agent support:
- add support for custom user agent suffix +Zimit with email address specifyable via --adminEmail cmd arg #38
- add ability to crawl as mobile device with --mobileDevice flag (default to iPhone X)
add integration tests runnable in docker via github actions
logging: print temp dir, flush print statements for immediate logging
2020-11-14 20:10:16 +00:00
rgaudin
0e3af5124b
Merge pull request #46 from openzim/crawler-split
Split zimit from webrecorder/browsertrix-crawler
2020-11-10 09:16:46 +00:00
renaud gaudin
0082d313ae Code formatting
- Added requests as a dependency (although currently brought in by warc2zim)
- removed unused imports
- black code formatting and some cleanup
- revamped actual_url fetching
2020-11-10 09:12:34 +00:00
renaud gaudin
568068ecfc WARC2zim version update
- updated to latest warc2zim release
- fixed param name typo in README
- added creation of `/output` so container can run on default params even if /output
is not a mounted volume
2020-11-10 08:26:56 +00:00
Ilya Kreymer
989567e05e README: fix typos in example command 2020-11-10 06:10:12 +00:00
Ilya Kreymer
5b640f2f8b main page redirect check: check if specified URL is a redirect, and use final URL if it is. Reject if redirect goes to a different domain, as suggested in #42 2020-11-10 06:07:27 +00:00
Ilya Kreymer
88a280bc58 ci: add simple github action for building image, running crawl, verifying zim exists 2020-11-10 03:55:33 +00:00
Ilya Kreymer
6b5dbd20cb base image: use latest base image, warc2zim 2020-11-10 03:32:26 +00:00
Ilya Kreymer
c228c8300c split zimit from core browsertrix-crawler, which has been moved to https://github.com/webrecorder/browsertrix-crawler
use versioned browsertrix-crawler:0.1.0 image
part of #45
2020-11-03 17:21:54 +00:00
rgaudin
f6282dbf14
Merge pull request #36 from openzim/video-capture-work
work on automated capture of video (#9)
2020-10-28 19:12:41 +00:00
Ilya Kreymer
ae9aba7a00 set default newContext to page as before 2020-10-28 18:19:27 +00:00
Ilya Kreymer
8ceabce0e9 update to warc2zim 1.3.0 2020-10-28 18:15:30 +00:00
Ilya Kreymer
a425cd6956 - add 'newContext' command line option to specify the context for each new url: new page, new session, or new browser
- convert the scope option to be a regex instead of just prefix
- remove custom wabac.js, now using released version in warc2zim
2020-10-27 18:00:44 +00:00
Ilya Kreymer
91fe76c56e work on automated capture of vidoe (#9)
- add autoplay behavior to reload known video sites to autoplay
- for video/audio on page, queue directly for loading if video.src or audio.src set to valid url, otherwise load through play in browser (may be slower)
- add extra wait if reloading for autoplay
- timeouts: set timeout for puppeteer-cluster double to timeout of page to avoid hitting that timeout during regular operation
- use browser from oldwebtoday/chrome:84 and puppeteer-core instead of puppeteer browser for consistent results
- temp testing: use custom wabac.js sw for testing (will use default from warc2zim), using warc2zim fuzzy-match branch for now
2020-10-21 06:09:10 +00:00
rgaudin
c6f27f3bf6
Merge pull request #30 from openzim/fix-typo
cleanup: fix typo in print msg
2020-10-20 16:32:29 +00:00
Ilya Kreymer
ab4e2e1a14 cleanup: fix typo in print msg 2020-10-20 15:55:25 +00:00
rgaudin
d8e313c492
Merge pull request #29 from openzim/python-runner
Replace shell script with python zimit.py + crawl dedup improvements
2020-10-20 08:08:44 +00:00
Ilya Kreymer
904c95963c update to warc2zim 1.2.0, fixes from code review:
- pass warc directory to warc2zim, supported in 1.2.0
- use Path for temp_root_dir
- use seconds instead of millis for page timeout, update help text
- fix help text for --scope
- restrict waitUntil to valid choices
2020-10-19 19:44:01 +00:00
renaud gaudin
fb2232d8b1 not using partial paths 2020-10-19 15:14:46 +00:00
Ilya Kreymer
2e2db2f352 simplification: remove zimit user, su, and run chrome as root with --no-sandbox
log exclusion regex
2020-10-16 21:04:10 +00:00
Ilya Kreymer
5b3101f2d8 add missing crawler.js! 2020-10-16 20:29:51 +00:00
Ilya Kreymer
d9ba7e246f add .bandit exclusion for codefactor 2020-10-16 19:21:37 +00:00
Ilya Kreymer
65b3b533b7 add .dockerignore to speed up docker build 2020-10-16 19:12:15 +00:00
Ilya Kreymer
2c1b401e93 fix codefactor complaints 2020-10-16 19:11:31 +00:00
Ilya Kreymer
c26fe5d4cd replace run.sh with python runner zimit.py, as suggested in #28
should fix arg parsing issues in #28,#18
warc2zim now called directly from zimit.py, both for arg check and for actual zim creation
crawler renamed to crawler.js, no longer handles zim creation, only crawling
add signal handling to both zimit and crawler.js for smooth shutdown, should fix #25
pywb: update to latest dev version with dedup support, add redis for deduplication
2020-10-16 18:54:04 +00:00