renaud gaudin
14ced5c481
fixed tests for new folder structure
2021-05-12 17:15:19 +00:00
renaud gaudin
2e9c129523
new crawler folder structure
v1.1.4
v.1.14
2021-05-12 17:03:48 +00:00
renaud gaudin
03abf6050a
updated warc2zim and browsertrix-crawler
2021-05-12 16:28:34 +00:00
renaud gaudin
f746f7b020
use same waitUntil defaults as current crawler
2021-03-04 10:40:12 +00:00
renaud gaudin
14fc8ffe0f
released v1.1.3
v1.1.3
2021-03-01 09:59:34 +00:00
rgaudin
ae820472de
Merge pull request #85 from openzim/limit-hit
...
capture and incorporates limit info from crawl
2021-02-15 17:23:42 +00:00
renaud gaudin
cfa4b0e7f8
capture and incorporates limit info from crawl
2021-02-15 17:20:43 +00:00
renaud gaudin
964746481f
using crawler 0.2.0
2021-02-15 17:15:54 +00:00
rgaudin
69892a215f
Merge pull request #84 from myt00seven/master
...
Update README.md with a --exclude example
2021-01-26 08:12:09 +00:00
lakesidethinks
6da4714cff
Update README.md
2021-01-25 12:31:09 -06:00
renaud gaudin
d0d51539fe
updated CHANGELOG
2021-01-15 12:59:00 +00:00
rgaudin
c3a7a02121
Merge pull request #80 from openzim/issue76
...
more flexible url redirects acceptance
2021-01-15 12:55:14 +00:00
renaud gaudin
76c92bdb4c
Fixed #76 : more flexible url redirects acceptance
...
- accepts redirects to same first-level domain
- accepts redirects matching scope
2021-01-15 12:50:53 +00:00
renaud gaudin
610ecc7e5c
using docker publish v5
v1.1.2
2021-01-14 18:27:07 +00:00
rgaudin
a60f7a392f
Merge pull request #79 from openzim/custom-css
...
Add custom-css option support (warc2zim)
2021-01-14 18:24:26 +00:00
renaud gaudin
871f7ab58d
Add custom-css option support (warc2zim)
2021-01-14 18:11:22 +00:00
rgaudin
e91cd7921e
Added domains blocklist ( #77 )
...
All domains from the 3 [anudeepND](https://github.com/anudeepND/blacklist ) lists
are now blocked at local resolver level by updating /etc/hosts in entrypoint.
- this saves network and CPU resources by failing early.
- this is wanted in almost all cases
- can be bypassed by setting a blank entrypoint
2021-01-12 07:31:16 +01:00
renaud gaudin
f4c11dc948
using published version of action
2020-12-22 15:48:12 +00:00
renaud gaudin
01302d3885
added package assignment
2020-12-22 11:15:51 +00:00
renaud gaudin
f72caad35c
added Docker publish GA
2020-12-22 11:10:53 +00:00
renaud gaudin
71603f8a15
fixed version number in changelog
2020-12-22 11:09:41 +00:00
rgaudin
ff5c6b3dc9
Merge pull request #68 from openzim/github-bots
...
GitHub bots
2020-12-15 11:23:28 +00:00
Emmanuel Engelhart
0cb3db6f16
Add move/stale bots configuration
2020-12-15 12:19:21 +01:00
Ilya Kreymer
508286ef78
Update to latest version of browsertrix-crawler (0.1.4) ( #66 )
...
to add autofetch support for srcset (and also stylesheets)
should fix (#63 )
v1.1.1
2020-12-14 09:36:41 +01:00
renaud gaudin
56d319ce3f
added changelog
v1.1
2020-12-14 08:13:54 +00:00
rgaudin
f6d44314cd
Fixed #58 : updated README with limitations
2020-12-12 13:58:32 +00:00
rgaudin
eb5ca99bfb
Merge pull request #62 from openzim/progres
...
Enhanced --statsFilename support
v1.0
2020-12-10 10:50:18 +00:00
renaud gaudin
85fad62b61
Updated test to new stats files
...
- verify output of crawl, warc2zim and zimit file
- using a simpler tag for CI test image as to not confuse it with public image
2020-12-10 10:44:49 +00:00
renaud gaudin
3ffa34d46e
Enhanced --statsFilename support
...
- `--statsFilename` to now represent overall zimit progress and not just crawling
- Exposing a simpler (`done`, `total`) json format for progress
- Live converting individual step's progres into this file
- using warc2zim 1.3.3 for its `--progress-file` support
- Currently arbitrarily assigning 90% to crawl and 10% to warc2zim
2020-12-10 10:44:39 +00:00
rgaudin
b9ed1d00a2
Merge pull request #60 from openzim/stats
...
stats: add support for stats output after every page crawled, fixes #39
2020-12-04 11:21:44 +00:00
Ilya Kreymer
5084c54af6
stats: add support for stats output after every page crawled, fixes #39
...
tests: integration test checks for stats.json
2020-12-02 16:28:25 +00:00
rgaudin
9422defe86
Merge pull request #54 from openzim/mobile-useragent
...
Mobile Device + User Agent Support
2020-11-16 11:14:52 +00:00
Ilya Kreymer
c0bb0503b8
add support for --useSitemap <url> flag to load additional URLs, potentially fixing #34 !
...
reformat
2020-11-14 22:01:36 +00:00
Ilya Kreymer
a801a1eef6
ci: improve tests, validate all UA, and check for at least one found
2020-11-14 20:50:03 +00:00
Ilya Kreymer
4723376ebc
ci: add --keep to keep warc files
2020-11-14 20:33:36 +00:00
Ilya Kreymer
5e4b3d80b3
ci: path fix
2020-11-14 20:30:15 +00:00
Ilya Kreymer
82f0fae959
update to warc2zim 1.3.2
...
fix ci test command
2020-11-14 20:27:43 +00:00
Ilya Kreymer
a930542af8
mobile + user agent support:
...
- add support for custom user agent suffix +Zimit with email address specifyable via --adminEmail cmd arg #38
- add ability to crawl as mobile device with --mobileDevice flag (default to iPhone X)
add integration tests runnable in docker via github actions
logging: print temp dir, flush print statements for immediate logging
2020-11-14 20:10:16 +00:00
rgaudin
0e3af5124b
Merge pull request #46 from openzim/crawler-split
...
Split zimit from webrecorder/browsertrix-crawler
2020-11-10 09:16:46 +00:00
renaud gaudin
0082d313ae
Code formatting
...
- Added requests as a dependency (although currently brought in by warc2zim)
- removed unused imports
- black code formatting and some cleanup
- revamped actual_url fetching
2020-11-10 09:12:34 +00:00
renaud gaudin
568068ecfc
WARC2zim version update
...
- updated to latest warc2zim release
- fixed param name typo in README
- added creation of `/output` so container can run on default params even if /output
is not a mounted volume
2020-11-10 08:26:56 +00:00
Ilya Kreymer
989567e05e
README: fix typos in example command
2020-11-10 06:10:12 +00:00
Ilya Kreymer
5b640f2f8b
main page redirect check: check if specified URL is a redirect, and use final URL if it is. Reject if redirect goes to a different domain, as suggested in #42
2020-11-10 06:07:27 +00:00
Ilya Kreymer
88a280bc58
ci: add simple github action for building image, running crawl, verifying zim exists
2020-11-10 03:55:33 +00:00
Ilya Kreymer
6b5dbd20cb
base image: use latest base image, warc2zim
2020-11-10 03:32:26 +00:00
Ilya Kreymer
c228c8300c
split zimit from core browsertrix-crawler, which has been moved to https://github.com/webrecorder/browsertrix-crawler
...
use versioned browsertrix-crawler:0.1.0 image
part of #45
2020-11-03 17:21:54 +00:00
rgaudin
f6282dbf14
Merge pull request #36 from openzim/video-capture-work
...
work on automated capture of video (#9 )
2020-10-28 19:12:41 +00:00
Ilya Kreymer
ae9aba7a00
set default newContext to page as before
2020-10-28 18:19:27 +00:00
Ilya Kreymer
8ceabce0e9
update to warc2zim 1.3.0
2020-10-28 18:15:30 +00:00
Ilya Kreymer
a425cd6956
- add 'newContext' command line option to specify the context for each new url: new page, new session, or new browser
...
- convert the scope option to be a regex instead of just prefix
- remove custom wabac.js, now using released version in warc2zim
2020-10-27 18:00:44 +00:00