233 Commits

Author SHA1 Message Date
renaud gaudin
885e1763a1 updated CI test website URL 2022-06-13 09:57:37 +00:00
Kelson
80f3d3293f
Merge pull request #129 from openzim/release-badge
Release badge
2022-06-11 20:06:20 +02:00
Emmanuel Engelhart
0025901959
Replace Docker Hub build badge with CI badge 2022-06-11 11:56:18 +02:00
Emmanuel Engelhart
99f8fbafe1
Movebot does not exist anymore 2022-06-11 11:53:35 +02:00
Emmanuel Engelhart
3d3f4fb121
Add release tag 2022-06-11 11:52:48 +02:00
rgaudin
8bcd692462
Merge pull request #125 from JensKorte/patch-1
Update README.md
2022-05-30 22:07:10 +02:00
JensKorte
1f31d6c1a5
Update README.md
relative link didn't work and replaced by https://github.com/openzim/warc2zim
2022-05-30 21:45:18 +02:00
renaud gaudin
98587045b4 Updated readme: warc2zim params can be passed 2022-05-03 10:31:34 +00:00
renaud gaudin
efd8ca53b4 updating crawler and warc2zim v1.1.5 2021-06-10 14:14:11 +00:00
renaud gaudin
14ced5c481 fixed tests for new folder structure 2021-05-12 17:15:19 +00:00
renaud gaudin
2e9c129523 new crawler folder structure v1.1.4 v.1.14 2021-05-12 17:03:48 +00:00
renaud gaudin
03abf6050a updated warc2zim and browsertrix-crawler 2021-05-12 16:28:34 +00:00
renaud gaudin
f746f7b020 use same waitUntil defaults as current crawler 2021-03-04 10:40:12 +00:00
renaud gaudin
14fc8ffe0f released v1.1.3 v1.1.3 2021-03-01 09:59:34 +00:00
rgaudin
ae820472de
Merge pull request #85 from openzim/limit-hit
capture and incorporates limit info from crawl
2021-02-15 17:23:42 +00:00
renaud gaudin
cfa4b0e7f8 capture and incorporates limit info from crawl 2021-02-15 17:20:43 +00:00
renaud gaudin
964746481f using crawler 0.2.0 2021-02-15 17:15:54 +00:00
rgaudin
69892a215f
Merge pull request #84 from myt00seven/master
Update README.md with a --exclude example
2021-01-26 08:12:09 +00:00
lakesidethinks
6da4714cff Update README.md 2021-01-25 12:31:09 -06:00
renaud gaudin
d0d51539fe updated CHANGELOG 2021-01-15 12:59:00 +00:00
rgaudin
c3a7a02121
Merge pull request #80 from openzim/issue76
more flexible url redirects acceptance
2021-01-15 12:55:14 +00:00
renaud gaudin
76c92bdb4c Fixed #76: more flexible url redirects acceptance
- accepts redirects to same first-level domain
- accepts redirects matching scope
2021-01-15 12:50:53 +00:00
renaud gaudin
610ecc7e5c using docker publish v5 v1.1.2 2021-01-14 18:27:07 +00:00
rgaudin
a60f7a392f
Merge pull request #79 from openzim/custom-css
Add custom-css option support (warc2zim)
2021-01-14 18:24:26 +00:00
renaud gaudin
871f7ab58d Add custom-css option support (warc2zim) 2021-01-14 18:11:22 +00:00
rgaudin
e91cd7921e
Added domains blocklist (#77)
All domains from the 3 [anudeepND](https://github.com/anudeepND/blacklist) lists
are now blocked at local resolver level by updating /etc/hosts in entrypoint.

- this saves network and CPU resources by failing early.
- this is wanted in almost all cases
- can be bypassed by setting a blank entrypoint
2021-01-12 07:31:16 +01:00
renaud gaudin
f4c11dc948 using published version of action 2020-12-22 15:48:12 +00:00
renaud gaudin
01302d3885 added package assignment 2020-12-22 11:15:51 +00:00
renaud gaudin
f72caad35c added Docker publish GA 2020-12-22 11:10:53 +00:00
renaud gaudin
71603f8a15 fixed version number in changelog 2020-12-22 11:09:41 +00:00
rgaudin
ff5c6b3dc9
Merge pull request #68 from openzim/github-bots
GitHub bots
2020-12-15 11:23:28 +00:00
Emmanuel Engelhart
0cb3db6f16 Add move/stale bots configuration 2020-12-15 12:19:21 +01:00
Ilya Kreymer
508286ef78
Update to latest version of browsertrix-crawler (0.1.4) (#66)
to add autofetch support for srcset (and also stylesheets)
should fix (#63)
v1.1.1
2020-12-14 09:36:41 +01:00
renaud gaudin
56d319ce3f added changelog v1.1 2020-12-14 08:13:54 +00:00
rgaudin
f6d44314cd
Fixed #58: updated README with limitations 2020-12-12 13:58:32 +00:00
rgaudin
eb5ca99bfb
Merge pull request #62 from openzim/progres
Enhanced --statsFilename support
v1.0
2020-12-10 10:50:18 +00:00
renaud gaudin
85fad62b61 Updated test to new stats files
- verify output of crawl, warc2zim and zimit file
- using a simpler tag for CI test image as to not confuse it with public image
2020-12-10 10:44:49 +00:00
renaud gaudin
3ffa34d46e Enhanced --statsFilename support
- `--statsFilename` to now represent overall zimit progress and not just crawling
- Exposing a simpler (`done`, `total`) json format for progress
- Live converting individual step's progres into this file
- using warc2zim 1.3.3 for its `--progress-file` support
- Currently arbitrarily assigning 90% to crawl and 10% to warc2zim
2020-12-10 10:44:39 +00:00
rgaudin
b9ed1d00a2
Merge pull request #60 from openzim/stats
stats: add support for stats output after every page crawled, fixes #39
2020-12-04 11:21:44 +00:00
Ilya Kreymer
5084c54af6 stats: add support for stats output after every page crawled, fixes #39
tests: integration test checks for stats.json
2020-12-02 16:28:25 +00:00
rgaudin
9422defe86
Merge pull request #54 from openzim/mobile-useragent
Mobile Device + User Agent Support
2020-11-16 11:14:52 +00:00
Ilya Kreymer
c0bb0503b8 add support for --useSitemap <url> flag to load additional URLs, potentially fixing #34!
reformat
2020-11-14 22:01:36 +00:00
Ilya Kreymer
a801a1eef6 ci: improve tests, validate all UA, and check for at least one found 2020-11-14 20:50:03 +00:00
Ilya Kreymer
4723376ebc ci: add --keep to keep warc files 2020-11-14 20:33:36 +00:00
Ilya Kreymer
5e4b3d80b3 ci: path fix 2020-11-14 20:30:15 +00:00
Ilya Kreymer
82f0fae959 update to warc2zim 1.3.2
fix ci test command
2020-11-14 20:27:43 +00:00
Ilya Kreymer
a930542af8 mobile + user agent support:
- add support for custom user agent suffix +Zimit with email address specifyable via --adminEmail cmd arg #38
- add ability to crawl as mobile device with --mobileDevice flag (default to iPhone X)
add integration tests runnable in docker via github actions
logging: print temp dir, flush print statements for immediate logging
2020-11-14 20:10:16 +00:00
rgaudin
0e3af5124b
Merge pull request #46 from openzim/crawler-split
Split zimit from webrecorder/browsertrix-crawler
2020-11-10 09:16:46 +00:00
renaud gaudin
0082d313ae Code formatting
- Added requests as a dependency (although currently brought in by warc2zim)
- removed unused imports
- black code formatting and some cleanup
- revamped actual_url fetching
2020-11-10 09:12:34 +00:00
renaud gaudin
568068ecfc WARC2zim version update
- updated to latest warc2zim release
- fixed param name typo in README
- added creation of `/output` so container can run on default params even if /output
is not a mounted volume
2020-11-10 08:26:56 +00:00