renaud gaudin
885e1763a1
updated CI test website URL
2022-06-13 09:57:37 +00:00
Kelson
80f3d3293f
Merge pull request #129 from openzim/release-badge
...
Release badge
2022-06-11 20:06:20 +02:00
Emmanuel Engelhart
0025901959
Replace Docker Hub build badge with CI badge
2022-06-11 11:56:18 +02:00
Emmanuel Engelhart
99f8fbafe1
Movebot does not exist anymore
2022-06-11 11:53:35 +02:00
Emmanuel Engelhart
3d3f4fb121
Add release tag
2022-06-11 11:52:48 +02:00
rgaudin
8bcd692462
Merge pull request #125 from JensKorte/patch-1
...
Update README.md
2022-05-30 22:07:10 +02:00
JensKorte
1f31d6c1a5
Update README.md
...
relative link didn't work and replaced by https://github.com/openzim/warc2zim
2022-05-30 21:45:18 +02:00
renaud gaudin
98587045b4
Updated readme: warc2zim params can be passed
2022-05-03 10:31:34 +00:00
renaud gaudin
efd8ca53b4
updating crawler and warc2zim
v1.1.5
2021-06-10 14:14:11 +00:00
renaud gaudin
14ced5c481
fixed tests for new folder structure
2021-05-12 17:15:19 +00:00
renaud gaudin
2e9c129523
new crawler folder structure
v1.1.4
v.1.14
2021-05-12 17:03:48 +00:00
renaud gaudin
03abf6050a
updated warc2zim and browsertrix-crawler
2021-05-12 16:28:34 +00:00
renaud gaudin
f746f7b020
use same waitUntil defaults as current crawler
2021-03-04 10:40:12 +00:00
renaud gaudin
14fc8ffe0f
released v1.1.3
v1.1.3
2021-03-01 09:59:34 +00:00
rgaudin
ae820472de
Merge pull request #85 from openzim/limit-hit
...
capture and incorporates limit info from crawl
2021-02-15 17:23:42 +00:00
renaud gaudin
cfa4b0e7f8
capture and incorporates limit info from crawl
2021-02-15 17:20:43 +00:00
renaud gaudin
964746481f
using crawler 0.2.0
2021-02-15 17:15:54 +00:00
rgaudin
69892a215f
Merge pull request #84 from myt00seven/master
...
Update README.md with a --exclude example
2021-01-26 08:12:09 +00:00
lakesidethinks
6da4714cff
Update README.md
2021-01-25 12:31:09 -06:00
renaud gaudin
d0d51539fe
updated CHANGELOG
2021-01-15 12:59:00 +00:00
rgaudin
c3a7a02121
Merge pull request #80 from openzim/issue76
...
more flexible url redirects acceptance
2021-01-15 12:55:14 +00:00
renaud gaudin
76c92bdb4c
Fixed #76 : more flexible url redirects acceptance
...
- accepts redirects to same first-level domain
- accepts redirects matching scope
2021-01-15 12:50:53 +00:00
renaud gaudin
610ecc7e5c
using docker publish v5
v1.1.2
2021-01-14 18:27:07 +00:00
rgaudin
a60f7a392f
Merge pull request #79 from openzim/custom-css
...
Add custom-css option support (warc2zim)
2021-01-14 18:24:26 +00:00
renaud gaudin
871f7ab58d
Add custom-css option support (warc2zim)
2021-01-14 18:11:22 +00:00
rgaudin
e91cd7921e
Added domains blocklist ( #77 )
...
All domains from the 3 [anudeepND](https://github.com/anudeepND/blacklist ) lists
are now blocked at local resolver level by updating /etc/hosts in entrypoint.
- this saves network and CPU resources by failing early.
- this is wanted in almost all cases
- can be bypassed by setting a blank entrypoint
2021-01-12 07:31:16 +01:00
renaud gaudin
f4c11dc948
using published version of action
2020-12-22 15:48:12 +00:00
renaud gaudin
01302d3885
added package assignment
2020-12-22 11:15:51 +00:00
renaud gaudin
f72caad35c
added Docker publish GA
2020-12-22 11:10:53 +00:00
renaud gaudin
71603f8a15
fixed version number in changelog
2020-12-22 11:09:41 +00:00
rgaudin
ff5c6b3dc9
Merge pull request #68 from openzim/github-bots
...
GitHub bots
2020-12-15 11:23:28 +00:00
Emmanuel Engelhart
0cb3db6f16
Add move/stale bots configuration
2020-12-15 12:19:21 +01:00
Ilya Kreymer
508286ef78
Update to latest version of browsertrix-crawler (0.1.4) ( #66 )
...
to add autofetch support for srcset (and also stylesheets)
should fix (#63 )
v1.1.1
2020-12-14 09:36:41 +01:00
renaud gaudin
56d319ce3f
added changelog
v1.1
2020-12-14 08:13:54 +00:00
rgaudin
f6d44314cd
Fixed #58 : updated README with limitations
2020-12-12 13:58:32 +00:00
rgaudin
eb5ca99bfb
Merge pull request #62 from openzim/progres
...
Enhanced --statsFilename support
v1.0
2020-12-10 10:50:18 +00:00
renaud gaudin
85fad62b61
Updated test to new stats files
...
- verify output of crawl, warc2zim and zimit file
- using a simpler tag for CI test image as to not confuse it with public image
2020-12-10 10:44:49 +00:00
renaud gaudin
3ffa34d46e
Enhanced --statsFilename support
...
- `--statsFilename` to now represent overall zimit progress and not just crawling
- Exposing a simpler (`done`, `total`) json format for progress
- Live converting individual step's progres into this file
- using warc2zim 1.3.3 for its `--progress-file` support
- Currently arbitrarily assigning 90% to crawl and 10% to warc2zim
2020-12-10 10:44:39 +00:00
rgaudin
b9ed1d00a2
Merge pull request #60 from openzim/stats
...
stats: add support for stats output after every page crawled, fixes #39
2020-12-04 11:21:44 +00:00
Ilya Kreymer
5084c54af6
stats: add support for stats output after every page crawled, fixes #39
...
tests: integration test checks for stats.json
2020-12-02 16:28:25 +00:00
rgaudin
9422defe86
Merge pull request #54 from openzim/mobile-useragent
...
Mobile Device + User Agent Support
2020-11-16 11:14:52 +00:00
Ilya Kreymer
c0bb0503b8
add support for --useSitemap <url> flag to load additional URLs, potentially fixing #34 !
...
reformat
2020-11-14 22:01:36 +00:00
Ilya Kreymer
a801a1eef6
ci: improve tests, validate all UA, and check for at least one found
2020-11-14 20:50:03 +00:00
Ilya Kreymer
4723376ebc
ci: add --keep to keep warc files
2020-11-14 20:33:36 +00:00
Ilya Kreymer
5e4b3d80b3
ci: path fix
2020-11-14 20:30:15 +00:00
Ilya Kreymer
82f0fae959
update to warc2zim 1.3.2
...
fix ci test command
2020-11-14 20:27:43 +00:00
Ilya Kreymer
a930542af8
mobile + user agent support:
...
- add support for custom user agent suffix +Zimit with email address specifyable via --adminEmail cmd arg #38
- add ability to crawl as mobile device with --mobileDevice flag (default to iPhone X)
add integration tests runnable in docker via github actions
logging: print temp dir, flush print statements for immediate logging
2020-11-14 20:10:16 +00:00
rgaudin
0e3af5124b
Merge pull request #46 from openzim/crawler-split
...
Split zimit from webrecorder/browsertrix-crawler
2020-11-10 09:16:46 +00:00
renaud gaudin
0082d313ae
Code formatting
...
- Added requests as a dependency (although currently brought in by warc2zim)
- removed unused imports
- black code formatting and some cleanup
- revamped actual_url fetching
2020-11-10 09:12:34 +00:00
renaud gaudin
568068ecfc
WARC2zim version update
...
- updated to latest warc2zim release
- fixed param name typo in README
- added creation of `/output` so container can run on default params even if /output
is not a mounted volume
2020-11-10 08:26:56 +00:00