398 Commits

Author SHA1 Message Date
benoit74
de0720e301
Prepare for 2.0.3 2024-06-18 14:05:47 +00:00
benoit74
b73a3e04d0
Release 2.0.2 v2.0.2 2024-06-18 13:44:13 +00:00
benoit74
2f50db874d
Upgrade dependencies 2024-06-18 13:42:05 +00:00
benoit74
68f2ed14d6
Upgrade to warc2zim 2.0.2 2024-06-18 13:40:23 +00:00
benoit74
baa0d9ecc7
Prepare for next release 2024-06-13 11:42:17 +00:00
benoit74
2835c7b078
Release 2.0.1 v2.0.1 2024-06-13 11:32:13 +00:00
benoit74
e6a6560b85
Merge pull request #318 from openzim/upgrade_deps
Upgrade dependencies
2024-06-13 12:28:45 +02:00
benoit74
77747ec1d3
Upgrade dependencies 2024-06-13 10:26:04 +00:00
benoit74
c67ccb9528
Allow to run dev image update manually + use main warc2zim branch for zimit dev versions 2024-06-04 15:17:33 +00:00
benoit74
83690f410d
Prepare for 2.1.0 2024-06-04 15:14:43 +00:00
benoit74
d8e6d55f87
Release 2.0.0 v2.0.0 2024-06-03 19:59:04 +00:00
benoit74
ae6e5ffaf6
Merge pull request #309 from openzim/wait_until_choices
Fix `--waitUntil` crawler options
2024-06-03 17:17:34 +02:00
benoit74
59057bdbb1
Fix documentation about --waitUntil allowed values and drop choices checks
- add networkidle0, networkidle2 and drop networkidle to reflect crawler
  changes
- drop choices check since this is anyway checked right at scraper start
  in crawler startup (this ensures to be more permissive should one want
  to use a different crawler version that the one supported in Docker
  image)
2024-06-03 15:11:48 +00:00
benoit74
7806aeba63
Merge pull request #310 from openzim/invalid_user_agent
Strip user-agent whitespaces and ignore empty user agents
2024-06-03 17:11:16 +02:00
benoit74
936666917c
Strip user-agent leading whitespaces and ignore empty user agents 2024-06-03 15:06:39 +00:00
benoit74
957e52c57f
Rebuild with warc2zim 2.0.0-dev9 2024-05-30 09:29:48 +00:00
benoit74
36f2fa5f2b
Rebuild with warc2zim 2.0.0-dev8 2024-05-27 08:56:32 +00:00
benoit74
8fdad5954e
Bump Github CI Actions versions 2024-05-24 14:16:53 +00:00
benoit74
9e6c998816
Bump zimit to 2.0.0-dev5 + use warc2zim2 branch + remove zimit2 image workflow 2024-05-24 14:10:19 +00:00
benoit74
4cf6e01669
Bump browsertrix crawler to 1.1.3 2024-05-24 14:07:40 +00:00
benoit74
ce49a5d4e9
Merge branch 'zimit2' 2024-05-24 14:07:05 +00:00
benoit74
1d54b20873
Upgrade to warc2zim 2.0.0-dev6 2024-05-06 09:55:38 +00:00
benoit74
9a7415a402
Upgrade to Browsertrix Crawler 1.1.1
Continue to use warc2zim 2.0.0-dev5 for now, Docker build issue with new
stuff in warc2zim 2.0.0-dev6, will be fixed later on
2024-05-06 06:00:14 +00:00
benoit74
d54aa22bb2
Upgrade to Browsertrix Crawler 1.1.0 2024-04-19 12:30:53 +00:00
rgaudin
f637c3fccc
Merge pull request #292 from openzim/ua_not_mandatory
Change crawler default settings around userAgent and mobileDevice
2024-03-27 15:51:14 +00:00
benoit74
728784d6bf
Upgrade Browsertrix Crawler to 1.0.3 2024-03-27 15:08:59 +00:00
benoit74
e24479945f
Remove trailing characters when retrieving Browsertrix Crawler version 2024-03-27 15:08:58 +00:00
benoit74
3070fe9724
Rollback previous changes around the presence of a default user-agent
- Remove default userAgent value
- Set a default mobileDevice
- Add back comments explaining that userAgent overrides other settings
- Add back logic around the computation of the userAgentSuffix instead
  of the userAgent
- Add new noMobileDevice argument to not set the default mobileDevice
2024-03-27 15:08:58 +00:00
benoit74
54732692ac
Bump dev version 2024-03-07 12:47:38 +00:00
benoit74
867d14fd00
Merge pull request #285 from openzim/crawler_beta5
Upgrade browsertrix crawler and remove redirect handling
2024-03-07 11:25:02 +01:00
benoit74
5c716747b4
Add CHANGELOG 2024-03-07 10:16:57 +00:00
benoit74
456219deb3
Fix tests, there are in fact only 7 items to be pushed to the ZIM
7 entries are expected:
https://isago.rskg.org/
https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0/css/bootstrap.min.css
https://isago.rskg.org/static/favicon256.png
https://isago.rskg.org/conseils
https://isago.rskg.org/faq
https://isago.rskg.org/a-propos
https://isago.rskg.org/static/tarifs-isago.pdf

1 unexpected entry is not produced anymore by Browsertrix crawler:
https://dict.brave.com/edgedl/chrome/dict/en-us-10-1.bdic

This was a technical artifact
2024-03-07 10:16:51 +00:00
benoit74
a9769b2871
Upgrade to crawler 1.0.0-beta6 2024-03-07 08:00:31 +00:00
benoit74
a4cb27a793
Fix clean_url method name 2024-03-07 07:59:41 +00:00
benoit74
4d31f8eabb
Remove handling of redirects which are now done by browsertrix crawler 2024-03-07 07:59:40 +00:00
benoit74
b69f3d610f
Upgrade to crawler 1.0.0-beta5 2024-03-07 07:59:40 +00:00
benoit74
c2dc8c5ccc
Merge pull request #286 from openzim/upgrade_deps
Upgrade to Python 3.12, upgrade Python dependencies and add hatch-openzim plugin
2024-03-04 11:23:42 +01:00
benoit74
857ae5674d
Upgrade to Python 3.12 2024-03-01 14:03:25 +00:00
benoit74
89aea6b41e
Adopt hatch-openzim plugin 2024-03-01 14:03:24 +00:00
benoit74
a44c1a7c7f
Upgrade dependencies 2024-03-01 14:03:24 +00:00
benoit74
6ca9be48c7
Empty commit to release warc2zim2 commit 3c00da0 2024-02-16 10:03:04 +01:00
benoit74
01c5833c29
Empty commit to release warc2zim2 commit f837179 2024-02-09 11:10:57 +01:00
rgaudin
7caa355c31
Merge pull request #277 from openzim/scraper_suffix
Pass scraper suffix to warc2zim
2024-02-05 13:45:13 +00:00
benoit74
49da57c5b6
fixup! Set zimit and browsertrix crawler versions in final ZIM 'Scraper' metadata 2024-02-05 14:33:38 +01:00
benoit74
9244f2e69c
Set zimit and browsertrix crawler versions in final ZIM 'Scraper' metadata 2024-01-31 15:10:08 +01:00
benoit74
ef462b5024
Empty commit to release warc2zim2 commit ae18aed 2024-01-26 16:34:26 +01:00
benoit74
f4359022b2
Merge pull request #274 from openzim/add_logging 2024-01-25 08:38:35 +01:00
benoit74
a505df9fe0
Add support for --logging parameter of browsertrix crawler 2024-01-23 17:28:56 +01:00
benoit74
343d0040cf
Merge pull request #272 from openzim/adopt_bootstrap 2024-01-22 10:41:29 +01:00
benoit74
c7fdc1d11e
Simplify logger name code 2024-01-22 10:38:25 +01:00