388 Commits

Author SHA1 Message Date
benoit74
d8e6d55f87
Release 2.0.0 v2.0.0 2024-06-03 19:59:04 +00:00
benoit74
ae6e5ffaf6
Merge pull request #309 from openzim/wait_until_choices
Fix `--waitUntil` crawler options
2024-06-03 17:17:34 +02:00
benoit74
59057bdbb1
Fix documentation about --waitUntil allowed values and drop choices checks
- add networkidle0, networkidle2 and drop networkidle to reflect crawler
  changes
- drop choices check since this is anyway checked right at scraper start
  in crawler startup (this ensures to be more permissive should one want
  to use a different crawler version that the one supported in Docker
  image)
2024-06-03 15:11:48 +00:00
benoit74
7806aeba63
Merge pull request #310 from openzim/invalid_user_agent
Strip user-agent whitespaces and ignore empty user agents
2024-06-03 17:11:16 +02:00
benoit74
936666917c
Strip user-agent leading whitespaces and ignore empty user agents 2024-06-03 15:06:39 +00:00
benoit74
957e52c57f
Rebuild with warc2zim 2.0.0-dev9 2024-05-30 09:29:48 +00:00
benoit74
36f2fa5f2b
Rebuild with warc2zim 2.0.0-dev8 2024-05-27 08:56:32 +00:00
benoit74
8fdad5954e
Bump Github CI Actions versions 2024-05-24 14:16:53 +00:00
benoit74
9e6c998816
Bump zimit to 2.0.0-dev5 + use warc2zim2 branch + remove zimit2 image workflow 2024-05-24 14:10:19 +00:00
benoit74
4cf6e01669
Bump browsertrix crawler to 1.1.3 2024-05-24 14:07:40 +00:00
benoit74
ce49a5d4e9
Merge branch 'zimit2' 2024-05-24 14:07:05 +00:00
benoit74
1d54b20873
Upgrade to warc2zim 2.0.0-dev6 2024-05-06 09:55:38 +00:00
benoit74
9a7415a402
Upgrade to Browsertrix Crawler 1.1.1
Continue to use warc2zim 2.0.0-dev5 for now, Docker build issue with new
stuff in warc2zim 2.0.0-dev6, will be fixed later on
2024-05-06 06:00:14 +00:00
benoit74
d54aa22bb2
Upgrade to Browsertrix Crawler 1.1.0 2024-04-19 12:30:53 +00:00
rgaudin
f637c3fccc
Merge pull request #292 from openzim/ua_not_mandatory
Change crawler default settings around userAgent and mobileDevice
2024-03-27 15:51:14 +00:00
benoit74
728784d6bf
Upgrade Browsertrix Crawler to 1.0.3 2024-03-27 15:08:59 +00:00
benoit74
e24479945f
Remove trailing characters when retrieving Browsertrix Crawler version 2024-03-27 15:08:58 +00:00
benoit74
3070fe9724
Rollback previous changes around the presence of a default user-agent
- Remove default userAgent value
- Set a default mobileDevice
- Add back comments explaining that userAgent overrides other settings
- Add back logic around the computation of the userAgentSuffix instead
  of the userAgent
- Add new noMobileDevice argument to not set the default mobileDevice
2024-03-27 15:08:58 +00:00
benoit74
54732692ac
Bump dev version 2024-03-07 12:47:38 +00:00
benoit74
867d14fd00
Merge pull request #285 from openzim/crawler_beta5
Upgrade browsertrix crawler and remove redirect handling
2024-03-07 11:25:02 +01:00
benoit74
5c716747b4
Add CHANGELOG 2024-03-07 10:16:57 +00:00
benoit74
456219deb3
Fix tests, there are in fact only 7 items to be pushed to the ZIM
7 entries are expected:
https://isago.rskg.org/
https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0/css/bootstrap.min.css
https://isago.rskg.org/static/favicon256.png
https://isago.rskg.org/conseils
https://isago.rskg.org/faq
https://isago.rskg.org/a-propos
https://isago.rskg.org/static/tarifs-isago.pdf

1 unexpected entry is not produced anymore by Browsertrix crawler:
https://dict.brave.com/edgedl/chrome/dict/en-us-10-1.bdic

This was a technical artifact
2024-03-07 10:16:51 +00:00
benoit74
a9769b2871
Upgrade to crawler 1.0.0-beta6 2024-03-07 08:00:31 +00:00
benoit74
a4cb27a793
Fix clean_url method name 2024-03-07 07:59:41 +00:00
benoit74
4d31f8eabb
Remove handling of redirects which are now done by browsertrix crawler 2024-03-07 07:59:40 +00:00
benoit74
b69f3d610f
Upgrade to crawler 1.0.0-beta5 2024-03-07 07:59:40 +00:00
benoit74
c2dc8c5ccc
Merge pull request #286 from openzim/upgrade_deps
Upgrade to Python 3.12, upgrade Python dependencies and add hatch-openzim plugin
2024-03-04 11:23:42 +01:00
benoit74
857ae5674d
Upgrade to Python 3.12 2024-03-01 14:03:25 +00:00
benoit74
89aea6b41e
Adopt hatch-openzim plugin 2024-03-01 14:03:24 +00:00
benoit74
a44c1a7c7f
Upgrade dependencies 2024-03-01 14:03:24 +00:00
benoit74
6ca9be48c7
Empty commit to release warc2zim2 commit 3c00da0 2024-02-16 10:03:04 +01:00
benoit74
01c5833c29
Empty commit to release warc2zim2 commit f837179 2024-02-09 11:10:57 +01:00
rgaudin
7caa355c31
Merge pull request #277 from openzim/scraper_suffix
Pass scraper suffix to warc2zim
2024-02-05 13:45:13 +00:00
benoit74
49da57c5b6
fixup! Set zimit and browsertrix crawler versions in final ZIM 'Scraper' metadata 2024-02-05 14:33:38 +01:00
benoit74
9244f2e69c
Set zimit and browsertrix crawler versions in final ZIM 'Scraper' metadata 2024-01-31 15:10:08 +01:00
benoit74
ef462b5024
Empty commit to release warc2zim2 commit ae18aed 2024-01-26 16:34:26 +01:00
benoit74
f4359022b2
Merge pull request #274 from openzim/add_logging 2024-01-25 08:38:35 +01:00
benoit74
a505df9fe0
Add support for --logging parameter of browsertrix crawler 2024-01-23 17:28:56 +01:00
benoit74
343d0040cf
Merge pull request #272 from openzim/adopt_bootstrap 2024-01-22 10:41:29 +01:00
benoit74
c7fdc1d11e
Simplify logger name code 2024-01-22 10:38:25 +01:00
benoit74
c0ffb74d8c
Adopt Python bootstrap conventions 2024-01-18 13:31:00 +01:00
benoit74
343fb7e770
Replace warning about service workers by a nota bene about there removal since 2.x 2024-01-18 13:28:11 +01:00
benoit74
909b6e3da8
Merge branch 'main' into zimit2 2024-01-18 09:27:00 +01:00
benoit74
f46f2568ff
Prepare for next release 2024-01-18 09:16:18 +01:00
benoit74
19b4898326
Release 1.6.3 v1.6.3 2024-01-18 09:12:36 +01:00
benoit74
10471c1ea9
Merge pull request #269 from openzim/prepare_1_6_3 2024-01-18 09:10:04 +01:00
benoit74
eebf26f7cb
Upgrade to browsertrix crawler 0.12.4 and warc2zim 1.5.5 2024-01-18 09:05:06 +01:00
benoit74
27f9dcc53f
Empty commit to release warc2zim2 commit aca2db3 2024-01-15 17:45:56 +01:00
benoit74
22551388e0
Merge pull request #264 from openzim/use_warc2zim2 2024-01-15 08:30:32 +01:00
benoit74
a352c0c402
Add temporary Github Actions workflow to build zimit2 image 2024-01-15 08:06:50 +01:00