rgaudin
f637c3fccc
Merge pull request #292 from openzim/ua_not_mandatory
...
Change crawler default settings around userAgent and mobileDevice
2024-03-27 15:51:14 +00:00
benoit74
728784d6bf
Upgrade Browsertrix Crawler to 1.0.3
2024-03-27 15:08:59 +00:00
benoit74
e24479945f
Remove trailing characters when retrieving Browsertrix Crawler version
2024-03-27 15:08:58 +00:00
benoit74
3070fe9724
Rollback previous changes around the presence of a default user-agent
...
- Remove default userAgent value
- Set a default mobileDevice
- Add back comments explaining that userAgent overrides other settings
- Add back logic around the computation of the userAgentSuffix instead
of the userAgent
- Add new noMobileDevice argument to not set the default mobileDevice
2024-03-27 15:08:58 +00:00
benoit74
54732692ac
Bump dev version
2024-03-07 12:47:38 +00:00
benoit74
867d14fd00
Merge pull request #285 from openzim/crawler_beta5
...
Upgrade browsertrix crawler and remove redirect handling
2024-03-07 11:25:02 +01:00
benoit74
5c716747b4
Add CHANGELOG
2024-03-07 10:16:57 +00:00
benoit74
456219deb3
Fix tests, there are in fact only 7 items to be pushed to the ZIM
...
7 entries are expected:
https://isago.rskg.org/
https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0/css/bootstrap.min.css
https://isago.rskg.org/static/favicon256.png
https://isago.rskg.org/conseils
https://isago.rskg.org/faq
https://isago.rskg.org/a-propos
https://isago.rskg.org/static/tarifs-isago.pdf
1 unexpected entry is not produced anymore by Browsertrix crawler:
https://dict.brave.com/edgedl/chrome/dict/en-us-10-1.bdic
This was a technical artifact
2024-03-07 10:16:51 +00:00
benoit74
a9769b2871
Upgrade to crawler 1.0.0-beta6
2024-03-07 08:00:31 +00:00
benoit74
a4cb27a793
Fix clean_url method name
2024-03-07 07:59:41 +00:00
benoit74
4d31f8eabb
Remove handling of redirects which are now done by browsertrix crawler
2024-03-07 07:59:40 +00:00
benoit74
b69f3d610f
Upgrade to crawler 1.0.0-beta5
2024-03-07 07:59:40 +00:00
benoit74
c2dc8c5ccc
Merge pull request #286 from openzim/upgrade_deps
...
Upgrade to Python 3.12, upgrade Python dependencies and add hatch-openzim plugin
2024-03-04 11:23:42 +01:00
benoit74
857ae5674d
Upgrade to Python 3.12
2024-03-01 14:03:25 +00:00
benoit74
89aea6b41e
Adopt hatch-openzim plugin
2024-03-01 14:03:24 +00:00
benoit74
a44c1a7c7f
Upgrade dependencies
2024-03-01 14:03:24 +00:00
benoit74
6ca9be48c7
Empty commit to release warc2zim2 commit 3c00da0
2024-02-16 10:03:04 +01:00
benoit74
01c5833c29
Empty commit to release warc2zim2 commit f837179
2024-02-09 11:10:57 +01:00
rgaudin
7caa355c31
Merge pull request #277 from openzim/scraper_suffix
...
Pass scraper suffix to warc2zim
2024-02-05 13:45:13 +00:00
benoit74
49da57c5b6
fixup! Set zimit and browsertrix crawler versions in final ZIM 'Scraper' metadata
2024-02-05 14:33:38 +01:00
benoit74
9244f2e69c
Set zimit and browsertrix crawler versions in final ZIM 'Scraper' metadata
2024-01-31 15:10:08 +01:00
benoit74
ef462b5024
Empty commit to release warc2zim2 commit ae18aed
2024-01-26 16:34:26 +01:00
benoit74
f4359022b2
Merge pull request #274 from openzim/add_logging
2024-01-25 08:38:35 +01:00
benoit74
a505df9fe0
Add support for --logging parameter of browsertrix crawler
2024-01-23 17:28:56 +01:00
benoit74
343d0040cf
Merge pull request #272 from openzim/adopt_bootstrap
2024-01-22 10:41:29 +01:00
benoit74
c7fdc1d11e
Simplify logger name code
2024-01-22 10:38:25 +01:00
benoit74
c0ffb74d8c
Adopt Python bootstrap conventions
2024-01-18 13:31:00 +01:00
benoit74
343fb7e770
Replace warning about service workers by a nota bene about there removal since 2.x
2024-01-18 13:28:11 +01:00
benoit74
909b6e3da8
Merge branch 'main' into zimit2
2024-01-18 09:27:00 +01:00
benoit74
f46f2568ff
Prepare for next release
2024-01-18 09:16:18 +01:00
benoit74
19b4898326
Release 1.6.3
v1.6.3
2024-01-18 09:12:36 +01:00
benoit74
10471c1ea9
Merge pull request #269 from openzim/prepare_1_6_3
2024-01-18 09:10:04 +01:00
benoit74
eebf26f7cb
Upgrade to browsertrix crawler 0.12.4 and warc2zim 1.5.5
2024-01-18 09:05:06 +01:00
benoit74
27f9dcc53f
Empty commit to release warc2zim2 commit aca2db3
2024-01-15 17:45:56 +01:00
benoit74
22551388e0
Merge pull request #264 from openzim/use_warc2zim2
2024-01-15 08:30:32 +01:00
benoit74
a352c0c402
Add temporary Github Actions workflow to build zimit2 image
2024-01-15 08:06:50 +01:00
benoit74
e034b08852
Update CHANGELOG
2024-01-15 08:06:50 +01:00
Matthieu Gautier
1c58bbe303
Adapt to warc2zim2
branch of warc2zim.
...
`warc2zim2` branch create zim files without service worker.
2024-01-15 08:00:05 +01:00
benoit74
eab3d1f189
Merge pull request #262 from openzim/warc2zim_update
2024-01-15 07:59:05 +01:00
benoit74
bbc8a48bc9
Update CHANGELOG
2024-01-15 07:55:53 +01:00
Matthieu Gautier
7bc0ed9c02
Use main branch of warc2zim in dockerfile instead of released version.
...
This PR adapt to API changed made in main branch of warc2zim, so we must
use it instead of released version.
2024-01-14 10:32:52 +01:00
Matthieu Gautier
af0c93f1df
Update to new organization of warc2zim.
...
Older `warc2zim` method is now named `main`.
2024-01-12 12:17:35 +01:00
benoit74
cd6a55b179
Merge pull request #263 from openzim/cleanup
2024-01-08 17:13:26 +01:00
Matthieu Gautier
f80dbd11d9
Remove unwanted file.
...
Sound like a vim miss-manipulation.
2024-01-08 16:42:28 +01:00
rgaudin
a62f31ed0d
Merge pull request #254 from openzim/collections_param
...
Collections and temporary directory parameters
2023-11-23 14:50:35 +00:00
benoit74
d6c0c6ce63
Fixes following review + we need to create on subdir per run to not mix data / cleanup correctly afer run
2023-11-23 13:08:45 +01:00
benoit74
a2b4c71ec9
Display warc2zim call args
2023-11-23 09:02:33 +01:00
benoit74
b98e8f7027
Fix handling of '--collection' parameter + add '--tmp' + enhance logging
2023-11-23 09:02:08 +01:00
benoit74
79d5f8bc7b
Tidy code automatically
2023-11-23 08:50:59 +01:00
benoit74
216ac09d8c
Enhance .gitignore with toptal generated one
2023-11-23 08:48:00 +01:00