mirror of
https://github.com/openzim/zimit.git
synced 2025-08-03 18:26:15 -04:00
10 KiB
10 KiB
Changelog
All notable changes to this project are documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning (as of version 1.2.0).
[Unreleased]
[3.0.5] - 2024-04-11
Changed
- Upgrade to browsertrix crawler 1.6.0 (#493)
[3.0.4] - 2024-04-04
Changed
- Upgrade to browsertrix crawler 1.5.10 (#491)
[3.0.3] - 2024-02-28
Changed
- Upgrade to browsertrix crawler 1.5.7 (#483)
[3.0.2] - 2024-02-27
Changed
- Upgrade to browsertrix crawler 1.5.6 (#482)
[3.0.1] - 2024-02-24
Changed
- Upgrade to browsertrix crawler 1.5.4 (#476)
[3.0.0] - 2024-02-17
Changed
- Change solution to report partial ZIM to the Zimfarm and other clients (#304)
- Keep temporary folder when crawler or warc2zim fails, even if not asked for (#468)
- Add many missing Browsertrix Crawler arguments ; drop default overrides by zimit ; drop
--noMobileDevice
setting (not needed anymore) (#433) - Document all Browsertrix Crawler default arguments values (#416)
- Use preferred Browsertrix Crawler arguments names: (part of #471)
--seeds
instead of--url
--seedFile
instead of--urlFile
--pageLimit
instead of--limit
--pageLoadTimeout
instead of--timeout
--scopeIncludeRx
instead of--include
--scopeExcludeRx
instead of--exclude
--pageExtraDelay
instead of--delay
- Remove confusion between zimit, warc2zim and crawler stats filenames (part of #471)
--statsFilename
is now the crawler stats file (since it is the same name, just like other arguments)--zimit-progress-file
is now the zimit stats location--warc2zim-progress-file
is the warc2zim stats location- all are optional values, if not set and needed temporary files are used
Fixed
- Do not create the ZIM when crawl is incomplete (#444)
[2.1.8] - 2024-02-07
Changed
- Upgrade to browsertrix crawler 1.5.1, Python 3.13 and others (#462 + #464)
[2.1.7] - 2024-01-10
Changed
- Upgrade to browsertrix crawler 1.4.2 (#450)
- Upgrade to warc2zim 2.2.0
[2.1.6] - 2024-11-07
Changed
- Upgrade to browsertrix crawler 1.3.5 (#426)
[2.1.5] - 2024-11-01
Changed
- Upgrade to browsertrix crawler 1.3.4 and warc2zim 2.1.3 (#424)
[2.1.4] - 2024-10-11
Changed
- Upgrade to browsertrix crawler 1.3.3 (#411)
[2.1.3] - 2024-10-08
Changed
- Upgrade to browsertrix crawler 1.3.2, warc2zim 2.1.2 and other dependencies (#406)
Fixed
- Fix help (#393)
[2.1.2] - 2024-09-09
Changed
- Upgrade to browsertrix crawler 1.3.0-beta.1 (#387) (fixes "Ziming a website with huge assets (e.g. PDFs) is failing to proceed" - #380)
[2.1.1] - 2024-09-05
Added
- Add support for uncompressed tar archive in --warcs (#369)
Changed
- Upgrade to browsertrix crawler 1.3.0-beta.0 (#379), including upgrage to Ubuntu Noble (#307)
Fixed
- Stream files downloads to not exhaust memory (#373)
- Fix documentation on
--diskUtilization
setting (#375)
[2.1.0] - 2024-08-09
Added
- Add
--custom-behaviors
argument to support path/HTTP(S) URL custom behaviors to pass to the crawler (#313) - Add daily automated end-to-end tests of a page with Youtube player (#330)
- Add
--warcs
option to directly process WARC files (#301)
Changed
- Make it clear that
--profile
argument can be an HTTP(S) URL (and not only a path) (#288) - Fix README imprecisions + add back warc2zim availability in docker image (#314)
- Enhance integration test to assert final content of the ZIM (#287)
- Stop fetching and passing browsertrix crawler version as scraperSuffix to warc2zim (#354)
- Do not log number of WARC files found (#357)
- Upgrade dependencies (warc2zim 2.1.0)
Fixed
- Sort WARC directories found by modification time (#366)
[2.0.6] - 2024-08-02
Changed
- Upgraded Browsertrix Crawler to 1.2.6
[2.0.5] - 2024-07-24
Changed
- Upgraded Browsertrix Crawler to 1.2.5
- Upgraded warc2zim to 2.0.3
[2.0.4] - 2024-07-15
Changed
- Upgraded Browsertrix Crawler to 1.2.4 (fixes retrieve automatically the assets present in a data-xxx tag #316)
[2.0.3] - 2024-06-24
Changed
- Upgraded Browsertrix Crawler to 1.2.0 (fixes Youtube videos issue #323)
[2.0.2] - 2024-06-18
Changed
- Upgrade dependencies (mainly warc2zim 2.0.2)
[2.0.1] - 2024-06-13
Changed
- Upgrade dependencies (especially warc2zim 2.0.1 and browsertrix crawler 1.2.0-beta.0) (#318)
Fixed
- Crawler is not correctly checking disk size / usage (#305)
[2.0.0] - 2024-06-04
Added
- New
--version
flag to display Zimit version (#234) - New
--logging
flag to adjust Browsertrix Crawler logging (#273) - Use new
--scraper-suffix
flag of warc2zim to enhance ZIM "Scraper" metadata (#275) - New
--noMobileDevice
CLI argument - Publish Docker image for
linux/arm64
(in addition tolinux/amd64
) (#178)
Changed
- Use
warc2zim
version 2, which works without Service Worker anymore (#193) - Upgraded Browsertrix Crawler to 1.1.3
- Adopt Python bootstrap conventions
- Upgrade to Python 3.12 + upgrade dependencies
- Removed handling of redirects by zimit, they are handled by browsertrix crawler and detected properly by warc2zim (#284)
- Drop initial check of URL in Python (#256)
--userAgent
CLI argument overrides again the--userAgentSuffix
and--adminEmail
values--userAgent
CLI arguement is not mandatory anymore
Fixed
- Fix support for Youtube videos (#291)
- Fix crawler
--waitUntil
values (#289)
[1.6.3] - 2024-01-18
Changed
- Adapt to new
warc2zim
code structure - Using browsertrix-crawler 0.12.4
- Using warc2zim 1.5.5
Added
- New
--build
parameter (optional) to specify the directory holding Browsertrix files ; if not set,--output
directory is used ; zimit creates one subdir of this folder per invocation to isolate datasets ; subdir is kept only if--keep
is set.
Fixed
--collection
parameter was not working (#252)
[1.6.2] - 2023-11-17
Changed
- Using browsertrix-crawler 0.12.3
Fixed
- Fix logic passing args to crawler to support value '0' (#245)
- Fix documentation about Chrome and headless (#248)
[1.6.1] - 2023-11-06
Changed
- Using browsertrix-crawler 0.12.1
[1.6.0] - 2023-11-02
Changed
- Scraper fails for all HTTP error codes returned when checking URL at startup (#223)
- User-Agent now has a default value (#228)
- Manipulation of spaces with UA suffix and adminEmail has been modified
- Same User-Agent is used for check_url (Python) and Browsertrix crawler (#227)
- Using browsertrix-crawler 0.12.0
[1.5.3] - 2023-10-02
Changed
- Using browsertrix-crawler 0.11.2
[1.5.2] - 2023-09-19
Changed
- Using browsertrix-crawler 0.11.1
[1.5.1] - 2023-09-18
Changed
- Using browsertrix-crawler 0.11.0
- Scraper stat file is not created empty (#211)
- Crawler statistics are not available anymore (#213)
- Using warc2zim 1.5.4
[1.5.0] - 2023-08-23
Added
--long-description
param
[1.4.1] - 2023-08-23
Changed
- Using browsertrix-crawler 0.10.4
- Using warc2zim 1.5.3
[1.4.0] - 2023-08-02
Added
--title
to set ZIM title--description
to set ZIM description- New crawler options:
--maxPageLimit
,--delay
,--diskUtilization
--zim-lang
param to set warc2zim's--lang
(ISO-639-3)
Changed
- Using browsertrix-crawler 0.10.2
- Default and accepted values for
--waitUntil
from crawler's update - Using warc2zim 1.5.2
- Disabled Chrome updates to prevent incidental inclusion of update data in WARC/ZIM (#172)
--failOnFailedSeed
used inconditionally--lang
now passed to crawler (ISO-639-1)
Removed
--newContext
from crawler's update
[1.3.1] - 2023-02-06
Changed
- Using browsertrix-crawler 0.8.0
- Using warc2zim version 1.5.1 with wabac.js 2.15.2
[1.3.0] - 2023-02-02
Added
- Initial url check normalizes homepage redirects to standart ports – 80/443 (#137)
Changed
- Using warc2zim version 1.5.0 with scope conflict fix and videos fix
- Using browsertrix-crawler 0.8.0-beta.1
- Fixed
--allowHashUrls
being a boolean param - Increased
check_url
timeout (12s to connect, 27s to read) instead of 10s
[1.2.0] - 2022-06-21
Added
--urlFile
browsertrix crawler parameter--depth
browsertrix crawler parameter--extraHops
, parameter--collection
browsertrix crawler parameter--allowHashUrls
browsertrix crawler parameter--userAgentSuffix
browsertrix crawler parameter--behaviors
, parameter--behaviorTimeout
browsertrix crawler parameter--profile
browsertrix crawler parameter--sizeLimit
browsertrix crawler parameter--timeLimit
browsertrix crawler parameter--healthCheckPort
, parameter--overwrite
parameter
Changed
- using browsertrix-crawler
0.6.0
and warc2zim1.4.2
- default WARC location after crawl changed
from
collections/capture-*/archive/
tocollections/crawl-*/archive/
Removed
--scroll
browsertrix crawler parameter (see--behaviors
)--scope
browsertrix crawler parameter (see--scopeType
,--include
and--exclude
)
[1.1.5]
- using crawler 0.3.2 and warc2zim 1.3.6
[1.1.4]
- Defaults to
load,networkidle0
for waitUntil param (same as crawler) - Allows setting combinations of values for waitUntil param
- Updated warc2zim to 1.3.5
- Updated browsertrix-crawler to 0.3.1
- Warc to zim now written to
{temp_root_dir}/collections/capture-*/archive/
wherecapture-*
is dynamic and includes the datetime. (from browsertrix-crawler)
[1.1.3]
- allows same first-level-domain redirects
- fixed redirects to URL in scope
- updated crawler to 0.2.0
statsFilename
now informs whether limit was hit or not
[1.1.2]
- added support for --custom-css
- added domains block list (dfault)
[1.1.1]
- updated browsertrix-crawler to 0.1.4
- autofetcher script to be injected by defaultDriver to capture srcsets + URLs in dynamically added stylesheets
[1.0]
- initial version using browsertrix-crawler:0.1.3 and warc2zim:1.3.3