docs: README overhaul

This commit is contained in:
Marcus Holland-Moritz 2025-08-14 21:50:49 +02:00
parent d2a1c00f04
commit dbf84a290a

133
README.md
View File

@ -10,7 +10,7 @@
The **D**eduplicating **W**arp-speed **A**dvanced **R**ead-only **F**ile **S**ystem.
A fast high compression read-only file system for Linux and Windows.
A fast high-compression read-only file system for Linux and Windows.
## Table of contents
@ -59,7 +59,7 @@ A fast high compression read-only file system for Linux and Windows.
![Linux Screen Capture](doc/screenshot.gif?raw=true "DwarFS Linux")
DwarFS is a read-only file system with a focus on achieving **very
high compression ratios** in particular for very redundant data.
high compression ratios**, particularly for highly redundant data.
This probably doesn't sound very exciting, because if it's redundant,
it *should* compress well. However, I found that other read-only,
@ -67,10 +67,10 @@ compressed file systems don't do a very good job at making use of
this redundancy. See [here](#comparison) for a comparison with other
compressed file systems.
DwarFS also **doesn't compromise on speed** and for my use cases I've
found it to be on par with or perform better than SquashFS. For my
primary use case, **DwarFS compression is an order of magnitude better
than SquashFS compression**, it's **6 times faster to build the file
DwarFS also **doesn't compromise on speed**; in my use cases, it
performs on par with, or better than, SquashFS. For my primary use
case, **DwarFS compression is an order of magnitude better than
SquashFS compression**, it's **6 times faster to build the file
system**, it's typically faster to access files on DwarFS and it uses
less CPU resources.
@ -83,7 +83,7 @@ So there's redundancy in both the video and audio data, but as the streams
are interleaved and identical blocks are typically very far apart, it's
challenging to make use of that redundancy for compression. SquashFS
essentially fails to compress the source data at all, whereas DwarFS is
able to reduce the size by almost a factor of 3, which is close to the
able to reduce the size to nearly one-third, which is close to the
theoretical maximum:
```
@ -177,21 +177,25 @@ some rudimentary docs as well.
### Note to Package Maintainers
DwarFS should usually build fine with minimal changes out of the box.
If it doesn't, please file a issue. I've set up
[CI jobs](https://github.com/mhx/dwarfs/actions/workflows/build.yml)
using Docker images for Ubuntu ([22.04](https://github.com/mhx/dwarfs/blob/main/.docker/Dockerfile.ubuntu-2204)
and [24.04](https://github.com/mhx/dwarfs/blob/main/.docker/Dockerfile.ubuntu)),
[Fedora Rawhide](https://github.com/mhx/dwarfs/blob/main/.docker/Dockerfile.fedora)
and [Arch](https://github.com/mhx/dwarfs/blob/main/.docker/Dockerfile.arch)
If it doesn't, please file an issue. I've set up
[CI jobs](actions/workflows/build.yml)
using Docker images for Ubuntu ([22.04](.docker/Dockerfile.ubuntu-2204)
and [24.04](.docker/Dockerfile.ubuntu)),
[Fedora Rawhide](.docker/Dockerfile.fedora),
[Arch Linux](.docker/Dockerfile.arch), and
[Debian](.docker/Dockerfile.debian),
as well as a setup script for [FreeBSD](.github/scripts/freebsd_setup_base.sh),
that can help with determining an up-to-date set of dependencies.
Note that building from the release tarball requires less dependencies
than building from the git repository, notably the `ronn` tool as well
as Python and the `mistletoe` Python module are not required when
building from the release tarball.
building from the release tarball. Also, the release tarball build
doesn't require to build the thrift compiler, which makes the build
a lot faster.
There are some things to be aware of:
- There's a tendency to try and unbundle the [folly](https://github.com/facebook/folly/)
- There's a tendency to try to unbundle the [folly](https://github.com/facebook/folly/)
and [fbthrift](https://github.com/facebook/fbthrift) libraries that
are included as submodules and are built along with DwarFS.
While I agree with the sentiment, it's unfortunately a bad idea.
@ -209,13 +213,13 @@ There are some things to be aware of:
fbthrift headers are required to build against DwarFS' libraries.
- Similar issues can arise when using a system-installed version
of GoogleTest. GoogleTest itself recommends that it is being
downloaded as part of the build. However, you can use the system
installed version by passing `-DPREFER_SYSTEM_GTEST=ON` to the
`cmake` call. Use at your own risk.
of GoogleTest. GoogleTest recommends downloading it as part of
the build. However, you can use the system-installed version by
passing `-DPREFER_SYSTEM_GTEST=ON` to the `cmake` call. Use at
your own risk.
- For other bundled libraries (namely `fmt`, `parallel-hashmap`,
`range-v3`), the system installed version is used as long as it
`range-v3`), the system-installed version is used as long as it
meets the minimum required version. Otherwise, the preferred
version is fetched during the build.
@ -233,18 +237,33 @@ In addition to the binary tarballs, there's a **universal binary**
available for each architecture. These universal binaries contain
*all* tools (`mkdwarfs`, `dwarfsck`, `dwarfsextract` and the `dwarfs`
FUSE driver) in a single executable. These executables are compressed
using [upx](https://github.com/upx/upx), so they are much smaller than
the individual tools combined. However, it also means the binaries need
to be decompressed each time they are run, which can have a significant
overhead. If that is an issue, you can either stick to the "classic"
individual binaries or you can decompress the universal binary, e.g.:
using [upx](https://github.com/upx/upx) where possible, and using a
custom self-extractor on all other platforms. This means they are much
smaller than the individual tools combined. However, it also means the
binaries need to be decompressed each time they are run, which can add
significant overhead. If that is an issue, you can either stick to the
"classic" individual binaries or you can decompress the universal binary.
For upx compressed binaries, you can use:
```
upx -d dwarfs-universal-0.7.0-Linux-aarch64
$ upx -d dwarfs-universal-0.7.0-Linux-aarch64
```
The universal binaries can be run through symbolic links named after
the proper tool. e.g.:
For the binaries that use the custom self-extractor, you can use:
```
$ ./dwarfs-universal-riscv64 --extract-wrapped-binary dwarfs-universal
```
Note that both self-extractors need at least Linux kernel 3.17 to work
properly. If you want to use the FUSE driver, you'll need to install
the fuse3 tools for your distribution. If you want to run the binaries
on an older kernel, you can unpack the universal binary (unpacking does
*not* require kernel 3.17). If you're stuck with fuse2, you must use the
individual `dwarfs2` driver instead of the universal binary.
You can run the universal binaries via symbolic links named after
the tool. For example:
```
$ ln -s dwarfs-universal-0.7.0-Linux-aarch64 mkdwarfs
@ -289,10 +308,13 @@ space-efficient, memory-mappable and well defined format. It's also
included as a submodule, and we only build the compiler and a very
reduced library that contains just enough for DwarFS to work.
Other than that, DwarFS really only depends on FUSE3 and on a set
of compression libraries that Folly already depends on (namely
[lz4](https://github.com/lz4/lz4), [zstd](https://github.com/facebook/zstd)
and [liblzma](https://github.com/kobolabs/liblzma)).
Beyond that, DwarFS depends on FUSE3 and a set of compression
libraries (namely [lz4](https://github.com/lz4/lz4),
[zstd](https://github.com/facebook/zstd),
[brotli](https://github.com/google/brotli),
[xz](https://github.com/tukaani-project/xz), and
[flac](https://github.com/xiph/flac)). Except for `zstd`, these
are all optional.
The dependency on [googletest](https://github.com/google/googletest)
will be automatically resolved if you build with tests.
@ -392,7 +414,7 @@ $ ctest -j
```
All binaries use [jemalloc](https://github.com/jemalloc/jemalloc)
as a memory allocator by default, as it is typically uses much less
as a memory allocator by default, as it typically uses much less
system memory compared to the `glibc` or `tcmalloc` allocators.
To disable the use of `jemalloc`, pass `-DUSE_JEMALLOC=0` on the
`cmake` command line.
@ -484,10 +506,9 @@ pages using the `--man` option to each binary, e.g.:
$ mkdwarfs --man
```
The [dwarfs](doc/dwarfs.md) manual page also shows an example for setting
up DwarFS with [overlayfs](https://www.kernel.org/doc/Documentation/filesystems/overlayfs.txt)
in order to create a writable file system mount on top a read-only
DwarFS image.
The [dwarfs](doc/dwarfs.md) manual page also shows an example for setting up DwarFS
with [overlayfs](https://www.kernel.org/doc/html/latest/filesystems/overlayfs.html)
in order to create a writable file system mount on top of a read-only DwarFS image.
A description of the DwarFS file system format can be found in
[dwarfs-format](doc/dwarfs-format.md).
@ -511,7 +532,7 @@ There are five individual libraries:
- `dwarfs_reader` contains all code required to read data from a
DwarFS image. The interfaces are defined in [`dwarfs/reader/`](include/dwarfs/reader).
- `dwarfs_extractor` contains the ccode required to extract a DwarFS
- `dwarfs_extractor` contains the code required to extract a DwarFS
image using [`libarchive`](https://libarchive.org/). The interfaces
are defined in [`dwarfs/utility/filesystem_extractor.h`](include/dwarfs/utility/filesystem_extractor.h).
@ -549,9 +570,9 @@ There are a few things worth pointing out, though:
- DwarFS supports both hardlinks and symlinks on Windows, just as it
does on Linux. However, creating hardlinks and symlinks seems to
require admin privileges on Windows, so if you want to e.g. extract
a DwarFS image that contains links of some sort, you might run into
errors if you don't have the right privileges.
require admin privileges on Windows, so if, for example, you want to
extract a DwarFS image that contains links of some sort, you might
run into errors if you don't have the right privileges.
- Due to a [problem](https://github.com/winfsp/winfsp/issues/454) in
WinFsp, symlinks cannot currently point outside of the mounted file
@ -593,7 +614,7 @@ You'll need to install:
if it's not, you'll need to set `WINFSP_PATH` when running CMake via
`cmake/win.bat`.
Now you need to clone `vcpkg` and `dwarfs`:
Clone `vcpkg` and `dwarfs`:
```
> cd %HOMEPATH%
@ -639,8 +660,8 @@ $ brew test dwarfs
```
The macOS version of the DwarFS file system driver relies on the awesome
[macFUSE](https://osxfuse.github.io/) project and is available from
gromgit's [homebrew-fuse tap](https://github.com/gromgit/homebrew-fuse):
[macFUSE](https://macfuse.io) project and is available via gromgit's
[homebrew-fuse tap](https://github.com/gromgit/homebrew-fuse):
```
$ brew tap gromgit/homebrew-fuse
@ -652,7 +673,7 @@ $ brew install dwarfs-fuse-mac
### Astrophotography
Astrophotography can generate huge amounts of raw image data. During a
single night, it's not unlikely to end up with a few dozens of gigabytes
single night, it's not unlikely to end up with a few dozen gigabytes
of data. With most dedicated astrophotography cameras, this data ends up
in the form of FITS images. These are usually uncompressed, don't compress
very well with standard compression algorithms, and while there are certain
@ -861,7 +882,7 @@ The source directory contained **1139 different Perl installations**
from 284 distinct releases, a total of 47.65 GiB of data in 1,927,501
files and 330,733 directories. The source directory was freshly
unpacked from a tar archive to an XFS partition on a 970 EVO Plus 2TB
NVME drive, so most of its contents were likely cached.
NVMe drive, so most of its contents were likely cached.
I'm using the same compression type and compression level for
SquashFS that is the default setting for DwarFS:
@ -1959,7 +1980,7 @@ $ ls -l perl-install-small.*fs
I noticed that the `blockifying` step that took ages for the full dataset
with `mkcromfs` ran substantially faster (in terms of MiB/second) on the
smaller dataset, which makes me wonder if there's some quadratic complexity
behaviour that's slowing down `mkcromfs`.
behavior that's slowing down `mkcromfs`.
In order to be completely fair, I also ran `mkdwarfs` with `-l 9` to enable
LZMA compression (which is what `mkcromfs` uses by default):
@ -2017,8 +2038,8 @@ it crashed right upon trying to list the directory after mounting.
### With EROFS
[EROFS](https://github.com/erofs/erofs-utils) is a read-only compressed
file system that has been added to the Linux kernel recently.
[EROFS](https://github.com/erofs/erofs-utils) is another read-only
compressed file system included in the Linux kernel.
Its goals are different from those of DwarFS, though. It is designed to
be lightweight (which DwarFS is definitely not) and to run on constrained
hardware like embedded devices or smartphones. It is not designed to provide
@ -2251,7 +2272,7 @@ sys 0m0.610s
```
Turns out that `tar --zstd` is easily winning the compression speed
test. Looking at the file sizes did actually blow my mind just a bit:
test. Looking at the file sizes did genuinely surprise me:
```
$ ll zerotest.* --sort=size
@ -2526,11 +2547,15 @@ typically want to run on your "performance" cores.
### Specifying file system offset and size
You can specify the byte offset at which the filesystem is located in the file using the `-o offset=N` option.
This can be useful when mounting images where there is some preceding data before the filesystem or when mounting merged/concatenated images.
When combined with the `-o imagesize=N` option you can mount merged filesystems, i.e. multiple filesystems stored in a single file.
You can specify the byte offset at which the file system is located in the
file using the `-o offset=N` option. This can be useful when mounting images
where there is some preceding data before the file system or when mounting
merged/concatenated images. When combined with the `-o imagesize=N` option
you can mount merged file systems, i.e. multiple file systems stored in a
single file.
Here is an example, you have two filesystems concatenated into a single file and you want to mount both of them, you can achieve this by running
Here is an example, you have two file systems concatenated into a single
file and you want to mount both of them, you can achieve this by running:
```sh
dwarfs merged.dwarfs /mnt/fs1 -o imagesize=9231
dwarfs merged.dwarfs /mnt/fs2 -o offset=9231,imagesize=7999