anubis/docs/docs/admin/policies.mdx
Xe Iaso d6e8bc9b33
docs: remove JSON examples from policy file docs
Signed-off-by: Xe Iaso <me@xeiaso.net>
2025-08-02 20:28:55 +00:00

281 lines
14 KiB
Plaintext

---
title: Policy Definitions
---
import Tabs from "@theme/Tabs";
import TabItem from "@theme/TabItem";
Out of the box, Anubis is pretty heavy-handed. It will aggressively challenge everything that might be a browser (usually indicated by having `Mozilla` in its user agent). However, some bots are smart enough to get past the challenge. Some things that look like bots may actually be fine (IE: RSS readers). Some resources need to be visible no matter what. Some resources and remotes are fine to begin with.
Anubis lets you customize its configuration with a Policy File. This is a YAML document that spells out what actions Anubis should take when evaluating requests. The [default configuration](https://github.com/TecharoHQ/anubis/blob/main/data/botPolicies.yaml) explains everything, but this page contains an overview of everything you can do with it.
## Bot Policies
Bot policies let you customize the rules that Anubis uses to allow, deny, or challenge incoming requests. Currently you can set policies by the following matches:
- Request path
- User agent string
- HTTP request header values
- [Importing other configuration snippets](./configuration/import.mdx)
As of version v1.17.0 or later, configuration can be written in either JSON or YAML.
Here's an example rule that denies [Amazonbot](https://developer.amazon.com/en/amazonbot):
```yaml
- name: amazonbot
user_agent_regex: Amazonbot
action: DENY
```
When this rule is evaluated, Anubis will check the `User-Agent` string of the request. If it contains `Amazonbot`, Anubis will send an error page to the user saying that access is denied, but in such a way that makes scrapers think they have correctly loaded the webpage.
Right now the only kinds of policies you can write are bot policies. Other forms of policies will be added in the future.
Here is a minimal policy file that will protect against most scraper bots:
```yaml
bots:
- name: cloudflare-workers
headers_regex:
CF-Worker: .*
action: DENY
- name: well-known
path_regex: ^/.well-known/.*$
action: ALLOW
- name: favicon
path_regex: ^/favicon.ico$
action: ALLOW
- name: robots-txt
path_regex: ^/robots.txt$
action: ALLOW
- name: generic-browser
user_agent_regex: Mozilla
action: CHALLENGE
```
This allows requests to [`/.well-known`](https://en.wikipedia.org/wiki/Well-known_URI), `/favicon.ico`, `/robots.txt`, and challenges any request that has the word `Mozilla` in its User-Agent string. The [default policy file](https://github.com/TecharoHQ/anubis/blob/main/data/botPolicies.json) is a bit more cohesive, but this should be more than enough for most users.
If no rules match the request, it is allowed through. For more details on this default behavior and its implications, see [Default allow behavior](./default-allow-behavior.mdx).
### Writing your own rules
There are four actions that can be returned from a rule:
| Action | Effects |
| :---------- | :---------------------------------------------------------------------------------------------------------------------------------- |
| `ALLOW` | Bypass all further checks and send the request to the backend. |
| `DENY` | Deny the request and send back an error message that scrapers think is a success. |
| `CHALLENGE` | Show a challenge page and/or validate that clients have passed a challenge. |
| `WEIGH` | Change the [request weight](#request-weight) for this request. See the [request weight](#request-weight) docs for more information. |
Name your rules in lower case using kebab-case. Rule names will be exposed in Prometheus metrics.
### Challenge configuration
Rules can also have their own challenge settings. These are customized using the `"challenge"` key. For example, here is a rule that makes challenges artificially hard for connections with the substring "bot" in their user agent:
This rule has been known to have a high false positive rate in testing. Please use this with care.
```yaml
# Punish any bot with "bot" in the user-agent string
- name: generic-bot-catchall
user_agent_regex: (?i:bot|crawler)
action: CHALLENGE
challenge:
difficulty: 16 # impossible
report_as: 4 # lie to the operator
algorithm: slow # intentionally waste CPU cycles and time
```
Challenges can be configured with these settings:
| Key | Example | Description |
| :----------- | :------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `difficulty` | `4` | The challenge difficulty (number of leading zeros) for proof-of-work. See [Why does Anubis use Proof-of-Work?](/docs/design/why-proof-of-work) for more details. |
| `report_as` | `4` | What difficulty the UI should report to the user. Useful for messing with industrial-scale scraping efforts. |
| `algorithm` | `"fast"` | The algorithm used on the client to run proof-of-work calculations. This must be set to `"fast"` or `"slow"`. See [Proof-of-Work Algorithm Selection](./algorithm-selection) for more details. |
### Remote IP based filtering
The `remote_addresses` field of a Bot rule allows you to set the IP range that this ruleset applies to.
For example, you can allow a search engine to connect if and only if its IP address matches the ones they published:
```yaml
- name: qwantbot
user_agent_regex: \+https\://help\.qwant\.com/bot/
action: ALLOW
# https://help.qwant.com/wp-content/uploads/sites/2/2025/01/qwantbot.json
remote_addresses: ["91.242.162.0/24"]
```
This also works at an IP range level without any other checks:
```yaml
name: internal-network
action: ALLOW
remote_addresses:
- 100.64.0.0/10
```
## Imprint / Impressum support
Anubis has support for showing imprint / impressum information. This is defined in the `impressum` block of your configuration. See [Imprint / Impressum configuration](./configuration/impressum.mdx) for more information.
## Storage backends
Anubis needs to store temporary data in order to determine if a user is legitimate or not. Administrators should choose a storage backend based on their infrastructure needs. Each backend has its own advantages and disadvantages.
Anubis offers the following storage backends:
- [`memory`](#memory) -- A simple in-memory hashmap
- [`bbolt`](#bbolt) -- An on-disk key/value store backed by [bbolt](https://github.com/etcd-io/bbolt), an embedded key/value database for Go programs
- [`valkey`](#valkey) -- A remote in-memory key/value database backed by [Valkey](https://valkey.io/) (or another database compatible with the [RESP](https://redis.io/docs/latest/develop/reference/protocol-spec/) protocol)
If no storage backend is set in the policy file, Anubis will use the [`memory`](#memory) backend by default. This is equivalent to the following in the policy file:
```yaml
store:
backend: memory
parameters: {}
```
### `memory`
The memory backend is an in-memory cache. This backend works best if you don't use multiple instances of Anubis or don't have mutable storage in the environment you're running Anubis in.
| Should I use this backend? | Yes/no |
| :------------------------------------------------------------ | :----- |
| Are you running only one instance of Anubis for this service? | ✅ Yes |
| Does your service get a lot of traffic? | 🚫 No |
| Do you want to store data persistently when Anubis restarts? | 🚫 No |
| Do you run Anubis without mutable filesystem storage? | ✅ Yes |
The biggest downside is that there is not currently a limit to how much data can be stored in memory. This will be addressed at a later time.
:::warning
The in-memory backend exists mostly for validation, testing, and to ensure that the default configuration of Anubis works as expected. Do not use this persistently in production.
:::
#### Configuration
The memory backend does not require any configuration to use.
### `bbolt`
An on-disk storage layer powered by [bbolt](https://github.com/etcd-io/bbolt), a high performance embedded key/value database used by containerd, etcd, Kubernetes, and NATS. This backend works best if you're running Anubis on a single host and get a lot of traffic.
| Should I use this backend? | Yes/no |
| :------------------------------------------------------------ | :----- |
| Are you running only one instance of Anubis for this service? | ✅ Yes |
| Does your service get a lot of traffic? | ✅ Yes |
| Do you want to store data persistently when Anubis restarts? | ✅ Yes |
| Do you run Anubis without mutable filesystem storage? | 🚫 No |
When Anubis opens a bbolt database, it takes an exclusive lock on that database. Other instances of Anubis or other tools cannot view the bbolt database while it is locked by another instance of Anubis. If you run multiple instances of Anubis for different services, give each its own `bbolt` configuration.
#### Configuration
The `bbolt` backend takes the following configuration options:
| Name | Type | Example | Description |
| :----- | :--- | :----------------- | :--------------------------------------------------------------------------------------------------------------------------- |
| `path` | path | `/data/anubis.bdb` | The filesystem path for the Anubis bbolt database. Anubis requires write access to the folder containing the bbolt database. |
Example:
If you have persistent storage mounted to `/data`, then your store configuration could look like this:
```yaml
store:
backend: bbolt
parameters:
path: /data/anubis.bdb
```
### `valkey`
[Valkey](https://valkey.io/) is an in-memory key/value store that clients access over the network. This allows multiple instances of Anubis to share information and does not require each instance of Anubis to have persistent filesystem storage.
:::note
You can also use [Redis](http://redis.io/) with Anubis.
:::
This backend is ideal if you are running multiple instances of Anubis in a worker pool (eg: Kubernetes Deployments with a copy of Anubis in each Pod).
| Should I use this backend? | Yes/no |
| :------------------------------------------------------------ | :----- |
| Are you running only one instance of Anubis for this service? | 🚫 No |
| Does your service get a lot of traffic? | ✅ Yes |
| Do you want to store data persistently when Anubis restarts? | ✅ Yes |
| Do you run Anubis without mutable filesystem storage? | ✅ Yes |
| Do you have Redis or Valkey installed? | ✅ Yes |
#### Configuration
The `valkey` backend takes the following configuration options:
| Name | Type | Example | Description |
| :---- | :----- | :---------------------- | :----------------------------------------------------------------------------------------------------------------------------------------------- |
| `url` | string | `redis://valkey:6379/0` | The URL for the instance of Redis or Valkey that Anubis should store data in. This is in the same format as `REDIS_URL` in many cloud providers. |
Example:
If you have an instance of Valkey running with the hostname `valkey.int.techaro.lol`, then your store configuration could look like this:
```yaml
store:
backend: valkey
parameters:
url: "redis://valkey.int.techaro.lol:6379/0"
```
This would have the Valkey client connect to host `valkey.int.techaro.lol` on port `6379` with database `0` (the default database).
## Risk calculation for downstream services
In case your service needs it for risk calculation reasons, Anubis exposes information about the rules that any requests match using a few headers:
| Header | Explanation | Example |
| :---------------- | :--------------------------------------------------- | :--------------- |
| `X-Anubis-Rule` | The name of the rule that was matched | `bot/lightpanda` |
| `X-Anubis-Action` | The action that Anubis took in response to that rule | `CHALLENGE` |
| `X-Anubis-Status` | The status and how strict Anubis was in its checks | `PASS` |
Policy rules are matched using [Go's standard library regular expressions package](https://pkg.go.dev/regexp). You can mess around with the syntax at [regex101.com](https://regex101.com), make sure to select the Golang option.
## Request Weight
Anubis rules can also add or remove "weight" from requests, allowing administrators to configure custom levels of suspicion. For example, if your application uses session tokens named `i_love_gitea`:
```yaml
- name: gitea-session-token
action: WEIGH
expression:
all:
- '"Cookie" in headers'
- headers["Cookie"].contains("i_love_gitea=")
# Remove 5 weight points
weight:
adjust: -5
```
This would remove five weight points from the request, which would make Anubis present the [Meta Refresh challenge](./configuration/challenges/metarefresh.mdx) in the default configuration.
### Weight Thresholds
For more information on configuring weight thresholds, see [Weight Threshold Configuration](./configuration/thresholds.mdx)
### Advice
Weight is still very new and needs work. This is an experimental feature and should be treated as such. Here's some advice to help you better tune requests:
- The default weight for browser-like clients is 10. This triggers an aggressive challenge.
- Remove and add weight in multiples of five.
- Be careful with how you configure weight.