mirror of
https://github.com/TecharoHQ/anubis.git
synced 2025-08-03 01:38:14 -04:00
395 lines
15 KiB
Plaintext
395 lines
15 KiB
Plaintext
---
|
|
title: Policy Definitions
|
|
---
|
|
|
|
import Tabs from "@theme/Tabs";
|
|
import TabItem from "@theme/TabItem";
|
|
|
|
Out of the box, Anubis is pretty heavy-handed. It will aggressively challenge everything that might be a browser (usually indicated by having `Mozilla` in its user agent). However, some bots are smart enough to get past the challenge. Some things that look like bots may actually be fine (IE: RSS readers). Some resources need to be visible no matter what. Some resources and remotes are fine to begin with.
|
|
|
|
Bot policies let you customize the rules that Anubis uses to allow, deny, or challenge incoming requests. Currently you can set policies by the following matches:
|
|
|
|
- Request path
|
|
- User agent string
|
|
- HTTP request header values
|
|
- [Importing other configuration snippets](./configuration/import.mdx)
|
|
|
|
As of version v1.17.0 or later, configuration can be written in either JSON or YAML.
|
|
|
|
Here's an example rule that denies [Amazonbot](https://developer.amazon.com/en/amazonbot):
|
|
|
|
<Tabs>
|
|
<TabItem value="json" label="JSON" default>
|
|
|
|
```json
|
|
{
|
|
"name": "amazonbot",
|
|
"user_agent_regex": "Amazonbot",
|
|
"action": "DENY"
|
|
}
|
|
```
|
|
|
|
</TabItem>
|
|
<TabItem value="yaml" label="YAML">
|
|
|
|
```yaml
|
|
- name: amazonbot
|
|
user_agent_regex: Amazonbot
|
|
action: DENY
|
|
```
|
|
|
|
</TabItem>
|
|
</Tabs>
|
|
|
|
When this rule is evaluated, Anubis will check the `User-Agent` string of the request. If it contains `Amazonbot`, Anubis will send an error page to the user saying that access is denied, but in such a way that makes scrapers think they have correctly loaded the webpage.
|
|
|
|
Right now the only kinds of policies you can write are bot policies. Other forms of policies will be added in the future.
|
|
|
|
Here is a minimal policy file that will protect against most scraper bots:
|
|
|
|
<Tabs>
|
|
<TabItem value="json" label="JSON" default>
|
|
|
|
```json
|
|
{
|
|
"bots": [
|
|
{
|
|
"name": "cloudflare-workers",
|
|
"headers_regex": {
|
|
"CF-Worker": ".*"
|
|
},
|
|
"action": "DENY"
|
|
},
|
|
{
|
|
"name": "well-known",
|
|
"path_regex": "^/.well-known/.*$",
|
|
"action": "ALLOW"
|
|
},
|
|
{
|
|
"name": "favicon",
|
|
"path_regex": "^/favicon.ico$",
|
|
"action": "ALLOW"
|
|
},
|
|
{
|
|
"name": "robots-txt",
|
|
"path_regex": "^/robots.txt$",
|
|
"action": "ALLOW"
|
|
},
|
|
{
|
|
"name": "generic-browser",
|
|
"user_agent_regex": "Mozilla",
|
|
"action": "CHALLENGE"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
</TabItem>
|
|
<TabItem value="yaml" label="YAML">
|
|
|
|
```yaml
|
|
bots:
|
|
- name: cloudflare-workers
|
|
headers_regex:
|
|
CF-Worker: .*
|
|
action: DENY
|
|
- name: well-known
|
|
path_regex: ^/.well-known/.*$
|
|
action: ALLOW
|
|
- name: favicon
|
|
path_regex: ^/favicon.ico$
|
|
action: ALLOW
|
|
- name: robots-txt
|
|
path_regex: ^/robots.txt$
|
|
action: ALLOW
|
|
- name: generic-browser
|
|
user_agent_regex: Mozilla
|
|
action: CHALLENGE
|
|
```
|
|
|
|
</TabItem>
|
|
</Tabs>
|
|
|
|
This allows requests to [`/.well-known`](https://en.wikipedia.org/wiki/Well-known_URI), `/favicon.ico`, `/robots.txt`, and challenges any request that has the word `Mozilla` in its User-Agent string. The [default policy file](https://github.com/TecharoHQ/anubis/blob/main/data/botPolicies.json) is a bit more cohesive, but this should be more than enough for most users.
|
|
|
|
If no rules match the request, it is allowed through. For more details on this default behavior and its implications, see [Default allow behavior](./default-allow-behavior.mdx).
|
|
|
|
## Writing your own rules
|
|
|
|
There are three actions that can be returned from a rule:
|
|
|
|
| Action | Effects |
|
|
| :---------- | :-------------------------------------------------------------------------------- |
|
|
| `ALLOW` | Bypass all further checks and send the request to the backend. |
|
|
| `DENY` | Deny the request and send back an error message that scrapers think is a success. |
|
|
| `CHALLENGE` | Show a challenge page and/or validate that clients have passed a challenge. |
|
|
|
|
Name your rules in lower case using kebab-case. Rule names will be exposed in Prometheus metrics.
|
|
|
|
### Challenge configuration
|
|
|
|
Rules can also have their own challenge settings. These are customized using the `"challenge"` key. For example, here is a rule that makes challenges artificially hard for connections with the substring "bot" in their user agent:
|
|
|
|
<Tabs>
|
|
<TabItem value="json" label="JSON" default>
|
|
|
|
This rule has been known to have a high false positive rate in testing. Please use this with care.
|
|
|
|
```json
|
|
{
|
|
"name": "generic-bot-catchall",
|
|
"user_agent_regex": "(?i:bot|crawler)",
|
|
"action": "CHALLENGE",
|
|
"challenge": {
|
|
"difficulty": 16,
|
|
"report_as": 4,
|
|
"algorithm": "slow"
|
|
}
|
|
}
|
|
```
|
|
|
|
</TabItem>
|
|
<TabItem value="yaml" label="YAML">
|
|
|
|
This rule has been known to have a high false positive rate in testing. Please use this with care.
|
|
|
|
```yaml
|
|
# Punish any bot with "bot" in the user-agent string
|
|
- name: generic-bot-catchall
|
|
user_agent_regex: (?i:bot|crawler)
|
|
action: CHALLENGE
|
|
challenge:
|
|
difficulty: 16 # impossible
|
|
report_as: 4 # lie to the operator
|
|
algorithm: slow # intentionally waste CPU cycles and time
|
|
```
|
|
|
|
</TabItem>
|
|
</Tabs>
|
|
|
|
Challenges can be configured with these settings:
|
|
|
|
| Key | Example | Description |
|
|
| :----------- | :------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `difficulty` | `4` | The challenge difficulty (number of leading zeros) for proof-of-work. See [Why does Anubis use Proof-of-Work?](/docs/design/why-proof-of-work) for more details. |
|
|
| `report_as` | `4` | What difficulty the UI should report to the user. Useful for messing with industrial-scale scraping efforts. |
|
|
| `algorithm` | `"fast"` | The algorithm used on the client to run proof-of-work calculations. This must be set to `"fast"` or `"slow"`. See [Proof-of-Work Algorithm Selection](./algorithm-selection) for more details. |
|
|
|
|
### Remote IP based filtering
|
|
|
|
The `remote_addresses` field of a Bot rule allows you to set the IP range that this ruleset applies to.
|
|
|
|
For example, you can allow a search engine to connect if and only if its IP address matches the ones they published:
|
|
|
|
<Tabs>
|
|
<TabItem value="json" label="JSON" default>
|
|
|
|
```json
|
|
{
|
|
"name": "qwantbot",
|
|
"user_agent_regex": "\\+https\\:\\/\\/help\\.qwant\\.com/bot/",
|
|
"action": "ALLOW",
|
|
"remote_addresses": ["91.242.162.0/24"]
|
|
}
|
|
```
|
|
|
|
</TabItem>
|
|
<TabItem value="yaml" label="YAML">
|
|
|
|
```yaml
|
|
- name: qwantbot
|
|
user_agent_regex: \+https\://help\.qwant\.com/bot/
|
|
action: ALLOW
|
|
# https://help.qwant.com/wp-content/uploads/sites/2/2025/01/qwantbot.json
|
|
remote_addresses: ["91.242.162.0/24"]
|
|
```
|
|
|
|
</TabItem>
|
|
</Tabs>
|
|
|
|
This also works at an IP range level without any other checks:
|
|
|
|
<Tabs>
|
|
<TabItem value="json" label="JSON" default>
|
|
|
|
```json
|
|
{
|
|
"name": "internal-network",
|
|
"action": "ALLOW",
|
|
"remote_addresses": ["100.64.0.0/10"]
|
|
}
|
|
```
|
|
|
|
</TabItem>
|
|
<TabItem value="yaml" label="YAML">
|
|
|
|
```yaml
|
|
name: internal-network
|
|
action: ALLOW
|
|
remote_addresses:
|
|
- 100.64.0.0/10
|
|
```
|
|
|
|
</TabItem>
|
|
</Tabs>
|
|
|
|
## Imprint / Impressum support
|
|
|
|
Anubis has support for showing imprint / impressum information. This is defined in the `impressum` block of your configuration. See [Imprint / Impressum configuration](./configuration/impressum.mdx) for more information.
|
|
|
|
## Storage backends
|
|
|
|
Anubis needs to store temporary data in order to determine if a user is legitimate or not. Administrators should choose a storage backend based on their infrastructure needs. Each backend has its own advantages and disadvantages.
|
|
|
|
Anubis offers the following storage backends:
|
|
|
|
- [`memory`](#memory) -- A simple in-memory hashmap
|
|
- [`bbolt`](#bbolt) -- An on-disk key/value store backed by [bbolt](https://github.com/etcd-io/bbolt), an embedded key/value database for Go programs
|
|
- [`valkey`](#valkey) -- A remote in-memory key/value database backed by [Valkey](https://valkey.io/) (or another database compatible with the [RESP](https://redis.io/docs/latest/develop/reference/protocol-spec/) protocol)
|
|
|
|
If no storage backend is set in the policy file, Anubis will use the [`memory`](#memory) backend by default. This is equivalent to the following in the policy file:
|
|
|
|
```yaml
|
|
store:
|
|
backend: memory
|
|
parameters: {}
|
|
```
|
|
|
|
### `memory`
|
|
|
|
The memory backend is an in-memory cache. This backend works best if you don't use multiple instances of Anubis or don't have mutable storage in the environment you're running Anubis in.
|
|
|
|
| Should I use this backend? | Yes/no |
|
|
| :------------------------------------------------------------ | :----- |
|
|
| Are you running only one instance of Anubis for this service? | ✅ Yes |
|
|
| Does your service get a lot of traffic? | 🚫 No |
|
|
| Do you want to store data persistently when Anubis restarts? | 🚫 No |
|
|
| Do you run Anubis without mutable filesystem storage? | ✅ Yes |
|
|
|
|
The biggest downside is that there is not currently a limit to how much data can be stored in memory. This will be addressed at a later time.
|
|
|
|
:::warning
|
|
|
|
The in-memory backend exists mostly for validation, testing, and to ensure that the default configuration of Anubis works as expected. Do not use this persistently in production.
|
|
|
|
:::
|
|
|
|
#### Configuration
|
|
|
|
The memory backend does not require any configuration to use.
|
|
|
|
### `bbolt`
|
|
|
|
An on-disk storage layer powered by [bbolt](https://github.com/etcd-io/bbolt), a high performance embedded key/value database used by containerd, etcd, Kubernetes, and NATS. This backend works best if you're running Anubis on a single host and get a lot of traffic.
|
|
|
|
| Should I use this backend? | Yes/no |
|
|
| :------------------------------------------------------------ | :----- |
|
|
| Are you running only one instance of Anubis for this service? | ✅ Yes |
|
|
| Does your service get a lot of traffic? | ✅ Yes |
|
|
| Do you want to store data persistently when Anubis restarts? | ✅ Yes |
|
|
| Do you run Anubis without mutable filesystem storage? | 🚫 No |
|
|
|
|
When Anubis opens a bbolt database, it takes an exclusive lock on that database. Other instances of Anubis or other tools cannot view the bbolt database while it is locked by another instance of Anubis. If you run multiple instances of Anubis for different services, give each its own `bbolt` configuration.
|
|
|
|
#### Configuration
|
|
|
|
The `bbolt` backend takes the following configuration options:
|
|
|
|
| Name | Type | Example | Description |
|
|
| :----- | :--- | :----------------- | :--------------------------------------------------------------------------------------------------------------------------- |
|
|
| `path` | path | `/data/anubis.bdb` | The filesystem path for the Anubis bbolt database. Anubis requires write access to the folder containing the bbolt database. |
|
|
|
|
Example:
|
|
|
|
If you have persistent storage mounted to `/data`, then your store configuration could look like this:
|
|
|
|
```yaml
|
|
store:
|
|
backend: bbolt
|
|
parameters:
|
|
path: /data/anubis.bdb
|
|
```
|
|
|
|
### `valkey`
|
|
|
|
[Valkey](https://valkey.io/) is an in-memory key/value store that clients access over the network. This allows multiple instances of Anubis to share information and does not require each instance of Anubis to have persistent filesystem storage.
|
|
|
|
:::note
|
|
|
|
You can also use [Redis](http://redis.io/) with Anubis.
|
|
|
|
:::
|
|
|
|
This backend is ideal if you are running multiple instances of Anubis in a worker pool (eg: Kubernetes Deployments with a copy of Anubis in each Pod).
|
|
|
|
| Should I use this backend? | Yes/no |
|
|
| :------------------------------------------------------------ | :----- |
|
|
| Are you running only one instance of Anubis for this service? | 🚫 No |
|
|
| Does your service get a lot of traffic? | ✅ Yes |
|
|
| Do you want to store data persistently when Anubis restarts? | ✅ Yes |
|
|
| Do you run Anubis without mutable filesystem storage? | ✅ Yes |
|
|
| Do you have Redis or Valkey installed? | ✅ Yes |
|
|
|
|
#### Configuration
|
|
|
|
The `valkey` backend takes the following configuration options:
|
|
|
|
| Name | Type | Example | Description |
|
|
| :---- | :----- | :---------------------- | :----------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `url` | string | `redis://valkey:6379/0` | The URL for the instance of Redis or Valkey that Anubis should store data in. This is in the same format as `REDIS_URL` in many cloud providers. |
|
|
|
|
Example:
|
|
|
|
If you have an instance of Valkey running with the hostname `valkey.int.techaro.lol`, then your store configuration could look like this:
|
|
|
|
```yaml
|
|
store:
|
|
backend: valkey
|
|
parameters:
|
|
url: "redis://valkey.int.techaro.lol:6379/0"
|
|
```
|
|
|
|
This would have the Valkey client connect to host `valkey.int.techaro.lol` on port `6379` with database `0` (the default database).
|
|
|
|
## Risk calculation for downstream services
|
|
|
|
In case your service needs it for risk calculation reasons, Anubis exposes information about the rules that any requests match using a few headers:
|
|
|
|
| Header | Explanation | Example |
|
|
| :---------------- | :--------------------------------------------------- | :--------------- |
|
|
| `X-Anubis-Rule` | The name of the rule that was matched | `bot/lightpanda` |
|
|
| `X-Anubis-Action` | The action that Anubis took in response to that rule | `CHALLENGE` |
|
|
| `X-Anubis-Status` | The status and how strict Anubis was in its checks | `PASS` |
|
|
|
|
Policy rules are matched using [Go's standard library regular expressions package](https://pkg.go.dev/regexp). You can mess around with the syntax at [regex101.com](https://regex101.com), make sure to select the Golang option.
|
|
|
|
## Request Weight
|
|
|
|
Anubis rules can also add or remove "weight" from requests, allowing administrators to configure custom levels of suspicion. For example, if your application uses session tokens named `i_love_gitea`:
|
|
|
|
```yaml
|
|
- name: gitea-session-token
|
|
action: WEIGH
|
|
expression:
|
|
all:
|
|
- '"Cookie" in headers'
|
|
- headers["Cookie"].contains("i_love_gitea=")
|
|
# Remove 5 weight points
|
|
weight:
|
|
adjust: -5
|
|
```
|
|
|
|
This would remove five weight points from the request, which would make Anubis present the [Meta Refresh challenge](./configuration/challenges/metarefresh.mdx) in the default configuration.
|
|
|
|
### Weight Thresholds
|
|
|
|
For more information on configuring weight thresholds, see [Weight Threshold Configuration](./configuration/thresholds.mdx)
|
|
|
|
### Advice
|
|
|
|
Weight is still very new and needs work. This is an experimental feature and should be treated as such. Here's some advice to help you better tune requests:
|
|
|
|
- The default weight for browser-like clients is 10. This triggers an aggressive challenge.
|
|
- Remove and add weight in multiples of five.
|
|
- Be careful with how you configure weight.
|