mirror of
https://github.com/TecharoHQ/anubis.git
synced 2025-08-03 17:59:24 -04:00

* feat: implement imprint/impressum support Closes #362 Signed-off-by: Xe Iaso <me@xeiaso.net> * chore(docs/anubis): enable an imprint Signed-off-by: Xe Iaso <me@xeiaso.net> * chore: spelling Signed-off-by: Xe Iaso <me@xeiaso.net> * docs: fix the end of the sentence, comment out a default impressum Signed-off-by: Xe Iaso <me@xeiaso.net> * docs: link back to impressum page Signed-off-by: Xe Iaso <me@xeiaso.net> --------- Signed-off-by: Xe Iaso <me@xeiaso.net>
281 lines
9.5 KiB
Plaintext
281 lines
9.5 KiB
Plaintext
---
|
|
title: Policy Definitions
|
|
---
|
|
|
|
import Tabs from "@theme/Tabs";
|
|
import TabItem from "@theme/TabItem";
|
|
|
|
Out of the box, Anubis is pretty heavy-handed. It will aggressively challenge everything that might be a browser (usually indicated by having `Mozilla` in its user agent). However, some bots are smart enough to get past the challenge. Some things that look like bots may actually be fine (IE: RSS readers). Some resources need to be visible no matter what. Some resources and remotes are fine to begin with.
|
|
|
|
Bot policies let you customize the rules that Anubis uses to allow, deny, or challenge incoming requests. Currently you can set policies by the following matches:
|
|
|
|
- Request path
|
|
- User agent string
|
|
- HTTP request header values
|
|
- [Importing other configuration snippets](./configuration/import.mdx)
|
|
|
|
As of version v1.17.0 or later, configuration can be written in either JSON or YAML.
|
|
|
|
Here's an example rule that denies [Amazonbot](https://developer.amazon.com/en/amazonbot):
|
|
|
|
<Tabs>
|
|
<TabItem value="json" label="JSON" default>
|
|
|
|
```json
|
|
{
|
|
"name": "amazonbot",
|
|
"user_agent_regex": "Amazonbot",
|
|
"action": "DENY"
|
|
}
|
|
```
|
|
|
|
</TabItem>
|
|
<TabItem value="yaml" label="YAML">
|
|
|
|
```yaml
|
|
- name: amazonbot
|
|
user_agent_regex: Amazonbot
|
|
action: DENY
|
|
```
|
|
|
|
</TabItem>
|
|
</Tabs>
|
|
|
|
When this rule is evaluated, Anubis will check the `User-Agent` string of the request. If it contains `Amazonbot`, Anubis will send an error page to the user saying that access is denied, but in such a way that makes scrapers think they have correctly loaded the webpage.
|
|
|
|
Right now the only kinds of policies you can write are bot policies. Other forms of policies will be added in the future.
|
|
|
|
Here is a minimal policy file that will protect against most scraper bots:
|
|
|
|
<Tabs>
|
|
<TabItem value="json" label="JSON" default>
|
|
|
|
```json
|
|
{
|
|
"bots": [
|
|
{
|
|
"name": "cloudflare-workers",
|
|
"headers_regex": {
|
|
"CF-Worker": ".*"
|
|
},
|
|
"action": "DENY"
|
|
},
|
|
{
|
|
"name": "well-known",
|
|
"path_regex": "^/.well-known/.*$",
|
|
"action": "ALLOW"
|
|
},
|
|
{
|
|
"name": "favicon",
|
|
"path_regex": "^/favicon.ico$",
|
|
"action": "ALLOW"
|
|
},
|
|
{
|
|
"name": "robots-txt",
|
|
"path_regex": "^/robots.txt$",
|
|
"action": "ALLOW"
|
|
},
|
|
{
|
|
"name": "generic-browser",
|
|
"user_agent_regex": "Mozilla",
|
|
"action": "CHALLENGE"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
</TabItem>
|
|
<TabItem value="yaml" label="YAML">
|
|
|
|
```yaml
|
|
bots:
|
|
- name: cloudflare-workers
|
|
headers_regex:
|
|
CF-Worker: .*
|
|
action: DENY
|
|
- name: well-known
|
|
path_regex: ^/.well-known/.*$
|
|
action: ALLOW
|
|
- name: favicon
|
|
path_regex: ^/favicon.ico$
|
|
action: ALLOW
|
|
- name: robots-txt
|
|
path_regex: ^/robots.txt$
|
|
action: ALLOW
|
|
- name: generic-browser
|
|
user_agent_regex: Mozilla
|
|
action: CHALLENGE
|
|
```
|
|
|
|
</TabItem>
|
|
</Tabs>
|
|
|
|
This allows requests to [`/.well-known`](https://en.wikipedia.org/wiki/Well-known_URI), `/favicon.ico`, `/robots.txt`, and challenges any request that has the word `Mozilla` in its User-Agent string. The [default policy file](https://github.com/TecharoHQ/anubis/blob/main/data/botPolicies.json) is a bit more cohesive, but this should be more than enough for most users.
|
|
|
|
If no rules match the request, it is allowed through. For more details on this default behavior and its implications, see [Default allow behavior](./default-allow-behavior.mdx).
|
|
|
|
## Writing your own rules
|
|
|
|
There are three actions that can be returned from a rule:
|
|
|
|
| Action | Effects |
|
|
| :---------- | :-------------------------------------------------------------------------------- |
|
|
| `ALLOW` | Bypass all further checks and send the request to the backend. |
|
|
| `DENY` | Deny the request and send back an error message that scrapers think is a success. |
|
|
| `CHALLENGE` | Show a challenge page and/or validate that clients have passed a challenge. |
|
|
|
|
Name your rules in lower case using kebab-case. Rule names will be exposed in Prometheus metrics.
|
|
|
|
### Challenge configuration
|
|
|
|
Rules can also have their own challenge settings. These are customized using the `"challenge"` key. For example, here is a rule that makes challenges artificially hard for connections with the substring "bot" in their user agent:
|
|
|
|
<Tabs>
|
|
<TabItem value="json" label="JSON" default>
|
|
|
|
This rule has been known to have a high false positive rate in testing. Please use this with care.
|
|
|
|
```json
|
|
{
|
|
"name": "generic-bot-catchall",
|
|
"user_agent_regex": "(?i:bot|crawler)",
|
|
"action": "CHALLENGE",
|
|
"challenge": {
|
|
"difficulty": 16,
|
|
"report_as": 4,
|
|
"algorithm": "slow"
|
|
}
|
|
}
|
|
```
|
|
|
|
</TabItem>
|
|
<TabItem value="yaml" label="YAML">
|
|
|
|
This rule has been known to have a high false positive rate in testing. Please use this with care.
|
|
|
|
```yaml
|
|
# Punish any bot with "bot" in the user-agent string
|
|
- name: generic-bot-catchall
|
|
user_agent_regex: (?i:bot|crawler)
|
|
action: CHALLENGE
|
|
challenge:
|
|
difficulty: 16 # impossible
|
|
report_as: 4 # lie to the operator
|
|
algorithm: slow # intentionally waste CPU cycles and time
|
|
```
|
|
|
|
</TabItem>
|
|
</Tabs>
|
|
|
|
Challenges can be configured with these settings:
|
|
|
|
| Key | Example | Description |
|
|
| :----------- | :------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `difficulty` | `4` | The challenge difficulty (number of leading zeros) for proof-of-work. See [Why does Anubis use Proof-of-Work?](/docs/design/why-proof-of-work) for more details. |
|
|
| `report_as` | `4` | What difficulty the UI should report to the user. Useful for messing with industrial-scale scraping efforts. |
|
|
| `algorithm` | `"fast"` | The algorithm used on the client to run proof-of-work calculations. This must be set to `"fast"` or `"slow"`. See [Proof-of-Work Algorithm Selection](./algorithm-selection) for more details. |
|
|
|
|
### Remote IP based filtering
|
|
|
|
The `remote_addresses` field of a Bot rule allows you to set the IP range that this ruleset applies to.
|
|
|
|
For example, you can allow a search engine to connect if and only if its IP address matches the ones they published:
|
|
|
|
<Tabs>
|
|
<TabItem value="json" label="JSON" default>
|
|
|
|
```json
|
|
{
|
|
"name": "qwantbot",
|
|
"user_agent_regex": "\\+https\\:\\/\\/help\\.qwant\\.com/bot/",
|
|
"action": "ALLOW",
|
|
"remote_addresses": ["91.242.162.0/24"]
|
|
}
|
|
```
|
|
|
|
</TabItem>
|
|
<TabItem value="yaml" label="YAML">
|
|
|
|
```yaml
|
|
- name: qwantbot
|
|
user_agent_regex: \+https\://help\.qwant\.com/bot/
|
|
action: ALLOW
|
|
# https://help.qwant.com/wp-content/uploads/sites/2/2025/01/qwantbot.json
|
|
remote_addresses: ["91.242.162.0/24"]
|
|
```
|
|
|
|
</TabItem>
|
|
</Tabs>
|
|
|
|
This also works at an IP range level without any other checks:
|
|
|
|
<Tabs>
|
|
<TabItem value="json" label="JSON" default>
|
|
|
|
```json
|
|
{
|
|
"name": "internal-network",
|
|
"action": "ALLOW",
|
|
"remote_addresses": ["100.64.0.0/10"]
|
|
}
|
|
```
|
|
|
|
</TabItem>
|
|
<TabItem value="yaml" label="YAML">
|
|
|
|
```yaml
|
|
name: internal-network
|
|
action: ALLOW
|
|
remote_addresses:
|
|
- 100.64.0.0/10
|
|
```
|
|
|
|
</TabItem>
|
|
</Tabs>
|
|
|
|
## Imprint / Impressum support
|
|
|
|
Anubis has support for showing imprint / impressum information. This is defined in the `impressum` block of your configuration. See [Imprint / Impressum configuration](./configuration/impressum.mdx) for more information.
|
|
|
|
## Risk calculation for downstream services
|
|
|
|
In case your service needs it for risk calculation reasons, Anubis exposes information about the rules that any requests match using a few headers:
|
|
|
|
| Header | Explanation | Example |
|
|
| :---------------- | :--------------------------------------------------- | :--------------- |
|
|
| `X-Anubis-Rule` | The name of the rule that was matched | `bot/lightpanda` |
|
|
| `X-Anubis-Action` | The action that Anubis took in response to that rule | `CHALLENGE` |
|
|
| `X-Anubis-Status` | The status and how strict Anubis was in its checks | `PASS` |
|
|
|
|
Policy rules are matched using [Go's standard library regular expressions package](https://pkg.go.dev/regexp). You can mess around with the syntax at [regex101.com](https://regex101.com), make sure to select the Golang option.
|
|
|
|
## Request Weight
|
|
|
|
Anubis rules can also add or remove "weight" from requests, allowing administrators to configure custom levels of suspicion. For example, if your application uses session tokens named `i_love_gitea`:
|
|
|
|
```yaml
|
|
- name: gitea-session-token
|
|
action: WEIGH
|
|
expression:
|
|
all:
|
|
- '"Cookie" in headers'
|
|
- headers["Cookie"].contains("i_love_gitea=")
|
|
# Remove 5 weight points
|
|
weight:
|
|
adjust: -5
|
|
```
|
|
|
|
This would remove five weight points from the request, which would make Anubis present the [Meta Refresh challenge](./configuration/challenges/metarefresh.mdx) in the default configuration.
|
|
|
|
### Weight Thresholds
|
|
|
|
For more information on configuring weight thresholds, see [Weight Threshold Configuration](./configuration/thresholds.mdx)
|
|
|
|
### Advice
|
|
|
|
Weight is still very new and needs work. This is an experimental feature and should be treated as such. Here's some advice to help you better tune requests:
|
|
|
|
- The default weight for browser-like clients is 10. This triggers an aggressive challenge.
|
|
- Remove and add weight in multiples of five.
|
|
- Be careful with how you configure weight.
|