docs: add architecture section and overhaul top-level README

- Move Simon's architecture documentation into architecture/
  (setup, variables, topology, dns, deploy, security, operations
  plus index and glossary). All cross-repo references point at
  https://git.digitalboard.ch/Digitalboard/{reference-ansible,dns-zones}
  via absolute URLs so the docs remain navigable from any context.
- Rewrite README.md as a documentation hub: introduction, platform
  Mermaid overview, comparison of the three repos
  (docs / digitalboard.core / reference-ansible) and a full table of
  contents covering architecture, contributing, infrastructure,
  keycloak, ms-entra and troubleshooting.

Addresses the open items from the WKS PoC review (2026-05-26):
docs README begrüssungstext + Übersichtsgrafik + Verlinkung der
beiden anderen Repos, sowie das Verschieben der Architektur-Doku.
This commit is contained in:
Simon Bärlocher 2026-05-28 14:25:27 +02:00
parent 8c2ea8cc72
commit 345cf4b319
No known key found for this signature in database
GPG key ID: 63DE20495932047A
9 changed files with 742 additions and 27 deletions

44
architecture/README.md Normal file
View file

@ -0,0 +1,44 @@
<!-- markdownlint-disable MD013 MD060 MD051 -->
# Architecture — `reference-ansible`
This documentation describes the architecture of the `reference-ansible`
repository and uses the inventory `inventories/demo-gymburgdorf/` as a
running example. It serves both as onboarding documentation for new
engineers and as a reference when setting up additional demo tenants.
> **Demo-only.** All defaults in the roles (passwords, tokens, RPC
> secrets) are insecure and intended exclusively for demo setups. See
> [security.md](security.md).
**Last updated:** 2026-05-26 · **Owner:** @sbaerlocher
## Contents
| Section | File | Topics |
|---|---|---|
| Setup and repo layout | [setup.md](setup.md) | Repo layout, role provenance, control-node prerequisites, Bao login workflow |
| Variables | [variables.md](variables.md) | Ansible variable hierarchy, variable cheatsheet |
| Topology | [topology.md](topology.md) | Inventory groups, service layout per host, variable placement |
| DNS and ACME | [dns.md](dns.md) | Knot zones, NS-delegated vs. ACL-isolated ACME models, split-horizon FQDNs, TSIG/ACL |
| Deploy | [deploy.md](deploy.md) | Play sequence, Traefik DMZ/backend modes |
| Security | [security.md](security.md) | Bao lookup pattern, demo-only defaults, threat boundaries, production hardening |
| Operations | [operations.md](operations.md) | New-tenant walkthrough, known gaps and trade-offs |
## Glossary
| Term | Meaning |
|---|---|
| **OpenBao** | HashiCorp Vault fork. Single source of truth for secrets. Endpoint: `bao.digitalboard.ch`. |
| **Authentik** | Identity provider. Issues OIDC for SP services and LDAP via the Outpost. |
| **Outpost (Authentik)** | Separate Authentik sidecar that emulates LDAP/proxy protocols for legacy apps. Talks to Authentik via RPC + token. |
| **WOPI** | Web Application Open Platform Interface — protocol used by Nextcloud/Opencloud to hand office documents to Collabora. |
| **TSIG / RFC2136** | Authenticated DNS updates. Traefik uses TSIG-signed `nsupdate` calls for ACME DNS-01 challenges. |
| **DNS-01 (ACME)** | Let's Encrypt challenge type: certificate ownership is proven via a TXT record in DNS instead of HTTP. Required for wildcard certs. |
| **CNAME bridge** | `_acme-challenge.<fqdn>` points via CNAME into a dedicated update label (`<service>.demo-gymb._acme.digitalboard.ch`), keeping the TSIG key scoped to a narrow sub-tree. See [dns.md](dns.md). |
| **Knot DNS** | Authoritative DNS server used on `ns1.digitalboard.ch`. Config and zone files live in the separate [`dns-zones`](https://git.digitalboard.ch/Digitalboard/dns-zones) repo. |
| **DNSSEC** | Zones are signed with Ed25519, NSEC3 (no opt-out), KSK 1y / ZSK 90d rollovers, CDS/CDNSKEY published for automatic DS at the parent. |
| **Split horizon** | Two FQDN families per service: public `<svc>.gymb.souveredu.ch` → DMZ Traefik front-end IP, internal `<svc>.int.gymb.souveredu.ch` → directly the backend host. See [dns.md](dns.md). |
| **File provider / Docker provider** | Traefik configuration sources. The file provider reads static YAML; the Docker provider reads container labels via `/var/run/docker.sock`. |
| **STUN/TURN** | NAT-traversal protocols for WebRTC (e.g. for Nextcloud Talk). Runs on a separate host (`turn`). |
| **Garage** | S3-compatible object store (Rust). Backend for Nextcloud/Opencloud. |
| **FQCN** | Fully Qualified Collection Name, e.g. `digitalboard.core.traefik`. Mandatory in Ansible since 2.10. |

74
architecture/deploy.md Normal file
View file

@ -0,0 +1,74 @@
<!-- markdownlint-disable MD013 MD060 MD051 -->
# Deploy flow and Traefik modes
← Back to [Architecture index](README.md)
## 6. Deploy flow
Sequence taken from [playbooks/site.yml](https://git.digitalboard.ch/Digitalboard/reference-ansible/src/branch/main/playbooks/site.yml):
```mermaid
sequenceDiagram
participant U as User
participant A as ansible-playbook
participant V as OpenBao
participant H as Hosts
U->>U: bao login + export VAULT_TOKEN
U->>A: make deploy_site_demo_gymburgdorf
A->>A: load vars: role defaults → group_vars/all → group_vars/&lt;groups&gt; → host_vars/&lt;host&gt;
A->>V: community.hashi_vault lookups<br/>(acme-tsig, service secrets)
V-->>A: secret values
A->>H: Play 1 — base (all hosts)
A->>H: Play 2 — traefik (all hosts: dmz on reverseproxy, backend elsewhere)
A->>H: Play 3 — httpbin
A->>H: Play 4 — 389ds
A->>H: Play 5 — keycloak
A->>H: Play 6 — garage (storage)
A->>H: Play 7 — collabora (application)
A->>H: Play 8 — authentik (application)
A->>H: Play 9 — authentik_outpost_ldap (application)
A->>H: Play 10 — nextcloud (application)
A->>H: Play 11 — drawio (application)
A->>H: Play 12 — send
A->>H: Play 13 — opnform
A->>H: Play 14 — homarr
A->>H: Play 15 — bookstack
A->>H: Play 16 — opencloud
```
Plays without matching group members (`httpbin_servers`,
`ds389_servers`, `keycloak_servers`, `send_servers`,
`opnform_servers`, `homarr_servers`, `bookstack_servers`,
`opencloud_servers` in this inventory) run as no-ops.
> **Role-name spelling traps:** the LDAP role is `389ds` (not
> `ds389`); the forms role is `opnform` (not `openforms`/`openform`).
> Inventory groups must match the names used in
> [playbooks/site.yml](https://git.digitalboard.ch/Digitalboard/reference-ansible/src/branch/main/playbooks/site.yml) exactly —
> `ds389_servers`, `opnform_servers`.
`--diff` is enabled in the target → per-task changes are visible.
## 7. Traefik modes (DMZ vs Backend)
**`traefik_mode: dmz`** — public-facing reverse proxy on `reverseproxy`:
- **File provider** with `services.yml` for static routing.
- No Docker socket mounted, no local containers.
- Routes to `backend_host` addresses on other machines.
- Backends are declared via `traefik_dmz_exposed_services` (a list in
`host_vars/reverseproxy/`). Selective backend selection is also
possible via `traefik_backend_servers_to_proxy`.
**`traefik_mode: backend`** — application/storage:
- Mounts `/var/run/docker.sock`.
- **Docker provider**: auto-discovery via container labels
(`traefik.enable=true`).
- Services are exposed locally; the DMZ Traefik routes external
traffic to them in plaintext HTTP (see
[security.md](security.md)).
**Both modes** support ACME via RFC2136 DNS challenge or self-signed
(`traefik_cert_mode: acme | selfsigned`).

123
architecture/dns.md Normal file
View file

@ -0,0 +1,123 @@
<!-- markdownlint-disable MD013 MD060 MD051 -->
# DNS topology and ACME zone layout
← Back to [Architecture index](README.md)
Authoritative DNS for everything described in this document runs on
**`ns1.digitalboard.ch`** (public `193.43.183.169`, DMZ `172.16.9.169`)
using **Knot DNS**. The zone files and Knot config live in the
[`dns-zones`](https://git.digitalboard.ch/Digitalboard/dns-zones) repo; this section explains how the
public service FQDNs, the internal "split-horizon" FQDNs, and the ACME
challenge sub-trees fit together.
## Authoritative zones on `ns1`
| Zone | Purpose | DNSSEC | Dynamic updates |
|---|---|---|---|
| `digitalboard.ch` | Production zone for the platform itself (`auth`, `cloud`, `office`, `bao`, …). | on | none (static zone file) |
| `_acme.digitalboard.ch` | Parent zone for ACME challenge labels. | on | yes, per-tenant TSIG ACLs (`demo-gymb`, `demo-phbe`, `demo-mbaz`) |
| `digitalboard._acme.digitalboard.ch` | **Delegated** child zone for `digitalboard.ch` ACME updates only. | off | yes, TSIG `acme_update_key_digitalboard` |
| `souveredu.ch` | Demo-tenant zone (`gymb`, `phbe`, `mbaz` sub-labels). | on | none (static zone file) |
| `demo-schulen.ch` | Reserve / unused so far. | on | none |
> **Two different ACME models live here.** This is the most common
> source of confusion when copying a tenant:
>
> - `digitalboard.ch` uses a **NS-delegated child zone**
> (`digitalboard._acme.digitalboard.ch.` has its own `NS` record in
> `_acme.digitalboard.ch`). The TSIG key writes into that delegated
> zone.
> - The demo tenants (`demo-gymb`, `demo-phbe`, `demo-mbaz`) **share
> the parent zone** `_acme.digitalboard.ch` and are isolated only
> by **Knot ACL `update-owner-name`** on the per-tenant sub-tree
> (`demo-gymb._acme.digitalboard.ch.` and below). There is no NS
> delegation for them.
>
> Both work for the ACME flow; the demo model is cheaper to manage but
> means tenant isolation depends on Knot ACLs, not zone boundaries.
## Naming pattern for `demo-gymb` (template for new tenants)
```text
Public, browser-facing:
cloud.gymb.souveredu.ch CNAME → rvp.gymb.souveredu.ch (193.43.183.131)
auth.gymb.souveredu.ch CNAME → rvp.gymb.souveredu.ch
office.gymb.souveredu.ch CNAME → rvp.gymb.souveredu.ch
s3.gymb.souveredu.ch CNAME → rvp.gymb.souveredu.ch
...
Internal, server-to-server (split horizon):
cloud.int.gymb.souveredu.ch A → 172.16.19.101 (application host)
auth.int.gymb.souveredu.ch A → 172.16.19.101
office.int.gymb.souveredu.ch A → 172.16.19.101
s3.int.gymb.souveredu.ch A → 172.16.19.102 (storage host)
...
Tenant entry IPs:
rvp.gymb.souveredu.ch A → 193.43.183.131 (DMZ Traefik public)
reverseproxy.int.gymb A → 172.16.9.111 (DMZ Traefik internal)
ACME challenge labels (writeable via TSIG acme_update_key_demo_gymb):
_acme-challenge.cloud.gymb CNAME → cloud.demo-gymb._acme.digitalboard.ch
_acme-challenge.cloud.int.gymb CNAME → cloud.int.demo-gymb._acme.digitalboard.ch
...
```
The `.int.` family is what makes Nextcloud → Garage, Nextcloud →
Authentik (OIDC), Nextcloud → Collabora (WOPI) etc. **bypass the DMZ
Traefik**: the backend host's local Traefik presents the right cert
directly, so traffic stays on the backend subnet. Without this,
server-to-server calls would either ride out through the DMZ and back
in, or hit a hostname mismatch on the cert.
## TSIG / ACL model
```mermaid
flowchart LR
classDef tenant fill:#dcfce7,stroke:#166534,color:#000
classDef zone fill:#dbeafe,stroke:#1e40af,color:#000
classDef acl fill:#fef3c7,stroke:#92400e,color:#000
subgraph KNOT["ns1.digitalboard.ch (Knot DNS)"]
Z1["_acme.digitalboard.ch<br/>(parent zone)"]:::zone
Z2["digitalboard._acme.digitalboard.ch<br/>(NS-delegated child)"]:::zone
A1["ACL acme_updates_digitalboard<br/>scope: digitalboard._acme.digitalboard.ch."]:::acl
A2["ACL acme_updates_demo_gymb<br/>scope: demo-gymb._acme.digitalboard.ch."]:::acl
A3["ACL acme_updates_demo_phbe<br/>scope: demo-phbe._acme.digitalboard.ch."]:::acl
A4["ACL acme_updates_demo_mbaz<br/>scope: demo-mbaz._acme.digitalboard.ch."]:::acl
end
DB["digitalboard.ch Traefik<br/>TSIG: acme_update_key_digitalboard"]:::tenant
GY["demo-gymb Traefik<br/>TSIG: acme_update_key_demo_gymb"]:::tenant
PH["demo-phbe Traefik<br/>TSIG: acme_update_key_demo_phbe"]:::tenant
MB["demo-mbaz Traefik<br/>TSIG: acme_update_key_demo_mbaz"]:::tenant
DB -- nsupdate TXT --> A1
GY -- nsupdate TXT --> A2
PH -- nsupdate TXT --> A3
MB -- nsupdate TXT --> A4
A1 -- writes into --> Z2
A2 -- writes into --> Z1
A3 -- writes into --> Z1
A4 -- writes into --> Z1
```
Each ACL is restricted to **`update-type: TXT`** and
**`update-owner-match: sub-or-equal`** under the tenant prefix, so a
leaked tenant key cannot write outside its own ACME sub-tree and cannot
modify non-TXT records (no A/CNAME/NS hijack).
## Traefik variables that bind to this layout
From `inventories/demo-gymburgdorf/group_vars/traefik_servers/traefik.yml`:
| Traefik variable | Value for `demo-gymb` | Bound to |
|---|---|---|
| `traefik_acme_dns_provider` | `rfc2136` | Knot dynamic-update endpoint |
| `traefik_acme_dns_zone` | `demo-gymb._acme.digitalboard.ch` | Per-tenant write scope on `ns1` |
| `traefik_acme_tsig_key_name` | `acme_update_key_demo_gymb` | Matches `key:` entry in [`knot.conf`](https://git.digitalboard.ch/Digitalboard/dns-zones/src/branch/main/knot/knot.conf) |
| `traefik_acme_tsig_secret` | Bao lookup | See [security.md](security.md) |
A tenant whose ACME zone does **not** match the Knot ACL
`update-owner-name` will get `REFUSED` on `nsupdate` and ACME issuance
will silently retry until the renewal window expires.

View file

@ -0,0 +1,99 @@
<!-- markdownlint-disable MD013 MD060 MD051 -->
# Operations — new tenants and known gaps
← Back to [Architecture index](README.md)
## 10. Walkthrough: creating a new demo tenant
Recommended template: **`demo-gymburgdorf`** (not `vagrant`, since its
group topology is incompatible).
1. **Copy the inventory:**
```bash
cp -r inventories/demo-gymburgdorf inventories/demo-<customer>
```
2. **Adjust `hosts.yml`:** IPs and hostnames per host.
3. **`group_vars/all/vault.yml`** — point `vault_mount` at the new
tenant mount (`demo-<customer>`).
4. **`group_vars/traefik_servers/traefik.yml`** — bend
`traefik_acme_dns_zone` and the `traefik_acme_tsig_*` lookup paths
to the new zone / new Bao path.
5. **`host_vars/application/*.yml`** and
**`host_vars/storage/*.yml`** — walk through them: FQDNs to the new
domain pattern (e.g. `*.<customer>.souveredu.ch`), Bao lookup paths
to `demo-<customer>/data/…`.
6. **Prepare OpenBao** (out-of-band, not via Ansible):
- Create a new KV-v2 mount `demo-<customer>`.
- Write secrets: `acme-tsig`, `authentik`, `nextcloud`, `garage`, …
(see [security.md](security.md) for the mandatory-override list).
- Policy for the deploy token: read on `demo-<customer>/data/*`.
7. **DNS** (in the [`dns-zones`](https://git.digitalboard.ch/Digitalboard/dns-zones) repo, see
[dns.md](dns.md)):
- Add `key:` and `acl:` entries for the new tenant in
[`knot/knot.conf`](https://git.digitalboard.ch/Digitalboard/dns-zones/src/branch/main/knot/knot.conf), pattern
`acme_update_key_demo_<customer>` /
`acme_updates_demo_<customer>` scoped to
`demo-<customer>._acme.digitalboard.ch.`.
- Append the new ACL to the `_acme.digitalboard.ch` zone's `acl:`
list — the tenants share the parent zone, no NS delegation.
- In `zones/souveredu.ch.zone` (or the tenant's public zone) add
the public/internal A records (`rvp.<customer>`,
`reverseproxy.int.<customer>`, `application.int.<customer>`,
`storage.int.<customer>`, …), the service CNAMEs to
`rvp.<customer>`, and the `_acme-challenge.*` CNAMEs into
`demo-<customer>._acme.digitalboard.ch`. Bump the SOA serial.
- `make deploy_ns1` to push.
8. **Makefile** — add a new target modelled on
`deploy_site_demo_gymburgdorf` and wire it into
`deploy_site_demo`.
9. **Smoke test:**
`ansible all -i inventories/demo-<customer>/hosts.yml -m ping`.
10. **Deploy:** Bao login + `make deploy_site_demo_<customer>`.
## 11. Known gaps and trade-offs
- **Optional services without group bindings in `demo-gymburgdorf`:**
`opencloud`, `send`, `opnform`, `homarr`, and `bookstack` are
declared as plays in
[playbooks/site.yml](https://git.digitalboard.ch/Digitalboard/reference-ansible/src/branch/main/playbooks/site.yml) but have no
`<service>_servers` group in the inventory — those plays run as
no-ops. If needed, add the group + `host_vars/application/<svc>.yml`
as described in [topology.md](topology.md). Mind spelling:
`opnform_servers` (not `openform`/`openforms`).
- **`turn` host:** defined in the DMZ, but no STUN/TURN role in
[playbooks/site.yml](https://git.digitalboard.ch/Digitalboard/reference-ansible/src/branch/main/playbooks/site.yml). Currently provisioned only
via `base` + `traefik`.
- **Idempotency:** roles are Docker-Compose-based; re-runs may trigger
container restarts when compose inputs change. There is no dedicated
rollback mechanism — on failure, roll back manually to the previous
state.
- **TLS renewal:** handled internally by Traefik via ACME. There is no
external renewal cron in the repo.
- **CI / testing:** not present in the repo. Smoke test is
`make ping_demo`.
- **Logs:** Traefik runs with `traefik_log_level: DEBUG` in
`demo-gymburgdorf` and `vagrant` (role default is `INFO`) — reduce
to `INFO` or `WARN` before adapting for production.
- **TSIG secrets in `knot.conf`:** the `dns-zones` repo currently
stores all four ACME TSIG keys in plaintext in
[`knot/knot.conf`](https://git.digitalboard.ch/Digitalboard/dns-zones/src/branch/main/knot/knot.conf). The Ansible
side reads them from Bao, but the Knot side does not — anyone with
read on the `dns-zones` repo can write TXT records under the
matching tenant's ACME sub-tree. For prod, source the Knot keys
from a templated config + secret store, or restrict repo access.
- **Demo tenants share `_acme.digitalboard.ch`:** isolation is by
Knot ACL `update-owner-name`, not by zone delegation. A mis-edit
of the ACL list could break ACL-based isolation without breaking
DNS resolution — failure is silent. The production zone
(`digitalboard.ch`) uses a properly delegated child zone and is
not affected.

71
architecture/security.md Normal file
View file

@ -0,0 +1,71 @@
<!-- markdownlint-disable MD013 MD060 MD051 -->
# Security and demo-only defaults
← Back to [Architecture index](README.md)
> This repo is explicitly designed for **demo setups**. All default
> values in the roles are insecure and are overridden in `demo-*`
> inventories via Bao lookups or host_vars. For production deployments
> the hardening block further down also applies.
## Secret pattern (Bao lookup)
```yaml
# group_vars/.../<service>.yml or host_vars/.../<service>.yml
authentik_secret_key: "{{ lookup('community.hashi_vault.hashi_vault',
vault_mount + '/data/authentik:secret_key',
url=vault_addr) }}"
```
- `vault_mount` and `vault_addr` come from
[group_vars/all/vault.yml](https://git.digitalboard.ch/Digitalboard/reference-ansible/src/branch/main/inventories/demo-gymburgdorf/group_vars/all/vault.yml).
- KV-v2 paths require an explicit `/data/` segment — Ansible does not
resolve this automatically.
- `vault_mount` is unique per inventory (`demo-gymburgdorf`,
`demo-phbern`, …) → tenant isolation in Bao via mount + policy.
## Demo-only defaults — override required
These defaults in `digitalboard.core` are insecure. In any
**production-grade** deployment they must be overridden via Bao lookup
or host_var:
| Variable | Default | Where to override |
|---|---|---|
| `keycloak_admin_password` | `changeme` | host_vars `keycloak_servers` |
| `keycloak_postgres_password` | `changeme` | same |
| `authentik_secret_key` | `changeme-generate-a-random-string` | `host_vars/application/authentik.yml` |
| `authentik_postgres_password` | `changeme` | same |
| `nextcloud_admin_password` | `admin` | `host_vars/application/nextcloud.yml` |
| `nextcloud_postgres_password` | `changeme` | same |
| `nextcloud_s3_key` / `nextcloud_s3_secret` | `changeme` / `changeme` | same |
| `garage_webui_password` | `admin` | `host_vars/storage/garage.yml` |
| `garage_rpc_secret` | `0123…cdef` (64-hex constant) | same |
| `garage_admin_token` | identical to `rpc_secret` | same |
| `garage_metrics_token` | identical to `rpc_secret` | same |
> **Convention:** every value listed above **must** have a Bao lookup
> in `demo-*/host_vars/.../...yml` before the inventory is considered
> deploy-ready.
## Threat boundaries (current demo state)
| Boundary | Status | Notes |
|---|---|---|
| DMZ ↔ Backend (172.16.9 ↔ 172.16.19) | **Plaintext HTTP** | Auth bearers, OIDC codes, session cookies travel unencrypted. Fine for demo; for prod use mTLS or a WireGuard overlay. |
| Host firewall | **missing** | The `base` role does not install UFW/nftables. Segmentation relies on the hypervisor/VLAN. |
| SSH | `ansible_user: root` | No bastion, no jump host. Key distribution out-of-band. |
| Authentik SPOF | **accepted** | IdP and SP services share the same host (`application`). An Authentik outage means a login outage including the LDAP outpost. No break-glass path. |
| ACME TSIG key | Bao lookup (in Ansible), **plaintext in [`knot.conf`](https://git.digitalboard.ch/Digitalboard/dns-zones/src/branch/main/knot/knot.conf)** on `ns1` side | One TSIG key per demo tenant, scoped via Knot ACL `update-owner-name` to the tenant's ACME sub-tree. Rotation is manual and must be done on both sides simultaneously (Bao + `knot.conf` + `knotc zone-reload`). |
| Backup / DR | **out of scope** | Garage `replication_factor: 1` (default), no Postgres backup job, no Bao snapshot cron. |
## To adapt for production, add
- Host firewall (extend the `base` role or add a dedicated `firewall`
role).
- mTLS or WireGuard between DMZ and backend.
- Authentik on a separate host with a recovery admin token.
- Bao policies per inventory mount (read-only for the deploy token,
write-only for the bootstrap job).
- Backup cron for Postgres + Garage + Bao.
- SSH bastion + key rotation.

68
architecture/setup.md Normal file
View file

@ -0,0 +1,68 @@
<!-- markdownlint-disable MD013 MD060 MD051 -->
# Setup and repo layout
← Back to [Architecture index](README.md)
## 1. Repo layout and role provenance
```text
reference-ansible/
├── Makefile # Deploy targets, OIDC login, OBJC fork workaround
├── ansible.cfg # collections_path, remote_user=root, hashi_vault auth_method=token
├── requirements.yml # community.hashi_vault + digitalboard.core (Git)
├── playbooks/site.yml # Play sequence (14 plays, see deploy.md)
├── collections/ # ← installed by `make install`, gitignored
│ └── ansible_collections/
│ └── digitalboard/core/
│ └── roles/ # 🔑 Roles live HERE, NOT in the repo root
└── inventories/
├── demo-gymburgdorf/ # Inventory used throughout this document
├── demo-mbazürich/
├── demo-phbern/
└── vagrant/ # Local test inventory with its own topology
```
> **Important:** There is **no** `roles/` directory at the repo root.
> All roles come from the `digitalboard.core` collection (see
> [requirements.yml](https://git.digitalboard.ch/Digitalboard/reference-ansible/src/branch/main/requirements.yml)), installed via `make install`
> into `./collections/`. Plays reference them by FQCN
> `digitalboard.core.<role>`.
## 2. Setup and prerequisites
**Tools on the control node:**
- `ansible` (Core ≥ 2.15)
- `bao` CLI (OpenBao) — e.g. `sudo pacman -S openbao python-hvac` (Arch) or Homebrew
- `python-hvac` (for `community.hashi_vault` lookups)
- On macOS: `OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES` (set in the
[Makefile](https://git.digitalboard.ch/Digitalboard/reference-ansible/src/branch/main/Makefile); without it Ansible forks crash on Bao lookups)
**Initial setup:**
```bash
git clone <repo>
cd reference-ansible
make install # Galaxy + digitalboard.core into ./collections/
```
**Before every deploy:** Bao login in the **same shell** that will then
run `ansible-playbook`:
```bash
export BAO_ADDR=https://bao.digitalboard.ch
bao login -method=oidc -path=Digitalboard
export VAULT_TOKEN=$(bao print token)
```
> ⚠️ `make bao` on its own is **not enough**: every `make` target spawns
> a new shell, and the `VAULT_TOKEN` exported in there only lives for
> the duration of `make bao` itself. Either run the three commands
> above manually, or invoke `make bao deploy_site_demo_gymburgdorf` as
> **one** call — otherwise the deploy has no token.
**Smoke test:**
```bash
make ping_demo # pings all three demo inventories
```

110
architecture/topology.md Normal file
View file

@ -0,0 +1,110 @@
<!-- markdownlint-disable MD013 MD060 MD051 -->
# Topology — inventory and services
← Back to [Architecture index](README.md)
## 4. Inventory topology (`demo-gymburgdorf`)
```mermaid
flowchart LR
classDef dmz fill:#fee2e2,stroke:#991b1b,color:#000
classDef app fill:#dcfce7,stroke:#166534,color:#000
classDef stor fill:#dbeafe,stroke:#1e40af,color:#000
classDef turn fill:#fef9c3,stroke:#854d0e,color:#000
subgraph ALL["group: all_servers"]
direction LR
subgraph DMZ["DMZ 172.16.9.0/24"]
RP["<b>reverseproxy</b><br/>172.16.9.111<br/>traefik_mode: dmz"]:::dmz
TURN["<b>turn</b><br/>172.16.9.112<br/>(no role in site.yml yet)"]:::turn
end
subgraph BE["Backend 172.16.19.0/24<br/>group: backend_servers"]
APP["<b>application</b><br/>172.16.19.101<br/>traefik_mode: backend<br/>+ authentik, authentik_outpost_ldap,<br/> nextcloud, collabora, drawio"]:::app
ST["<b>storage</b><br/>172.16.19.102<br/>traefik_mode: backend<br/>+ garage (S3)"]:::stor
end
end
RP -.HTTPS in, HTTP out.-> APP
RP -.HTTPS in, HTTP out.-> ST
```
**Group memberships (from [hosts.yml](https://git.digitalboard.ch/Digitalboard/reference-ansible/src/branch/main/inventories/demo-gymburgdorf/hosts.yml)):**
| Group | Members | Purpose |
|---|---|---|
| `all_servers` | `reverseproxy`, `application`, `storage`, `turn` | Base role for all hosts |
| `traefik_servers` | `children: all_servers` (= all 4 hosts) | Traefik everywhere; DMZ/backend via `traefik_mode` |
| `backend_servers` | `application`, `storage` | Sets `traefik_mode: backend` via group var |
| `garage_servers` | `storage` | Single-host wrapper for the Garage role |
| `nextcloud_servers`, `collabora_servers`, `drawio_servers`, `authentik_servers`, `authentik_outpost_ldap_servers` | `application` only | Single-host wrappers |
> **Difference vs. the `vagrant` inventory:** `vagrant` structures
> Traefik differently — via the children groups `traefik_servers_dmz`
> and `traefik_servers_backend` instead of `backend_servers` +
> `host_vars` override. The two topologies are **structurally
> incompatible**; a 1:1 mapping is not possible. See
> [operations.md](operations.md) for the recommended template.
## 5. Service layout and variable placement
```mermaid
flowchart TB
classDef rp fill:#fee2e2,stroke:#991b1b,color:#000
classDef ap fill:#dcfce7,stroke:#166534,color:#000
classDef st fill:#dbeafe,stroke:#1e40af,color:#000
classDef ext fill:#e9d5ff,stroke:#6b21a8,color:#000
Internet((Internet))
DNS["DNS ns1.digitalboard.ch<br/>RFC2136 TSIG<br/>Zone: demo-gymb._acme.digitalboard.ch<br/>CNAME bridge: _acme-challenge.*.gymb.souveredu.ch"]:::ext
BAO["OpenBao<br/>bao.digitalboard.ch<br/>mount: demo-gymburgdorf"]:::ext
subgraph RP["<b>reverseproxy</b> — traefik dmz"]
TRDMZ["traefik (file provider)<br/>📍 group_vars/traefik_servers/traefik.yml<br/>📍 host_vars/reverseproxy/traefik.yml<br/> → traefik_mode: dmz<br/> → traefik_dmz_exposed_services"]:::rp
end
subgraph APP["<b>application</b> — traefik backend"]
TRA["traefik (docker provider)<br/>📍 group_vars/backend_servers/traefik.yml"]:::ap
AK["authentik (OIDC + LDAP outpost backend)<br/>📍 host_vars/application/authentik.yml"]:::ap
AKO["authentik_outpost_ldap<br/>📍 host_vars/application/authentik_outpost_ldap.yml"]:::ap
NC["nextcloud<br/>📍 host_vars/application/nextcloud.yml"]:::ap
COL["collabora<br/>📍 host_vars/application/collabora.yml"]:::ap
DRW["drawio<br/>📍 host_vars/application/drawio.yml"]:::ap
end
subgraph ST["<b>storage</b> — traefik backend"]
TRS["traefik (docker provider)"]:::st
GAR["garage (S3)<br/>📍 host_vars/storage/garage.yml"]:::st
end
Internet -->|HTTPS :443| TRDMZ
TRDMZ -->|HTTP backend| TRA
TRDMZ -->|HTTP backend| TRS
TRA --> AK & AKO & NC & COL & DRW
TRS --> GAR
NC -. S3 .-> GAR
NC -. OIDC .-> AK
NC -. WOPI .-> COL
NC -. LDAP .-> AKO
AKO -. RPC + token .-> AK
TRDMZ -. ACME DNS-01 TSIG .-> DNS
TRDMZ -. hashi_vault acme-tsig .-> BAO
AK -. hashi_vault secrets .-> BAO
NC -. hashi_vault secrets .-> BAO
GAR -. hashi_vault secrets .-> BAO
```
> **Note:** `opencloud`, `send`, `opnform`, `homarr`, and `bookstack`
> are defined as plays in [playbooks/site.yml](https://git.digitalboard.ch/Digitalboard/reference-ansible/src/branch/main/playbooks/site.yml)
> but currently have no matching group in
> [hosts.yml](https://git.digitalboard.ch/Digitalboard/reference-ansible/src/branch/main/inventories/demo-gymburgdorf/hosts.yml) for
> `demo-gymburgdorf` — those plays therefore run as no-ops. If a
> tenant needs these services, add the corresponding
> `<service>_servers` group in `hosts.yml` and a
> `host_vars/application/<service>.yml` (mind the spelling — the
> forms role is `opnform`, the LDAP role is `389ds`).
>
> The `turn` host is in `all_servers` (and therefore in
> `traefik_servers`) but has **no** service group of its own —
> currently only the `base` and `traefik` roles run on it.

58
architecture/variables.md Normal file
View file

@ -0,0 +1,58 @@
<!-- markdownlint-disable MD013 MD060 MD051 -->
# Variables — hierarchy and cheatsheet
← Back to [Architecture index](README.md)
## 3. Variable hierarchy
Ansible merges variables from multiple sources. Simplified model for
this repo (see the Ansible docs for the full precedence rules):
```mermaid
flowchart LR
classDef role fill:#fef3c7,stroke:#92400e,color:#000
classDef group fill:#dbeafe,stroke:#1e40af,color:#000
classDef host fill:#dcfce7,stroke:#166534,color:#000
classDef vault fill:#fee2e2,stroke:#991b1b,color:#000
R["<b>role defaults</b><br/>(lowest precedence)<br/>collections/.../roles/&lt;r&gt;/defaults/main.yml"]:::role
GA["<b>group_vars/all/</b><br/>vault.yml, docker.yml"]:::group
GG["<b>group_vars/&lt;group&gt;/</b><br/>traefik_servers/, backend_servers/<br/>(parallel groups, merged via<br/>ansible_group_priority)"]:::group
HV["<b>host_vars/&lt;host&gt;/</b><br/>(highest of the three inventory sources)"]:::host
BAO["<b>OpenBao</b><br/>lookup at runtime"]:::vault
R --> |"&lt;overridden by&gt;"| GA
GA --> |"&lt;overridden by&gt;"| GG
GG --> |"&lt;overridden by&gt;"| HV
HV -.community.hashi_vault.-> BAO
GG -.community.hashi_vault.-> BAO
```
**Key properties:**
- Multiple `group_vars/<group>/` are **parallel**, not hierarchically
nested. `traefik_servers` and `backend_servers` are merged by
`ansible_group_priority` (default 1); on conflict the
alphabetically-later group name wins.
- `host_vars/<host>/` beats any group.
- `host_vars/reverseproxy/traefik.yml: traefik_mode: dmz` therefore
overrides the default from `group_vars/backend_servers/` — and only
because `reverseproxy` is not a member of `backend_servers` in the
first place (otherwise the override wouldn't even be needed).
**Bao lookups** are not a precedence layer but **values** inside any
variable source. See [security.md](security.md) for the pattern.
## 9. Variable cheatsheet
| Variable | Where in `demo-gymburgdorf/` | Why |
|---|---|---|
| `vault_addr`, `vault_mount` | `group_vars/all/vault.yml` | Bao endpoint applies site-wide |
| `docker_registry_mirrors` | `group_vars/all/docker.yml` | Pulls from mirror on all hosts |
| `traefik_acme_*`, `traefik_use_ssl`, `traefik_cert_mode` | `group_vars/traefik_servers/traefik.yml` | Applies to every Traefik instance (dmz + backend) |
| `traefik_mode: backend` | `group_vars/backend_servers/traefik.yml` | Default for app + storage |
| `traefik_mode: dmz` | `host_vars/reverseproxy/traefik.yml` | Host-specific override |
| `traefik_dmz_exposed_services` | `host_vars/reverseproxy/` | DMZ backend list — only meaningful here |
| `nextcloud_*`, `authentik_*`, `collabora_*`, `drawio_*` | `host_vars/application/<service>.yml` | Service runs on `application` |
| `garage_*` | `host_vars/storage/garage.yml` | Service runs on `storage` |
| Secrets (passwords, tokens, keys) | inline variable using `lookup('community.hashi_vault.hashi_vault', …)` | Single source of truth via Bao |