docs: add architecture section and overhaul top-level README

- Move Simon's architecture documentation into architecture/
  (setup, variables, topology, dns, deploy, security, operations
  plus index and glossary). All cross-repo references point at
  https://git.digitalboard.ch/Digitalboard/{reference-ansible,dns-zones}
  via absolute URLs so the docs remain navigable from any context.
- Rewrite README.md as a documentation hub: introduction, platform
  Mermaid overview, comparison of the three repos
  (docs / digitalboard.core / reference-ansible) and a full table of
  contents covering architecture, contributing, infrastructure,
  keycloak, ms-entra and troubleshooting.

Addresses the open items from the WKS PoC review (2026-05-26):
docs README begrüssungstext + Übersichtsgrafik + Verlinkung der
beiden anderen Repos, sowie das Verschieben der Architektur-Doku.
This commit is contained in:
Simon Bärlocher 2026-05-28 14:25:27 +02:00
parent 8c2ea8cc72
commit 345cf4b319
No known key found for this signature in database
GPG key ID: 63DE20495932047A
9 changed files with 742 additions and 27 deletions

122
README.md
View file

@ -1,35 +1,103 @@
# 📚 Documentation Repository
<!-- markdownlint-disable MD013 MD060 -->
# 📚 Digitalboard Documentation
This repository contains documentation, guides, and reference material.
Welcome — this repository is the **central documentation hub** for the
Digitalboard platform. It collects architecture notes, operational
runbooks, integration guides, and troubleshooting recipes that span
multiple repositories, so they have one stable home instead of being
scattered across READMEs.
## 📖 Available Documentation
## 🏛️ Platform at a glance
- **[Contribution guidelines](./contributing/)**
Documentation and guides related to infrastructure configuration and best practices.
- [Git](./contributing/git.md)
Guidelines for contributing using git
```mermaid
flowchart LR
classDef docs fill:#dbeafe,stroke:#1e40af,color:#000
classDef core fill:#dcfce7,stroke:#166534,color:#000
classDef ans fill:#fef3c7,stroke:#92400e,color:#000
classDef ext fill:#e9d5ff,stroke:#6b21a8,color:#000
- **[Infrastructure](./infrastructure/)**
Documentation and guides related to infrastructure configuration and best practices.
- [ACME](./infrastructure/acme.md)
Documentation of the ACME concept.
- [IPV6](./infrastructure/ipv6.md)
Documentation of the ipv6 concept.
User((Operator / Engineer))
- **[Keycloak](./keycloak/)**
Documentation and guides related to Keycloak configuration and best practices.
- [Enforce OTP 2FA for Internal Users](./keycloak/enforce-otp-internal.md)
Step-by-step instructions for enforcing OTP-based two-factor authentication for internal users, while excluding external Microsoft Entra users.
- [Integrate MS Entra in Keycloak as IDP](./keycloak/idp-ms-entra.md)
Step-by-step instructions for integrating MS Entra as identity-provider.
subgraph REPOS["Digitalboard repositories"]
DOCS["<b>docs</b><br/>📖 architecture, runbooks,<br/>integration guides<br/>(this repo)"]:::docs
CORE["<b>digitalboard.core</b><br/>⚙️ Ansible collection<br/>= all roles<br/>(traefik, authentik, nextcloud,<br/>garage, keycloak, …)"]:::core
REF["<b>reference-ansible</b><br/>🚀 inventories + playbooks<br/>(demo-gymburgdorf,<br/>demo-phbern, demo-mbazürich,<br/>vagrant)"]:::ans
end
- **[Microsoft Entra](./ms-entra/)**
Documentation and guides related to Microsft Entra configuration and best practices.
- [Enterprise App Integration with Keycloak](./ms-entra/enterprise-app-keycloak.md)
Step-by-step instructions for creating an Enterprise Application in Microsoft Entra (Azure AD) as an identity provider for Keycloak.
subgraph PLATFORM["Runtime targets"]
BAO["OpenBao<br/>bao.digitalboard.ch<br/>(secrets)"]:::ext
DNS["Knot DNS<br/>ns1.digitalboard.ch<br/>(ACME / split-horizon)"]:::ext
HOSTS["Tenant VMs<br/>(reverseproxy · application ·<br/>storage · turn)"]:::ext
end
- **[Troubleshooting](./troubleshooting/)**
Encountered & solved problems.
- [Nextcloud File Locking](./troubleshooting/nextcloud-file-locking.md)
Preventing sync conflicts when multiple users edit the same file via the Nextcloud desktop client.
User -->|reads| DOCS
User -->|runs `make deploy_…`| REF
REF -->|requires| CORE
REF -.->|hashi_vault lookups.-> BAO
REF -->|ansible-playbook| HOSTS
HOSTS -.->|nsupdate TSIG / ACME DNS-01.-> DNS
HOSTS -.->|hashi_vault lookups.-> BAO
DOCS -.documents.-> REF
DOCS -.documents.-> CORE
```
**The three repos at a glance:**
| Repo | Role | Link |
|---|---|---|
| **`docs`** *(here)* | Architecture, integration guides, runbooks, troubleshooting. The "why" and the "how it fits together." | [git.digitalboard.ch/Digitalboard/docs](https://git.digitalboard.ch/Digitalboard/docs) |
| **`digitalboard.core`** | Ansible collection — every reusable role (Traefik, Authentik, Keycloak, Nextcloud, Garage, …). The "what runs on a host." | [git.digitalboard.ch/Digitalboard/digitalboard.core](https://git.digitalboard.ch/Digitalboard/digitalboard.core) |
| **`reference-ansible`** | Inventories + playbooks for the demo tenants and the `vagrant` test setup. The "what gets deployed where, with which variables." | [git.digitalboard.ch/Digitalboard/reference-ansible](https://git.digitalboard.ch/Digitalboard/reference-ansible) |
> 🚀 **Want to deploy something?** Start in
> [`reference-ansible`](https://git.digitalboard.ch/Digitalboard/reference-ansible) —
> its README covers the Bao login, the `make` targets, and the available
> playbooks. Come back here for the architectural background
> ([architecture/](./architecture/)) or for solved problems
> ([troubleshooting/](./troubleshooting/)).
## 📖 Contents
- **[Architecture](./architecture/)** — How the `reference-ansible`
deployment is structured, using `demo-gymburgdorf` as the running example.
- [Index & glossary](./architecture/README.md)
- [Setup and repo layout](./architecture/setup.md) — control-node prerequisites, Bao login workflow
- [Variables](./architecture/variables.md) — Ansible variable hierarchy and cheatsheet
- [Topology](./architecture/topology.md) — Inventory groups, service layout per host
- [DNS and ACME](./architecture/dns.md) — Knot zones, TSIG/ACL model, split-horizon FQDNs
- [Deploy](./architecture/deploy.md) — Play sequence, Traefik DMZ vs. backend modes
- [Security](./architecture/security.md) — Bao lookup pattern, demo-only defaults, production hardening
- [Operations](./architecture/operations.md) — New-tenant walkthrough, known gaps
- **[Contributing](./contributing/)** — Conventions for collaborating on this codebase.
- [Git](./contributing/git.md) — Guidelines for contributing using git
- **[Infrastructure](./infrastructure/)** — Infrastructure-level concepts that apply across services.
- [ACME](./infrastructure/acme.md) — Documentation of the ACME concept
- [IPv6](./infrastructure/ipv6.md) — Documentation of the IPv6 concept
- **[Keycloak](./keycloak/)** — Keycloak configuration and best practices.
- [Account Linking](./keycloak/account-linking.md) — How to link existing accounts to a federated identity
- [Enforce OTP 2FA for Internal Users](./keycloak/enforce-otp-internal.md) — OTP-based 2FA for internal users, excluding external MS Entra users
- [Integrate MS Entra in Keycloak as IDP](./keycloak/idp-ms-entra.md) — MS Entra as identity provider
- **[Microsoft Entra](./ms-entra/)** — Microsoft Entra configuration and best practices.
- [Enterprise App Integration with Keycloak](./ms-entra/enterprise-app-keycloak.md) — Enterprise App in MS Entra (Azure AD) as IDP for Keycloak
- **[Troubleshooting](./troubleshooting/)** — Encountered & solved problems.
- [Nextcloud File Locking](./troubleshooting/nextcloud-file-locking.md) — Preventing sync conflicts when multiple users edit the same file via the Nextcloud desktop client
## 🧭 Where to look
| If you want to… | Go to |
|---|---|
| Understand how a tenant is wired up | [architecture/topology.md](./architecture/topology.md) |
| Set up a new demo tenant | [architecture/operations.md](./architecture/operations.md) |
| Look up a variable's correct home | [architecture/variables.md](./architecture/variables.md) |
| Understand why two ACME models coexist | [architecture/dns.md](./architecture/dns.md) |
| Plug an identity provider into Keycloak | [keycloak/](./keycloak/) |
| Solve a recurring runtime issue | [troubleshooting/](./troubleshooting/) |
---
📝 Contributions follow the guidelines in [contributing/git.md](./contributing/git.md).

44
architecture/README.md Normal file
View file

@ -0,0 +1,44 @@
<!-- markdownlint-disable MD013 MD060 MD051 -->
# Architecture — `reference-ansible`
This documentation describes the architecture of the `reference-ansible`
repository and uses the inventory `inventories/demo-gymburgdorf/` as a
running example. It serves both as onboarding documentation for new
engineers and as a reference when setting up additional demo tenants.
> **Demo-only.** All defaults in the roles (passwords, tokens, RPC
> secrets) are insecure and intended exclusively for demo setups. See
> [security.md](security.md).
**Last updated:** 2026-05-26 · **Owner:** @sbaerlocher
## Contents
| Section | File | Topics |
|---|---|---|
| Setup and repo layout | [setup.md](setup.md) | Repo layout, role provenance, control-node prerequisites, Bao login workflow |
| Variables | [variables.md](variables.md) | Ansible variable hierarchy, variable cheatsheet |
| Topology | [topology.md](topology.md) | Inventory groups, service layout per host, variable placement |
| DNS and ACME | [dns.md](dns.md) | Knot zones, NS-delegated vs. ACL-isolated ACME models, split-horizon FQDNs, TSIG/ACL |
| Deploy | [deploy.md](deploy.md) | Play sequence, Traefik DMZ/backend modes |
| Security | [security.md](security.md) | Bao lookup pattern, demo-only defaults, threat boundaries, production hardening |
| Operations | [operations.md](operations.md) | New-tenant walkthrough, known gaps and trade-offs |
## Glossary
| Term | Meaning |
|---|---|
| **OpenBao** | HashiCorp Vault fork. Single source of truth for secrets. Endpoint: `bao.digitalboard.ch`. |
| **Authentik** | Identity provider. Issues OIDC for SP services and LDAP via the Outpost. |
| **Outpost (Authentik)** | Separate Authentik sidecar that emulates LDAP/proxy protocols for legacy apps. Talks to Authentik via RPC + token. |
| **WOPI** | Web Application Open Platform Interface — protocol used by Nextcloud/Opencloud to hand office documents to Collabora. |
| **TSIG / RFC2136** | Authenticated DNS updates. Traefik uses TSIG-signed `nsupdate` calls for ACME DNS-01 challenges. |
| **DNS-01 (ACME)** | Let's Encrypt challenge type: certificate ownership is proven via a TXT record in DNS instead of HTTP. Required for wildcard certs. |
| **CNAME bridge** | `_acme-challenge.<fqdn>` points via CNAME into a dedicated update label (`<service>.demo-gymb._acme.digitalboard.ch`), keeping the TSIG key scoped to a narrow sub-tree. See [dns.md](dns.md). |
| **Knot DNS** | Authoritative DNS server used on `ns1.digitalboard.ch`. Config and zone files live in the separate [`dns-zones`](https://git.digitalboard.ch/Digitalboard/dns-zones) repo. |
| **DNSSEC** | Zones are signed with Ed25519, NSEC3 (no opt-out), KSK 1y / ZSK 90d rollovers, CDS/CDNSKEY published for automatic DS at the parent. |
| **Split horizon** | Two FQDN families per service: public `<svc>.gymb.souveredu.ch` → DMZ Traefik front-end IP, internal `<svc>.int.gymb.souveredu.ch` → directly the backend host. See [dns.md](dns.md). |
| **File provider / Docker provider** | Traefik configuration sources. The file provider reads static YAML; the Docker provider reads container labels via `/var/run/docker.sock`. |
| **STUN/TURN** | NAT-traversal protocols for WebRTC (e.g. for Nextcloud Talk). Runs on a separate host (`turn`). |
| **Garage** | S3-compatible object store (Rust). Backend for Nextcloud/Opencloud. |
| **FQCN** | Fully Qualified Collection Name, e.g. `digitalboard.core.traefik`. Mandatory in Ansible since 2.10. |

74
architecture/deploy.md Normal file
View file

@ -0,0 +1,74 @@
<!-- markdownlint-disable MD013 MD060 MD051 -->
# Deploy flow and Traefik modes
← Back to [Architecture index](README.md)
## 6. Deploy flow
Sequence taken from [playbooks/site.yml](https://git.digitalboard.ch/Digitalboard/reference-ansible/src/branch/main/playbooks/site.yml):
```mermaid
sequenceDiagram
participant U as User
participant A as ansible-playbook
participant V as OpenBao
participant H as Hosts
U->>U: bao login + export VAULT_TOKEN
U->>A: make deploy_site_demo_gymburgdorf
A->>A: load vars: role defaults → group_vars/all → group_vars/&lt;groups&gt; → host_vars/&lt;host&gt;
A->>V: community.hashi_vault lookups<br/>(acme-tsig, service secrets)
V-->>A: secret values
A->>H: Play 1 — base (all hosts)
A->>H: Play 2 — traefik (all hosts: dmz on reverseproxy, backend elsewhere)
A->>H: Play 3 — httpbin
A->>H: Play 4 — 389ds
A->>H: Play 5 — keycloak
A->>H: Play 6 — garage (storage)
A->>H: Play 7 — collabora (application)
A->>H: Play 8 — authentik (application)
A->>H: Play 9 — authentik_outpost_ldap (application)
A->>H: Play 10 — nextcloud (application)
A->>H: Play 11 — drawio (application)
A->>H: Play 12 — send
A->>H: Play 13 — opnform
A->>H: Play 14 — homarr
A->>H: Play 15 — bookstack
A->>H: Play 16 — opencloud
```
Plays without matching group members (`httpbin_servers`,
`ds389_servers`, `keycloak_servers`, `send_servers`,
`opnform_servers`, `homarr_servers`, `bookstack_servers`,
`opencloud_servers` in this inventory) run as no-ops.
> **Role-name spelling traps:** the LDAP role is `389ds` (not
> `ds389`); the forms role is `opnform` (not `openforms`/`openform`).
> Inventory groups must match the names used in
> [playbooks/site.yml](https://git.digitalboard.ch/Digitalboard/reference-ansible/src/branch/main/playbooks/site.yml) exactly —
> `ds389_servers`, `opnform_servers`.
`--diff` is enabled in the target → per-task changes are visible.
## 7. Traefik modes (DMZ vs Backend)
**`traefik_mode: dmz`** — public-facing reverse proxy on `reverseproxy`:
- **File provider** with `services.yml` for static routing.
- No Docker socket mounted, no local containers.
- Routes to `backend_host` addresses on other machines.
- Backends are declared via `traefik_dmz_exposed_services` (a list in
`host_vars/reverseproxy/`). Selective backend selection is also
possible via `traefik_backend_servers_to_proxy`.
**`traefik_mode: backend`** — application/storage:
- Mounts `/var/run/docker.sock`.
- **Docker provider**: auto-discovery via container labels
(`traefik.enable=true`).
- Services are exposed locally; the DMZ Traefik routes external
traffic to them in plaintext HTTP (see
[security.md](security.md)).
**Both modes** support ACME via RFC2136 DNS challenge or self-signed
(`traefik_cert_mode: acme | selfsigned`).

123
architecture/dns.md Normal file
View file

@ -0,0 +1,123 @@
<!-- markdownlint-disable MD013 MD060 MD051 -->
# DNS topology and ACME zone layout
← Back to [Architecture index](README.md)
Authoritative DNS for everything described in this document runs on
**`ns1.digitalboard.ch`** (public `193.43.183.169`, DMZ `172.16.9.169`)
using **Knot DNS**. The zone files and Knot config live in the
[`dns-zones`](https://git.digitalboard.ch/Digitalboard/dns-zones) repo; this section explains how the
public service FQDNs, the internal "split-horizon" FQDNs, and the ACME
challenge sub-trees fit together.
## Authoritative zones on `ns1`
| Zone | Purpose | DNSSEC | Dynamic updates |
|---|---|---|---|
| `digitalboard.ch` | Production zone for the platform itself (`auth`, `cloud`, `office`, `bao`, …). | on | none (static zone file) |
| `_acme.digitalboard.ch` | Parent zone for ACME challenge labels. | on | yes, per-tenant TSIG ACLs (`demo-gymb`, `demo-phbe`, `demo-mbaz`) |
| `digitalboard._acme.digitalboard.ch` | **Delegated** child zone for `digitalboard.ch` ACME updates only. | off | yes, TSIG `acme_update_key_digitalboard` |
| `souveredu.ch` | Demo-tenant zone (`gymb`, `phbe`, `mbaz` sub-labels). | on | none (static zone file) |
| `demo-schulen.ch` | Reserve / unused so far. | on | none |
> **Two different ACME models live here.** This is the most common
> source of confusion when copying a tenant:
>
> - `digitalboard.ch` uses a **NS-delegated child zone**
> (`digitalboard._acme.digitalboard.ch.` has its own `NS` record in
> `_acme.digitalboard.ch`). The TSIG key writes into that delegated
> zone.
> - The demo tenants (`demo-gymb`, `demo-phbe`, `demo-mbaz`) **share
> the parent zone** `_acme.digitalboard.ch` and are isolated only
> by **Knot ACL `update-owner-name`** on the per-tenant sub-tree
> (`demo-gymb._acme.digitalboard.ch.` and below). There is no NS
> delegation for them.
>
> Both work for the ACME flow; the demo model is cheaper to manage but
> means tenant isolation depends on Knot ACLs, not zone boundaries.
## Naming pattern for `demo-gymb` (template for new tenants)
```text
Public, browser-facing:
cloud.gymb.souveredu.ch CNAME → rvp.gymb.souveredu.ch (193.43.183.131)
auth.gymb.souveredu.ch CNAME → rvp.gymb.souveredu.ch
office.gymb.souveredu.ch CNAME → rvp.gymb.souveredu.ch
s3.gymb.souveredu.ch CNAME → rvp.gymb.souveredu.ch
...
Internal, server-to-server (split horizon):
cloud.int.gymb.souveredu.ch A → 172.16.19.101 (application host)
auth.int.gymb.souveredu.ch A → 172.16.19.101
office.int.gymb.souveredu.ch A → 172.16.19.101
s3.int.gymb.souveredu.ch A → 172.16.19.102 (storage host)
...
Tenant entry IPs:
rvp.gymb.souveredu.ch A → 193.43.183.131 (DMZ Traefik public)
reverseproxy.int.gymb A → 172.16.9.111 (DMZ Traefik internal)
ACME challenge labels (writeable via TSIG acme_update_key_demo_gymb):
_acme-challenge.cloud.gymb CNAME → cloud.demo-gymb._acme.digitalboard.ch
_acme-challenge.cloud.int.gymb CNAME → cloud.int.demo-gymb._acme.digitalboard.ch
...
```
The `.int.` family is what makes Nextcloud → Garage, Nextcloud →
Authentik (OIDC), Nextcloud → Collabora (WOPI) etc. **bypass the DMZ
Traefik**: the backend host's local Traefik presents the right cert
directly, so traffic stays on the backend subnet. Without this,
server-to-server calls would either ride out through the DMZ and back
in, or hit a hostname mismatch on the cert.
## TSIG / ACL model
```mermaid
flowchart LR
classDef tenant fill:#dcfce7,stroke:#166534,color:#000
classDef zone fill:#dbeafe,stroke:#1e40af,color:#000
classDef acl fill:#fef3c7,stroke:#92400e,color:#000
subgraph KNOT["ns1.digitalboard.ch (Knot DNS)"]
Z1["_acme.digitalboard.ch<br/>(parent zone)"]:::zone
Z2["digitalboard._acme.digitalboard.ch<br/>(NS-delegated child)"]:::zone
A1["ACL acme_updates_digitalboard<br/>scope: digitalboard._acme.digitalboard.ch."]:::acl
A2["ACL acme_updates_demo_gymb<br/>scope: demo-gymb._acme.digitalboard.ch."]:::acl
A3["ACL acme_updates_demo_phbe<br/>scope: demo-phbe._acme.digitalboard.ch."]:::acl
A4["ACL acme_updates_demo_mbaz<br/>scope: demo-mbaz._acme.digitalboard.ch."]:::acl
end
DB["digitalboard.ch Traefik<br/>TSIG: acme_update_key_digitalboard"]:::tenant
GY["demo-gymb Traefik<br/>TSIG: acme_update_key_demo_gymb"]:::tenant
PH["demo-phbe Traefik<br/>TSIG: acme_update_key_demo_phbe"]:::tenant
MB["demo-mbaz Traefik<br/>TSIG: acme_update_key_demo_mbaz"]:::tenant
DB -- nsupdate TXT --> A1
GY -- nsupdate TXT --> A2
PH -- nsupdate TXT --> A3
MB -- nsupdate TXT --> A4
A1 -- writes into --> Z2
A2 -- writes into --> Z1
A3 -- writes into --> Z1
A4 -- writes into --> Z1
```
Each ACL is restricted to **`update-type: TXT`** and
**`update-owner-match: sub-or-equal`** under the tenant prefix, so a
leaked tenant key cannot write outside its own ACME sub-tree and cannot
modify non-TXT records (no A/CNAME/NS hijack).
## Traefik variables that bind to this layout
From `inventories/demo-gymburgdorf/group_vars/traefik_servers/traefik.yml`:
| Traefik variable | Value for `demo-gymb` | Bound to |
|---|---|---|
| `traefik_acme_dns_provider` | `rfc2136` | Knot dynamic-update endpoint |
| `traefik_acme_dns_zone` | `demo-gymb._acme.digitalboard.ch` | Per-tenant write scope on `ns1` |
| `traefik_acme_tsig_key_name` | `acme_update_key_demo_gymb` | Matches `key:` entry in [`knot.conf`](https://git.digitalboard.ch/Digitalboard/dns-zones/src/branch/main/knot/knot.conf) |
| `traefik_acme_tsig_secret` | Bao lookup | See [security.md](security.md) |
A tenant whose ACME zone does **not** match the Knot ACL
`update-owner-name` will get `REFUSED` on `nsupdate` and ACME issuance
will silently retry until the renewal window expires.

View file

@ -0,0 +1,99 @@
<!-- markdownlint-disable MD013 MD060 MD051 -->
# Operations — new tenants and known gaps
← Back to [Architecture index](README.md)
## 10. Walkthrough: creating a new demo tenant
Recommended template: **`demo-gymburgdorf`** (not `vagrant`, since its
group topology is incompatible).
1. **Copy the inventory:**
```bash
cp -r inventories/demo-gymburgdorf inventories/demo-<customer>
```
2. **Adjust `hosts.yml`:** IPs and hostnames per host.
3. **`group_vars/all/vault.yml`** — point `vault_mount` at the new
tenant mount (`demo-<customer>`).
4. **`group_vars/traefik_servers/traefik.yml`** — bend
`traefik_acme_dns_zone` and the `traefik_acme_tsig_*` lookup paths
to the new zone / new Bao path.
5. **`host_vars/application/*.yml`** and
**`host_vars/storage/*.yml`** — walk through them: FQDNs to the new
domain pattern (e.g. `*.<customer>.souveredu.ch`), Bao lookup paths
to `demo-<customer>/data/…`.
6. **Prepare OpenBao** (out-of-band, not via Ansible):
- Create a new KV-v2 mount `demo-<customer>`.
- Write secrets: `acme-tsig`, `authentik`, `nextcloud`, `garage`, …
(see [security.md](security.md) for the mandatory-override list).
- Policy for the deploy token: read on `demo-<customer>/data/*`.
7. **DNS** (in the [`dns-zones`](https://git.digitalboard.ch/Digitalboard/dns-zones) repo, see
[dns.md](dns.md)):
- Add `key:` and `acl:` entries for the new tenant in
[`knot/knot.conf`](https://git.digitalboard.ch/Digitalboard/dns-zones/src/branch/main/knot/knot.conf), pattern
`acme_update_key_demo_<customer>` /
`acme_updates_demo_<customer>` scoped to
`demo-<customer>._acme.digitalboard.ch.`.
- Append the new ACL to the `_acme.digitalboard.ch` zone's `acl:`
list — the tenants share the parent zone, no NS delegation.
- In `zones/souveredu.ch.zone` (or the tenant's public zone) add
the public/internal A records (`rvp.<customer>`,
`reverseproxy.int.<customer>`, `application.int.<customer>`,
`storage.int.<customer>`, …), the service CNAMEs to
`rvp.<customer>`, and the `_acme-challenge.*` CNAMEs into
`demo-<customer>._acme.digitalboard.ch`. Bump the SOA serial.
- `make deploy_ns1` to push.
8. **Makefile** — add a new target modelled on
`deploy_site_demo_gymburgdorf` and wire it into
`deploy_site_demo`.
9. **Smoke test:**
`ansible all -i inventories/demo-<customer>/hosts.yml -m ping`.
10. **Deploy:** Bao login + `make deploy_site_demo_<customer>`.
## 11. Known gaps and trade-offs
- **Optional services without group bindings in `demo-gymburgdorf`:**
`opencloud`, `send`, `opnform`, `homarr`, and `bookstack` are
declared as plays in
[playbooks/site.yml](https://git.digitalboard.ch/Digitalboard/reference-ansible/src/branch/main/playbooks/site.yml) but have no
`<service>_servers` group in the inventory — those plays run as
no-ops. If needed, add the group + `host_vars/application/<svc>.yml`
as described in [topology.md](topology.md). Mind spelling:
`opnform_servers` (not `openform`/`openforms`).
- **`turn` host:** defined in the DMZ, but no STUN/TURN role in
[playbooks/site.yml](https://git.digitalboard.ch/Digitalboard/reference-ansible/src/branch/main/playbooks/site.yml). Currently provisioned only
via `base` + `traefik`.
- **Idempotency:** roles are Docker-Compose-based; re-runs may trigger
container restarts when compose inputs change. There is no dedicated
rollback mechanism — on failure, roll back manually to the previous
state.
- **TLS renewal:** handled internally by Traefik via ACME. There is no
external renewal cron in the repo.
- **CI / testing:** not present in the repo. Smoke test is
`make ping_demo`.
- **Logs:** Traefik runs with `traefik_log_level: DEBUG` in
`demo-gymburgdorf` and `vagrant` (role default is `INFO`) — reduce
to `INFO` or `WARN` before adapting for production.
- **TSIG secrets in `knot.conf`:** the `dns-zones` repo currently
stores all four ACME TSIG keys in plaintext in
[`knot/knot.conf`](https://git.digitalboard.ch/Digitalboard/dns-zones/src/branch/main/knot/knot.conf). The Ansible
side reads them from Bao, but the Knot side does not — anyone with
read on the `dns-zones` repo can write TXT records under the
matching tenant's ACME sub-tree. For prod, source the Knot keys
from a templated config + secret store, or restrict repo access.
- **Demo tenants share `_acme.digitalboard.ch`:** isolation is by
Knot ACL `update-owner-name`, not by zone delegation. A mis-edit
of the ACL list could break ACL-based isolation without breaking
DNS resolution — failure is silent. The production zone
(`digitalboard.ch`) uses a properly delegated child zone and is
not affected.

71
architecture/security.md Normal file
View file

@ -0,0 +1,71 @@
<!-- markdownlint-disable MD013 MD060 MD051 -->
# Security and demo-only defaults
← Back to [Architecture index](README.md)
> This repo is explicitly designed for **demo setups**. All default
> values in the roles are insecure and are overridden in `demo-*`
> inventories via Bao lookups or host_vars. For production deployments
> the hardening block further down also applies.
## Secret pattern (Bao lookup)
```yaml
# group_vars/.../<service>.yml or host_vars/.../<service>.yml
authentik_secret_key: "{{ lookup('community.hashi_vault.hashi_vault',
vault_mount + '/data/authentik:secret_key',
url=vault_addr) }}"
```
- `vault_mount` and `vault_addr` come from
[group_vars/all/vault.yml](https://git.digitalboard.ch/Digitalboard/reference-ansible/src/branch/main/inventories/demo-gymburgdorf/group_vars/all/vault.yml).
- KV-v2 paths require an explicit `/data/` segment — Ansible does not
resolve this automatically.
- `vault_mount` is unique per inventory (`demo-gymburgdorf`,
`demo-phbern`, …) → tenant isolation in Bao via mount + policy.
## Demo-only defaults — override required
These defaults in `digitalboard.core` are insecure. In any
**production-grade** deployment they must be overridden via Bao lookup
or host_var:
| Variable | Default | Where to override |
|---|---|---|
| `keycloak_admin_password` | `changeme` | host_vars `keycloak_servers` |
| `keycloak_postgres_password` | `changeme` | same |
| `authentik_secret_key` | `changeme-generate-a-random-string` | `host_vars/application/authentik.yml` |
| `authentik_postgres_password` | `changeme` | same |
| `nextcloud_admin_password` | `admin` | `host_vars/application/nextcloud.yml` |
| `nextcloud_postgres_password` | `changeme` | same |
| `nextcloud_s3_key` / `nextcloud_s3_secret` | `changeme` / `changeme` | same |
| `garage_webui_password` | `admin` | `host_vars/storage/garage.yml` |
| `garage_rpc_secret` | `0123…cdef` (64-hex constant) | same |
| `garage_admin_token` | identical to `rpc_secret` | same |
| `garage_metrics_token` | identical to `rpc_secret` | same |
> **Convention:** every value listed above **must** have a Bao lookup
> in `demo-*/host_vars/.../...yml` before the inventory is considered
> deploy-ready.
## Threat boundaries (current demo state)
| Boundary | Status | Notes |
|---|---|---|
| DMZ ↔ Backend (172.16.9 ↔ 172.16.19) | **Plaintext HTTP** | Auth bearers, OIDC codes, session cookies travel unencrypted. Fine for demo; for prod use mTLS or a WireGuard overlay. |
| Host firewall | **missing** | The `base` role does not install UFW/nftables. Segmentation relies on the hypervisor/VLAN. |
| SSH | `ansible_user: root` | No bastion, no jump host. Key distribution out-of-band. |
| Authentik SPOF | **accepted** | IdP and SP services share the same host (`application`). An Authentik outage means a login outage including the LDAP outpost. No break-glass path. |
| ACME TSIG key | Bao lookup (in Ansible), **plaintext in [`knot.conf`](https://git.digitalboard.ch/Digitalboard/dns-zones/src/branch/main/knot/knot.conf)** on `ns1` side | One TSIG key per demo tenant, scoped via Knot ACL `update-owner-name` to the tenant's ACME sub-tree. Rotation is manual and must be done on both sides simultaneously (Bao + `knot.conf` + `knotc zone-reload`). |
| Backup / DR | **out of scope** | Garage `replication_factor: 1` (default), no Postgres backup job, no Bao snapshot cron. |
## To adapt for production, add
- Host firewall (extend the `base` role or add a dedicated `firewall`
role).
- mTLS or WireGuard between DMZ and backend.
- Authentik on a separate host with a recovery admin token.
- Bao policies per inventory mount (read-only for the deploy token,
write-only for the bootstrap job).
- Backup cron for Postgres + Garage + Bao.
- SSH bastion + key rotation.

68
architecture/setup.md Normal file
View file

@ -0,0 +1,68 @@
<!-- markdownlint-disable MD013 MD060 MD051 -->
# Setup and repo layout
← Back to [Architecture index](README.md)
## 1. Repo layout and role provenance
```text
reference-ansible/
├── Makefile # Deploy targets, OIDC login, OBJC fork workaround
├── ansible.cfg # collections_path, remote_user=root, hashi_vault auth_method=token
├── requirements.yml # community.hashi_vault + digitalboard.core (Git)
├── playbooks/site.yml # Play sequence (14 plays, see deploy.md)
├── collections/ # ← installed by `make install`, gitignored
│ └── ansible_collections/
│ └── digitalboard/core/
│ └── roles/ # 🔑 Roles live HERE, NOT in the repo root
└── inventories/
├── demo-gymburgdorf/ # Inventory used throughout this document
├── demo-mbazürich/
├── demo-phbern/
└── vagrant/ # Local test inventory with its own topology
```
> **Important:** There is **no** `roles/` directory at the repo root.
> All roles come from the `digitalboard.core` collection (see
> [requirements.yml](https://git.digitalboard.ch/Digitalboard/reference-ansible/src/branch/main/requirements.yml)), installed via `make install`
> into `./collections/`. Plays reference them by FQCN
> `digitalboard.core.<role>`.
## 2. Setup and prerequisites
**Tools on the control node:**
- `ansible` (Core ≥ 2.15)
- `bao` CLI (OpenBao) — e.g. `sudo pacman -S openbao python-hvac` (Arch) or Homebrew
- `python-hvac` (for `community.hashi_vault` lookups)
- On macOS: `OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES` (set in the
[Makefile](https://git.digitalboard.ch/Digitalboard/reference-ansible/src/branch/main/Makefile); without it Ansible forks crash on Bao lookups)
**Initial setup:**
```bash
git clone <repo>
cd reference-ansible
make install # Galaxy + digitalboard.core into ./collections/
```
**Before every deploy:** Bao login in the **same shell** that will then
run `ansible-playbook`:
```bash
export BAO_ADDR=https://bao.digitalboard.ch
bao login -method=oidc -path=Digitalboard
export VAULT_TOKEN=$(bao print token)
```
> ⚠️ `make bao` on its own is **not enough**: every `make` target spawns
> a new shell, and the `VAULT_TOKEN` exported in there only lives for
> the duration of `make bao` itself. Either run the three commands
> above manually, or invoke `make bao deploy_site_demo_gymburgdorf` as
> **one** call — otherwise the deploy has no token.
**Smoke test:**
```bash
make ping_demo # pings all three demo inventories
```

110
architecture/topology.md Normal file
View file

@ -0,0 +1,110 @@
<!-- markdownlint-disable MD013 MD060 MD051 -->
# Topology — inventory and services
← Back to [Architecture index](README.md)
## 4. Inventory topology (`demo-gymburgdorf`)
```mermaid
flowchart LR
classDef dmz fill:#fee2e2,stroke:#991b1b,color:#000
classDef app fill:#dcfce7,stroke:#166534,color:#000
classDef stor fill:#dbeafe,stroke:#1e40af,color:#000
classDef turn fill:#fef9c3,stroke:#854d0e,color:#000
subgraph ALL["group: all_servers"]
direction LR
subgraph DMZ["DMZ 172.16.9.0/24"]
RP["<b>reverseproxy</b><br/>172.16.9.111<br/>traefik_mode: dmz"]:::dmz
TURN["<b>turn</b><br/>172.16.9.112<br/>(no role in site.yml yet)"]:::turn
end
subgraph BE["Backend 172.16.19.0/24<br/>group: backend_servers"]
APP["<b>application</b><br/>172.16.19.101<br/>traefik_mode: backend<br/>+ authentik, authentik_outpost_ldap,<br/> nextcloud, collabora, drawio"]:::app
ST["<b>storage</b><br/>172.16.19.102<br/>traefik_mode: backend<br/>+ garage (S3)"]:::stor
end
end
RP -.HTTPS in, HTTP out.-> APP
RP -.HTTPS in, HTTP out.-> ST
```
**Group memberships (from [hosts.yml](https://git.digitalboard.ch/Digitalboard/reference-ansible/src/branch/main/inventories/demo-gymburgdorf/hosts.yml)):**
| Group | Members | Purpose |
|---|---|---|
| `all_servers` | `reverseproxy`, `application`, `storage`, `turn` | Base role for all hosts |
| `traefik_servers` | `children: all_servers` (= all 4 hosts) | Traefik everywhere; DMZ/backend via `traefik_mode` |
| `backend_servers` | `application`, `storage` | Sets `traefik_mode: backend` via group var |
| `garage_servers` | `storage` | Single-host wrapper for the Garage role |
| `nextcloud_servers`, `collabora_servers`, `drawio_servers`, `authentik_servers`, `authentik_outpost_ldap_servers` | `application` only | Single-host wrappers |
> **Difference vs. the `vagrant` inventory:** `vagrant` structures
> Traefik differently — via the children groups `traefik_servers_dmz`
> and `traefik_servers_backend` instead of `backend_servers` +
> `host_vars` override. The two topologies are **structurally
> incompatible**; a 1:1 mapping is not possible. See
> [operations.md](operations.md) for the recommended template.
## 5. Service layout and variable placement
```mermaid
flowchart TB
classDef rp fill:#fee2e2,stroke:#991b1b,color:#000
classDef ap fill:#dcfce7,stroke:#166534,color:#000
classDef st fill:#dbeafe,stroke:#1e40af,color:#000
classDef ext fill:#e9d5ff,stroke:#6b21a8,color:#000
Internet((Internet))
DNS["DNS ns1.digitalboard.ch<br/>RFC2136 TSIG<br/>Zone: demo-gymb._acme.digitalboard.ch<br/>CNAME bridge: _acme-challenge.*.gymb.souveredu.ch"]:::ext
BAO["OpenBao<br/>bao.digitalboard.ch<br/>mount: demo-gymburgdorf"]:::ext
subgraph RP["<b>reverseproxy</b> — traefik dmz"]
TRDMZ["traefik (file provider)<br/>📍 group_vars/traefik_servers/traefik.yml<br/>📍 host_vars/reverseproxy/traefik.yml<br/> → traefik_mode: dmz<br/> → traefik_dmz_exposed_services"]:::rp
end
subgraph APP["<b>application</b> — traefik backend"]
TRA["traefik (docker provider)<br/>📍 group_vars/backend_servers/traefik.yml"]:::ap
AK["authentik (OIDC + LDAP outpost backend)<br/>📍 host_vars/application/authentik.yml"]:::ap
AKO["authentik_outpost_ldap<br/>📍 host_vars/application/authentik_outpost_ldap.yml"]:::ap
NC["nextcloud<br/>📍 host_vars/application/nextcloud.yml"]:::ap
COL["collabora<br/>📍 host_vars/application/collabora.yml"]:::ap
DRW["drawio<br/>📍 host_vars/application/drawio.yml"]:::ap
end
subgraph ST["<b>storage</b> — traefik backend"]
TRS["traefik (docker provider)"]:::st
GAR["garage (S3)<br/>📍 host_vars/storage/garage.yml"]:::st
end
Internet -->|HTTPS :443| TRDMZ
TRDMZ -->|HTTP backend| TRA
TRDMZ -->|HTTP backend| TRS
TRA --> AK & AKO & NC & COL & DRW
TRS --> GAR
NC -. S3 .-> GAR
NC -. OIDC .-> AK
NC -. WOPI .-> COL
NC -. LDAP .-> AKO
AKO -. RPC + token .-> AK
TRDMZ -. ACME DNS-01 TSIG .-> DNS
TRDMZ -. hashi_vault acme-tsig .-> BAO
AK -. hashi_vault secrets .-> BAO
NC -. hashi_vault secrets .-> BAO
GAR -. hashi_vault secrets .-> BAO
```
> **Note:** `opencloud`, `send`, `opnform`, `homarr`, and `bookstack`
> are defined as plays in [playbooks/site.yml](https://git.digitalboard.ch/Digitalboard/reference-ansible/src/branch/main/playbooks/site.yml)
> but currently have no matching group in
> [hosts.yml](https://git.digitalboard.ch/Digitalboard/reference-ansible/src/branch/main/inventories/demo-gymburgdorf/hosts.yml) for
> `demo-gymburgdorf` — those plays therefore run as no-ops. If a
> tenant needs these services, add the corresponding
> `<service>_servers` group in `hosts.yml` and a
> `host_vars/application/<service>.yml` (mind the spelling — the
> forms role is `opnform`, the LDAP role is `389ds`).
>
> The `turn` host is in `all_servers` (and therefore in
> `traefik_servers`) but has **no** service group of its own —
> currently only the `base` and `traefik` roles run on it.

58
architecture/variables.md Normal file
View file

@ -0,0 +1,58 @@
<!-- markdownlint-disable MD013 MD060 MD051 -->
# Variables — hierarchy and cheatsheet
← Back to [Architecture index](README.md)
## 3. Variable hierarchy
Ansible merges variables from multiple sources. Simplified model for
this repo (see the Ansible docs for the full precedence rules):
```mermaid
flowchart LR
classDef role fill:#fef3c7,stroke:#92400e,color:#000
classDef group fill:#dbeafe,stroke:#1e40af,color:#000
classDef host fill:#dcfce7,stroke:#166534,color:#000
classDef vault fill:#fee2e2,stroke:#991b1b,color:#000
R["<b>role defaults</b><br/>(lowest precedence)<br/>collections/.../roles/&lt;r&gt;/defaults/main.yml"]:::role
GA["<b>group_vars/all/</b><br/>vault.yml, docker.yml"]:::group
GG["<b>group_vars/&lt;group&gt;/</b><br/>traefik_servers/, backend_servers/<br/>(parallel groups, merged via<br/>ansible_group_priority)"]:::group
HV["<b>host_vars/&lt;host&gt;/</b><br/>(highest of the three inventory sources)"]:::host
BAO["<b>OpenBao</b><br/>lookup at runtime"]:::vault
R --> |"&lt;overridden by&gt;"| GA
GA --> |"&lt;overridden by&gt;"| GG
GG --> |"&lt;overridden by&gt;"| HV
HV -.community.hashi_vault.-> BAO
GG -.community.hashi_vault.-> BAO
```
**Key properties:**
- Multiple `group_vars/<group>/` are **parallel**, not hierarchically
nested. `traefik_servers` and `backend_servers` are merged by
`ansible_group_priority` (default 1); on conflict the
alphabetically-later group name wins.
- `host_vars/<host>/` beats any group.
- `host_vars/reverseproxy/traefik.yml: traefik_mode: dmz` therefore
overrides the default from `group_vars/backend_servers/` — and only
because `reverseproxy` is not a member of `backend_servers` in the
first place (otherwise the override wouldn't even be needed).
**Bao lookups** are not a precedence layer but **values** inside any
variable source. See [security.md](security.md) for the pattern.
## 9. Variable cheatsheet
| Variable | Where in `demo-gymburgdorf/` | Why |
|---|---|---|
| `vault_addr`, `vault_mount` | `group_vars/all/vault.yml` | Bao endpoint applies site-wide |
| `docker_registry_mirrors` | `group_vars/all/docker.yml` | Pulls from mirror on all hosts |
| `traefik_acme_*`, `traefik_use_ssl`, `traefik_cert_mode` | `group_vars/traefik_servers/traefik.yml` | Applies to every Traefik instance (dmz + backend) |
| `traefik_mode: backend` | `group_vars/backend_servers/traefik.yml` | Default for app + storage |
| `traefik_mode: dmz` | `host_vars/reverseproxy/traefik.yml` | Host-specific override |
| `traefik_dmz_exposed_services` | `host_vars/reverseproxy/` | DMZ backend list — only meaningful here |
| `nextcloud_*`, `authentik_*`, `collabora_*`, `drawio_*` | `host_vars/application/<service>.yml` | Service runs on `application` |
| `garage_*` | `host_vars/storage/garage.yml` | Service runs on `storage` |
| Secrets (passwords, tokens, keys) | inline variable using `lookup('community.hashi_vault.hashi_vault', …)` | Single source of truth via Bao |