diff --git a/README.md b/README.md index 6a5b32e..d49b004 100644 --- a/README.md +++ b/README.md @@ -1,35 +1,103 @@ -# πŸ“š Documentation Repository + +# πŸ“š Digitalboard Documentation -This repository contains documentation, guides, and reference material. +Welcome β€” this repository is the **central documentation hub** for the +Digitalboard platform. It collects architecture notes, operational +runbooks, integration guides, and troubleshooting recipes that span +multiple repositories, so they have one stable home instead of being +scattered across READMEs. -## πŸ“– Available Documentation +## πŸ›οΈ Platform at a glance -- **[Contribution guidelines](./contributing/)** - Documentation and guides related to infrastructure configuration and best practices. - - [Git](./contributing/git.md) - Guidelines for contributing using git +```mermaid +flowchart LR + classDef docs fill:#dbeafe,stroke:#1e40af,color:#000 + classDef core fill:#dcfce7,stroke:#166534,color:#000 + classDef ans fill:#fef3c7,stroke:#92400e,color:#000 + classDef ext fill:#e9d5ff,stroke:#6b21a8,color:#000 -- **[Infrastructure](./infrastructure/)** - Documentation and guides related to infrastructure configuration and best practices. - - [ACME](./infrastructure/acme.md) - Documentation of the ACME concept. - - [IPV6](./infrastructure/ipv6.md) - Documentation of the ipv6 concept. + User((Operator / Engineer)) -- **[Keycloak](./keycloak/)** - Documentation and guides related to Keycloak configuration and best practices. - - [Enforce OTP 2FA for Internal Users](./keycloak/enforce-otp-internal.md) - Step-by-step instructions for enforcing OTP-based two-factor authentication for internal users, while excluding external Microsoft Entra users. - - [Integrate MS Entra in Keycloak as IDP](./keycloak/idp-ms-entra.md) - Step-by-step instructions for integrating MS Entra as identity-provider. + subgraph REPOS["Digitalboard repositories"] + DOCS["docs
πŸ“– architecture, runbooks,
integration guides
(this repo)"]:::docs + CORE["digitalboard.core
βš™οΈ Ansible collection
= all roles
(traefik, authentik, nextcloud,
garage, keycloak, …)"]:::core + REF["reference-ansible
πŸš€ inventories + playbooks
(demo-gymburgdorf,
demo-phbern, demo-mbazΓΌrich,
vagrant)"]:::ans + end -- **[Microsoft Entra](./ms-entra/)** - Documentation and guides related to Microsft Entra configuration and best practices. - - [Enterprise App Integration with Keycloak](./ms-entra/enterprise-app-keycloak.md) - Step-by-step instructions for creating an Enterprise Application in Microsoft Entra (Azure AD) as an identity provider for Keycloak. + subgraph PLATFORM["Runtime targets"] + BAO["OpenBao
bao.digitalboard.ch
(secrets)"]:::ext + DNS["Knot DNS
ns1.digitalboard.ch
(ACME / split-horizon)"]:::ext + HOSTS["Tenant VMs
(reverseproxy Β· application Β·
storage Β· turn)"]:::ext + end -- **[Troubleshooting](./troubleshooting/)** - Encountered & solved problems. - - [Nextcloud File Locking](./troubleshooting/nextcloud-file-locking.md) - Preventing sync conflicts when multiple users edit the same file via the Nextcloud desktop client. + User -->|reads| DOCS + User -->|runs `make deploy_…`| REF + REF -->|requires| CORE + REF -.->|hashi_vault lookups.-> BAO + REF -->|ansible-playbook| HOSTS + HOSTS -.->|nsupdate TSIG / ACME DNS-01.-> DNS + HOSTS -.->|hashi_vault lookups.-> BAO + DOCS -.documents.-> REF + DOCS -.documents.-> CORE +``` +**The three repos at a glance:** + +| Repo | Role | Link | +|---|---|---| +| **`docs`** *(here)* | Architecture, integration guides, runbooks, troubleshooting. The "why" and the "how it fits together." | [git.digitalboard.ch/Digitalboard/docs](https://git.digitalboard.ch/Digitalboard/docs) | +| **`digitalboard.core`** | Ansible collection β€” every reusable role (Traefik, Authentik, Keycloak, Nextcloud, Garage, …). The "what runs on a host." | [git.digitalboard.ch/Digitalboard/digitalboard.core](https://git.digitalboard.ch/Digitalboard/digitalboard.core) | +| **`reference-ansible`** | Inventories + playbooks for the demo tenants and the `vagrant` test setup. The "what gets deployed where, with which variables." | [git.digitalboard.ch/Digitalboard/reference-ansible](https://git.digitalboard.ch/Digitalboard/reference-ansible) | + +> πŸš€ **Want to deploy something?** Start in +> [`reference-ansible`](https://git.digitalboard.ch/Digitalboard/reference-ansible) β€” +> its README covers the Bao login, the `make` targets, and the available +> playbooks. Come back here for the architectural background +> ([architecture/](./architecture/)) or for solved problems +> ([troubleshooting/](./troubleshooting/)). + +## πŸ“– Contents + +- **[Architecture](./architecture/)** β€” How the `reference-ansible` + deployment is structured, using `demo-gymburgdorf` as the running example. + - [Index & glossary](./architecture/README.md) + - [Setup and repo layout](./architecture/setup.md) β€” control-node prerequisites, Bao login workflow + - [Variables](./architecture/variables.md) β€” Ansible variable hierarchy and cheatsheet + - [Topology](./architecture/topology.md) β€” Inventory groups, service layout per host + - [DNS and ACME](./architecture/dns.md) β€” Knot zones, TSIG/ACL model, split-horizon FQDNs + - [Deploy](./architecture/deploy.md) β€” Play sequence, Traefik DMZ vs. backend modes + - [Security](./architecture/security.md) β€” Bao lookup pattern, demo-only defaults, production hardening + - [Operations](./architecture/operations.md) β€” New-tenant walkthrough, known gaps + +- **[Contributing](./contributing/)** β€” Conventions for collaborating on this codebase. + - [Git](./contributing/git.md) β€” Guidelines for contributing using git + +- **[Infrastructure](./infrastructure/)** β€” Infrastructure-level concepts that apply across services. + - [ACME](./infrastructure/acme.md) β€” Documentation of the ACME concept + - [IPv6](./infrastructure/ipv6.md) β€” Documentation of the IPv6 concept + +- **[Keycloak](./keycloak/)** β€” Keycloak configuration and best practices. + - [Account Linking](./keycloak/account-linking.md) β€” How to link existing accounts to a federated identity + - [Enforce OTP 2FA for Internal Users](./keycloak/enforce-otp-internal.md) β€” OTP-based 2FA for internal users, excluding external MS Entra users + - [Integrate MS Entra in Keycloak as IDP](./keycloak/idp-ms-entra.md) β€” MS Entra as identity provider + +- **[Microsoft Entra](./ms-entra/)** β€” Microsoft Entra configuration and best practices. + - [Enterprise App Integration with Keycloak](./ms-entra/enterprise-app-keycloak.md) β€” Enterprise App in MS Entra (Azure AD) as IDP for Keycloak + +- **[Troubleshooting](./troubleshooting/)** β€” Encountered & solved problems. + - [Nextcloud File Locking](./troubleshooting/nextcloud-file-locking.md) β€” Preventing sync conflicts when multiple users edit the same file via the Nextcloud desktop client + +## 🧭 Where to look + +| If you want to… | Go to | +|---|---| +| Understand how a tenant is wired up | [architecture/topology.md](./architecture/topology.md) | +| Set up a new demo tenant | [architecture/operations.md](./architecture/operations.md) | +| Look up a variable's correct home | [architecture/variables.md](./architecture/variables.md) | +| Understand why two ACME models coexist | [architecture/dns.md](./architecture/dns.md) | +| Plug an identity provider into Keycloak | [keycloak/](./keycloak/) | +| Solve a recurring runtime issue | [troubleshooting/](./troubleshooting/) | + +--- + +πŸ“ Contributions follow the guidelines in [contributing/git.md](./contributing/git.md). diff --git a/architecture/README.md b/architecture/README.md new file mode 100644 index 0000000..df6317f --- /dev/null +++ b/architecture/README.md @@ -0,0 +1,44 @@ + +# Architecture β€” `reference-ansible` + +This documentation describes the architecture of the `reference-ansible` +repository and uses the inventory `inventories/demo-gymburgdorf/` as a +running example. It serves both as onboarding documentation for new +engineers and as a reference when setting up additional demo tenants. + +> **Demo-only.** All defaults in the roles (passwords, tokens, RPC +> secrets) are insecure and intended exclusively for demo setups. See +> [security.md](security.md). + +**Last updated:** 2026-05-26 Β· **Owner:** @sbaerlocher + +## Contents + +| Section | File | Topics | +|---|---|---| +| Setup and repo layout | [setup.md](setup.md) | Repo layout, role provenance, control-node prerequisites, Bao login workflow | +| Variables | [variables.md](variables.md) | Ansible variable hierarchy, variable cheatsheet | +| Topology | [topology.md](topology.md) | Inventory groups, service layout per host, variable placement | +| DNS and ACME | [dns.md](dns.md) | Knot zones, NS-delegated vs. ACL-isolated ACME models, split-horizon FQDNs, TSIG/ACL | +| Deploy | [deploy.md](deploy.md) | Play sequence, Traefik DMZ/backend modes | +| Security | [security.md](security.md) | Bao lookup pattern, demo-only defaults, threat boundaries, production hardening | +| Operations | [operations.md](operations.md) | New-tenant walkthrough, known gaps and trade-offs | + +## Glossary + +| Term | Meaning | +|---|---| +| **OpenBao** | HashiCorp Vault fork. Single source of truth for secrets. Endpoint: `bao.digitalboard.ch`. | +| **Authentik** | Identity provider. Issues OIDC for SP services and LDAP via the Outpost. | +| **Outpost (Authentik)** | Separate Authentik sidecar that emulates LDAP/proxy protocols for legacy apps. Talks to Authentik via RPC + token. | +| **WOPI** | Web Application Open Platform Interface β€” protocol used by Nextcloud/Opencloud to hand office documents to Collabora. | +| **TSIG / RFC2136** | Authenticated DNS updates. Traefik uses TSIG-signed `nsupdate` calls for ACME DNS-01 challenges. | +| **DNS-01 (ACME)** | Let's Encrypt challenge type: certificate ownership is proven via a TXT record in DNS instead of HTTP. Required for wildcard certs. | +| **CNAME bridge** | `_acme-challenge.` points via CNAME into a dedicated update label (`.demo-gymb._acme.digitalboard.ch`), keeping the TSIG key scoped to a narrow sub-tree. See [dns.md](dns.md). | +| **Knot DNS** | Authoritative DNS server used on `ns1.digitalboard.ch`. Config and zone files live in the separate [`dns-zones`](https://git.digitalboard.ch/Digitalboard/dns-zones) repo. | +| **DNSSEC** | Zones are signed with Ed25519, NSEC3 (no opt-out), KSK 1y / ZSK 90d rollovers, CDS/CDNSKEY published for automatic DS at the parent. | +| **Split horizon** | Two FQDN families per service: public `.gymb.souveredu.ch` β†’ DMZ Traefik front-end IP, internal `.int.gymb.souveredu.ch` β†’ directly the backend host. See [dns.md](dns.md). | +| **File provider / Docker provider** | Traefik configuration sources. The file provider reads static YAML; the Docker provider reads container labels via `/var/run/docker.sock`. | +| **STUN/TURN** | NAT-traversal protocols for WebRTC (e.g. for Nextcloud Talk). Runs on a separate host (`turn`). | +| **Garage** | S3-compatible object store (Rust). Backend for Nextcloud/Opencloud. | +| **FQCN** | Fully Qualified Collection Name, e.g. `digitalboard.core.traefik`. Mandatory in Ansible since 2.10. | diff --git a/architecture/deploy.md b/architecture/deploy.md new file mode 100644 index 0000000..13f1ab8 --- /dev/null +++ b/architecture/deploy.md @@ -0,0 +1,74 @@ + +# Deploy flow and Traefik modes + +← Back to [Architecture index](README.md) + +## 6. Deploy flow + +Sequence taken from [playbooks/site.yml](https://git.digitalboard.ch/Digitalboard/reference-ansible/src/branch/main/playbooks/site.yml): + +```mermaid +sequenceDiagram + participant U as User + participant A as ansible-playbook + participant V as OpenBao + participant H as Hosts + + U->>U: bao login + export VAULT_TOKEN + U->>A: make deploy_site_demo_gymburgdorf + A->>A: load vars: role defaults β†’ group_vars/all β†’ group_vars/<groups> β†’ host_vars/<host> + A->>V: community.hashi_vault lookups
(acme-tsig, service secrets) + V-->>A: secret values + A->>H: Play 1 β€” base (all hosts) + A->>H: Play 2 β€” traefik (all hosts: dmz on reverseproxy, backend elsewhere) + A->>H: Play 3 β€” httpbin + A->>H: Play 4 β€” 389ds + A->>H: Play 5 β€” keycloak + A->>H: Play 6 β€” garage (storage) + A->>H: Play 7 β€” collabora (application) + A->>H: Play 8 β€” authentik (application) + A->>H: Play 9 β€” authentik_outpost_ldap (application) + A->>H: Play 10 β€” nextcloud (application) + A->>H: Play 11 β€” drawio (application) + A->>H: Play 12 β€” send + A->>H: Play 13 β€” opnform + A->>H: Play 14 β€” homarr + A->>H: Play 15 β€” bookstack + A->>H: Play 16 β€” opencloud +``` + +Plays without matching group members (`httpbin_servers`, +`ds389_servers`, `keycloak_servers`, `send_servers`, +`opnform_servers`, `homarr_servers`, `bookstack_servers`, +`opencloud_servers` in this inventory) run as no-ops. + +> **Role-name spelling traps:** the LDAP role is `389ds` (not +> `ds389`); the forms role is `opnform` (not `openforms`/`openform`). +> Inventory groups must match the names used in +> [playbooks/site.yml](https://git.digitalboard.ch/Digitalboard/reference-ansible/src/branch/main/playbooks/site.yml) exactly β€” +> `ds389_servers`, `opnform_servers`. + +`--diff` is enabled in the target β†’ per-task changes are visible. + +## 7. Traefik modes (DMZ vs Backend) + +**`traefik_mode: dmz`** β€” public-facing reverse proxy on `reverseproxy`: + +- **File provider** with `services.yml` for static routing. +- No Docker socket mounted, no local containers. +- Routes to `backend_host` addresses on other machines. +- Backends are declared via `traefik_dmz_exposed_services` (a list in + `host_vars/reverseproxy/`). Selective backend selection is also + possible via `traefik_backend_servers_to_proxy`. + +**`traefik_mode: backend`** β€” application/storage: + +- Mounts `/var/run/docker.sock`. +- **Docker provider**: auto-discovery via container labels + (`traefik.enable=true`). +- Services are exposed locally; the DMZ Traefik routes external + traffic to them in plaintext HTTP (see + [security.md](security.md)). + +**Both modes** support ACME via RFC2136 DNS challenge or self-signed +(`traefik_cert_mode: acme | selfsigned`). diff --git a/architecture/dns.md b/architecture/dns.md new file mode 100644 index 0000000..226dd5c --- /dev/null +++ b/architecture/dns.md @@ -0,0 +1,123 @@ + +# DNS topology and ACME zone layout + +← Back to [Architecture index](README.md) + +Authoritative DNS for everything described in this document runs on +**`ns1.digitalboard.ch`** (public `193.43.183.169`, DMZ `172.16.9.169`) +using **Knot DNS**. The zone files and Knot config live in the +[`dns-zones`](https://git.digitalboard.ch/Digitalboard/dns-zones) repo; this section explains how the +public service FQDNs, the internal "split-horizon" FQDNs, and the ACME +challenge sub-trees fit together. + +## Authoritative zones on `ns1` + +| Zone | Purpose | DNSSEC | Dynamic updates | +|---|---|---|---| +| `digitalboard.ch` | Production zone for the platform itself (`auth`, `cloud`, `office`, `bao`, …). | on | none (static zone file) | +| `_acme.digitalboard.ch` | Parent zone for ACME challenge labels. | on | yes, per-tenant TSIG ACLs (`demo-gymb`, `demo-phbe`, `demo-mbaz`) | +| `digitalboard._acme.digitalboard.ch` | **Delegated** child zone for `digitalboard.ch` ACME updates only. | off | yes, TSIG `acme_update_key_digitalboard` | +| `souveredu.ch` | Demo-tenant zone (`gymb`, `phbe`, `mbaz` sub-labels). | on | none (static zone file) | +| `demo-schulen.ch` | Reserve / unused so far. | on | none | + +> **Two different ACME models live here.** This is the most common +> source of confusion when copying a tenant: +> +> - `digitalboard.ch` uses a **NS-delegated child zone** +> (`digitalboard._acme.digitalboard.ch.` has its own `NS` record in +> `_acme.digitalboard.ch`). The TSIG key writes into that delegated +> zone. +> - The demo tenants (`demo-gymb`, `demo-phbe`, `demo-mbaz`) **share +> the parent zone** `_acme.digitalboard.ch` and are isolated only +> by **Knot ACL `update-owner-name`** on the per-tenant sub-tree +> (`demo-gymb._acme.digitalboard.ch.` and below). There is no NS +> delegation for them. +> +> Both work for the ACME flow; the demo model is cheaper to manage but +> means tenant isolation depends on Knot ACLs, not zone boundaries. + +## Naming pattern for `demo-gymb` (template for new tenants) + +```text +Public, browser-facing: + cloud.gymb.souveredu.ch CNAME β†’ rvp.gymb.souveredu.ch (193.43.183.131) + auth.gymb.souveredu.ch CNAME β†’ rvp.gymb.souveredu.ch + office.gymb.souveredu.ch CNAME β†’ rvp.gymb.souveredu.ch + s3.gymb.souveredu.ch CNAME β†’ rvp.gymb.souveredu.ch + ... + +Internal, server-to-server (split horizon): + cloud.int.gymb.souveredu.ch A β†’ 172.16.19.101 (application host) + auth.int.gymb.souveredu.ch A β†’ 172.16.19.101 + office.int.gymb.souveredu.ch A β†’ 172.16.19.101 + s3.int.gymb.souveredu.ch A β†’ 172.16.19.102 (storage host) + ... + +Tenant entry IPs: + rvp.gymb.souveredu.ch A β†’ 193.43.183.131 (DMZ Traefik public) + reverseproxy.int.gymb A β†’ 172.16.9.111 (DMZ Traefik internal) + +ACME challenge labels (writeable via TSIG acme_update_key_demo_gymb): + _acme-challenge.cloud.gymb CNAME β†’ cloud.demo-gymb._acme.digitalboard.ch + _acme-challenge.cloud.int.gymb CNAME β†’ cloud.int.demo-gymb._acme.digitalboard.ch + ... +``` + +The `.int.` family is what makes Nextcloud β†’ Garage, Nextcloud β†’ +Authentik (OIDC), Nextcloud β†’ Collabora (WOPI) etc. **bypass the DMZ +Traefik**: the backend host's local Traefik presents the right cert +directly, so traffic stays on the backend subnet. Without this, +server-to-server calls would either ride out through the DMZ and back +in, or hit a hostname mismatch on the cert. + +## TSIG / ACL model + +```mermaid +flowchart LR + classDef tenant fill:#dcfce7,stroke:#166534,color:#000 + classDef zone fill:#dbeafe,stroke:#1e40af,color:#000 + classDef acl fill:#fef3c7,stroke:#92400e,color:#000 + + subgraph KNOT["ns1.digitalboard.ch (Knot DNS)"] + Z1["_acme.digitalboard.ch
(parent zone)"]:::zone + Z2["digitalboard._acme.digitalboard.ch
(NS-delegated child)"]:::zone + A1["ACL acme_updates_digitalboard
scope: digitalboard._acme.digitalboard.ch."]:::acl + A2["ACL acme_updates_demo_gymb
scope: demo-gymb._acme.digitalboard.ch."]:::acl + A3["ACL acme_updates_demo_phbe
scope: demo-phbe._acme.digitalboard.ch."]:::acl + A4["ACL acme_updates_demo_mbaz
scope: demo-mbaz._acme.digitalboard.ch."]:::acl + end + + DB["digitalboard.ch Traefik
TSIG: acme_update_key_digitalboard"]:::tenant + GY["demo-gymb Traefik
TSIG: acme_update_key_demo_gymb"]:::tenant + PH["demo-phbe Traefik
TSIG: acme_update_key_demo_phbe"]:::tenant + MB["demo-mbaz Traefik
TSIG: acme_update_key_demo_mbaz"]:::tenant + + DB -- nsupdate TXT --> A1 + GY -- nsupdate TXT --> A2 + PH -- nsupdate TXT --> A3 + MB -- nsupdate TXT --> A4 + A1 -- writes into --> Z2 + A2 -- writes into --> Z1 + A3 -- writes into --> Z1 + A4 -- writes into --> Z1 +``` + +Each ACL is restricted to **`update-type: TXT`** and +**`update-owner-match: sub-or-equal`** under the tenant prefix, so a +leaked tenant key cannot write outside its own ACME sub-tree and cannot +modify non-TXT records (no A/CNAME/NS hijack). + +## Traefik variables that bind to this layout + +From `inventories/demo-gymburgdorf/group_vars/traefik_servers/traefik.yml`: + +| Traefik variable | Value for `demo-gymb` | Bound to | +|---|---|---| +| `traefik_acme_dns_provider` | `rfc2136` | Knot dynamic-update endpoint | +| `traefik_acme_dns_zone` | `demo-gymb._acme.digitalboard.ch` | Per-tenant write scope on `ns1` | +| `traefik_acme_tsig_key_name` | `acme_update_key_demo_gymb` | Matches `key:` entry in [`knot.conf`](https://git.digitalboard.ch/Digitalboard/dns-zones/src/branch/main/knot/knot.conf) | +| `traefik_acme_tsig_secret` | Bao lookup | See [security.md](security.md) | + +A tenant whose ACME zone does **not** match the Knot ACL +`update-owner-name` will get `REFUSED` on `nsupdate` and ACME issuance +will silently retry until the renewal window expires. diff --git a/architecture/operations.md b/architecture/operations.md new file mode 100644 index 0000000..58ce8b9 --- /dev/null +++ b/architecture/operations.md @@ -0,0 +1,99 @@ + +# Operations β€” new tenants and known gaps + +← Back to [Architecture index](README.md) + +## 10. Walkthrough: creating a new demo tenant + +Recommended template: **`demo-gymburgdorf`** (not `vagrant`, since its +group topology is incompatible). + +1. **Copy the inventory:** + + ```bash + cp -r inventories/demo-gymburgdorf inventories/demo- + ``` + +2. **Adjust `hosts.yml`:** IPs and hostnames per host. + +3. **`group_vars/all/vault.yml`** β€” point `vault_mount` at the new + tenant mount (`demo-`). + +4. **`group_vars/traefik_servers/traefik.yml`** β€” bend + `traefik_acme_dns_zone` and the `traefik_acme_tsig_*` lookup paths + to the new zone / new Bao path. + +5. **`host_vars/application/*.yml`** and + **`host_vars/storage/*.yml`** β€” walk through them: FQDNs to the new + domain pattern (e.g. `*..souveredu.ch`), Bao lookup paths + to `demo-/data/…`. + +6. **Prepare OpenBao** (out-of-band, not via Ansible): + - Create a new KV-v2 mount `demo-`. + - Write secrets: `acme-tsig`, `authentik`, `nextcloud`, `garage`, … + (see [security.md](security.md) for the mandatory-override list). + - Policy for the deploy token: read on `demo-/data/*`. + +7. **DNS** (in the [`dns-zones`](https://git.digitalboard.ch/Digitalboard/dns-zones) repo, see + [dns.md](dns.md)): + - Add `key:` and `acl:` entries for the new tenant in + [`knot/knot.conf`](https://git.digitalboard.ch/Digitalboard/dns-zones/src/branch/main/knot/knot.conf), pattern + `acme_update_key_demo_` / + `acme_updates_demo_` scoped to + `demo-._acme.digitalboard.ch.`. + - Append the new ACL to the `_acme.digitalboard.ch` zone's `acl:` + list β€” the tenants share the parent zone, no NS delegation. + - In `zones/souveredu.ch.zone` (or the tenant's public zone) add + the public/internal A records (`rvp.`, + `reverseproxy.int.`, `application.int.`, + `storage.int.`, …), the service CNAMEs to + `rvp.`, and the `_acme-challenge.*` CNAMEs into + `demo-._acme.digitalboard.ch`. Bump the SOA serial. + - `make deploy_ns1` to push. + +8. **Makefile** β€” add a new target modelled on + `deploy_site_demo_gymburgdorf` and wire it into + `deploy_site_demo`. + +9. **Smoke test:** + `ansible all -i inventories/demo-/hosts.yml -m ping`. + +10. **Deploy:** Bao login + `make deploy_site_demo_`. + +## 11. Known gaps and trade-offs + +- **Optional services without group bindings in `demo-gymburgdorf`:** + `opencloud`, `send`, `opnform`, `homarr`, and `bookstack` are + declared as plays in + [playbooks/site.yml](https://git.digitalboard.ch/Digitalboard/reference-ansible/src/branch/main/playbooks/site.yml) but have no + `_servers` group in the inventory β€” those plays run as + no-ops. If needed, add the group + `host_vars/application/.yml` + as described in [topology.md](topology.md). Mind spelling: + `opnform_servers` (not `openform`/`openforms`). +- **`turn` host:** defined in the DMZ, but no STUN/TURN role in + [playbooks/site.yml](https://git.digitalboard.ch/Digitalboard/reference-ansible/src/branch/main/playbooks/site.yml). Currently provisioned only + via `base` + `traefik`. +- **Idempotency:** roles are Docker-Compose-based; re-runs may trigger + container restarts when compose inputs change. There is no dedicated + rollback mechanism β€” on failure, roll back manually to the previous + state. +- **TLS renewal:** handled internally by Traefik via ACME. There is no + external renewal cron in the repo. +- **CI / testing:** not present in the repo. Smoke test is + `make ping_demo`. +- **Logs:** Traefik runs with `traefik_log_level: DEBUG` in + `demo-gymburgdorf` and `vagrant` (role default is `INFO`) β€” reduce + to `INFO` or `WARN` before adapting for production. +- **TSIG secrets in `knot.conf`:** the `dns-zones` repo currently + stores all four ACME TSIG keys in plaintext in + [`knot/knot.conf`](https://git.digitalboard.ch/Digitalboard/dns-zones/src/branch/main/knot/knot.conf). The Ansible + side reads them from Bao, but the Knot side does not β€” anyone with + read on the `dns-zones` repo can write TXT records under the + matching tenant's ACME sub-tree. For prod, source the Knot keys + from a templated config + secret store, or restrict repo access. +- **Demo tenants share `_acme.digitalboard.ch`:** isolation is by + Knot ACL `update-owner-name`, not by zone delegation. A mis-edit + of the ACL list could break ACL-based isolation without breaking + DNS resolution β€” failure is silent. The production zone + (`digitalboard.ch`) uses a properly delegated child zone and is + not affected. diff --git a/architecture/security.md b/architecture/security.md new file mode 100644 index 0000000..64dc6ec --- /dev/null +++ b/architecture/security.md @@ -0,0 +1,71 @@ + +# Security and demo-only defaults + +← Back to [Architecture index](README.md) + +> This repo is explicitly designed for **demo setups**. All default +> values in the roles are insecure and are overridden in `demo-*` +> inventories via Bao lookups or host_vars. For production deployments +> the hardening block further down also applies. + +## Secret pattern (Bao lookup) + +```yaml +# group_vars/.../.yml or host_vars/.../.yml +authentik_secret_key: "{{ lookup('community.hashi_vault.hashi_vault', + vault_mount + '/data/authentik:secret_key', + url=vault_addr) }}" +``` + +- `vault_mount` and `vault_addr` come from + [group_vars/all/vault.yml](https://git.digitalboard.ch/Digitalboard/reference-ansible/src/branch/main/inventories/demo-gymburgdorf/group_vars/all/vault.yml). +- KV-v2 paths require an explicit `/data/` segment β€” Ansible does not + resolve this automatically. +- `vault_mount` is unique per inventory (`demo-gymburgdorf`, + `demo-phbern`, …) β†’ tenant isolation in Bao via mount + policy. + +## Demo-only defaults β€” override required + +These defaults in `digitalboard.core` are insecure. In any +**production-grade** deployment they must be overridden via Bao lookup +or host_var: + +| Variable | Default | Where to override | +|---|---|---| +| `keycloak_admin_password` | `changeme` | host_vars `keycloak_servers` | +| `keycloak_postgres_password` | `changeme` | same | +| `authentik_secret_key` | `changeme-generate-a-random-string` | `host_vars/application/authentik.yml` | +| `authentik_postgres_password` | `changeme` | same | +| `nextcloud_admin_password` | `admin` | `host_vars/application/nextcloud.yml` | +| `nextcloud_postgres_password` | `changeme` | same | +| `nextcloud_s3_key` / `nextcloud_s3_secret` | `changeme` / `changeme` | same | +| `garage_webui_password` | `admin` | `host_vars/storage/garage.yml` | +| `garage_rpc_secret` | `0123…cdef` (64-hex constant) | same | +| `garage_admin_token` | identical to `rpc_secret` | same | +| `garage_metrics_token` | identical to `rpc_secret` | same | + +> **Convention:** every value listed above **must** have a Bao lookup +> in `demo-*/host_vars/.../...yml` before the inventory is considered +> deploy-ready. + +## Threat boundaries (current demo state) + +| Boundary | Status | Notes | +|---|---|---| +| DMZ ↔ Backend (172.16.9 ↔ 172.16.19) | **Plaintext HTTP** | Auth bearers, OIDC codes, session cookies travel unencrypted. Fine for demo; for prod use mTLS or a WireGuard overlay. | +| Host firewall | **missing** | The `base` role does not install UFW/nftables. Segmentation relies on the hypervisor/VLAN. | +| SSH | `ansible_user: root` | No bastion, no jump host. Key distribution out-of-band. | +| Authentik SPOF | **accepted** | IdP and SP services share the same host (`application`). An Authentik outage means a login outage including the LDAP outpost. No break-glass path. | +| ACME TSIG key | Bao lookup (in Ansible), **plaintext in [`knot.conf`](https://git.digitalboard.ch/Digitalboard/dns-zones/src/branch/main/knot/knot.conf)** on `ns1` side | One TSIG key per demo tenant, scoped via Knot ACL `update-owner-name` to the tenant's ACME sub-tree. Rotation is manual and must be done on both sides simultaneously (Bao + `knot.conf` + `knotc zone-reload`). | +| Backup / DR | **out of scope** | Garage `replication_factor: 1` (default), no Postgres backup job, no Bao snapshot cron. | + +## To adapt for production, add + +- Host firewall (extend the `base` role or add a dedicated `firewall` + role). +- mTLS or WireGuard between DMZ and backend. +- Authentik on a separate host with a recovery admin token. +- Bao policies per inventory mount (read-only for the deploy token, + write-only for the bootstrap job). +- Backup cron for Postgres + Garage + Bao. +- SSH bastion + key rotation. diff --git a/architecture/setup.md b/architecture/setup.md new file mode 100644 index 0000000..0c75d18 --- /dev/null +++ b/architecture/setup.md @@ -0,0 +1,68 @@ + +# Setup and repo layout + +← Back to [Architecture index](README.md) + +## 1. Repo layout and role provenance + +```text +reference-ansible/ +β”œβ”€β”€ Makefile # Deploy targets, OIDC login, OBJC fork workaround +β”œβ”€β”€ ansible.cfg # collections_path, remote_user=root, hashi_vault auth_method=token +β”œβ”€β”€ requirements.yml # community.hashi_vault + digitalboard.core (Git) +β”œβ”€β”€ playbooks/site.yml # Play sequence (14 plays, see deploy.md) +β”œβ”€β”€ collections/ # ← installed by `make install`, gitignored +β”‚ └── ansible_collections/ +β”‚ └── digitalboard/core/ +β”‚ └── roles/ # πŸ”‘ Roles live HERE, NOT in the repo root +└── inventories/ + β”œβ”€β”€ demo-gymburgdorf/ # Inventory used throughout this document + β”œβ”€β”€ demo-mbazΓΌrich/ + β”œβ”€β”€ demo-phbern/ + └── vagrant/ # Local test inventory with its own topology +``` + +> **Important:** There is **no** `roles/` directory at the repo root. +> All roles come from the `digitalboard.core` collection (see +> [requirements.yml](https://git.digitalboard.ch/Digitalboard/reference-ansible/src/branch/main/requirements.yml)), installed via `make install` +> into `./collections/`. Plays reference them by FQCN +> `digitalboard.core.`. + +## 2. Setup and prerequisites + +**Tools on the control node:** + +- `ansible` (Core β‰₯ 2.15) +- `bao` CLI (OpenBao) β€” e.g. `sudo pacman -S openbao python-hvac` (Arch) or Homebrew +- `python-hvac` (for `community.hashi_vault` lookups) +- On macOS: `OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES` (set in the + [Makefile](https://git.digitalboard.ch/Digitalboard/reference-ansible/src/branch/main/Makefile); without it Ansible forks crash on Bao lookups) + +**Initial setup:** + +```bash +git clone +cd reference-ansible +make install # Galaxy + digitalboard.core into ./collections/ +``` + +**Before every deploy:** Bao login in the **same shell** that will then +run `ansible-playbook`: + +```bash +export BAO_ADDR=https://bao.digitalboard.ch +bao login -method=oidc -path=Digitalboard +export VAULT_TOKEN=$(bao print token) +``` + +> ⚠️ `make bao` on its own is **not enough**: every `make` target spawns +> a new shell, and the `VAULT_TOKEN` exported in there only lives for +> the duration of `make bao` itself. Either run the three commands +> above manually, or invoke `make bao deploy_site_demo_gymburgdorf` as +> **one** call β€” otherwise the deploy has no token. + +**Smoke test:** + +```bash +make ping_demo # pings all three demo inventories +``` diff --git a/architecture/topology.md b/architecture/topology.md new file mode 100644 index 0000000..c884c4d --- /dev/null +++ b/architecture/topology.md @@ -0,0 +1,110 @@ + +# Topology β€” inventory and services + +← Back to [Architecture index](README.md) + +## 4. Inventory topology (`demo-gymburgdorf`) + +```mermaid +flowchart LR + classDef dmz fill:#fee2e2,stroke:#991b1b,color:#000 + classDef app fill:#dcfce7,stroke:#166534,color:#000 + classDef stor fill:#dbeafe,stroke:#1e40af,color:#000 + classDef turn fill:#fef9c3,stroke:#854d0e,color:#000 + + subgraph ALL["group: all_servers"] + direction LR + subgraph DMZ["DMZ 172.16.9.0/24"] + RP["reverseproxy
172.16.9.111
traefik_mode: dmz"]:::dmz + TURN["turn
172.16.9.112
(no role in site.yml yet)"]:::turn + end + subgraph BE["Backend 172.16.19.0/24
group: backend_servers"] + APP["application
172.16.19.101
traefik_mode: backend
+ authentik, authentik_outpost_ldap,
nextcloud, collabora, drawio"]:::app + ST["storage
172.16.19.102
traefik_mode: backend
+ garage (S3)"]:::stor + end + end + + RP -.HTTPS in, HTTP out.-> APP + RP -.HTTPS in, HTTP out.-> ST +``` + +**Group memberships (from [hosts.yml](https://git.digitalboard.ch/Digitalboard/reference-ansible/src/branch/main/inventories/demo-gymburgdorf/hosts.yml)):** + +| Group | Members | Purpose | +|---|---|---| +| `all_servers` | `reverseproxy`, `application`, `storage`, `turn` | Base role for all hosts | +| `traefik_servers` | `children: all_servers` (= all 4 hosts) | Traefik everywhere; DMZ/backend via `traefik_mode` | +| `backend_servers` | `application`, `storage` | Sets `traefik_mode: backend` via group var | +| `garage_servers` | `storage` | Single-host wrapper for the Garage role | +| `nextcloud_servers`, `collabora_servers`, `drawio_servers`, `authentik_servers`, `authentik_outpost_ldap_servers` | `application` only | Single-host wrappers | + +> **Difference vs. the `vagrant` inventory:** `vagrant` structures +> Traefik differently β€” via the children groups `traefik_servers_dmz` +> and `traefik_servers_backend` instead of `backend_servers` + +> `host_vars` override. The two topologies are **structurally +> incompatible**; a 1:1 mapping is not possible. See +> [operations.md](operations.md) for the recommended template. + +## 5. Service layout and variable placement + +```mermaid +flowchart TB + classDef rp fill:#fee2e2,stroke:#991b1b,color:#000 + classDef ap fill:#dcfce7,stroke:#166534,color:#000 + classDef st fill:#dbeafe,stroke:#1e40af,color:#000 + classDef ext fill:#e9d5ff,stroke:#6b21a8,color:#000 + + Internet((Internet)) + DNS["DNS ns1.digitalboard.ch
RFC2136 TSIG
Zone: demo-gymb._acme.digitalboard.ch
CNAME bridge: _acme-challenge.*.gymb.souveredu.ch"]:::ext + BAO["OpenBao
bao.digitalboard.ch
mount: demo-gymburgdorf"]:::ext + + subgraph RP["reverseproxy β€” traefik dmz"] + TRDMZ["traefik (file provider)
πŸ“ group_vars/traefik_servers/traefik.yml
πŸ“ host_vars/reverseproxy/traefik.yml
β†’ traefik_mode: dmz
β†’ traefik_dmz_exposed_services"]:::rp + end + + subgraph APP["application β€” traefik backend"] + TRA["traefik (docker provider)
πŸ“ group_vars/backend_servers/traefik.yml"]:::ap + AK["authentik (OIDC + LDAP outpost backend)
πŸ“ host_vars/application/authentik.yml"]:::ap + AKO["authentik_outpost_ldap
πŸ“ host_vars/application/authentik_outpost_ldap.yml"]:::ap + NC["nextcloud
πŸ“ host_vars/application/nextcloud.yml"]:::ap + COL["collabora
πŸ“ host_vars/application/collabora.yml"]:::ap + DRW["drawio
πŸ“ host_vars/application/drawio.yml"]:::ap + end + + subgraph ST["storage β€” traefik backend"] + TRS["traefik (docker provider)"]:::st + GAR["garage (S3)
πŸ“ host_vars/storage/garage.yml"]:::st + end + + Internet -->|HTTPS :443| TRDMZ + TRDMZ -->|HTTP backend| TRA + TRDMZ -->|HTTP backend| TRS + TRA --> AK & AKO & NC & COL & DRW + TRS --> GAR + + NC -. S3 .-> GAR + NC -. OIDC .-> AK + NC -. WOPI .-> COL + NC -. LDAP .-> AKO + AKO -. RPC + token .-> AK + + TRDMZ -. ACME DNS-01 TSIG .-> DNS + TRDMZ -. hashi_vault acme-tsig .-> BAO + AK -. hashi_vault secrets .-> BAO + NC -. hashi_vault secrets .-> BAO + GAR -. hashi_vault secrets .-> BAO +``` + +> **Note:** `opencloud`, `send`, `opnform`, `homarr`, and `bookstack` +> are defined as plays in [playbooks/site.yml](https://git.digitalboard.ch/Digitalboard/reference-ansible/src/branch/main/playbooks/site.yml) +> but currently have no matching group in +> [hosts.yml](https://git.digitalboard.ch/Digitalboard/reference-ansible/src/branch/main/inventories/demo-gymburgdorf/hosts.yml) for +> `demo-gymburgdorf` β€” those plays therefore run as no-ops. If a +> tenant needs these services, add the corresponding +> `_servers` group in `hosts.yml` and a +> `host_vars/application/.yml` (mind the spelling β€” the +> forms role is `opnform`, the LDAP role is `389ds`). +> +> The `turn` host is in `all_servers` (and therefore in +> `traefik_servers`) but has **no** service group of its own β€” +> currently only the `base` and `traefik` roles run on it. diff --git a/architecture/variables.md b/architecture/variables.md new file mode 100644 index 0000000..66bb648 --- /dev/null +++ b/architecture/variables.md @@ -0,0 +1,58 @@ + +# Variables β€” hierarchy and cheatsheet + +← Back to [Architecture index](README.md) + +## 3. Variable hierarchy + +Ansible merges variables from multiple sources. Simplified model for +this repo (see the Ansible docs for the full precedence rules): + +```mermaid +flowchart LR + classDef role fill:#fef3c7,stroke:#92400e,color:#000 + classDef group fill:#dbeafe,stroke:#1e40af,color:#000 + classDef host fill:#dcfce7,stroke:#166534,color:#000 + classDef vault fill:#fee2e2,stroke:#991b1b,color:#000 + + R["role defaults
(lowest precedence)
collections/.../roles/<r>/defaults/main.yml"]:::role + GA["group_vars/all/
vault.yml, docker.yml"]:::group + GG["group_vars/<group>/
traefik_servers/, backend_servers/
(parallel groups, merged via
ansible_group_priority)"]:::group + HV["host_vars/<host>/
(highest of the three inventory sources)"]:::host + BAO["OpenBao
lookup at runtime"]:::vault + + R --> |"<overridden by>"| GA + GA --> |"<overridden by>"| GG + GG --> |"<overridden by>"| HV + HV -.community.hashi_vault.-> BAO + GG -.community.hashi_vault.-> BAO +``` + +**Key properties:** + +- Multiple `group_vars//` are **parallel**, not hierarchically + nested. `traefik_servers` and `backend_servers` are merged by + `ansible_group_priority` (default 1); on conflict the + alphabetically-later group name wins. +- `host_vars//` beats any group. +- `host_vars/reverseproxy/traefik.yml: traefik_mode: dmz` therefore + overrides the default from `group_vars/backend_servers/` β€” and only + because `reverseproxy` is not a member of `backend_servers` in the + first place (otherwise the override wouldn't even be needed). + +**Bao lookups** are not a precedence layer but **values** inside any +variable source. See [security.md](security.md) for the pattern. + +## 9. Variable cheatsheet + +| Variable | Where in `demo-gymburgdorf/` | Why | +|---|---|---| +| `vault_addr`, `vault_mount` | `group_vars/all/vault.yml` | Bao endpoint applies site-wide | +| `docker_registry_mirrors` | `group_vars/all/docker.yml` | Pulls from mirror on all hosts | +| `traefik_acme_*`, `traefik_use_ssl`, `traefik_cert_mode` | `group_vars/traefik_servers/traefik.yml` | Applies to every Traefik instance (dmz + backend) | +| `traefik_mode: backend` | `group_vars/backend_servers/traefik.yml` | Default for app + storage | +| `traefik_mode: dmz` | `host_vars/reverseproxy/traefik.yml` | Host-specific override | +| `traefik_dmz_exposed_services` | `host_vars/reverseproxy/` | DMZ backend list β€” only meaningful here | +| `nextcloud_*`, `authentik_*`, `collabora_*`, `drawio_*` | `host_vars/application/.yml` | Service runs on `application` | +| `garage_*` | `host_vars/storage/garage.yml` | Service runs on `storage` | +| Secrets (passwords, tokens, keys) | inline variable using `lookup('community.hashi_vault.hashi_vault', …)` | Single source of truth via Bao |