docs: add architecture section and overhaul top-level README

- Move Simon's architecture documentation into architecture/
  (setup, variables, topology, dns, deploy, security, operations
  plus index and glossary). All cross-repo references point at
  https://git.digitalboard.ch/Digitalboard/{reference-ansible,dns-zones}
  via absolute URLs so the docs remain navigable from any context.
- Rewrite README.md as a documentation hub: introduction, platform
  Mermaid overview, comparison of the three repos
  (docs / digitalboard.core / reference-ansible) and a full table of
  contents covering architecture, contributing, infrastructure,
  keycloak, ms-entra and troubleshooting.

Addresses the open items from the WKS PoC review (2026-05-26):
docs README begrüssungstext + Übersichtsgrafik + Verlinkung der
beiden anderen Repos, sowie das Verschieben der Architektur-Doku.
This commit is contained in:
Simon Bärlocher 2026-05-28 14:25:27 +02:00
parent 8c2ea8cc72
commit 345cf4b319
No known key found for this signature in database
GPG key ID: 63DE20495932047A
9 changed files with 742 additions and 27 deletions

123
architecture/dns.md Normal file
View file

@ -0,0 +1,123 @@
<!-- markdownlint-disable MD013 MD060 MD051 -->
# DNS topology and ACME zone layout
← Back to [Architecture index](README.md)
Authoritative DNS for everything described in this document runs on
**`ns1.digitalboard.ch`** (public `193.43.183.169`, DMZ `172.16.9.169`)
using **Knot DNS**. The zone files and Knot config live in the
[`dns-zones`](https://git.digitalboard.ch/Digitalboard/dns-zones) repo; this section explains how the
public service FQDNs, the internal "split-horizon" FQDNs, and the ACME
challenge sub-trees fit together.
## Authoritative zones on `ns1`
| Zone | Purpose | DNSSEC | Dynamic updates |
|---|---|---|---|
| `digitalboard.ch` | Production zone for the platform itself (`auth`, `cloud`, `office`, `bao`, …). | on | none (static zone file) |
| `_acme.digitalboard.ch` | Parent zone for ACME challenge labels. | on | yes, per-tenant TSIG ACLs (`demo-gymb`, `demo-phbe`, `demo-mbaz`) |
| `digitalboard._acme.digitalboard.ch` | **Delegated** child zone for `digitalboard.ch` ACME updates only. | off | yes, TSIG `acme_update_key_digitalboard` |
| `souveredu.ch` | Demo-tenant zone (`gymb`, `phbe`, `mbaz` sub-labels). | on | none (static zone file) |
| `demo-schulen.ch` | Reserve / unused so far. | on | none |
> **Two different ACME models live here.** This is the most common
> source of confusion when copying a tenant:
>
> - `digitalboard.ch` uses a **NS-delegated child zone**
> (`digitalboard._acme.digitalboard.ch.` has its own `NS` record in
> `_acme.digitalboard.ch`). The TSIG key writes into that delegated
> zone.
> - The demo tenants (`demo-gymb`, `demo-phbe`, `demo-mbaz`) **share
> the parent zone** `_acme.digitalboard.ch` and are isolated only
> by **Knot ACL `update-owner-name`** on the per-tenant sub-tree
> (`demo-gymb._acme.digitalboard.ch.` and below). There is no NS
> delegation for them.
>
> Both work for the ACME flow; the demo model is cheaper to manage but
> means tenant isolation depends on Knot ACLs, not zone boundaries.
## Naming pattern for `demo-gymb` (template for new tenants)
```text
Public, browser-facing:
cloud.gymb.souveredu.ch CNAME → rvp.gymb.souveredu.ch (193.43.183.131)
auth.gymb.souveredu.ch CNAME → rvp.gymb.souveredu.ch
office.gymb.souveredu.ch CNAME → rvp.gymb.souveredu.ch
s3.gymb.souveredu.ch CNAME → rvp.gymb.souveredu.ch
...
Internal, server-to-server (split horizon):
cloud.int.gymb.souveredu.ch A → 172.16.19.101 (application host)
auth.int.gymb.souveredu.ch A → 172.16.19.101
office.int.gymb.souveredu.ch A → 172.16.19.101
s3.int.gymb.souveredu.ch A → 172.16.19.102 (storage host)
...
Tenant entry IPs:
rvp.gymb.souveredu.ch A → 193.43.183.131 (DMZ Traefik public)
reverseproxy.int.gymb A → 172.16.9.111 (DMZ Traefik internal)
ACME challenge labels (writeable via TSIG acme_update_key_demo_gymb):
_acme-challenge.cloud.gymb CNAME → cloud.demo-gymb._acme.digitalboard.ch
_acme-challenge.cloud.int.gymb CNAME → cloud.int.demo-gymb._acme.digitalboard.ch
...
```
The `.int.` family is what makes Nextcloud → Garage, Nextcloud →
Authentik (OIDC), Nextcloud → Collabora (WOPI) etc. **bypass the DMZ
Traefik**: the backend host's local Traefik presents the right cert
directly, so traffic stays on the backend subnet. Without this,
server-to-server calls would either ride out through the DMZ and back
in, or hit a hostname mismatch on the cert.
## TSIG / ACL model
```mermaid
flowchart LR
classDef tenant fill:#dcfce7,stroke:#166534,color:#000
classDef zone fill:#dbeafe,stroke:#1e40af,color:#000
classDef acl fill:#fef3c7,stroke:#92400e,color:#000
subgraph KNOT["ns1.digitalboard.ch (Knot DNS)"]
Z1["_acme.digitalboard.ch<br/>(parent zone)"]:::zone
Z2["digitalboard._acme.digitalboard.ch<br/>(NS-delegated child)"]:::zone
A1["ACL acme_updates_digitalboard<br/>scope: digitalboard._acme.digitalboard.ch."]:::acl
A2["ACL acme_updates_demo_gymb<br/>scope: demo-gymb._acme.digitalboard.ch."]:::acl
A3["ACL acme_updates_demo_phbe<br/>scope: demo-phbe._acme.digitalboard.ch."]:::acl
A4["ACL acme_updates_demo_mbaz<br/>scope: demo-mbaz._acme.digitalboard.ch."]:::acl
end
DB["digitalboard.ch Traefik<br/>TSIG: acme_update_key_digitalboard"]:::tenant
GY["demo-gymb Traefik<br/>TSIG: acme_update_key_demo_gymb"]:::tenant
PH["demo-phbe Traefik<br/>TSIG: acme_update_key_demo_phbe"]:::tenant
MB["demo-mbaz Traefik<br/>TSIG: acme_update_key_demo_mbaz"]:::tenant
DB -- nsupdate TXT --> A1
GY -- nsupdate TXT --> A2
PH -- nsupdate TXT --> A3
MB -- nsupdate TXT --> A4
A1 -- writes into --> Z2
A2 -- writes into --> Z1
A3 -- writes into --> Z1
A4 -- writes into --> Z1
```
Each ACL is restricted to **`update-type: TXT`** and
**`update-owner-match: sub-or-equal`** under the tenant prefix, so a
leaked tenant key cannot write outside its own ACME sub-tree and cannot
modify non-TXT records (no A/CNAME/NS hijack).
## Traefik variables that bind to this layout
From `inventories/demo-gymburgdorf/group_vars/traefik_servers/traefik.yml`:
| Traefik variable | Value for `demo-gymb` | Bound to |
|---|---|---|
| `traefik_acme_dns_provider` | `rfc2136` | Knot dynamic-update endpoint |
| `traefik_acme_dns_zone` | `demo-gymb._acme.digitalboard.ch` | Per-tenant write scope on `ns1` |
| `traefik_acme_tsig_key_name` | `acme_update_key_demo_gymb` | Matches `key:` entry in [`knot.conf`](https://git.digitalboard.ch/Digitalboard/dns-zones/src/branch/main/knot/knot.conf) |
| `traefik_acme_tsig_secret` | Bao lookup | See [security.md](security.md) |
A tenant whose ACME zone does **not** match the Knot ACL
`update-owner-name` will get `REFUSED` on `nsupdate` and ACME issuance
will silently retry until the renewal window expires.