docs/architecture/operations.md
Simon Bärlocher 345cf4b319
docs: add architecture section and overhaul top-level README
- Move Simon's architecture documentation into architecture/
  (setup, variables, topology, dns, deploy, security, operations
  plus index and glossary). All cross-repo references point at
  https://git.digitalboard.ch/Digitalboard/{reference-ansible,dns-zones}
  via absolute URLs so the docs remain navigable from any context.
- Rewrite README.md as a documentation hub: introduction, platform
  Mermaid overview, comparison of the three repos
  (docs / digitalboard.core / reference-ansible) and a full table of
  contents covering architecture, contributing, infrastructure,
  keycloak, ms-entra and troubleshooting.

Addresses the open items from the WKS PoC review (2026-05-26):
docs README begrüssungstext + Übersichtsgrafik + Verlinkung der
beiden anderen Repos, sowie das Verschieben der Architektur-Doku.
2026-05-28 14:25:27 +02:00

99 lines
4.6 KiB
Markdown

<!-- markdownlint-disable MD013 MD060 MD051 -->
# Operations — new tenants and known gaps
← Back to [Architecture index](README.md)
## 10. Walkthrough: creating a new demo tenant
Recommended template: **`demo-gymburgdorf`** (not `vagrant`, since its
group topology is incompatible).
1. **Copy the inventory:**
```bash
cp -r inventories/demo-gymburgdorf inventories/demo-<customer>
```
2. **Adjust `hosts.yml`:** IPs and hostnames per host.
3. **`group_vars/all/vault.yml`** — point `vault_mount` at the new
tenant mount (`demo-<customer>`).
4. **`group_vars/traefik_servers/traefik.yml`** — bend
`traefik_acme_dns_zone` and the `traefik_acme_tsig_*` lookup paths
to the new zone / new Bao path.
5. **`host_vars/application/*.yml`** and
**`host_vars/storage/*.yml`** — walk through them: FQDNs to the new
domain pattern (e.g. `*.<customer>.souveredu.ch`), Bao lookup paths
to `demo-<customer>/data/…`.
6. **Prepare OpenBao** (out-of-band, not via Ansible):
- Create a new KV-v2 mount `demo-<customer>`.
- Write secrets: `acme-tsig`, `authentik`, `nextcloud`, `garage`, …
(see [security.md](security.md) for the mandatory-override list).
- Policy for the deploy token: read on `demo-<customer>/data/*`.
7. **DNS** (in the [`dns-zones`](https://git.digitalboard.ch/Digitalboard/dns-zones) repo, see
[dns.md](dns.md)):
- Add `key:` and `acl:` entries for the new tenant in
[`knot/knot.conf`](https://git.digitalboard.ch/Digitalboard/dns-zones/src/branch/main/knot/knot.conf), pattern
`acme_update_key_demo_<customer>` /
`acme_updates_demo_<customer>` scoped to
`demo-<customer>._acme.digitalboard.ch.`.
- Append the new ACL to the `_acme.digitalboard.ch` zone's `acl:`
list — the tenants share the parent zone, no NS delegation.
- In `zones/souveredu.ch.zone` (or the tenant's public zone) add
the public/internal A records (`rvp.<customer>`,
`reverseproxy.int.<customer>`, `application.int.<customer>`,
`storage.int.<customer>`, …), the service CNAMEs to
`rvp.<customer>`, and the `_acme-challenge.*` CNAMEs into
`demo-<customer>._acme.digitalboard.ch`. Bump the SOA serial.
- `make deploy_ns1` to push.
8. **Makefile** — add a new target modelled on
`deploy_site_demo_gymburgdorf` and wire it into
`deploy_site_demo`.
9. **Smoke test:**
`ansible all -i inventories/demo-<customer>/hosts.yml -m ping`.
10. **Deploy:** Bao login + `make deploy_site_demo_<customer>`.
## 11. Known gaps and trade-offs
- **Optional services without group bindings in `demo-gymburgdorf`:**
`opencloud`, `send`, `opnform`, `homarr`, and `bookstack` are
declared as plays in
[playbooks/site.yml](https://git.digitalboard.ch/Digitalboard/reference-ansible/src/branch/main/playbooks/site.yml) but have no
`<service>_servers` group in the inventory — those plays run as
no-ops. If needed, add the group + `host_vars/application/<svc>.yml`
as described in [topology.md](topology.md). Mind spelling:
`opnform_servers` (not `openform`/`openforms`).
- **`turn` host:** defined in the DMZ, but no STUN/TURN role in
[playbooks/site.yml](https://git.digitalboard.ch/Digitalboard/reference-ansible/src/branch/main/playbooks/site.yml). Currently provisioned only
via `base` + `traefik`.
- **Idempotency:** roles are Docker-Compose-based; re-runs may trigger
container restarts when compose inputs change. There is no dedicated
rollback mechanism — on failure, roll back manually to the previous
state.
- **TLS renewal:** handled internally by Traefik via ACME. There is no
external renewal cron in the repo.
- **CI / testing:** not present in the repo. Smoke test is
`make ping_demo`.
- **Logs:** Traefik runs with `traefik_log_level: DEBUG` in
`demo-gymburgdorf` and `vagrant` (role default is `INFO`) — reduce
to `INFO` or `WARN` before adapting for production.
- **TSIG secrets in `knot.conf`:** the `dns-zones` repo currently
stores all four ACME TSIG keys in plaintext in
[`knot/knot.conf`](https://git.digitalboard.ch/Digitalboard/dns-zones/src/branch/main/knot/knot.conf). The Ansible
side reads them from Bao, but the Knot side does not — anyone with
read on the `dns-zones` repo can write TXT records under the
matching tenant's ACME sub-tree. For prod, source the Knot keys
from a templated config + secret store, or restrict repo access.
- **Demo tenants share `_acme.digitalboard.ch`:** isolation is by
Knot ACL `update-owner-name`, not by zone delegation. A mis-edit
of the ACL list could break ACL-based isolation without breaking
DNS resolution — failure is silent. The production zone
(`digitalboard.ch`) uses a properly delegated child zone and is
not affected.