docs/architecture/operations.md
Simon Bärlocher 345cf4b319
docs: add architecture section and overhaul top-level README
- Move Simon's architecture documentation into architecture/
  (setup, variables, topology, dns, deploy, security, operations
  plus index and glossary). All cross-repo references point at
  https://git.digitalboard.ch/Digitalboard/{reference-ansible,dns-zones}
  via absolute URLs so the docs remain navigable from any context.
- Rewrite README.md as a documentation hub: introduction, platform
  Mermaid overview, comparison of the three repos
  (docs / digitalboard.core / reference-ansible) and a full table of
  contents covering architecture, contributing, infrastructure,
  keycloak, ms-entra and troubleshooting.

Addresses the open items from the WKS PoC review (2026-05-26):
docs README begrüssungstext + Übersichtsgrafik + Verlinkung der
beiden anderen Repos, sowie das Verschieben der Architektur-Doku.
2026-05-28 14:25:27 +02:00

4.6 KiB

Operations — new tenants and known gaps

← Back to Architecture index

10. Walkthrough: creating a new demo tenant

Recommended template: demo-gymburgdorf (not vagrant, since its group topology is incompatible).

  1. Copy the inventory:

    cp -r inventories/demo-gymburgdorf inventories/demo-<customer>
    
  2. Adjust hosts.yml: IPs and hostnames per host.

  3. group_vars/all/vault.yml — point vault_mount at the new tenant mount (demo-<customer>).

  4. group_vars/traefik_servers/traefik.yml — bend traefik_acme_dns_zone and the traefik_acme_tsig_* lookup paths to the new zone / new Bao path.

  5. host_vars/application/*.yml and host_vars/storage/*.yml — walk through them: FQDNs to the new domain pattern (e.g. *.<customer>.souveredu.ch), Bao lookup paths to demo-<customer>/data/….

  6. Prepare OpenBao (out-of-band, not via Ansible):

    • Create a new KV-v2 mount demo-<customer>.
    • Write secrets: acme-tsig, authentik, nextcloud, garage, … (see security.md for the mandatory-override list).
    • Policy for the deploy token: read on demo-<customer>/data/*.
  7. DNS (in the dns-zones repo, see dns.md):

    • Add key: and acl: entries for the new tenant in knot/knot.conf, pattern acme_update_key_demo_<customer> / acme_updates_demo_<customer> scoped to demo-<customer>._acme.digitalboard.ch..
    • Append the new ACL to the _acme.digitalboard.ch zone's acl: list — the tenants share the parent zone, no NS delegation.
    • In zones/souveredu.ch.zone (or the tenant's public zone) add the public/internal A records (rvp.<customer>, reverseproxy.int.<customer>, application.int.<customer>, storage.int.<customer>, …), the service CNAMEs to rvp.<customer>, and the _acme-challenge.* CNAMEs into demo-<customer>._acme.digitalboard.ch. Bump the SOA serial.
    • make deploy_ns1 to push.
  8. Makefile — add a new target modelled on deploy_site_demo_gymburgdorf and wire it into deploy_site_demo.

  9. Smoke test: ansible all -i inventories/demo-<customer>/hosts.yml -m ping.

  10. Deploy: Bao login + make deploy_site_demo_<customer>.

11. Known gaps and trade-offs

  • Optional services without group bindings in demo-gymburgdorf: opencloud, send, opnform, homarr, and bookstack are declared as plays in playbooks/site.yml but have no <service>_servers group in the inventory — those plays run as no-ops. If needed, add the group + host_vars/application/<svc>.yml as described in topology.md. Mind spelling: opnform_servers (not openform/openforms).
  • turn host: defined in the DMZ, but no STUN/TURN role in playbooks/site.yml. Currently provisioned only via base + traefik.
  • Idempotency: roles are Docker-Compose-based; re-runs may trigger container restarts when compose inputs change. There is no dedicated rollback mechanism — on failure, roll back manually to the previous state.
  • TLS renewal: handled internally by Traefik via ACME. There is no external renewal cron in the repo.
  • CI / testing: not present in the repo. Smoke test is make ping_demo.
  • Logs: Traefik runs with traefik_log_level: DEBUG in demo-gymburgdorf and vagrant (role default is INFO) — reduce to INFO or WARN before adapting for production.
  • TSIG secrets in knot.conf: the dns-zones repo currently stores all four ACME TSIG keys in plaintext in knot/knot.conf. The Ansible side reads them from Bao, but the Knot side does not — anyone with read on the dns-zones repo can write TXT records under the matching tenant's ACME sub-tree. For prod, source the Knot keys from a templated config + secret store, or restrict repo access.
  • Demo tenants share _acme.digitalboard.ch: isolation is by Knot ACL update-owner-name, not by zone delegation. A mis-edit of the ACL list could break ACL-based isolation without breaking DNS resolution — failure is silent. The production zone (digitalboard.ch) uses a properly delegated child zone and is not affected.