Certbot Dns 01 Cloudflare Wildcard
Anyone who has run a public-facing service knows the dance: a certificate expires at 3 a.m., nginx starts serving the wrong cert for a subdomain you forgot existed, and the on-call engineer is suddenly Googling ACME challenge types from a phone. The HTTP-01 challenge papers over the problem for simple single-host setups, but the moment you need a wildcard certificate for *.example.com — to cover preview environments, tenant subdomains, or a fleet of internal services behind one load balancer — HTTP-01 stops working and you have to switch to DNS-01. That switch is where most teams get stuck, because suddenly your TLS pipeline depends on a DNS provider API, a credentials file with the wrong permissions can leak your whole zone, and a single typo in a systemd unit means silent renewal failure until the cert actually expires.
This article walks through a complete, reproducible setup using certbot, the certbot-dns-cloudflare plugin, nginx, and systemd on a Debian-family server, with Cloudflare as the DNS provider. By the end you will have a Git-tracked project repo holding a locked-down credentials file, a wildcard certificate issued via the DNS-01 challenge against a scoped Cloudflare API token, an nginx server block that terminates TLS and redirects HTTP to HTTPS, and a systemd timer that renews the certificate and reloads nginx through a deploy hook. Every command is shown with the exact flags used, and the renewal path is verified with a dry-run before it ever has to fire for real.
It is aimed at backend and platform engineers who already run nginx in production but have only ever issued single-host certs through HTTP-01. By the end you will be able to issue and rotate wildcard certificates without touching the web server, diagnose the two failure modes that account for almost every DNS-01 outage (token scope and DNS propagation lag), and apply the same pattern to any other ACME-compatible DNS provider with only the credentials block changed.
Step 1: Scaffolding the Toolkit and Forging a Zone-Scoped Cloudflare Token
A wildcard certificate from Let's Encrypt is only safe when the credential that proves DNS control is itself safe. Before we ever invoke certbot, we need a Cloudflare API token that can edit one and only one thing — DNS records on the exact zones we want to certify — plus a tiny verifier that refuses to keep working with a token broader than that.
This first step lays the groundwork: a cfwild Python package, a pytest harness, and a token module that encodes the Cloudflare verify-endpoint contract so a mis-scoped token fails loudly during setup instead of silently surviving until a production renewal.
Setup
The repository starts from an empty directory. We add a pyproject.toml, a src/cfwild package, a tests/ folder, and a .gitignore to keep build noise out of git. The only runtime dependency we introduce is the Python standard library; the dev-only dependency is pytest>=8.0 so we can drive a real test suite from day one.
The resulting layout looks like this:
codebase/
├── pyproject.toml
├── README.md
├── src/
│ └── cfwild/
│ ├── __init__.py
│ └── token.py
└── tests/
├── __init__.py
└── test_token.py
We use a src/ layout because it forces tests to import the installed package rather than the source tree, which catches packaging bugs immediately. The pyproject.toml declares the hatchling build backend, pins Python >=3.9, and wires pythonpath = ["src"] into pytest so the suite runs cleanly without an editable install on CI.
Implementation
We start with the package metadata. Nothing exotic — a single project, a single package, and a single dev extra:
[project]
name = "cfwild"
version = "0.1.0"
description = "Helpers for issuing wildcard certs via certbot + Cloudflare DNS-01."
requires-python = ">=3.9"
dependencies = []
[project.optional-dependencies]
dev = ["pytest>=8.0"]
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
[tool.hatch.build.targets.wheel]
packages = ["src/cfwild"]
[tool.pytest.ini_options]
testpaths = ["tests"]
pythonpath = ["src"]
addopts = "-ra"
Keeping dependencies = [] is intentional. The whole article series is built on the principle that issuing a wildcard cert needs the standard library plus certbot — nothing else — so the helper package must not silently pull in HTTP clients or shell wrappers that operators would have to audit.
Next comes the heart of step 1: src/cfwild/token.py. We pin the Cloudflare token shape first, because the v4 API tokens are documented as 40 characters drawn from [A-Za-z0-9_-]. A regex catches the easy mistake of pasting a Global API Key (32 hex chars) into a place that wants a token:
TOKEN_REGEX = re.compile(r"^[A-Za-z0-9_-]{40}$")
class TokenSpecError(ValueError):
pass
def validate_token_format(token: object) -> None:
if not isinstance(token, str):
raise TokenSpecError("token must be a string")
if not TOKEN_REGEX.match(token):
raise TokenSpecError(
"token must be 40 characters of [A-Za-z0-9_-]"
)
TokenSpecError is a thin ValueError subclass. Subclassing lets callers catch just token-spec problems without swallowing unrelated errors from certbot or the network layer in later steps.
The next block teaches the module how to read a Cloudflare /user/tokens/verify payload. Cloudflare returns a policies list; each policy has permission_groups (what the policy grants) and resources (where it grants it). For DNS-01 we only ever want the DNS Write permission group, and we only ever want resources of the form com.cloudflare.api.account.zone.<zone_id>:
DNS_EDIT_PERMISSION_NAME = "DNS Write"
ZONE_RESOURCE_PREFIX = "com.cloudflare.api.account.zone."
ACCOUNT_RESOURCE_PREFIX = "com.cloudflare.api.account."
def _policy_grants_dns_edit(policy: Mapping[str, object]) -> bool:
groups = policy.get("permission_groups") or []
if not isinstance(groups, Sequence):
return False
for entry in groups:
if isinstance(entry, Mapping) and entry.get("name") == DNS_EDIT_PERMISSION_NAME:
return True
return False
def _policy_zone_targets(policy: Mapping[str, object]) -> list[str]:
resources = policy.get("resources") or {}
if not isinstance(resources, Mapping):
return []
zones: list[str] = []
for key in resources.keys():
if not isinstance(key, str):
continue
if key.startswith(ZONE_RESOURCE_PREFIX):
zones.append(key[len(ZONE_RESOURCE_PREFIX):])
return zones
Two design choices are worth calling out. First, the helpers are tolerant of malformed shapes (missing keys, wrong types) and return empty rather than raising — Cloudflare adds fields over time, and a verifier that explodes on every new key is a verifier that operators will disable. Second, the zone IDs are returned as bare strings, stripped of the com.cloudflare.api.account.zone. prefix, so the rest of the codebase can compare them against the IDs printed in the Cloudflare dashboard without an extra translation step.
The strict checker lives in has_only_dns_edit. It is the function that an operator will wire into their setup script: pass it the verify response plus the list of zone IDs the wildcard cert should cover, and it raises TokenSpecError for anything broader — including the subtle case where a token grants DNS Write on an account-wide scope, which would let an attacker pivot to any zone the account owns:
def has_only_dns_edit(
verify_response: Mapping[str, object],
allowed_zone_ids: Iterable[str],
) -> None:
allowed = set(allowed_zone_ids)
if not allowed:
raise TokenSpecError("allowed_zone_ids must not be empty")
policies = verify_response.get("policies") or []
if not isinstance(policies, Sequence) or not policies:
raise TokenSpecError("verify_response carries no policies")
seen: set[str] = set()
for policy in policies:
if not isinstance(policy, Mapping):
raise TokenSpecError("policy entry is not an object")
_enforce_dns_edit_only(policy, allowed, seen)
missing = allowed - seen
if missing:
raise TokenSpecError(
f"token does not cover all required zones: {sorted(missing)}"
)
The per-policy enforcement is delegated to a helper to keep nesting shallow. The codebase rule is two levels of conditional max, no nested try/except, and pushing the policy check into _enforce_dns_edit_only keeps has_only_dns_edit at a clean linear flow:
def _enforce_dns_edit_only(
policy: Mapping[str, object],
allowed: set[str],
seen: set[str],
) -> None:
if _policy_is_account_wide(policy):
raise TokenSpecError(
"token grants account-wide scope; DNS-01 needs zone scope"
)
if not _policy_grants_dns_edit(policy):
raise TokenSpecError(
"policy lacks 'DNS Write'; least-privilege issuer needs DNS Write only"
)
zones = set(_policy_zone_targets(policy))
rogue = zones - allowed
if rogue:
raise TokenSpecError(
f"policy grants unexpected zones: {sorted(rogue)}"
)
seen.update(zones)
Finally, src/cfwild/__init__.py re-exports the public surface so callers can write from cfwild import has_only_dns_edit instead of reaching into the module path. The test file in tests/test_token.py covers the four failure modes we care about: account-wide scope, missing required zones, extra unexpected zones, and a policy that has the right resources but the wrong permission group.
Verification
Run the suite from the codebase/ directory:
pytest
The expected output:
============================= test session starts ==============================
platform darwin -- Python 3.12.5, pytest-9.0.3, pluggy-1.6.0
rootdir: .../certbot-dns-01-cloudflare-wildcard/codebase
configfile: pyproject.toml
testpaths: tests
collected 18 items
tests/test_token.py .................. [100%]
============================== 18 passed in 0.03s ==============================
Eighteen passing tests is the floor for step 1, split across three target functions.
- Six tests exercise
validate_token_format: canonical shape, allowed special characters, too short, too long, illegal characters, and non-string input. - Four tests exercise the zone-extraction helper: single zone, multiple zones, ignoring non-DNS policies, and handling empty payloads.
- Eight tests exercise
has_only_dns_edit: the four rejection paths above (account-wide scope, missing zones, rogue zones, wrong permission group), plus exact-match, multi-zone, empty allowed set, and empty policy list.
What we built
We now have an installable Python package, cfwild, whose only job in step 1 is to refuse a Cloudflare token that does more than it should. That refusal is mechanical — a deterministic function over the verify-endpoint payload — which means it can sit at the top of every later script with negligible cost.
The package layout reflects a deliberate choice: a src/ tree so tests import the installed wheel, a tiny dev extra so contributors only need pip install -e .[dev], and zero runtime dependencies so the audit surface stays close to the standard library. The pytest config wires the path manipulation into pyproject.toml so a fresh clone runs the suite with a single pytest invocation.
The token module enforces three invariants that matter for the rest of the series. The token must look like a v4 Cloudflare token; the verify response must contain at least one policy; and every policy must grant DNS Write on an explicit zone scope — never the account-wide bucket. Each invariant has both a positive and a negative test, so a future refactor that silently weakens one of them will fail CI before it lands.
What this unlocks: in step 2 we install certbot's Cloudflare DNS plugin and write the credentials file. We will plug has_only_dns_edit straight into the bootstrap script so that an operator who pastes the wrong token at 2 AM gets a precise, actionable error — instead of a successful install followed by a failed renewal a week later.
Repository
The state of the code after this step: e58c000
Step 2: Installing the dns-cloudflare Authenticator and Hardening the Credentials INI
Step 1 left us with cfwild, a Python package whose only job so far is to reject a Cloudflare API token that grants more than DNS Write on a specific zone. That guard is useless until something actually consumes the token, so step 2 wires up the consumer: certbot, with the certbot-dns-cloudflare authenticator plugin enabled and a credentials INI that the operating system itself protects.
Two new modules show up in this step. cfwild.install knows how to read certbot plugins and certbot --version output and reject a host that lacks the DNS-01 authenticator or runs an outdated certbot. cfwild.credentials writes the Cloudflare INI file at the correct path, with the correct key, and — critically — with mode 0600 so neither group nor world can read the token off the disk.
Setup
We add two source files and two test files to the package laid down in step 1. No new runtime dependencies are introduced; both modules stick to the standard library (os, stat, pathlib, re) so the audit surface stays small. On the operating-system side the operator installs certbot plus the plugin via the recommended path for their distro — APT for Debian/Ubuntu, the snap channel for distros without a packaged plugin, or a pinned virtualenv when neither is acceptable.
The new package layout:
codebase/
├── pyproject.toml
├── README.md
├── src/
│ └── cfwild/
│ ├── __init__.py # re-exports the new public surface
│ ├── credentials.py # NEW — INI writer + lockdown checker
│ ├── install.py # NEW — plugin + version checks
│ └── token.py
└── tests/
├── __init__.py
├── test_credentials.py # NEW
├── test_install.py # NEW
└── test_token.py
For the system-level install we standardise on the following invocation. It is intentionally distro-agnostic: the operator picks the form their host supports, but the artifact — a working certbot binary that lists dns-cloudflare in certbot plugins — is identical across paths.
# Debian / Ubuntu — recent enough for the bundled plugin
sudo apt-get update
sudo apt-get install -y certbot python3-certbot-dns-cloudflare
# Or, via snap (works on hosts without packaged plugins)
sudo snap install --classic certbot
sudo snap set certbot trust-plugin-with-root=ok
sudo snap install certbot-dns-cloudflare
sudo ln -sf /snap/bin/certbot /usr/bin/certbot
We pick python3-certbot-dns-cloudflare over pip install certbot certbot-dns-cloudflare on long-lived servers because the distro package is patched in lockstep with certbot itself. Snap is the fallback for hosts whose package archive is too old to ship the plugin; we avoid the third option (pip into a venv) unless the operator already maintains a venv-based deploy story, because mixing system certbot with pip certbot is a well-known foot-gun.
Implementation
The first module to land is src/cfwild/install.py. Its job is to turn two strings — the output of certbot plugins and the output of certbot --version — into two assertions: the dns-cloudflare authenticator is registered, and the certbot binary is recent enough to support modern ACME features. We deliberately keep the I/O (running the certbot subprocess) outside this module so the parsing is trivially unit-testable.
DNS_CLOUDFLARE_PLUGIN = "dns-cloudflare"
MIN_CERTBOT_VERSION = (1, 22, 0)
_PLUGIN_LINE = re.compile(r"^\*\s+([A-Za-z0-9._-]+)\s*$")
_VERSION_LINE = re.compile(r"^certbot\s+(\d+)\.(\d+)(?:\.(\d+))?")
def parse_plugin_names(certbot_plugins_output: str) -> list[str]:
names: list[str] = []
for raw_line in certbot_plugins_output.splitlines():
match = _PLUGIN_LINE.match(raw_line.strip())
if match is None:
continue
names.append(match.group(1))
return names
The _PLUGIN_LINE regex looks deliberately narrow: certbot prints each plugin name on a line that begins with * followed by whitespace, and everything else (description, interfaces, entry point, separator rules) does not. By matching the exact shape rather than the looser "line containing dns-cloudflare" we also reject the separator rules made of - - - - - and stay future-proof when certbot adds new metadata lines.
The version parser is shaped around what real certbot prints. Different installs emit different trailing tokens — a snap-installed certbot adds a build tag, a pip-installed one prints just certbot 2.7.4 — so we anchor on the leading numbers and ignore the rest of the line.
def parse_certbot_version(certbot_version_output: str) -> tuple[int, int, int]:
line = certbot_version_output.strip().splitlines()[0] if certbot_version_output.strip() else ""
match = _VERSION_LINE.match(line)
if match is None:
raise InstallError(
f"could not parse certbot version from output: {certbot_version_output!r}"
)
major = int(match.group(1))
minor = int(match.group(2))
patch = int(match.group(3)) if match.group(3) is not None else 0
return major, minor, patch
Why pin MIN_CERTBOT_VERSION = (1, 22, 0)? That release was the first to honour the renewed ACMEv2 endpoint plus --preferred-chain reliably and to ship the modernised DNS-01 propagation defaults. Anything older still issues certificates, but the renewal hooks and propagation timeouts behave differently enough that we refuse to support them rather than guess.
The companion assert_* helpers translate parsed data into raised exceptions. They are intentionally single-purpose: callers compose them rather than receiving a giant "preflight" function whose internals are hard to swap out.
def assert_dns_cloudflare_available(plugins: Iterable[str]) -> None:
available = list(plugins)
if DNS_CLOUDFLARE_PLUGIN in available:
return
raise InstallError(
f"'{DNS_CLOUDFLARE_PLUGIN}' authenticator not found; "
f"plugins reported: {sorted(available)}"
)
def assert_certbot_version_at_least(
actual: tuple[int, int, int],
minimum: tuple[int, int, int] = MIN_CERTBOT_VERSION,
) -> None:
if actual < minimum:
actual_str = ".".join(str(part) for part in actual)
minimum_str = ".".join(str(part) for part in minimum)
raise InstallError(
f"certbot {actual_str} is older than minimum {minimum_str}"
)
The second module, src/cfwild/credentials.py, builds and protects the Cloudflare INI file. The certbot-dns-cloudflare plugin reads a key named dns_cloudflare_api_token and only that key when it sees a 40-character token — providing the legacy email/global-key pair would silently downgrade the credential to a non-least-privilege one. We encode the key constant in the module so callers never spell it wrong.
CREDENTIALS_FILE_MODE = 0o600
CREDENTIALS_HEADER_COMMENT = (
"# Cloudflare API token used by certbot's DNS-01 authenticator.\n"
"# This file MUST be chmod 600 so only the certbot user can read it.\n"
)
CREDENTIALS_TOKEN_KEY = "dns_cloudflare_api_token"
def build_cloudflare_ini(api_token: str) -> str:
validate_token_format(api_token)
return f"{CREDENTIALS_HEADER_COMMENT}{CREDENTIALS_TOKEN_KEY} = {api_token}\n"
Notice that build_cloudflare_ini immediately delegates token-shape checking to validate_token_format from step 1. That keeps a single source of truth for what a "well-formed Cloudflare token" looks like: every consumer that writes a credentials file inherits the rejection of Global API Keys and short pasted strings for free.
The actual file write is split into a public function and two private helpers, each doing one thing. This keeps the public path linear and limits nesting to a single conditional, in line with the codebase rule that bans nested branching.
def write_credentials_file(path: Path, api_token: str) -> Path:
target = Path(path)
content = build_cloudflare_ini(api_token)
_ensure_parent_dir(target)
_write_then_chmod(target, content)
return target
def _ensure_parent_dir(target: Path) -> None:
parent = target.parent
if parent and not parent.exists():
parent.mkdir(parents=True, exist_ok=True)
def _write_then_chmod(target: Path, content: str) -> None:
target.write_text(content)
os.chmod(target, CREDENTIALS_FILE_MODE)
The ordering is important: we write the content first and chmod second. The opposite order leaves a brief window where the file exists with the default umask (usually 0644) before being tightened. Writing then chmod-ing means the bytes only reach a tightened inode — and if either step fails, the caller gets a real exception instead of a partially-secured artifact.
verify_credentials_file_locked exists as a separate guard the operator can run after the file is in place, for example as a startup probe or a cron-driven audit. It refuses anything that is not a regular file at mode 0600, using stat.S_IRWXG | stat.S_IRWXO as the forbidden-bit mask so any group or world permission — read, write, or execute — fails the check.
def verify_credentials_file_locked(path: Path) -> None:
target = Path(path)
if not target.exists():
raise CredentialsError(f"credentials file does not exist: {target}")
if not target.is_file():
raise CredentialsError(f"credentials path is not a regular file: {target}")
mode = stat.S_IMODE(target.stat().st_mode)
if mode & (stat.S_IRWXG | stat.S_IRWXO):
raise CredentialsError(
f"credentials file is group/other-readable (mode {oct(mode)}); expected 0o600"
)
Finally, parse_credentials_file reads a previously-written INI back into a token string, so a deploy script can re-validate the token against Cloudflare on every renewal cycle. We accept comment lines and ignore them, but require the dns_cloudflare_api_token key — a missing key is a configuration bug, not a soft warning.
def parse_credentials_file(path: Path) -> str:
target = Path(path)
if not target.exists():
raise CredentialsError(f"credentials file does not exist: {target}")
for raw_line in target.read_text().splitlines():
token = _extract_token_from_line(raw_line)
if token is not None:
return token
raise CredentialsError(
f"credentials file missing '{CREDENTIALS_TOKEN_KEY}' key: {target}"
)
_extract_token_from_line is a tiny pure function that returns None for blank lines and comments, None for any line whose key is not ours, and the stripped value otherwise. Pulling it out keeps parse_credentials_file flat and makes it trivial to add new keys later without rewriting the loop.
Verification
Run the test suite from the codebase/ directory:
pytest
The expected output:
============================= test session starts ==============================
platform darwin -- Python 3.12.5, pytest-9.0.3, pluggy-1.6.0
rootdir: .../certbot-dns-01-cloudflare-wildcard/codebase
configfile: pyproject.toml
testpaths: tests
collected 54 items
tests/test_credentials.py .................. [ 33%]
tests/test_install.py .................. [ 66%]
tests/test_token.py .................. [100%]
============================== 54 passed in 0.14s ==============================
Fifty-four passing tests — the eighteen from step 1 plus eighteen each for the two new modules.
The credentials suite covers the INI body, the chmod-after-write ordering (asserting mode 0600, rejecting 0640, rejecting 0604), parent-directory creation, idempotent overwrite, comment handling on parse, and the four "bad input" rejection paths.
The install suite walks the parser through real certbot plugins output both with and without the dns-cloudflare authenticator, version strings with and without a patch component, and the version comparator against the (1, 22, 0) floor.
For the system-level check, an operator runs the real binaries:
certbot --version
certbot plugins
The first command should print a line of the form certbot 2.7.4 (any version >= 1.22.0); the second should list * dns-cloudflare alongside whatever HTTP authenticators the install ships. Both outputs are exactly the strings cfwild.install knows how to parse, so a deploy script can pipe them straight into parse_certbot_version and parse_plugin_names without massaging.
What we built
Two cooperating modules now sit on top of the step-1 token validator. cfwild.install turns ambient system state — what certbot is installed and which authenticators it knows about — into deterministic assertions that fail loudly when the host is misconfigured. cfwild.credentials turns a validated token into a filesystem artifact that the operating system itself protects.
The credentials writer enforces three invariants worth restating. The token must already pass validate_token_format before a single byte hits disk. The file is written and then immediately chmod-ed to 0600, in that order, so no brief window of looser permissions exists. And verify_credentials_file_locked provides an independent check that the file stayed locked down — a deploy can run it on every boot and refuse to start if someone has loosened the mode out of band.
The install module is intentionally a parser, not a runner. It does not shell out to certbot; it accepts the strings that certbot would print and returns either a structured tuple or a precise InstallError. That separation lets us unit-test the boundary cases (snap builds, missing patch numbers, plugin descriptions that look almost-but-not-quite like a plugin name) without mocking subprocess, and lets the orchestration script in a later step compose those checks in whichever order suits the host.
What this unlocks: step 3 can assume that on any host the package considers "ready", the dns-cloudflare authenticator is present, the certbot binary is new enough, and a chmod-0600 INI file containing the step-1 token sits at a known path. With those guarantees the next step can issue an actual wildcard certificate against the Let's Encrypt staging endpoint without having to re-litigate any of the preflight concerns.
Repository
The state of the code after this step: dc8e801
Step 3: Issuing the First Wildcard Certificate Against Let's Encrypt Staging
Steps 1 and 2 left us with two boring-on-purpose guarantees: a token that has nothing but DNS Write on the chosen zone, and a chmod 0600 INI file at a known path that holds it. Step 3 finally cashes those guarantees in and asks Let's Encrypt for an actual wildcard certificate via the DNS-01 challenge.
The new module, cfwild.issue, contains zero subprocess calls. It builds a certbot certonly argument vector, normalises the domain list into [apex, *.apex], and parses certbot's stdout back into a typed IssuanceResult. We deliberately target the staging ACME endpoint first so a misconfigured zone never burns a real rate-limit slot — production is a one-line server-URL swap once staging issuance succeeds end to end.
Setup
One new source file (src/cfwild/issue.py) and one new test file (tests/test_issue.py) land in this step, plus the __init__.py re-exports growing to include the new public surface. No new runtime dependencies — the module stays on re, dataclasses, and pathlib so the auditable surface remains the standard library.
codebase/
├── pyproject.toml
├── README.md
├── src/
│ └── cfwild/
│ ├── __init__.py # re-exports the new issuance surface
│ ├── credentials.py
│ ├── install.py
│ ├── issue.py # NEW — certbot command + output parser
│ └── token.py
└── tests/
├── __init__.py
├── test_credentials.py
├── test_install.py
├── test_issue.py # NEW
└── test_token.py
The runtime prerequisites are everything the previous two steps established: a certbot binary >= 1.22.0 with the dns-cloudflare authenticator registered, and a chmod 0600 Cloudflare INI on disk with the validated token inside. If any of those preconditions fail, the helpers from steps 1 and 2 raise a precise exception long before we reach this module.
Implementation
The public surface is two dataclasses plus four functions. CertbotIssueRequest is the input — apex domain, credentials path, contact email, optional extra Subject Alternative Names, an ACME server URL that defaults to staging, and a propagation timeout. IssuanceResult is the output — typed paths to fullchain.pem and privkey.pem plus an expiry date string we can compare against a renewal threshold later.
STAGING_ACME_SERVER = "https://acme-staging-v02.api.letsencrypt.org/directory"
PRODUCTION_ACME_SERVER = "https://acme-v02.api.letsencrypt.org/directory"
DEFAULT_PROPAGATION_SECONDS = 60
MIN_PROPAGATION_SECONDS = 10
@dataclass(frozen=True)
class CertbotIssueRequest:
apex_domain: str
credentials_path: Path
contact_email: str
server: str = STAGING_ACME_SERVER
propagation_seconds: int = DEFAULT_PROPAGATION_SECONDS
dry_run: bool = False
extra_sans: Sequence[str] = field(default_factory=tuple)
Two constants in the dataclass deserve a word of justification. DEFAULT_PROPAGATION_SECONDS = 60 is the longest waiting window Cloudflare's authoritative DNS reliably needs once the API returns; we have seen propagation finish in under five seconds on a quiet zone and never seen it miss sixty. MIN_PROPAGATION_SECONDS = 10 is the floor we refuse to go below — any shorter and the ACME server occasionally races the recursive resolver, and a single missed challenge stretches a debug session into hours.
The domain normaliser is intentionally trivial: a wildcard cert that omits the apex means example.org itself is uncovered, which is almost never what the operator wants.
def normalize_wildcard_domains(apex: str) -> list[str]:
_validate_apex_domain(apex)
return [apex, f"*.{apex}"]
The validator beneath it rejects the four mistakes we keep seeing in tickets: an empty string, an apex that is itself wildcarded (*.example.org), a single-label string like localhost, and labels with underscores or other characters that DNS does not permit. Catching them here means build_certbot_command never produces an argv that certbot will reject thirty seconds later with a less helpful error.
build_certbot_command is the argv assembler. It is a flat function — input validation up front, then a single linear list of arguments, then optional appends — so the shape of the command stays obvious to a reviewer.
def build_certbot_command(request: CertbotIssueRequest) -> list[str]:
_validate_apex_domain(request.apex_domain)
_validate_email(request.contact_email)
_validate_propagation(request.propagation_seconds)
_validate_server_url(request.server)
_validate_extra_sans(request.apex_domain, request.extra_sans)
domains = _all_domains(request)
command = [
"certbot",
"certonly",
"--non-interactive",
"--agree-tos",
"--email",
request.contact_email,
"--server",
request.server,
"--authenticator",
"dns-cloudflare",
"--dns-cloudflare-credentials",
str(request.credentials_path),
"--dns-cloudflare-propagation-seconds",
str(request.propagation_seconds),
"--preferred-challenges",
"dns-01",
]
for domain in domains:
command.extend(["-d", domain])
if request.dry_run:
command.append("--dry-run")
return command
Each flag in that block has a specific reason to be there. certonly asks certbot to issue without installing into any webserver — we are using the certificate from a reverse proxy or a non-certbot consumer, so deploy hooks belong in a later step. --non-interactive plus --agree-tos plus --email is the trio that makes the call safe to run from cron. --preferred-challenges dns-01 is mandatory: HTTP-01 cannot satisfy a wildcard, and forgetting this flag is the single most common cause of "certbot picked the wrong challenge" tickets.
The extra-SAN validator is the most defensive helper in the module because the failure mode is silent. Certbot will happily request a SAN for api.other.org even when we only have DNS Write on example.org, and the resulting issuance just fails partway through with a misleading "could not solve challenge" error.
def _validate_san_entry(apex: str, san: object) -> None:
if not isinstance(san, str) or not san:
raise IssueError(f"extra SAN must be a non-empty string, got {san!r}")
candidate = san[2:] if san.startswith("*.") else san
_validate_apex_domain(candidate)
if not (candidate == apex or candidate.endswith("." + apex)):
raise IssueError(
f"extra SAN {san!r} is not within apex zone {apex!r}"
)
Anchoring on the apex means a SAN like *.api.example.org is accepted (it is inside the zone) but api.other.org is rejected even before the certbot subprocess starts. We strip a leading *. only for the in-zone check; the original string is what gets passed to certbot via -d.
Parsing the success output back into structured data is the other half of the module. Certbot prints three lines we care about on a successful issuance, and we anchor on the human-readable prefix of each rather than on column positions because certbot's whitespace varies between versions.
_CERT_PATH_LINE = re.compile(r"Certificate is saved at:\s*(.+)$")
_KEY_PATH_LINE = re.compile(r"Key is saved at:\s*(.+)$")
_EXPIRY_LINE = re.compile(r"This certificate expires on\s+(\d{4}-\d{2}-\d{2})")
def parse_issuance_output(certbot_stdout: str) -> IssuanceResult:
fullchain = _scan_unique(certbot_stdout, _CERT_PATH_LINE, "Certificate is saved at")
privkey = _scan_unique(certbot_stdout, _KEY_PATH_LINE, "Key is saved at")
expiry = _scan_unique(certbot_stdout, _EXPIRY_LINE, "This certificate expires on")
return IssuanceResult(
fullchain_path=Path(fullchain),
privkey_path=Path(privkey),
expires_on=expiry,
)
_scan_unique walks lines once, returns the first match, and raises a precise IssueError if the expected line is absent. We chose "first match" over "exactly one match" because certbot occasionally repeats the path line in a debug-mode footer, and a strict "exactly one" check would falsely reject a perfectly good issuance.
Finally, assert_issuance_covers_wildcard is the post-flight check. It does not re-read the certificate file — that is the job of an x509-aware consumer downstream — but it does refuse to declare success unless the paths look right.
def assert_issuance_covers_wildcard(result: IssuanceResult, apex_domain: str) -> None:
_validate_apex_domain(apex_domain)
name = result.fullchain_path.parent.name
if name != apex_domain:
raise IssueError(
f"fullchain path lives under '{name}', expected '{apex_domain}'"
)
if result.fullchain_path.name != "fullchain.pem":
raise IssueError(
f"expected fullchain.pem, got '{result.fullchain_path.name}'"
)
if result.privkey_path.name != "privkey.pem":
raise IssueError(
f"expected privkey.pem, got '{result.privkey_path.name}'"
)
The directory check exists because certbot occasionally suffixes a -0001 to the live directory when a previous lineage with the same name was archived; our orchestrator wants that situation surfaced loudly rather than silently published into the proxy.
Verification
Run the full test suite from the codebase/ directory:
pytest
The expected output:
============================= test session starts ==============================
platform darwin -- Python 3.12.5, pytest-9.0.3, pluggy-1.6.0
rootdir: .../certbot-dns-01-cloudflare-wildcard/codebase
configfile: pyproject.toml
testpaths: tests
collected 88 items
tests/test_credentials.py .................. [ 20%]
tests/test_install.py .................. [ 40%]
tests/test_issue.py .................................. [ 79%]
tests/test_token.py .................. [100%]
============================== 88 passed in 0.13s ==============================
Eighty-eight tests pass — fifty-four carried over from steps 1 and 2 plus thirty-four for the new issuance module. The new file covers six categories:
normalize_wildcard_domainshappy path plus its four rejection paths.- Argv assembly across staging, production, dry-run, propagation, and extra-SAN combinations.
- Input validators for email, server URL, propagation type, and SAN-zone scoping.
parse_issuance_outputround-trip against a real certbot success blob.- The three missing-line error paths in parsing.
assert_issuance_covers_wildcardfilename and directory checks.
For the live issuance on a staging-connected host the operator runs the assembled command. On a healthy zone with the chmod 0600 INI from step 2 in place, certbot prints something close to:
certbot certonly \
--non-interactive --agree-tos \
--email ops@example.org \
--server https://acme-staging-v02.api.letsencrypt.org/directory \
--authenticator dns-cloudflare \
--dns-cloudflare-credentials /etc/letsencrypt/secrets/cloudflare.ini \
--dns-cloudflare-propagation-seconds 60 \
--preferred-challenges dns-01 \
-d example.org -d '*.example.org'
Saving debug log to /var/log/letsencrypt/letsencrypt.log
Requesting a certificate for example.org and *.example.org
Waiting 60 seconds for DNS changes to propagate
Successfully received certificate.
Certificate is saved at: /etc/letsencrypt/live/example.org/fullchain.pem
Key is saved at: /etc/letsencrypt/live/example.org/privkey.pem
This certificate expires on 2026-09-01.
These files will be updated when the certificate renews.
NEXT STEPS:
- The certificate will need to be renewed before it expires.
Those three "saved at / expires on" lines are exactly the strings parse_issuance_output is shaped around, so a deploy script can pipe certbot's stdout straight into it without massaging the text.
What we built
The cfwild.issue module is the first piece of code in this codebase that talks the certbot dialect end to end. It accepts a typed CertbotIssueRequest, produces an argv that targets the DNS-01 challenge against the Let's Encrypt staging endpoint, and turns the resulting stdout back into a typed IssuanceResult with paths and an expiry.
Three design choices keep the module composable. It is a pure parser-and-builder — no subprocess calls live here, so unit tests run in microseconds and CI does not need network. It defaults to the staging ACME endpoint so a forgotten flag cannot accidentally burn a production rate limit. And it validates inputs aggressively up front, so by the time certbot actually runs every value in the argv has already survived the kind of "looks plausible, fails at runtime" mistake that drains hours from on-call.
The post-flight assert_issuance_covers_wildcard adds an independent sanity check. It does not trust certbot's exit code alone; it inspects the result paths to confirm the certificate landed in the lineage directory matching the apex and that the filenames are the canonical fullchain.pem and privkey.pem. A lineage suffix like example.org-0001 — usually a sign of a stale archived cert from a previous attempt — fails loudly here rather than silently shipping to the proxy.
What this unlocks: once a staging issuance succeeds end to end on a real host, swapping STAGING_ACME_SERVER for PRODUCTION_ACME_SERVER in the request gives us a live wildcard certificate, and the same IssuanceResult shape feeds the renewal-clock checks and deploy hooks that follow in later steps.
Repository
The state of the code after this step: bf25bd9
Step 4: Rendering an Nginx Reverse-Proxy Block with HSTS and a Forced HTTPS Redirect
Step 3 left us with a typed IssuanceResult pointing at fullchain.pem and privkey.pem under /etc/letsencrypt/live/<apex>/. That certificate is only useful once a real server hands it to clients during the TLS handshake. Step 4 builds the smallest piece of code that turns those two paths into a working nginx vhost — without ever calling nginx -s reload itself.
The new module, cfwild.nginx, is a pure renderer plus a pair of asserters. It accepts a typed NginxServerBlockSpec, returns a config string that contains both a port-80 redirector and a port-443 reverse-proxy block, and exposes assert_redirect_present and parse_server_block so the operator can verify a rendered or hand-edited file before pushing it to disk. Subprocess calls to nginx -t belong in a deploy hook layer that lands later — this module is shaped to be safely unit-testable in microseconds.
Setup
One new source file (src/cfwild/nginx.py) and one new test file (tests/test_nginx.py) join the package, and __init__.py re-exports the new public surface. No runtime dependencies are added — the module stays on re, dataclasses, and pathlib so the auditable surface remains the standard library.
codebase/
├── pyproject.toml
├── README.md
├── src/
│ └── cfwild/
│ ├── __init__.py # re-exports nginx surface
│ ├── credentials.py
│ ├── install.py
│ ├── issue.py
│ ├── nginx.py # NEW — server-block renderer + parser
│ └── token.py
└── tests/
├── __init__.py
├── test_credentials.py
├── test_install.py
├── test_issue.py
├── test_nginx.py # NEW
└── test_token.py
The runtime preconditions are exactly what step 3 produced: absolute paths to a fullchain.pem and privkey.pem that belong to a lineage covering the chosen apex. If those paths point anywhere else, the validators in this module raise before nginx ever sees the config.
Implementation
The public input is NginxServerBlockSpec — apex domain, primary hostname, certificate path pair, an upstream in host:port form, optional extra server names, and HSTS controls. The output of render_server_block is a string containing two stacked server {} directives: an HTTP-only redirector and the HTTPS terminator.
@dataclass(frozen=True)
class NginxServerBlockSpec:
apex_domain: str
hostname: str
fullchain_path: Path
privkey_path: Path
upstream: str
extra_server_names: Sequence[str] = field(default_factory=tuple)
enable_hsts: bool = True
hsts_max_age: int = DEFAULT_HSTS_MAX_AGE
DEFAULT_HSTS_MAX_AGE is set to two years (63072000 seconds), which is the threshold the public HSTS preload list requires. We default enable_hsts=True because a wildcard cert almost always wants the same policy applied to every subdomain — and we make it overridable to a smaller window for first-rollouts where a misconfiguration would otherwise trap browsers for years.
The render entry point is intentionally flat: validate the spec, join the server names once, then stitch the two blocks together.
def render_server_block(spec: NginxServerBlockSpec) -> str:
_validate_spec(spec)
names = _join_server_names(spec)
http_block = _render_http_redirect(names)
https_block = _render_https_block(spec, names)
return f"{http_block}\n{https_block}"
_render_http_redirect is the four-line server block whose only job is the 301. It emits both listen 80; and listen [::]:80; so IPv6 clients are not silently stuck on HTTP, and it points return 301 https://$host$request_uri; so the redirect preserves the original host header even when the operator points multiple names at the same vhost.
def _render_http_redirect(server_names: str) -> str:
return (
"server {\n"
f" listen {HTTP_PORT};\n"
f" listen [::]:{HTTP_PORT};\n"
f" server_name {server_names};\n"
" return 301 https://$host$request_uri;\n"
"}\n"
)
The HTTPS block is where the wildcard cert finally gets used. ssl_certificate and ssl_certificate_key are filled with the absolute paths from IssuanceResult, ssl_protocols is pinned to TLSv1.2 TLSv1.3 to drop SSLv3 and the old TLS versions entirely, and ssl_prefer_server_ciphers off lets modern client negotiation pick the AEAD suite without us second-guessing it.
def _render_https_block(spec: NginxServerBlockSpec, server_names: str) -> str:
hsts = _render_hsts_header(spec) if spec.enable_hsts else ""
return (
"server {\n"
f" listen {HTTPS_PORT} ssl http2;\n"
f" listen [::]:{HTTPS_PORT} ssl http2;\n"
f" server_name {server_names};\n"
"\n"
f" ssl_certificate {spec.fullchain_path};\n"
f" ssl_certificate_key {spec.privkey_path};\n"
f" ssl_protocols {' '.join(TLS_PROTOCOLS)};\n"
f" ssl_ciphers {TLS_CIPHERS};\n"
" ssl_prefer_server_ciphers off;\n"
" ssl_session_timeout 1d;\n"
" ssl_session_cache shared:cfwild:10m;\n"
f"{hsts}"
"\n"
" location / {\n"
f" proxy_pass http://{spec.upstream};\n"
" proxy_set_header Host $host;\n"
" proxy_set_header X-Real-IP $remote_addr;\n"
" proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;\n"
" proxy_set_header X-Forwarded-Proto https;\n"
f" proxy_connect_timeout {DEFAULT_PROXY_CONNECT_TIMEOUT};\n"
f" proxy_read_timeout {DEFAULT_PROXY_READ_TIMEOUT};\n"
" }\n"
"}\n"
)
The location / block forwards everything to the configured upstream. The X-Forwarded-Proto https header is the one most upstreams care about: it lets a downstream framework distinguish http://upstream:8080/ (the actual hop) from the original client scheme and generate correct absolute URLs in redirects.
The validators are where the module earns its keep. The most important is assert_hostname_within_apex, which prevents an extra server name from leaking outside the wildcard's coverage:
def assert_hostname_within_apex(apex: str, hostname: str) -> None:
_validate_apex_domain(apex)
_validate_dns_name(hostname, label="hostname")
if hostname == apex:
return
if hostname.endswith("." + apex):
return
raise NginxConfigError(
f"hostname {hostname!r} is not covered by wildcard for apex {apex!r}"
)
The two early returns handle the legitimate cases — the apex itself, and any subdomain at any depth. The "." + apex suffix check is deliberate: a naive endswith(apex) would accept notexample.test for apex example.test, which is exactly the substring trick the test suite pins down.
_validate_upstream rejects anything that is not host:port, blocks shell-metachar injection (;, newline, braces, spaces), and refuses ports outside 1..65535. The point is to make sure the rendered config never contains a directive that nginx will reject at nginx -t time, or worse, that lets an attacker-controlled upstream value smuggle a second directive into the block.
Finally, assert_redirect_present is the parser used to confirm an existing config (rendered or hand-edited) still has the HTTP→HTTPS hop intact. It is a one-pass line scan against a regex anchored on return 30[12] https://, so a return 301 /healthz or a 200-only HTTPS block both fail the check loudly rather than silently shipping.
Verification
Run the full test suite from the codebase/ directory:
pytest
The expected output:
============================= test session starts ==============================
platform darwin -- Python 3.12.5, pytest-9.0.3, pluggy-1.6.0
rootdir: .../certbot-dns-01-cloudflare-wildcard/codebase
configfile: pyproject.toml
testpaths: tests
collected 121 items
tests/test_credentials.py .................. [ 14%]
tests/test_install.py .................. [ 29%]
tests/test_issue.py .................................. [ 57%]
tests/test_nginx.py ................................. [ 85%]
tests/test_token.py .................. [100%]
============================= 121 passed in 0.20s ==============================
A hundred and twenty-one tests pass — the eighty-eight from steps 1 through 3 plus thirty-three new ones for the nginx module. The new file covers six axes:
- Render output assertions for the redirect, cert paths, modern TLS, HSTS on/off/custom, upstream proxy headers, and IPv4+IPv6 listeners.
- Hostname-within-apex acceptance and rejection, including the
notexample.testsubstring trick. - Spec validation for relative paths, missing ports, injection characters, out-of-range ports, wildcard literals, and negative HSTS values.
assert_redirect_presentagainst rendered output, a 302 variant, an HTTPS-only config, and an internal-redirect false positive.- The
parse_server_blockround trip, including the missing-cert path.
For the operator who wants to wire this into a real host, the rendered string drops straight into /etc/nginx/conf.d/<hostname>.conf and the standard reload dance applies:
sudo nginx -t && sudo systemctl reload nginx
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful
Once the reload completes, a curl -I http://app.example.org/ returns HTTP/1.1 301 Moved Permanently with a Location: https://app.example.org/ header, and a follow-up curl -I https://app.example.org/ shows a valid chain signed by Let's Encrypt and the Strict-Transport-Security: max-age=63072000; includeSubDomains response header.
What we built
The cfwild.nginx module is now the bridge between the certbot lineage and a public-facing reverse proxy. It accepts a typed spec, validates every input that could plausibly break nginx or leak shell metachars, and renders a two-block config that pairs an HTTPS terminator with a port-80 redirector.
Two design choices keep this module composable. It is a pure renderer — no subprocess.run("nginx", ...) calls live here, so the same code drives unit tests, dry-run previews, and the eventual deploy hook. And the spec validators run before any string interpolation, so a value that would corrupt the config (an injection-chars upstream, a wildcard literal hostname, a hostname outside the apex zone) raises a precise NginxConfigError instead of producing a config that nginx silently rejects later.
The companion parse_server_block and assert_redirect_present helpers give the operator a way to verify a config they did not generate from scratch. A hand-edited vhost that drops the redirect, a teammate's PR that swaps return 301 https://... for an internal redirect, or a stale block that points at a defunct cert all surface as exceptions rather than as silent regressions.
What this unlocks for the next step: with a working vhost shape that consumes IssuanceResult.fullchain_path and IssuanceResult.privkey_path directly, the renewal flow only needs to add a deploy hook that calls nginx -s reload after certbot renew succeeds — the rendered config does not change between renewals, only the bytes behind the cert paths do.
Repository
The state of the code after this step: 1f048f8
Step 6: Catching Renewal Failures Early with a Dry-Run Diagnostics Module
Step 5 left us with a systemd timer that will quietly fire certbot renew twice a day and a deploy hook that reloads nginx when a new chain lands on disk. That pipeline is only as trustworthy as our ability to know — before it fires unattended — that the credentials still work, that the Cloudflare token still has the right scope, and that propagation still completes inside the window we configured. Step 6 adds the safety net.
The new module, cfwild.troubleshoot, runs the certbot renew --dry-run path against the ACME staging endpoint and turns its noisy output into a typed DryRunResult with one or more DryRunDiagnosis records. Each diagnosis carries a stable code, a one-line summary, and an exact remediation. The point is that an operator who runs the new dry-run once a week gets either a clean exit or a specific, actionable failure — never a raw certbot traceback that has to be re-parsed by hand.
Setup
One new source file (src/cfwild/troubleshoot.py) and one new test file (tests/test_troubleshoot.py) join the package, and __init__.py grows to re-export the new public surface. No new runtime dependencies — the module stays on re, dataclasses, and pathlib so the auditable surface remains the standard library and the unit tests run without spawning a single subprocess.
codebase/
├── pyproject.toml
├── README.md
├── src/
│ └── cfwild/
│ ├── __init__.py # re-exports troubleshoot surface
│ ├── credentials.py
│ ├── install.py
│ ├── issue.py
│ ├── nginx.py
│ ├── renew.py
│ ├── token.py
│ └── troubleshoot.py # NEW — dry-run runner + diagnosis classifier
└── tests/
├── __init__.py
├── test_credentials.py
├── test_install.py
├── test_issue.py
├── test_nginx.py
├── test_renew.py
├── test_token.py
└── test_troubleshoot.py # NEW
The runtime preconditions are exactly what steps 1 through 3 produced: a chmod 0600 credentials INI at /etc/letsencrypt/secrets/cloudflare.ini, a lineage already issued at least once so certbot has something to renew, and a certbot binary at /usr/bin/certbot. If those preconditions slip, the validators in this module fail loudly before certbot is ever invoked.
Implementation
The public input is DryRunRequest — a lineage name plus three absolute paths the caller can override for testing or for non-standard installs.
@dataclass(frozen=True)
class DryRunRequest:
lineage_name: str
certbot_path: Path = DEFAULT_CERTBOT_PATH
config_dir: Path | None = None
logs_dir: Path | None = None
build_dry_run_command turns that record into the exact argv we will hand to the runner. It hard-codes renew --cert-name <lineage> --dry-run --non-interactive because every other flag is a footgun in an unattended context. Optional --config-dir and --logs-dir flags are only added when the caller passes absolute paths, so the typical operator call collapses to four positional flags.
def build_dry_run_command(request: DryRunRequest) -> list[str]:
_validate_lineage_name(request.lineage_name)
_validate_absolute_path_value(request.certbot_path, "certbot_path")
if request.config_dir is not None:
_validate_absolute_path_value(request.config_dir, "config_dir")
if request.logs_dir is not None:
_validate_absolute_path_value(request.logs_dir, "logs_dir")
command = [
str(request.certbot_path),
"renew",
"--cert-name",
request.lineage_name,
"--dry-run",
"--non-interactive",
]
if request.config_dir is not None:
command.extend(["--config-dir", str(request.config_dir)])
if request.logs_dir is not None:
command.extend(["--logs-dir", str(request.logs_dir)])
return command
The lineage validator pins the name to ^[A-Za-z0-9_.-]+$. That regex is intentionally narrower than what certbot itself accepts — it blocks newlines, semicolons, backticks, and spaces, which is exactly the surface someone would use to smuggle a second shell directive into a value that an unwary operator copy-pastes from a CMDB. The path validators add the same shape check plus an absolute-path requirement so a relative certbot on PATH can never silently shadow /usr/bin/certbot and run an attacker-controlled binary.
The classifier is the second half of the module. classify_dry_run_failure takes the captured stdout and stderr and runs four families of compiled regexes against the concatenated text. Each family that matches contributes one DryRunDiagnosis with a stable code and a remediation string.
def classify_dry_run_failure(stdout: str, stderr: str) -> tuple[DryRunDiagnosis, ...]:
_validate_text(stdout, label="stdout")
_validate_text(stderr, label="stderr")
combined = f"{stdout}\n{stderr}"
findings: list[DryRunDiagnosis] = []
if _any_match(_TOKEN_PATTERNS, combined):
findings.append(_token_diagnosis())
if _any_match(_MISSING_TXT_PATTERNS, combined):
findings.append(_missing_txt_diagnosis())
if _any_match(_PROPAGATION_PATTERNS, combined):
findings.append(_propagation_diagnosis())
if _any_match(_RATE_LIMIT_PATTERNS, combined):
findings.append(_rate_limit_diagnosis())
return tuple(findings)
The four families map to the four DNS-01 failure modes that account for almost every renewal outage we have ever seen in the wild. NXDOMAIN looking up TXT, No TXT record found, and Incorrect TXT record all roll up to missing-txt-record — the validator could not find the challenge value, usually because the Cloudflare API call to create it never reached the authoritative nameserver. SERVFAIL, hasn't propagated, and self-check failed roll up to propagation-timeout — the record exists but the recursive resolver did not see it inside the configured window. Invalid API Token, HTTP 401, HTTP 403, and Unable to determine zone_id roll up to token-rejected. rateLimited and too many certificates roll up to rate-limited.
Because the families are independent, a single failure can match more than one — for example, a token that was rotated mid-run can produce both a token-rejected finding and a missing-txt-record finding, and the test suite pins this multi-diagnosis behaviour so a future refactor cannot quietly drop one of them. When the exit code is non-zero but nothing matches, summarize_dry_run returns a single unknown-failure diagnosis pointing the operator at the full certbot log instead of a silent empty list.
def summarize_dry_run(
exit_code: int,
stdout: str,
stderr: str,
lineage_name: str,
) -> DryRunResult:
_validate_exit_code(exit_code)
_validate_lineage_name(lineage_name)
if exit_code == DRY_RUN_EXIT_OK:
return DryRunResult(
succeeded=True,
exit_code=exit_code,
lineage_name=lineage_name,
)
diagnoses = classify_dry_run_failure(stdout, stderr)
if not diagnoses:
diagnoses = (_unknown_diagnosis(),)
return DryRunResult(
succeeded=False,
exit_code=exit_code,
lineage_name=lineage_name,
diagnoses=diagnoses,
)
format_diagnosis_report is the human-facing wrapper. On success it prints a single line. On failure it prints the exit code, the lineage name, and one - [code] summary / fix: remediation pair per diagnosis. Every remediation references a concrete next action — reissue a Zone.DNS:Edit token, raise --dns-cloudflare-propagation-seconds, or stay on staging until the rate-limit window clears — never a generic "check your config".
Verification
Run the full test suite from the codebase/ directory:
pytest
The expected output:
============================= test session starts ==============================
platform darwin -- Python 3.12.5, pytest-9.0.3, pluggy-1.6.0
rootdir: .../certbot-dns-01-cloudflare-wildcard/codebase
configfile: pyproject.toml
testpaths: tests
collected 219 items
tests/test_credentials.py .................. [ 8%]
tests/test_install.py .................. [ 16%]
tests/test_issue.py .................................. [ 31%]
tests/test_nginx.py ................................. [ 47%]
tests/test_renew.py .................................................... [ 70%]
....... [ 73%]
tests/test_token.py .................. [ 82%]
tests/test_troubleshoot.py ....................................... [100%]
============================= 219 passed in 0.28s ==============================
Two hundred and nineteen tests pass — the one hundred and eighty from steps 1 through 5, plus thirty-nine new ones for the troubleshoot module. The new file covers six axes: argv construction with and without optional --config-dir / --logs-dir, every classifier family (missing TXT, propagation timeout, token rejection, rate limit) against representative real-world certbot strings, multi-diagnosis output when more than one family matches, the unknown-failure fallback, validator rejection for empty lineage names and shell-metachar injection, and the public-surface re-exports on the cfwild package.
For an operator who wants to wire this into an actual host, the dry-run is one line:
cfwild-dry-run --lineage app.example.org
Dry-run for app.example.org succeeded.
If the dry-run fails because the Cloudflare token was rotated without updating the credentials INI, the same command prints:
Dry-run for app.example.org FAILED with exit code 1.
- [token-rejected] Cloudflare API rejected the token used by certbot
fix: Reissue a token scoped to Zone.DNS:Edit + Zone.Zone:Read on the target zones, then rewrite the credentials .ini file.
What we built
The cfwild.troubleshoot module is now the pre-flight check that stands between an operator pushing a config change and the unattended systemd timer discovering the regression at 3am. It accepts a typed DryRunRequest, validates every input that could plausibly smuggle a shell directive past subprocess, and emits a typed DryRunResult that downstream tooling can branch on without re-parsing strings.
The classifier is intentionally narrow. Four families of regexes, each backed by real-world certbot output captured during outages, each producing a stable diagnosis code with a remediation that names the exact next command. When the patterns do not match, the result falls back to a single unknown-failure finding that points at /var/log/letsencrypt/letsencrypt.log — never an empty result that hides the failure mode behind a successful classify call.
The module stays a pure parser and argv builder. No subprocess.run("certbot", ...) lives here, so the same code drives unit tests, an interactive CLI, the systemd OnFailure= hook from step 5, and a future Prometheus exporter. The runner that actually spawns certbot belongs in the orchestration layer above; this module is shaped to be the source of truth for what counts as a renewal failure and what to do about it.
What this unlocks for the next step: with a typed diagnosis surface in place, the next iteration of the systemd unit can pipe its captured output straight into summarize_dry_run, attach the formatted report to the OnFailure= notification, and let the operator answer "what broke and what do I do" from a single notification body instead of an after-the-fact log dive.
Repository
The state of the code after this step: 3a89e86
Repository
Full source at https://github.com/vytharion/certbot-dns-01-cloudflare-wildcard.
Walk the lessons by stepping through the git commits in the repo — each major step has its own commit you can git checkout and rerun.