certbot renewal failure modes: the 6 ways auto-renewal silently breaks and how to detect them before the cert expires
Renewal is the part indie operators forget about until the browser shows a red padlock. Covers the 6 most common silent-failure modes: (1) the systemd timer that was masked by a package upgrade, (2) the HTTP-01 challenge that fails because a new nginx location block shadows `/.well-known/acme-challenge/`, (3) the DNS-01 hook script that exits 0 but didn't actually propagate, (4) the snap-vs-apt certbot binary path drift after a distro upgrade, (5) the disk-full `/var/log/letsencrypt` that aborts renewal, (6) the rate-limit hit from too-frequent dry runs. Unique angle: failure-driven, not happy-path — each section is 'how to detect this BEFORE the cert expires.' Take-away: a 10-line monitoring script (curl + `openssl s_client` + a Telegram or email webhook) plus a `certbot renew --dry-run` cron pattern that catches all 6 modes 30+ days ahead.
Certbot Auto-Renewal Failure Modes: The 6 Ways It Silently Breaks (and How to Catch Them Early)
Renewal is the part indie operators forget about until a visitor screenshots a red padlock and DMs them about it. The first cert install gets the attention. The second one, three months later, runs at 03:17 from a systemd timer nobody has looked at since the install script finished. When that timer fails silently, you have between zero and 89 days to notice before browsers start refusing the connection.
This article walks through the six ways certbot auto-renewal silently fails on a single-server polyglot stack, in rough order of how often each one shows up in the wild. Every section ends with a detection recipe you can run today, before the cert is anywhere near expiry. The closing section pulls those detections together into a 10-line monitoring script that runs once a day and pings you on Telegram or email if any of the six failure modes is present.
Why "auto-renewal works" is a dangerous belief
The cron entry or systemd timer that ships with certbot does one thing: it calls certbot renew twice a day. That call is non-interactive. It succeeds quietly. It also fails quietly. There is no built-in alerting. If you do not actively watch the timer's exit status, you have no idea whether last night's run renewed five certs, renewed zero certs, or printed a stack trace into a log nobody reads.
A renewal failure at day 89 is too late. A renewal failure at day 60 gives you a month to react. The discipline this article pushes is: assume the renewer is broken, and prove it works at least once a week.
Mode 1: the systemd timer masked by a package upgrade
The most common silent failure is the timer never firing at all. Distros update certbot through apt or snap, and on a few transitional releases the post-install scripts have been observed to mask certbot.timer (set its enabled state to masked) rather than re-enable it. The unit file still exists. The binary still works when called by hand. The timer just never wakes up.
Detect it directly:
systemctl list-timers certbot.timer
systemctl is-enabled certbot.timer
systemctl status certbot.timer | head -n 5
If list-timers shows no row for certbot, the timer is not scheduled. If is-enabled returns masked, unmask it with systemctl unmask certbot.timer && systemctl enable --now certbot.timer. The unit's Next column should show a time within the next 12 hours.
Recovery is a 30-second fix. The damage is from not noticing for 89 days.
Mode 2: an nginx location block shadows the ACME challenge path
Most indie operators run nginx in front of their app and rely on the HTTP-01 challenge: certbot writes a token to /var/www/letsencrypt/.well-known/acme-challenge/<token> and Let's Encrypt fetches it over plain HTTP on port 80. The path served at /.well-known/acme-challenge/ MUST be exactly that directory and MUST be reachable over HTTP without redirect to HTTPS.
The failure mode: months after install, you add a new server block or a new location block (often a catch-all 301 redirect to HTTPS) that matches /.well-known/acme-challenge/ BEFORE the certbot snippet does. The new block returns a 301 or 404. Let's Encrypt sees a non-200 response and fails the challenge.
Detect it with a deliberate request that mimics what Let's Encrypt sees:
curl -sSI -o /dev/null -w "%{http_code}\n" \
http://your-site.example.com/.well-known/acme-challenge/probe-$(date +%s)
A healthy stack returns 404 from certbot's webroot (the token doesn't exist, so 404 is correct). A broken stack returns 301, 502, or 444. Either of those means the next renewal will fail. Compare the raw response against the result of nginx -T | grep -A 3 'acme-challenge' to see which location block is winning the match.
Two structural fixes outperform clever regex: serve the challenge from a dedicated server block that listens only on port 80, OR put the location ^~ /.well-known/acme-challenge/ block FIRST inside every HTTP server block. Both prevent later edits from shadowing it.
Mode 3: the DNS-01 hook that exits 0 without propagating
Wildcard certificates require DNS-01 instead of HTTP-01. A custom --manual-auth-hook script publishes a _acme-challenge.example.com TXT record, waits for propagation, then exits. The script you wrote during install exits 0 the moment your API call returns 200. The DNS provider takes 30 to 240 seconds to actually propagate that record to the public resolvers Let's Encrypt queries. Sometimes it just never propagates because the API silently rate-limited you.
A hook script that exits 0 too early causes the validator to query DNS while the record is still missing. The validation fails. Certbot retries a handful of times and then gives up. The cron job reports success because certbot's exit code reflects "ran cleanly", not "renewed everything."
Detect it by polling the public resolver from your auth hook BEFORE returning:
#!/bin/bash
# DNS-01 hook with propagation check
expected="$CERTBOT_VALIDATION"
domain="_acme-challenge.${CERTBOT_DOMAIN}"
# publish_record_via_provider_api "$domain" "$expected"
your_provider_publish "$domain" "$expected" || exit 1
# poll until 4 of the 6 public resolvers agree
resolvers="1.1.1.1 8.8.8.8 9.9.9.9 208.67.222.222 8.8.4.4 1.0.0.1"
for attempt in $(seq 1 60); do
hits=0
for r in $resolvers; do
if dig +short @${r} TXT "$domain" | grep -q "$expected"; then
hits=$((hits + 1))
fi
done
[ "$hits" -ge 4 ] && exit 0
sleep 5
done
echo "DNS-01 propagation timeout after 5 minutes" >&2
exit 1
Sixty attempts at 5 seconds each is a 5-minute ceiling. In practice propagation completes inside 90 seconds for most providers, but the 5-minute ceiling protects against silent API failures. The choice between 4 of 6 resolvers and 6 of 6 is pragmatic: requiring all six on every renewal triples your false-failure rate without meaningful improvement in correctness.
Mode 4: snap-vs-apt binary path drift after a distro upgrade
The official certbot install guide on https://eff-certbot.readthedocs.io/ recommends the snap distribution because it auto-updates. Many older installs predate that guidance and use the apt package. After a distro upgrade (Ubuntu 22.04 to 24.04, for example), both binaries can end up installed simultaneously. The cron job calls /usr/bin/certbot (apt). The snap version at /snap/bin/certbot holds the renewal config. Renewals fail because the apt binary cannot find the snap binary's account credentials at /var/snap/certbot/....
Detect it:
which -a certbot
ls -la /etc/letsencrypt/renewal/
sudo /usr/bin/certbot renew --dry-run 2>&1 | grep -i "error"
sudo /snap/bin/certbot renew --dry-run 2>&1 | grep -i "error"
If both binaries exist, pick one and remove the other. Snap is the upstream-recommended path for Ubuntu and Debian today, so the cleaner fix on those distros is apt remove certbot and reinstall via snap. Update your cron entry or systemd timer to call the surviving binary explicitly with its full path, not the bare command, because a future PATH change can flip which binary wins.
Mode 5: a disk-full /var/log/letsencrypt aborts renewal
Certbot writes detailed logs to /var/log/letsencrypt/ on every run. The default logrotate config keeps a week of compressed logs. On a busy host that also writes nginx access logs, dpkg logs, and journal data to the same partition, you can run /var to 100% in a couple of months without noticing. Certbot's renewal step opens its log file for writing BEFORE it talks to the ACME server, so a full /var causes an immediate exit with a non-obvious error.
Detect it preemptively:
df -h /var | awk 'NR==2 {print $5}' | tr -d '%'
If the number is above 85, you have a window of weeks before the next renewal fails. The right structural fix is to give /var its own partition or LVM volume with a monitored alert at 80% usage. The wrong fix is to clear /var/log/letsencrypt/ by hand every few weeks, because that hides the underlying disk-pressure problem rather than solving it.
A defensive cron addition: find /var/log/letsencrypt -type f -mtime +30 -delete weekly, as a backstop against logrotate misconfiguration.
Mode 6: the rate-limit hit from too-frequent dry runs
The Let's Encrypt production environment enforces rate limits documented at https://letsencrypt.org/docs/rate-limits/. The two that bite hardest are 50 certificates per registered domain per week, and 5 duplicate certificates per week. You will almost never hit these on a normal renewal cadence. You absolutely will hit them if you wire certbot renew --dry-run into a CI pipeline that runs on every commit and accidentally points at production instead of the staging environment.
The damage is asymmetric: a single mistake can lock you out of renewing the affected domain for a full week. A renewal scheduled inside that window will fail and there is no way to override.
Detect the dangerous pattern BEFORE it bites by auditing your renewal config and CI:
sudo certbot certificates 2>&1 | grep -E "Domains|Expiry"
grep -r "certbot renew" .github/ /etc/cron.d/ /etc/systemd/system/ 2>/dev/null
Confirm every dry-run target uses --server https://acme-staging-v02.api.letsencrypt.org/directory. The staging environment has separate, much looser rate limits and issues untrusted certs that are perfect for CI. If you find a production dry-run in CI, change it to staging the same day; the next time someone pushes to that branch ten times in an hour, you will be glad you did.
The 10-line monitoring script that catches all six modes
A daily check that runs out-of-band from the renewer itself gives you weeks of lead time on every failure mode above. The script below does four things: it queries the live cert with openssl s_client, parses the expiry, compares against today, and sends a Telegram or generic webhook ping when the cert is under 30 days OR the timer is not scheduled OR a forced dry-run fails.
#!/bin/bash
set -euo pipefail
DOMAIN="${1:?usage: cert-monitor.sh <domain> <webhook-url>}"
WEBHOOK="${2:?usage: cert-monitor.sh <domain> <webhook-url>}"
alert() { curl -sS -X POST -d "text=$1" "$WEBHOOK" >/dev/null; }
exp=$(echo | openssl s_client -servername "$DOMAIN" -connect "$DOMAIN":443 2>/dev/null \
| openssl x509 -noout -enddate | cut -d= -f2)
days=$(( ( $(date -d "$exp" +%s) - $(date +%s) ) / 86400 ))
[ "$days" -lt 30 ] && alert "[$DOMAIN] cert expires in $days days"
systemctl is-active --quiet certbot.timer || alert "[$DOMAIN] certbot.timer not active"
certbot renew --dry-run >/dev/null 2>&1 || alert "[$DOMAIN] dry-run failed"
Wire it to a daily cron entry at 09:00 local time. The webhook URL can be a Telegram bot's sendMessage endpoint, an Apprise gateway, or any generic incoming-webhook URL that accepts a text field. Running once a day rather than once an hour keeps you well clear of any rate limits and still gives a 30-day warning window.
When you combine all six checks
A reasonable target for an indie single-server stack is: the cert monitor runs daily at 09:00, the renewer runs at 03:17 and 15:17 daily (certbot's default), and the --dry-run portion of the monitor effectively rehearses a real renewal against the staging environment if you swap in --server https://acme-staging-v02.api.letsencrypt.org/directory. That last swap turns the daily check into a true end-to-end smoke test without spending production rate-limit budget.
The headline number: in a 90-day cert lifetime, this setup gives you up to 60 days of advance notice on most failure modes and a minimum of 30 days of notice on the rest. That is the difference between fixing a misconfiguration on a Tuesday afternoon and rolling out an emergency patch at 23:00 on a Sunday.
References: