Skip to content

Panic in certificate worker when renewal fails #7366

@st3fan

Description

@st3fan

Issue Details

We run an internal ACME Server (Step CA) and sometimes Caddy fails to renew certificates against it. Why this happens is not known yet - this is something I am still debugging. However, I did find something that I think is an actual bug in Caddy.

In short: when renewal fails under a certain condition, Caddy's internal renewal process (Go Routine?) panics and will then never renew anything again until Caddy is restarted.

Here is the trace from my logs. I have triggered this bug by repeatedly running:

systemctl stop caddy; rm -rf /var/lib/caddy/; sleep 3; systemctl start caddy

Again, I am debugging an issue that is likely an non-deterministic network issue in our own infrastructure. And I can trigger it with the above. I am including that snippet because it shows I start Caddy with a completely empty state. (Data directory wiped)

{
  "level": "info",
  "ts": 1764090510.8839772,
  "logger": "tls",
  "msg": "cleaning storage unit",
  "description": "FileStorage:/var/lib/caddy/.local/share/caddy"
}
{
  "level": "info",
  "ts": 1764090510.884073,
  "logger": "tls",
  "msg": "finished cleaning storage units"
}
{
  "level": "info",
  "ts": 1764090510.8842258,
  "logger": "tls.obtain",
  "msg": "acquiring lock",
  "identifier": "www.example.com"
}
{
  "level": "info",
  "ts": 1764090510.8894386,
  "logger": "tls.obtain",
  "msg": "lock acquired",
  "identifier": "www.example.com"
}
{
  "level": "info",
  "ts": 1764090510.889603,
  "logger": "tls.obtain",
  "msg": "obtaining certificate",
  "identifier": "www.example.com"
}
{
  "level": "info",
  "ts": 1764090510.9287097,
  "logger": "http",
  "msg": "waiting on internal rate limiter",
  "identifiers": [
    "www.example.com"
  ],
  "ca": "https://acme.internal.systems/acme/example/directory",
  "account": ""
}
{
  "level": "info",
  "ts": 1764090510.92878,
  "logger": "http",
  "msg": "done waiting on internal rate limiter",
  "identifiers": [
    "www.example.com"
  ],
  "ca": "https://acme.internal.systems/acme/example/directory",
  "account": ""
}
{
  "level": "info",
  "ts": 1764090510.9819884,
  "logger": "http.acme_client",
  "msg": "trying to solve challenge",
  "identifier": "www.example.com",
  "challenge_type": "tls-alpn-01",
  "ca": "https://acme.internal.systems/acme/example/directory"
}

Up to here everything looks normal.

But then the following happens:

{
  "level": "error",
  "ts": 1764090665.8923006,
  "logger": "http.acme_client",
  "msg": "deactivating authorization",
  "identifier": "www.example.com",
  "authz": "https://acme.internal.systems/acme/example/authz/zzpfwpadkwsc0eium6ogz9LDYkmWh4xa",
  "error": "request to https://acme.internal.systems/acme/example/authz/zzpfwpadkwsc0eium6ogz9LDYkmWh4xa failed after 1 attempts: HTTP 0 urn:ietf:params:acme:error:malformed - The request message was malformed"
}

This is very interesting - it says "the request message was malformed" but I put a tcpdump on the network and I verified that the ACME Server is not actually talking to Caddy on port 80. There is no traffic. THis is probably the internal network/infra issue I am dealing with and not a Caddy issue.

{
  "level": "info",
  "ts": 1764090665.892389,
  "logger": "tls.obtain",
  "msg": "releasing lock",
  "identifier": "www.example.com"
}

And then after about a minute this happens:

{
  "level": "info",
  "ts": 1764090665.8927844,
  "msg": "panic: certificate worker: interface conversion: interface {} is nil, not acme.Authorization\ngoroutine 51 [running]:\ngithub.com/caddyserver/certmagic.(*jobManager).worker.func1()\n\tgithub.com/caddyserver/certmagic/async.go:58 +0x65\npanic({0x183b1c0?, 0xc000447b30?})\n\truntime/panic.go:792 +0x132\ngithub.com/mholt/acmez.(*Client).ObtainCertificateUsingCSR(0xc000580ae0, {0x1d3f020, 0xc00052b3e0}, {{0xc000537e80, 0x5}, {0x0, 0x0, 0x0}, 0x1, {0x0, ...}, ...}, ...)\n\tgithub.com/mholt/acmez/client.go:137 +0x1587\ngithub.com/caddyserver/certmagic.(*ACMEIssuer).doIssue(0xc00057f7a0, {0x1d3f020, 0xc00052b3e0}, 0xc000148008, 0x0)\n\tgithub.com/caddyserver/certmagic/acmeissuer.go:385 +0x19c\ngithub.com/caddyserver/certmagic.(*ACMEIssuer).Issue(0xc00057f7a0, {0x1d3f020, 0xc00052b3e0}, 0xc000148008)\n\tgithub.com/caddyserver/certmagic/acmeissuer.go:314 +0xa9\ngithub.com/caddyserver/caddy/v2/modules/caddytls.(*ACMEIssuer).Issue(0x29c3d10?, {0x1d3f020?, 0xc00052b3e0?}, 0xc000512210?)\n\tgithub.com/caddyserver/caddy/v2/modules/caddytls/acmeissuer.go:233 +0x25\ngithub.com/caddyserver/certmagic.(*Config).obtainCert.func2({0x1d3f020, 0xc00052b3e0})\n\tgithub.com/caddyserver/certmagic/config.go:554 +0x98d\ngithub.com/caddyserver/certmagic.doWithRetry({0x1d3f058, 0xc00007e640}, 0xc00025e620, 0xc000525af0)\n\tgithub.com/caddyserver/certmagic/async.go:104 +0x2f8\ngithub.com/caddyserver/certmagic.(*Config).obtainCert(0xc000570420, {0x1d3f058, 0xc00007e640}, {0xc0004a2540, 0x33}, 0x0)\n\tgithub.com/caddyserver/certmagic/config.go:611 +0x68a\ngithub.com/caddyserver/certmagic.(*Config).ObtainCertAsync(...)\n\tgithub.com/caddyserver/certmagic/config.go:462\ngithub.com/caddyserver/certmagic.(*Config).manageOne.func1()\n\tgithub.com/caddyserver/certmagic/config.go:359 +0x73\ngithub.com/caddyserver/certmagic.(*jobManager).worker(0x29df3a0)\n\tgithub.com/caddyserver/certmagic/async.go:73 +0x11c\ncreated by github.com/caddyserver/certmagic.(*jobManager).Submit in goroutine 1\n\tgithub.com/caddyserver/certmagic/async.go:50 +0x279"
}

This panic decoded is the following:

panic: certificate worker: interface conversion: interface {} is nil, not acme.Authorization
goroutine 51 [running]:
github.com/caddyserver/certmagic.(*jobManager).worker.func1()
        github.com/caddyserver/certmagic/async.go:58 +0x65
panic({0x183b1c0?, 0xc000447b30?})
        runtime/panic.go:792 +0x132
github.com/mholt/acmez.(*Client).ObtainCertificateUsingCSR(0xc000580ae0, {0x1d3f020, 0xc00052b3e0}, {{0xc000537e80, 0x5}, {0x0, 0x0, 0x0}, 0x1, {0x0, ...}, ...}, ...)
        github.com/mholt/acmez/client.go:137 +0x1587
github.com/caddyserver/certmagic.(*ACMEIssuer).doIssue(0xc00057f7a0, {0x1d3f020, 0xc00052b3e0}, 0xc000148008, 0x0)
        github.com/caddyserver/certmagic/acmeissuer.go:385 +0x19c
github.com/caddyserver/certmagic.(*ACMEIssuer).Issue(0xc00057f7a0, {0x1d3f020, 0xc00052b3e0}, 0xc000148008)
        github.com/caddyserver/certmagic/acmeissuer.go:314 +0xa9
github.com/caddyserver/caddy/v2/modules/caddytls.(*ACMEIssuer).Issue(0x29c3d10?, {0x1d3f020?, 0xc00052b3e0?}, 0xc000512210?)
        github.com/caddyserver/caddy/v2/modules/caddytls/acmeissuer.go:233 +0x25
github.com/caddyserver/certmagic.(*Config).obtainCert.func2({0x1d3f020, 0xc00052b3e0})
        github.com/caddyserver/certmagic/config.go:554 +0x98d
github.com/caddyserver/certmagic.doWithRetry({0x1d3f058, 0xc00007e640}, 0xc00025e620, 0xc000525af0)
        github.com/caddyserver/certmagic/async.go:104 +0x2f8
github.com/caddyserver/certmagic.(*Config).obtainCert(0xc000570420, {0x1d3f058, 0xc00007e640}, {0xc0004a2540, 0x33}, 0x0)
        github.com/caddyserver/certmagic/config.go:611 +0x68a
github.com/caddyserver/certmagic.(*Config).ObtainCertAsync(...)
        github.com/caddyserver/certmagic/config.go:462
github.com/caddyserver/certmagic.(*Config).manageOne.func1()
        github.com/caddyserver/certmagic/config.go:359 +0x73
github.com/caddyserver/certmagic.(*jobManager).worker(0x29df3a0)
        github.com/caddyserver/certmagic/async.go:73 +0x11c
created by github.com/caddyserver/certmagic.(*jobManager).Submit in goroutine 1
        github.com/caddyserver/certmagic/async.go:50 +0x279

After this panic, the renewal thread (go routine?) seems dead and no further attempts for ACME renewal are made until Caddy is restarted.

Forget about the weird renewal problem that I have - if that turns out to be a Caddy issue then I will report it. But I think the more important part of this is that there is a panic in likely an edge case that will stop the renewal process completely.

I am running Caddy that ships with Debian 13, 2.6.2-12, but I have also manually upgraded to 2.10.2 and observed the same panic.

Assistance Disclosure

No response

If AI was used, describe the extent to which it was used.

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions