One of our customers is sending support form submissions to their support email and we didn’t have any issues sending them until this morning. I just noticed ~400 messages transiently failed (some with 5 attempts now), where the transient failure peer_address is
"peer_address": {
"name": "",
"addr": ""
}
response.content is :
KumoMTA internal: failed to connect to any candidate hosts: connect to ResolvedAddress { name: \"bitpin.ir.\", addr: 185.143.233.120 } port 25 and read initial banner: deadline has elapsed, connect to ResolvedAddress { name: \"bitpin.ir.\", addr: 185.143.234.120 } port 25 and read initial banner: deadline has elapsed
The email is being sent to support@bitpin.ir, dig says mx record is mailer.bitpin.io which resolves to 135.181.110.161 . The IP address 185.143.233.120 shown above in the response.content is actually the IP of the a record for bitpin.ir. For some reason Kumo is using that instead of the mx record?
Seems like kumo.dns.lookup_mx is finding the MX record and returning the correct IP (testing with a test script that I’m running with kumod --policy /tmp/dns.lua --user kumod)
Part of the SMTP spec says that a failure to resolve an MX record should result in the use of the A record instead. It sounds like that might be what happened here?
can you switch to the hickory resolver? the non-unbound one. We had another customer experience some transient DNS failures with the unbound resolver recently which makes me suspicious.
I’ve had a lot of good experience with the embedded unbound resolver in the past, but it seems like there might be something funky here.