Error calling msg:from_header(): Data is not ASCII or UTF-8

This is probably caused by a bad from header on some inbound emails, but I’ve started noticing this error in the output of journalctl -u kumomta:

Feb 24 17:17:32 send kumod[1382402]: 2024-02-24T17:17:32.204443Z ERROR localset-1 run{socket=PollEvented { io: Some(TcpStream { addr: 5.78.73.137:25, peer: 162.55.128.206:38772, fd: 50 }) }}: kumod::smtp_server: Error in SmtpServer: callback error
Feb 24 17:17:32 send kumod[1382402]: stack traceback:
Feb 24 17:17:32 send kumod[1382402]:         [C]: in method 'from_header'
Feb 24 17:17:32 send kumod[1382402]:         [string "/opt/kumomta/etc/policy/init.lua"]:339: in function <[string "/opt/kumomta/etc/policy/init.lua"]:337>
Feb 24 17:17:32 send kumod[1382402]: caused by: Data is not ASCII or UTF-8

init.lua line 339 is:

kumo.on('smtp_server_message_received', function(msg)
  -- the next line is line 339
  local tenant = cached_tenant_id(msg:from_header().domain)
  -- the rest of the code...

Is there a way for me to figure out what the from header actually is and why/how it’s invalid?

Hey there @original-baboon, thanks for posting. Please read the “Troubleshooting” and “How to Ask for Help” buttons below. If you would like a 1:1 support session from the KumoMTA team, details are at the “Book a Support Session” button below.

My first thought is to add a pcall and if there’s an error, log the msg:get_data(), but there are quite a few of these errors happening and I’d prefer a less verbose method to prevent polluting the logs.

yeah, pcall to detect the error, then I’d suggest logging using get_first_named_header_value - KumoMTA Docs but it may also have a similar problem with a broken header, in which case I’d suggest msg:get_data(), then break that apart around the \r\n\r\n so that you log the full set of headers, or go a bit further and use lua string.match to find the From: header in the data by hand

I think we could try to do a better job at logging the offending header in this error case

I added this to log the emails with problematic From headers:

kumo.on('smtp_server_message_received', function(msg)
  -- Assign tenant based on "from" domain.
  local ok, err = pcall(function()
    return msg:from_header().domain
  end)
  if ok then
    local tenant = cached_tenant_id(msg:from_header().domain)
    local domain = cached_domain_id(msg:from_header().domain)
  else
    print("from_header", ok, err)
    print(msg:get_data())
    kumo.reject(451, "Internal server error, please try again later.")
  end
  -- The rest of the code...
end

Here’s the log: https://paste.mozilla.org/eGCaX4CX

The email that’s causing the error is a bounce message

I think it’s the same one I sent you in DM @free-spirited-yorksh

but I’m not sure why it’s saying Data is not ASCII or UTF-8 - the From headers look okay to me.

My theory is that incoming message is not 7-bit clean content. When parsing out the headers, we need to consider the incoming message content bytes as either ASCII or UTF-8 content. I think the error is due to something that is present later in the message body. Since the bounce message appears to be sourced from exim, and that includes data from the original message, it implies that the original message relayed through your system was not 7-bit clean, or, that there is a 7-bit cleanliness problem in the system that is generating the report (this seems less likely).

I’ve pushed add more context to "Data is not ASCII or UTF-8" errors · KumoCorp/kumomta@81a79ac · GitHub which should give us slightly more context but what I’d recommend is two things:

  1. In the error case, capture the message to a file so that we can analyze its bytes without having spaces or other context mangled by the pastebin. You can use something like this to write it to a file:
local f = io.open("/tmp/bad-message.eml", "w")
f:write(msg:get_data())
f:close()
  1. To prevent relaying badly formed content in the first instance, consider applying the NEEDS_TRANSFER_ENCODING option from check_fix_conformance - KumoMTA Docs to either reject (recommended!) or attempt to fixup 7-bit cleanliness issues.

Thanks @free-spirited-yorksh - updated to the latest version and added the code for saving the bad message, I’ll wait for the error to happen again so that I can collect some data and then I’ll add the check_fix_conformance.

Feb 28 14:12:32 send kumod[1491108]: Failed to parse the from header:        69cd9fd3d64311eeb1b7960002cafe7c        callback error
Feb 28 14:12:32 send kumod[1491108]: stack traceback:
Feb 28 14:12:32 send kumod[1491108]:         [C]: in method 'from_header'
Feb 28 14:12:32 send kumod[1491108]:         [string "/opt/kumomta/etc/policy/init.lua"]:343: in function <[string "/opt/kumomta/etc/policy/init.lua"]:342>
Feb 28 14:12:32 send kumod[1491108]:         [C]: in function 'pcall'
Feb 28 14:12:32 send kumod[1491108]:         [string "/opt/kumomta/etc/policy/init.lua"]:342: in function <[string "/opt/kumomta/etc/policy/init.lua"]:337>
Feb 28 14:12:32 send kumod[1491108]: caused by: Header::parser_headers: Data is not ASCII or UTF-8: invalid utf-8 sequence of 1 bytes from index 5640

looks like that content-preview stuff is raw binary

is that something generated by your system?

that X-Ham-Report header needs to be rfc2047 header encoded (like the Subject header is) for that email to be conformant. I think that is being added by the destination system, so that is technically a bug on their part.

We could make kumo a little smarter in this case and ignore binary data in the body portion of the mail if we’re just trying to parse headers, however, the message is badly formed and cannot be relayed as-is because of the 8-bit data

Yes, it’s not done on our side, it’s the destination system adding it.

What’s your recommendation? Should I just reject these message with some error message or try and fix it?

I don’t think you can workaround it from lua, so mailparsing: improve handling of bad messages · KumoCorp/kumomta@aa46113 · GitHub should make this easy to reconcile

wow, thanks @free-spirited-yorksh! Just upgraded to the latest dev version and added check_fix_conformance to handle this situation, I’ll send an update tomorrow with the results.