Check_fix_conformance breaks attachments when body is in UTF-8

It looks like calling msg:check_fix_conformance breaks attachments when body contains UTF-8.

kumo.on('smtp_server_message_received', function(msg, conn_meta)
  if msg:sender().email == 'test@ahasend.com' then
    local file = io.open("/tmp/msg-before-fix.eml", "w")
    file:write(msg:get_data())
    file:close()
  end


  local failed = msg:check_fix_conformance(
    -- check for and reject messages with these issues:
    'NON_CANONICAL_LINE_ENDINGS',
    -- fix messages with these issues:
    'NEEDS_TRANSFER_ENCODING|MISSING_DATE_HEADER|MISSING_MESSAGE_ID_HEADER'
  )
  if failed then
    kumo.reject(552, string.format('5.6.0 %s', failed))
  end

  if msg:sender().email == 'test@ahasend.com' then
    local file = io.open("/tmp/msg-after-fix.eml", "w")
    file:write(msg:get_data())
    file:close()
  end
  -- the rest of init...

Please see msg-before-fix.eml and msg-after-fix.eml files attached below. The attachment content (which in this case is an ics file) has changed after calling check_fix_conformance , which in this case results in a broken event being shown on Gmail.

Sending the same email with body set to test instead of تست does not result in the same situation and the attachment won’t change.
msg-after-fix.eml (19.7 KB)
msg-before-fix.eml (18 KB)

broken attachment

vs correct attachment when body is test (no UTF-8 content)

This is on Kumo 2025.12.02-67ee9e96

UTF-8. Le sigh.
Looking

What’s happening here is tricky.

There are two parts in the message:

  • the html content, supplied as explicitly binary utf-8 text. This requires transfer encoding in order to be relayed successfully via SMTP
  • the ics attachment, supplied implicitly as US-ASCII (because there is no charset= field, and because the content type is text/), but actually it is binary utf-8 content inside the base64 transfer encoding.

The first part triggers the NEEDS_TRANSFER_ENCODING check in check_fix_conformance, and it is successfully rewritten with transfer encoding.

The second part is also rewritten by the rebuild that was triggered by the first part. When we extract its content, since it is text, we try to decode the part:

  • US-ASCII is treated as equivalent to windows-1252 in the rust charset crate. This is a pragmatic choice on the part of the crate authors; US-ASCII is a subset of windows-1252, and windows 1252 is close enough to iso-8859-1 for most purposes.
  • The actually-utf8-bytes in that part are not strictly 7-bit ASCII, but do happen to be technically valid bytes for windows-1252
  • The byte stream is therefore successfully decoded from windows-1252 and into UTF-8, but it is bogus because the data isn’t actually windows-1252
  • The resulting data is then put into the rebuilt message and labelled as UTF-8

Fundamentally, the encoding on the input message is ambiguous. Do you control that input? I would recommend that both parts explicitly set the correct charset and transfer encoding.

I tried to use the charset detection stuff in check_fix_conformance - KumoMTA Docs on this, but this particular conformance issue isn’t detectable in its current form

mailparsing: improve charset detection/conformance fixing · KumoCorp/kumomta@fead711 · GitHub should make this situation better

Thanks Wez, I’ll test it out over the weekend!