Webhook events not being processed

Hi, since the upgrade to the latest stable version of KumoMTA, some webhook events are not being delivered, they just get stuck in the queue, some will get processed after restarting kumo, but some will not. Delivery has slowed down in general- especially to Google / GMail where most of our messages go.

root@send:/opt/kumomta/sbin# ./kcli provider-summary | head 
PROVIDER                                                 D   T  F C     Q
webhook.log_hook                                     8,186   0  0 0 6,182
(alt1|alt2|alt3|alt4)?.gmail-smtp-in.l.google.com        0   0  0 0 4,500
google                                               2,044 181 29 4     0
...

Our configuration is valid (the only change I’ve made since since our upgrade from the last stable release is to use the new provider syntax in our shaping files):

root@send:/opt/kumomta/sbin# ./kumod --user kumod --validate 
Issues found in the combined set of shaping files:
 - /opt/kumomta/share/policy-extras/shaping.toml
 - /opt/kumomta/etc/policy/config/shaping.toml
 - http://127.0.0.1:8008/get_config_v1/shaping.toml
WARNING: Ignoring http://127.0.0.1:8008/get_config_v1/shaping.toml because skip_remote is set to true

This is how I’ve defined the webhook:

log_hooks:new_json {
  name = "webhook",
  -- log_parameters are combined with the name and
  -- passed through to kumo.configure_log_hook
  log_parameters = {
    headers = { 'Subject', 'From', 'Message-Id', 'References', 'In-Reply-To' },
    meta = { 'tenant', 'domain_id', 'customer_id', 'direction', 'metadata_retention', 'data_retention', 'tags' }
  },
  -- queue config are passed to kumo.make_queue_config.
  -- You can use these to override the retry parameters
  -- if you wish.
  -- The defaults are shown below.
  queue_config = {
    retry_interval = "1m",
    max_retry_interval = "24h",
    max_message_rate = '100000/s',
  },
  -- The URL to POST the JSON to
  url = LOG_WEBHOOK_URL
}
-- ... some other stuff
local queue_helper = queue_module:setup({'/opt/kumomta/etc/policy/config/queue.toml', '/opt/kumomta/etc/policy/config/queue-defaults.toml'})
-- ... the rest of the file

And I have this in shaping.toml:

["webhook.log_hook"]
mx_rollup = false
connection_limit = 1000
max_deliveries_per_connection = 1
max_message_rate = "10000/s"
max_ready = 2048
max_connection_rate = "10000/s"

(I’ve also tried changing max_deliveries_per_connection to a large value - didn’t make a difference).

I’ve set the log level to the debug, but nothing stands out in the logs either.

./kcli set-log-filter debug 

The logs from the past 7 hours
logs.txt.xz (1.18 MB)

downgraded to 2024.09.02.062344.c5476b89 and the webhook.log_hook queue was processed and is now empty. Same for the queued up messaages to gmail

The shaping file I use with 2024.09.02.062344.c5476b89
shaping.toml (10.8 KB)

shaping file for 2024.11.08.103708.d383b033
shaping-new.toml (9.22 KB)

OK, I think I’ve found the issue, I had a bug in the code for throttling customers that caused webhook calls to get throttled. This didn’t happen in the previous stable version, I think because according to the changelog The throttle_insert_ready_queue event was not correctly registered and would never trigger, so the buggy code was never called.

I’m not 100% sure, but it seems like throttle_insert_ready_queue event gets called for a webhook message, which does not have a tenant. My code in aha.per_tenant_throttle would default to 200/hr if tenant was nil, which ended up throttling webhooks.

Previous code:

kumo.on('throttle_insert_ready_queue', function(msg)
  local throttle = aha.per_tenant_throttle(tenant)
  throttle:delay_message_if_throttled(msg)
end)
-- ...
mod.per_tenant_throttle = function (tenant_id)
  local rate = mod.cached_tenant_throttle(tenant_id)
  return kumo.make_throttle(
    string.format('tenant-send-limit-%s', tenant_id),
    rate
  )
end
-- ...
local function get_tenant_throttle(tenant_id)
  local rate = '200/hr'
  local ok, db = pcall(sqlite.open, "/opt/kumomta/etc/policy/config.db")
  if not ok then
      return rate
  end

  local ok, result = pcall(function ()
      return db:execute("SELECT throttle FROM accounts WHERE id = :id", {
          id = tenant_id
      })
  end)
  if not ok then
      return rate
  end

  return result[1] .. "/hr"
end

Updated the code to this and now the webhook events are being processed without any issues:

kumo.on('throttle_insert_ready_queue', function(msg)
  local ok, tenant = pcall(function()
    return msg:get_meta('tenant')
  end)
  if ok then
    local throttle = aha.per_tenant_throttle(tenant)
    throttle:delay_message_if_throttled(msg)
  end
end)

local function get_tenant_throttle(tenant_id)
  if not tenant_id then
    -- This function should'nt be called with a nil tenant ID
    -- but just in case it is, better safe than sorry...
    return "10000000/hour"
  end

  local rate = '200/hr'
  local ok, db = pcall(sqlite.open, "/opt/kumomta/etc/policy/config.db")
  if not ok then
      return rate
  end

  local ok, result = pcall(function ()
      return db:execute("SELECT throttle FROM accounts WHERE id = :id", {
          id = tenant_id
      })
  end)
  if not ok then
      return rate
  end

  return result[1] .. "/hr"
end

Happened again this morning under load… Will be doing some more investigation today to see what’s going on.

Removed the throttle_insert_ready_queue event handler for now and things are back to normal, I’ll have to do more debugging to figure out why the code throttles webhooks as opposed to just regular tenants

queue.rs: fix occasional accidental delay · KumoCorp/kumomta@4a0a4d6 · GitHub is an issue that I found and fixed yesterday that might be a factor here

@original-baboon

Hmmm…
I had to stop testing webhooks.
I wonder if it’s the same issue.

I was getting duplicate outputs for each event.
The last event I checked had generated 19 webhook updates for just 3 events.

Webhook endpoint was Google Run, so it was able to listen thousands of requests just fine. But it didn’t cut…

And the server was using 100% cpu to deliver webhook events due to amplified load.

Hope it’s related and this update fixes that too.

How many webhooks had you defined? @earnest-seal

Just 1 @original-baboon

Are the actual duplicates or are they multiple records for the same event?
Like is it one reception and 18 transfails for the same message? Do the timestamps all say the same number?

I would expect 2 webhooks calls per event per message, so for Reception you’d have one for the actual reception of the message by the SMTP server (with reception_protocol set to ESMTP) and one for delivering the webhook related to the reception event (with delivery_protocol set to Lua), so for a single message that gets delivered on first try you’d get 4 webhook calls (two for Reception, and two for Delivery), while it might go higher than that for messages that have Transient Failures. 19 webhook calls for 3 events shouldn’t be happening (unless you by 3 events you mean 3 messages with transient failures).

Can you grep those events for the timestamp and message ID to confirm that they are indeed duplicates?