Debugging throttles

After updating to kumod 2025.01.29-833f82a8 I started getting these errors in the logs:

Feb 28 11:11:42 send.ahasend.com kumod[3824241]: 2025-02-28T11:11:42.590263Z ERROR  spoolin-61 kumod::spool: failed to insert Message 3b8329e0f54b11efbd819c6b004d4752 to queue webhook.log_hook: invalid ThrottleSpec `0/day`: limit must be greater than 0!
Feb 28 11:11:42 send.ahasend.com kumod[3824241]: stack traceback:
Feb 28 11:11:42 send.ahasend.com kumod[3824241]:         [C]: in function 'kumo.make_throttle'
Feb 28 11:11:42 send.ahasend.com kumod[3824241]:         /opt/kumomta/etc/policy/ahasend.lua:379: in function 'ahasend.per_tenant_throttle'
Feb 28 11:11:42 send.ahasend.com kumod[3824241]:         [string "/opt/kumomta/etc/policy/init.lua"]:282: in function <[string "/opt/kumomta/etc/policy/init.lua"]:268>
Feb 28 11:11:42 send.ahasend.com kumod[3824241]: . Ignoring message until kumod is restarted

I don’t want to throttle webhooks, and I don’t have a throttle set for it in the sqlite database (but I do have some valid value for all tenants), so I updated the config to specifically check for invalid values (nil or 0) and set some high value if the value is not valid as a temporary workaround. But even with this change, I still see ~100k webhook messages in stuck in the queue in the outpuut of kcli queue-summary.

Is there a way to see what throttles are set for each queue? I’m trying to figure out the best way for debugging this.

The relevant parts of the configuration:

-- in init.lua
kumo.on('throttle_insert_ready_queue', function(msg)
  local ok, tenant = pcall(function()
    return msg:get_meta('tenant')
  end)

  local ok2, direction = pcall(function()
    return msg:get_meta('direction')
  end)

  if ok2 and direction == 'inbound' then
    return
  end

  if ok then
    local throttle = ahasend.per_tenant_throttle(tenant)
    throttle:delay_message_if_throttled(msg)
  end
end)

-- in ahasend.lua. Excuse the mess, I've been trying to get around the issue by checking for all sorts of invalid values as a temporary workaround.
local function get_tenant_throttle(tenant_id)
  if tenant_id == nil then
      return "1000000000/hour"
  end
  local rate = '1000/day'
  local ok, db = pcall(sqlite.open, "/opt/kumomta/etc/policy/config.db")
  if not ok then
      return rate
  end

  local ok, result = pcall(function ()
      return db:execute("SELECT throttle FROM accounts WHERE id = :id", {
          id = tenant_id
      })
  end)
  if not ok then
      return rate
  end
  if tenant_id == nil or result[1] == nil then
    return "10000000/hour"
  end
  if result[1] == 0 then
    return "10000000/hour"
  end
  return result[1] .. "/day"
end
mod.cached_tenant_throttle = kumo.memoize(get_tenant_throttle, {
  name = 'tenant_throttles',
  ttl = '10 minutes',
  capacity = 1000,
})

mod.per_tenant_throttle = function (tenant_id)
  local rate = mod.cached_tenant_throttle(tenant_id)
  if rate == "0/day" then
     rate = "1000000000/hour"
  end
  return kumo.make_throttle(
    string.format('tenant-send-limit-%s', tenant_id),
    rate
  )
end

Also, I don’t remember exactly what happened, but the msg param passed to the handler for throttle_insert_ready_queue sometimes has an issue with the get_meta() call and raises an error - this was already happening in the previous versions and I don’t remember the exact error message, but that’s why I have those pcall calls in there.

It took some time, but it has processed the queued up webhooks now, but am I doing this right? There’s no mention of webhook messages getting throttled in the documentation and I’d initially followed the basic example from this page

But from what I understand, that example will throttle webhook calls as well?

FWIW, if you don’t want get_tenant_throttle to throttle, I’d suggest having it return nil instead of a throttle spec. Then you could do something like:

 local rate = mod.cached_tenant_throttle(tenant_id)
 if not rate then
    return nil
 end
 return kumo.make_throttle(
   string.format('tenant-send-limit-%s', tenant_id),
   rate
 )

and:

    local throttle = ahasend.per_tenant_throttle(tenant)
    if throttle then
      throttle:delay_message_if_throttled(msg)
    end

re: get_meta, what is the error message that you see? is memory running low around that time?

keep in mind that throttle hooks are additive with other shaping rules that you might have, so if you have a default block in your shaping setup that limits max_message_rate, and you don’t explicitly have shaping defined for your webhook with an override for its message rate, then you’ll inherit that default value

all of those sources of throttles are considered, resulting in the effective rate being the smallest rate allowed by one of those throttle specs

Thanks Wez!

I’ll have to check this again, I’ll have to remove to pcalls to make it happen again, I’ll do that and get back to you.

The default value for max_message_rate is 100/s but I have an override for the webhooks:

["webhook.log_hook"]
connection_limit = 10000
max_deliveries_per_connection = 1000
max_message_rate = "10000000/min"
max_ready = 100
max_connection_rate = "10000/s"

max_ready seems very small. Keep in mind that in a through-and-through high throughput scenario you will have 1 Reception and 1 Delivery log record per message transiting the system. So if you have say 1000 msgs/s throughput you will have 2000 log msgs/s going through the webhook. I generally suggest that you take 2x that peak throughput as the starting size for max_ready, which would be 4000 in that scenario.

The consequence of max_ready being too small is that those log event messages will get pushed into the scheduled queue and be delayed for a randomized duration of approx 1 minute.