Debugging throttles

fhf · February 28, 2025, 11:44am

After updating to kumod 2025.01.29-833f82a8 I started getting these errors in the logs:

Feb 28 11:11:42 send.ahasend.com kumod[3824241]: 2025-02-28T11:11:42.590263Z ERROR  spoolin-61 kumod::spool: failed to insert Message 3b8329e0f54b11efbd819c6b004d4752 to queue webhook.log_hook: invalid ThrottleSpec `0/day`: limit must be greater than 0!
Feb 28 11:11:42 send.ahasend.com kumod[3824241]: stack traceback:
Feb 28 11:11:42 send.ahasend.com kumod[3824241]:         [C]: in function 'kumo.make_throttle'
Feb 28 11:11:42 send.ahasend.com kumod[3824241]:         /opt/kumomta/etc/policy/ahasend.lua:379: in function 'ahasend.per_tenant_throttle'
Feb 28 11:11:42 send.ahasend.com kumod[3824241]:         [string "/opt/kumomta/etc/policy/init.lua"]:282: in function <[string "/opt/kumomta/etc/policy/init.lua"]:268>
Feb 28 11:11:42 send.ahasend.com kumod[3824241]: . Ignoring message until kumod is restarted

I don’t want to throttle webhooks, and I don’t have a throttle set for it in the sqlite database (but I do have some valid value for all tenants), so I updated the config to specifically check for invalid values (nil or 0) and set some high value if the value is not valid as a temporary workaround. But even with this change, I still see ~100k webhook messages in stuck in the queue in the outpuut of kcli queue-summary.

Is there a way to see what throttles are set for each queue? I’m trying to figure out the best way for debugging this.

fhf · February 28, 2025, 11:45am

The relevant parts of the configuration:

-- in init.lua
kumo.on('throttle_insert_ready_queue', function(msg)
  local ok, tenant = pcall(function()
    return msg:get_meta('tenant')
  end)

  local ok2, direction = pcall(function()
    return msg:get_meta('direction')
  end)

  if ok2 and direction == 'inbound' then
    return
  end

  if ok then
    local throttle = ahasend.per_tenant_throttle(tenant)
    throttle:delay_message_if_throttled(msg)
  end
end)

-- in ahasend.lua. Excuse the mess, I've been trying to get around the issue by checking for all sorts of invalid values as a temporary workaround.
local function get_tenant_throttle(tenant_id)
  if tenant_id == nil then
      return "1000000000/hour"
  end
  local rate = '1000/day'
  local ok, db = pcall(sqlite.open, "/opt/kumomta/etc/policy/config.db")
  if not ok then
      return rate
  end

  local ok, result = pcall(function ()
      return db:execute("SELECT throttle FROM accounts WHERE id = :id", {
          id = tenant_id
      })
  end)
  if not ok then
      return rate
  end
  if tenant_id == nil or result[1] == nil then
    return "10000000/hour"
  end
  if result[1] == 0 then
    return "10000000/hour"
  end
  return result[1] .. "/day"
end
mod.cached_tenant_throttle = kumo.memoize(get_tenant_throttle, {
  name = 'tenant_throttles',
  ttl = '10 minutes',
  capacity = 1000,
})

mod.per_tenant_throttle = function (tenant_id)
  local rate = mod.cached_tenant_throttle(tenant_id)
  if rate == "0/day" then
     rate = "1000000000/hour"
  end
  return kumo.make_throttle(
    string.format('tenant-send-limit-%s', tenant_id),
    rate
  )
end

fhf · February 28, 2025, 11:49am

Also, I don’t remember exactly what happened, but the msg param passed to the handler for throttle_insert_ready_queue sometimes has an issue with the get_meta() call and raises an error - this was already happening in the previous versions and I don’t remember the exact error message, but that’s why I have those pcall calls in there.

fhf · February 28, 2025, 1:38pm

It took some time, but it has processed the queued up webhooks now, but am I doing this right? There’s no mention of webhook messages getting throttled in the documentation and I’d initially followed the basic example from this page

But from what I understand, that example will throttle webhook calls as well?

wez · February 28, 2025, 3:28pm

FWIW, if you don’t want get_tenant_throttle to throttle, I’d suggest having it return nil instead of a throttle spec. Then you could do something like:

 local rate = mod.cached_tenant_throttle(tenant_id)
 if not rate then
    return nil
 end
 return kumo.make_throttle(
   string.format('tenant-send-limit-%s', tenant_id),
   rate
 )

and:

    local throttle = ahasend.per_tenant_throttle(tenant)
    if throttle then
      throttle:delay_message_if_throttled(msg)
    end

wez · February 28, 2025, 3:29pm

re: get_meta, what is the error message that you see? is memory running low around that time?

wez · February 28, 2025, 3:31pm

keep in mind that throttle hooks are additive with other shaping rules that you might have, so if you have a default block in your shaping setup that limits max_message_rate, and you don’t explicitly have shaping defined for your webhook with an override for its message rate, then you’ll inherit that default value

wez · February 28, 2025, 3:32pm

all of those sources of throttles are considered, resulting in the effective rate being the smallest rate allowed by one of those throttle specs

fhf · March 8, 2025, 9:31am

Thanks Wez!

fhf · March 8, 2025, 9:32am

I’ll have to check this again, I’ll have to remove to pcalls to make it happen again, I’ll do that and get back to you.

fhf · March 8, 2025, 9:33am

The default value for max_message_rate is 100/s but I have an override for the webhooks:

["webhook.log_hook"]
connection_limit = 10000
max_deliveries_per_connection = 1000
max_message_rate = "10000000/min"
max_ready = 100
max_connection_rate = "10000/s"

wez · March 8, 2025, 10:59am

max_ready seems very small. Keep in mind that in a through-and-through high throughput scenario you will have 1 Reception and 1 Delivery log record per message transiting the system. So if you have say 1000 msgs/s throughput you will have 2000 log msgs/s going through the webhook. I generally suggest that you take 2x that peak throughput as the starting size for max_ready, which would be 4000 in that scenario.

The consequence of max_ready being too small is that those log event messages will get pushed into the scheduled queue and be delayed for a randomized duration of approx 1 minute.