So many Open files use Socks-proxy

Hi Teams

I use Kumo socks-proxy server

version :2025.05.06-b29689af(I haven’t had time to upgrade to the latest version yet.)

Recently, I noticed that the value of open files keeps increasing, so I’m wondering if some connections in the SOCKS service aren’t being closed in time.

ss -ant | awk ‘{print $1}’ | sort | uniq -c

 259 CLOSE-WAIT
      4 CLOSING
   7745 ESTAB
    105 FIN-WAIT-1
  47201 FIN-WAIT-2
     63 LAST-ACK
      9 LISTEN
      1 State
    111 SYN-SENT
   3515 TIME-WAIT

ls /proc/$(pidof proxy-server)/fd 2>/dev/null | wc -l
is 211182

I don’t know your specific workload, kumo conf, distribution of contacted domains, shaping, etc, so I’m sharing the note below in general terms on topic, are notes I meant to post in another thread some time ago but never did… taking the opportunity now :slightly_smiling_face: I hope it’s useful to others as well.

These are practical tips from years (and years) working with **Momentum ** and recent load testing on Kumo. Treat them as information and apply only if you understand the trade offs and after proper validation.. Some of this appears in System Preparation docs, but IMO it’s easy to miss.

Systemd limits

  • For kumo / kumo-proxy (depending on your cluster), raise, as already said in some threads
    • LimitNOFILE
    • LimitNPROC

TCP /proc/sys/net/ipv4 tuning for outbound (heavy) workloads

  • net.ipv4.tcp_tw_reuse affects outbound connections

    • Newer kernels often default to 2 (loopback-only), not helpful
    • Set it to 1 and ensure net.ipv4.tcp_timestamps = 1 (they work as a pair)
    • Leaving it at 2 can contribute to symptoms like “Too many open files” or “Cannot open connection” under load and peak of TIME_WAIT connections
  • a quick note on tcp_fin_timeout*

    • you see many connections stuck in FIN-WAIT-2, it’s reasonable to lower it to 30s ( default 60 ) and monitor, if you haven’t already
    • don’t push too low in production
  • If you hit somtingh like Cannot assign requested address under [very]high concurrency, consider extending the ephemeral port range:

    • net.ipv4.ip_local_port_range (defaults 32768 60999)
    • Widen it (ex 1024 65535) but check for conflicts with:
      • net.ipv4.ip_unprivileged_port_start
      • net.ipv4.ip_local_reserved_ports
    • Do a proper review before changing these in production

Thank you very much for your sharing.

  1. Before raising this issue, we had already set LimitNOFILE and LimitNPROC to large values, since there were indeed problems at the beginning.

  2. tcp_fin_timeout has also been set to 30 seconds, and we even tried 20 seconds, but neither reduced the number of FIN-WAIT-2 states.

Next, I’ll try using net.ipv4.tcp_tw_reuse = 1 and net.ipv4.tcp_timestamps = 1 to see if they have any effect.

It might take a long time, or it may have no effect on the existing FIN-WAIT-2 connections—the number hasn’t changed (in fact, it’s even increased slightly). :joy:
I’m not planning to restart the proxy server yet, in order to troubleshoot the issue.

PS: note that tcp_tw_reuse works on **TIME_WAIT **sockets ( not on FIN-WAIT-2 )

Yes, I understand. For now, I’m just trying anything that might help; net.ipv4.tcp_fin_timeout is already set to 30 seconds.

Just for info, have you tried checking which peers are involved in the FIN-WAIT-2 state? Just to see if the issue can be isolated to a specific host or domain?

Like using ss or tcpdump?

Of course, we’ve analyzed that. Since our Kumomta deployment is in China, it’s well known that emails sent from China are often treated differently, resulting in a large number of link issues.

ss -ant state fin-wait-2

0                    0                                    myip:36737                               203.138.180.112:25
xxxx
many
many
many

I also looked it up on DeepWiki, and they suspect that the proxy server may have connection leaks in certain situations. That’s why I reported it here for review.

Connections in the FIN-WAIT-2 state are controlled by the Linux kernel’s tcp_fin_timeout parameter, which is typically set to 60 seconds by default. This means that even if the application layer doesn’t properly close the connection, the operating system will automatically clean it up after the timeout.

So, I’m not quite sure where the problem actually lies.

I upgraded the proxy server. After restarting, no issues were found. I’ll continue to monitor it.