Follow-up / Additional information
The issue happened again after the initial recovery.
Before restarting gromox-http, I captured the state of the process while EWS was not responding.
At that time, gromox-http was still active and running:
PID: 16652
User: gromox
Elapsed time: 01:20:29
Threads: 47
Command: /usr/libexec/gromox/http
The local EWS endpoint was still timing out:
curl -sS -o /dev/null -w "HTTP=%{http_code} TIME=%{time_total}\n" --max-time 10 http://127.0.0.1:10080/EWS/Exchange.asmx
Result:
HTTP=000 TIME=10.001916
curl: (28) Operation timed out after 10001 milliseconds with 0 bytes received
During the failure, ss showed many sockets owned by the gromox-http process stuck in CLOSE-WAIT on port 10443.
Examples, anonymized:
CLOSE-WAIT ... [::1]:10443 [::1]:xxxxx users🙁("http",pid=16652,fd=xx))
CLOSE-WAIT ... [::ffff:127.0.0.1]:10443 [::ffff:127.0.0.1]:xxxxx users🙁("http",pid=16652,fd=xx))
There was also at least one CLOSE-WAIT connection on port 10080:
CLOSE-WAIT ... [::ffff:127.0.0.1]:10080 [::ffff:127.0.0.1]:xxxxx users🙁("http",pid=16652,fd=xx))
Some internal connections were still established, for example to [::1]:5000 and [::1]:6666.
The gromox-http threads were mostly waiting in:
- hrtimer_nanosleep
- futex_wait_queue
- do_epoll_wait
- do_sys_poll
The journal still showed repeated messages like:
exmdb-audit: truncated message /var/lib/gromox/user/example.tld/example-user:f1966080:m1966091 (rewrite)
exmdb-audit: truncated message /var/lib/gromox/user/example.tld/example-user:f1966080:m1966116 (rewrite)
I do not know whether these exmdb-audit messages are related to the EWS hang, but they appear repeatedly around the same period.
Restarting only gromox-http immediately restored EWS again:
systemctl restart gromox-http
After restart:
HTTP=405 TIME=0.000407
So the issue is reproducible on this system: gromox-http remains active, but EWS stops responding locally until the service is restarted.
The large number of CLOSE-WAIT sockets on 10443 may be relevant. Please let me know if there are specific debug logs, a backtrace, gcore, strace, lsof output, or any other diagnostic commands you would like me to capture if the issue happens again.