Slow webinterface/nginx requests & 504 gateway timeout with nginx reverse proxy

sweetgood

Hi there,

while migrating a company to grommunio I noticed that the webinterface (which has to be reached over nginx as reverse proxy) is terribly slow – most requests end in "Gateway timeout 504". Looking in the browser console I see stuff like this:

This is not a big mailbox – it only contains a few mails and there are no shared mailboxes attached to it. So the number of mails can not be the cause.

If I click on the folders like this (Inbox, Outbox, Sent, Inbox and so on) the spinner is spinning for almost all time and for every click another four requests are added to the queue which all get a timeout after exactly 60 seconds:

There are no errors in any of the logs in grommunio so it seems the root cause is the reverse proxy setup or the grommunio nginx which is slow. Grommunio is mostly idle.

Here's the reverse proxy configuration in front of grommunio – I already played around with the keepalive, keepalive_timeout, proxy_buffering and proxy_request_buffering parameters according to some tutorials on the web. Result: Nothing changed.

upstream grommunio01-web1 {
        server 10.0.0.30:443;
        keepalive 2;
        keepalive_timeout 3h;
#       keepalive_requests 50;
#       keepalive_timeout 60s;
}

upstream grommunio01-web2 {
        server 10.0.0.30:8443;
        keepalive 2;
        keepalive_timeout 3h;
#        keepalive_requests 50;
#        keepalive_timeout 60s;
}

# Redirect HTTP requests to HTTPS
server {
        listen 80;
        listen [::]:80;

        server_name                             mail.domain.de autodiscover.domain.de;

        error_log       /var/log/nginx/error_80_mail.domain.de.log;
        access_log      /var/log/nginx/access_80_mail.domain.de.log;

        return 301 https://$server_name$request_uri;
}

server {
        listen 443 ssl http2;
        listen [::]:443 ssl http2;

        # CHANGE-SERVER-NAME-HERE
        server_name                             mail.domain.de autodiscover.domain.de;

        # !!! WILDCARD SSL CERTIFICATE !!!
        ssl_certificate                 /etc/ssl/mail.domain.de.pem;
        ssl_certificate_key             /etc/ssl/mail.domain.de.key;

        include ssl_params;

        # CHANGE-SERVER-NAME-HERE
        error_log       /var/log/nginx/error_443_mail.domain.de.log;
        access_log      /var/log/nginx/access_443_mail.domain.de.log;

        charset utf-8;
#       client_max_body_size 50m;

        # Set global proxy settings
        proxy_read_timeout    3h;
        proxy_http_version    1.1;
        proxy_buffering        off; # Some tutorial says this is not recommended
        proxy_request_buffering    off; # Some tutorial says this is not recommended

        proxy_pass_request_headers    on;

        proxy_pass_header    Date;
        proxy_pass_header    Server;
        proxy_pass_header    Authorization;

        proxy_set_header    Host $host;
        proxy_set_header    X-Real-IP $remote_addr;
        proxy_set_header    X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header    X-Forwarded-Proto $scheme;
        proxy_set_header    Accept-Encoding "";
        proxy_set_header    Connection "Keep-Alive";

        #more_set_input_headers    'Authorization: $http_authorization';
        #more_set_headers    -s 401 'WWW-Authenticate: Basic realm="10.0.0.30"';

        client_max_body_size    0;

        location /web        { proxy_pass https://grommunio01-web1/web; }
        location /chat       { proxy_pass https://grommunio01-web1/chat; }
        location /meet        { proxy_pass https://grommunio01-web1/meet; }
        location /files        { proxy_pass https://grommunio01-web1/files; }
        location /archive        { proxy_pass https://grommunio01-web1/archive; }

        location /        { proxy_pass https://grommunio01-web2/; }
        location /owa        { proxy_pass https://grommunio01-web2/owa; }
        location /OWA        { proxy_pass https://grommunio01-web2/owa; }
        location /EWS        { proxy_pass https://grommunio01-web2/EWS; }
        location /ews        { proxy_pass https://grommunio01-web2/EWS; }
        location /Microsoft-Server-ActiveSync    { proxy_pass https://grommunio01-web2/Microsoft-Server-ActiveSync; }
        location /mapi        { proxy_pass https://grommunio01-web2/mapi; }
        location /MAPI        { proxy_pass https://grommunio01-web2/mapi; }
        location /rpc        { proxy_pass https://grommunio01-web2/Rpc; }
        location /RPC        { proxy_pass https://grommunio01-web2/Rpc; }
        location /oab        { proxy_pass https://grommunio01-web2/OAB; }
        location /OAB        { proxy_pass https://grommunio01-web2/OAB; }
        location /autodiscover    { proxy_pass https://grommunio01-web2/Autodiscover; }
        location /Autodiscover    { proxy_pass https://grommunio01-web2/Autodiscover; }
}

And of course I did restart nginx after changing the configuration with systemctl restart nginx.

Any hint appreciated 🥰

Some of the tutorials / guides I followed:
https://community.grommunio.com/d/291-grommunio-with-a-nginx-reverse-proxy-in-front
https://community.grommunio.com/d/91-solved-nginx-reverse-proxy
https://docs.nginx.com/nginx/deployment-guides/load-balance-third-party/microsoft-exchange/
https://www.nginx.com/blog/avoiding-top-10-nginx-configuration-mistakes/

WalterH

Try to do a package trace between reverse proxy and grommunio server, do you see any errors in package trace?

sweetgood

WalterH I already captured this a few hours ago – not really helpful from my perspective. I will post the trace as soon as the copying of big pst files has finished as this is affecting the overall performance of the system.

high

same here, but i have more bigger problem then this.
webinterface ist so slow that usage not possible. the problem only happend on more traffic or cache mode sync in outlook. if outlook is down all work well.

but same wehn onlien profile used..
i will check this when the ol works agion without troubles.

sweetgood

OK one thing is terribly wrong in my whole config … don't really know why I redirected all Outlook/Exchange requests to 8443 instead of 443 😅🙈 Will fix that first.

high

sweetgood

dieses problem meldetet heute auch meine kunden relativ häufig. tritt auf wenn es zur überlastung kommt oder mehrer ladevorgänge gleichzeitig passieren. auch wenn das postfach groß ist.
das ladesymol dreht. der ngix meldet fehler und dann kommt 503 im webbrowser mit dem roten kasten.

nach 5 minuten wartezeit geht alles weider normal erneutes login notwendig.
das passiert so alle 20 min aktuell bei meinen kunden.

sweetgood

Although I'm German switching between English and German in posts is not very helpful for Non-Germans I think. So I stick to English 😃

This case is clear to me:
Every client (💻️) is sending two requests on every folder action inside of the web client to /grommunio.php?subsystem=webapp_0815XXX0815.
This action is passed through my nginx reverse proxy (⚙️) to grommunio's nginx (📩) (I can see that from the logs, BUT:
As logs are written after success/failure and not at time of the request I started to add a little "echo to file" routing at the beginning of grommunio.php to know whether the file is called and how often.

So far, so good.

My nginx reverse proxy (⚙️) is closing the upstream request to grommunio's nginx (📩) after the given amount of seconds and grommunio's nginx (📩) is logging 499 (Client closed connection) which is true. Grommunio's nginx (📩) is reporting this error upstream timed out (110: Connection timed out) while reading response header from upstream in /var/log/nginx/nginx-web-error.log.

And why? Because the PHP requests take way too long.

After logging in ALL requests to grommunio.php are getting through fast and reliable. Always.

As soon as I start clicking through three or four folders the requests get into a queue and nginx is waiting for the timeout because the PHP processes are finishing veeeeery late. And again – the mailbox is mostly empty.

Some eye candy:

Call of a folder in web client results in two requests of grommunio.php where the second one did not finish in the 5s timeframe (I set this for testing purposes):

Same procedure with clicking through some more folders:

I increased the timeout and as you can see the requests are "stuck" but were sent to grommunio.php just fine:

And they finished after around 13s - 50s (!!!). If thats the case while NOBODY else is using the system and it's idling around I can imagine that the whole system just breaks as soon as more than one person is using the web client 😂😅

I know those kind of "problems" from Nextcloud where the only solution was to increase the number of PHP pool workers. I tried it here but it didn't change anything so it must be a database request or something else which just takes so long on each request.

The only solution which comes to my mind is to abort PHP requests as soon as someone clicks on another folder. Because without this I can just queue lots of requests on the server which is just slowing down the whole machine for everyone. And as I already clicked on another folder the old requests is useless as soon as it finishes (after 30-45s 😂) because I already clicked on five other folders in the meantime.

I increased the nginx timeout to 120s for now – let's see what the next day in production brings.

Will ask for support from Grommunio GmbH here as this should be solved somehow.

EDIT: As the pictures are just too small here (how bad) I uploaded them here again: https://next.sweetba.se/s/ZCn353zSJoaCBsc

Andy

The connection could also have to do with other effects. When I start the WebAPP or DeskAPP and the content has built up, it initially appears that I can start working. However, when clicking on an element, the "rotating pause wheel", known elsewhere as an hourglass, appears first and you have to wait until this is over, only then can you work and jump in the directories. It seems as if there is only then a connection to the database, but this may also have something to do with the issue described here. Another issue could perhaps be the synchronization problems with Outlook that are being discussed here.

These are just my guesses, just as a mental hint.

jengelh

NOBODY else is using the system

Makes timing tests much easier. 😉

zcore_log_level=6
zrpc_debug=2

set that in zcore.cfg, SIGHUP it (systemctl reload gromox-zcore), and watch the log for any unreasonable outliers in the timing information. This gives an indication in which direction to look for. "reasonable" zrpc execution times mean the issue is likely higher up (php-fpm) while high execution times mean the issue is likely in lower levels (zcore, gxhttp, sqlite). Perhaps it's also apparent by running /usr/bin/top with 0.1s interval and just seeing who's pulling the CPU time during longer requests—on the odd idea that something is downloading 100000 rows at once or something.

sweetgood

I want to say a HUGE THANKS to the whole grommunio team and especially the support team. They helped me immediately and we could track down the issue to one thing: IO waits... 🥳

For others who might have similar problems – it's NOT a grommunio issue at all.

Some background information:
The machine is hosted at Hetzner (it's a EX42) and it has Enterprise HDDs (ST4000NM0245-1Z2) inside. But as you might know they are not as fast as SSDs because they're SATA600 only. PROXMOX is the virtualisation host and apart from grommunio (which is the only real production VM) there are four LXC containers (2x Nextcloud, 1x Proxy, 1x Vaultwarden) on the machine. For migration purposes I also set up two Win10 VMs to be able to copy backed up pst files faster – but they will be shut down soon.

Now the overall IO wait is heavily fluctuating and it has always been between 10% and 50%. This causes the grommunio-nginx to run into the timeout of 65s (/etc/nginx/nginx.conf: keepalive_timeout: 65)

So everything would be fine IF the devops guy would have done his job right 😅🙈

"Again what learned" – and maybe a hint for others who run into similar issues.

Apart from that its helpful to use tools like vmstat 1 or iotop to track down issues with your IO.

And I've seen that it's possible to upgrade the drives on several servers from SATA to NVMe:
https://docs.hetzner.com/robot/dedicated-server/general-information/root-server-hardware/

nettania

I have this problem too, after upgrading the appliance to 2023.11.1. A ticket has been open since about one week, currently no response.
Grommunio is running on SAS drives, there are other java based web applications who have more load, but without this problem.

Florian

sweetgood

This is how the request times look like after moving the VM from SATA HDD to NVMe SSD: