Kimmo L writes:
Our mailman + hyperkitty runs on RHEL 8 and yes, all the services are systemd format. Host has 8GB swap at the moment In addition to the above mentioned services, there is also NGINX and postfix services as well.
OK. That's helpful to know.
Have you modified any web pages associated with Mailman, including any of the Django templates? Are there any pages with unusually long load times?
No and usually all pages are very smooth. Only if i open archives, then sometimes it will load like sec or three.
That's what I would expect.
I also noticed that after midnight, if one of the mailman-web cronjob was finished, the free memory jumped from 100mb to 1gb.
Search indexes are quite large, and constructing them takes a lot of concurrently active data. I'm not at all surprised that some of the cronjobs take ~1GB.
The thing about Linux is that it hates free memory. It thinks that memory should be used. If you open a data file, then close it, Linux will leave it in memory along with the metadata needed to find that memory without reading the file again. (Consider the case of a script where the first process writes a temporary file, closes it, and exits, then the second process opens the file and reads it. Big win, and this is not an uncommon case.) This is what is meant by "cache" in Linux kernel memory statistics.
And then it will start to decrease again.
Because of caching, free memory is not the right statistic to look at to check for memory leaks when you're running Linux. It's free + cache. If that total consistently decreases, you have either a leak or a process that's building a large data structure. For example, when the cronjobs go on their midnight run, I'm sure you see not only free memory decrease, but cache as well. Then when they're done, boom! you see free memory jump by ~1GB.
When a program builds a large structure in memory, it does so because it needs to write to it a lot. It's definitely volatile during the run, and normally it's variable from run to run. So there's no provision for caching it: when a program the creates a large data structure exits, that memory is returned to the free pool. In fact, how do programs cache such structures? They write them to files! So the first run doesn't use the kernel cache for that data, but the second run checks for the file, reads it, and after that it will be in the kernel's cache until some process demands the memory.
I will try to investigate little bit more and play little bit with the services to be sure, that the issue is related with mailman or gunicorn or with something else.
Write a script that prints free + cache and the top ten processes from ps, sorted by VM size and also by RSS size. Run this every 15 minutes or so, it should help catch the culprit.
Steve