My django container keeps getting shutdown by oom-killer
Greetings,
Ok, I just updated the docker containers that are running my Mailman 3 installation over the weekend. mailman-core is now 3.3.10, with mailman-ngnix at 1.28.0, mailman-postfix at 3.10.5-r0 and mailman-django-uwsgi at "latest".
I think I have got everything working normally — messages seem to be processed correctly, and I can access the web pages on the web interface. And I have updated various scripts to use --run-as-root until I can get around to reconfiguring mailman-core to use a non-root user.
Anyway, since that time, the EC2 instance that the containers are running on becomes "unreachable", to use AWS's parlance. I can restore functionality by rebooting the instance. According to AWS, this is an application or operating system issue, not their instance. Which means auto-recovery doesn't get activated.
This always happens overnight, usually after midnight, though one time it was around 10:40. I don't see any crontab entries that correlate. And the after midnight issues don't always happen at the same time, so it seemed unlikely that it was a crontab job that was causing it
I finally looked at the host (rather than the logs of each container) and realized that the oom-killer was killing django-admin. As a temporary workaround, I've limited memory on the mailman-django-uwsgi container to 4 GiB (though, since I just did that this morning, I won't know until after midnight (presumably) if it worked or not), and set the RestartPolicy to always.
I'm thinking that this points to a memory leak or some kind?
I did have to add a bunch of packages to my Dockerfile when I rebuilt the container, since the last build was like 5 years ago, so I suppose it could be that django is just using more memory than it used to, though at the moment, I see that it is around 1 GiB:
FROM python:3.12-alpine3.22
RUN apk update && apk add --no-cache --virtual build-deps
cargo gcc libffi-dev musl-dev &&
apk add --no-cache gettext postgresql-dev sassc build-base pcre-dev python3-dev libc-dev linux-headers &&
pip install --upgrade pip setuptools wheel &&
pip --no-cache-dir install django-haystack==3.3.0 hyperkitty==1.3.12
postorius==1.3.13 psycopg2 uwsgi==2.0.25 whoosh &&
apk update && apk del --no-cache build-deps
WORKDIR /opt/mailman/mailman-suite/mailman-suite_project USER mail
EXPOSE 8000
CMD ["uwsgi", "--ini", "uwsgi.ini"]
Any insights, ideas or assistance would be appreciated.
Thanks!
Pat Hirayama Pronouns: he/him/his Systems Engineer IT | Systems Engineering - Infrastructure Fred Hutch Cancer Center O 206.667.4856
phirayam@fredhutch.org<mailto:phirayam@fredhutch.org>
Hirayama, Pat writes:
I finally looked at the host (rather than the logs of each container) and realized that the oom-killer was killing django-admin.
I don't have a django-admin process in my installation as far as I know. (I didn't check to see if some process renames itself 'django-admin' though.) Are you sure that's what got killed?
As a temporary workaround, I've limited memory on the mailman-django-uwsgi container to 4 GiB (though, since I just did that this morning, I won't know until after midnight (presumably) if it worked or not), and set the RestartPolicy to always.
What do you mean by "work"? If you've got a process blowing past 4GB that's going probably going to die of ENOMEM. I hope it doesn't manage to try to allocate 10GB.
I'm thinking that this points to a memory leak or some kind?
I would think not. Something's allocating gobs of memory and it's not getting collected, but I doubt it's a process forgetting to delete garbage. I think it's just a runaway.
I did have to add a bunch of packages to my Dockerfile when I rebuilt the container, since the last build was like 5 years ago, so I suppose it could be that django is just using more memory than it used to, though at the moment, I see that it is around 1 GiB:
I've run a Debian system in 1GB, including the kernel, cloud admin stuff, PostgreSQL, Postfix, nginx, Xapian, and the Mailman suite. It always OOM'd during a Xapian reindexing run. :-) Upped the droplet to 2GB, and it's fine on memory.
I have seen reports that uwsgi systems use a lot more memory than gunicorn systems. I don't have hands on to confirm or analyze why, though. I'm not sure using Whoosh (instead of Xapian, Elastic Search, or SOLR) is a good idea -- I found it to be *extremely* slow on initial indexing of a system with lots of archives migrated from Mailman 2, and I wouldn't be surprised if that uses a lot of memory (since then I have stuck to Xapian, so no confirmation or analysis).
-- GNU Mailman consultant (installation, migration, customization) Sirius Open Source https://www.siriusopensource.com/ Software systems consulting in Europe, North America, and Japan
On Thu, Nov 13, 2025 at 6:35 PM Stephen J. Turnbull <steve@turnbull.jp> wrote:
Hirayama, Pat writes:
I finally looked at the host (rather than the logs of each container) and realized that the oom-killer was killing django-admin.
[snip]
I have seen reports that uwsgi systems use a lot more memory than gunicorn systems. I don't have hands on to confirm or analyze why, though. I'm not sure using Whoosh (instead of Xapian, Elastic Search, or SOLR) is a good idea -- I found it to be *extremely* slow on initial indexing of a system with lots of archives migrated from Mailman 2, and I wouldn't be surprised if that uses a lot of memory (since then I have stuck to Xapian, so no confirmation or analysis).
I can attest to the fact that using gunicorn uses much less memory than
uwsgi.
On my server, I switched from uwsgi to gunicorn and the memory consumption
dropped drastically - to almost nothing!
Well, not nothing, but the constant 2GB that the mailman (uwsgi) process
was using suddenly disappeared from my btop radar.
-- Best regards, Odhiambo WASHINGTON, Nairobi,KE +254 7 3200 0004/+254 7 2274 3223 In an Internet failure case, the #1 suspect is a constant: DNS. "Oh, the cruft.", egrep -v '^$|^.*#' ¯\_(ツ)_/¯ :-) [How to ask smart questions: http://www.catb.org/~esr/faqs/smart-questions.html]
-----Original Message----- From: Stephen J. Turnbull <steve@turnbull.jp> Sent: Thursday, November 13, 2025 7:35 AM Hirayama, Pat writes:
I finally looked at the host (rather than the logs of each container) and realized that the oom-killer was killing django-admin.
I don't have a django-admin process in my installation as far as I know. (I didn't check to see if some process renames itself 'django-admin' though.) Are you sure that's what got killed?
Pretty sure: Nov 12 00:08:00 lists kernel: [60897.148840] systemd invoked oom-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 Nov 12 00:08:00 lists kernel: [60897.149647] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/ docker/bae673e621f3e3c380c360ca891c5737803384f8b87006af73bb9a679909cea6,task=django-admin,pid=552788,uid=8
Sure enough to focus my attention on my django container.
As a temporary workaround, I've limited memory on the mailman-django-uwsgi container to 4 GiB (though, since I just did that this morning, I won't know until after midnight (presumably) if it worked or not), and set the RestartPolicy to always.
What do you mean by "work"? If you've got a process blowing past 4GB that's going probably going to die of ENOMEM. I hope it doesn't manage to try to allocate 10GB.
It seems to take time to have an impact. The instance remains available for several hours -- usually becoming unavailable while I'm sleeping.
I'm thinking that this points to a memory leak or some kind?
I would think not. Something's allocating gobs of memory and it's not getting collected, but I doubt it's a process forgetting to delete garbage. I think it's just a runaway.
That's a good suggestion.
FWIW, the instance has remained available all night, so limiting memory on the container seems to be working for now. <snip>
I have seen reports that uwsgi systems use a lot more memory than gunicorn systems. I don't have hands on to confirm or analyze why, though. I'm not sure using Whoosh (instead of Xapian, Elastic Search, or SOLR) is a good idea -- I found it to be *extremely* slow on initial indexing of a system with lots of archives migrated from Mailman 2, and I wouldn't be surprised if that uses a lot of memory (since then I have stuck to Xapian, so no confirmation or analysis).
I'll take a look at that. Thanks, Steven!
-p
Pat Hirayama Pronouns: he/him/his Systems Engineer IT | Systems Engineering - Infrastructure Fred Hutch Cancer Center O 206.667.4856
phirayam@fredhutch.org
On 11/12/25 17:10, Hirayama, Pat wrote:
This always happens overnight, usually after midnight, though one time it was around 10:40. I don't see any crontab entries that correlate. And the after midnight issues don't always happen at the same time, so it seemed unlikely that it was a crontab job that was causing it
I finally looked at the host (rather than the logs of each container) and realized that the oom-killer was killing django-admin. As a temporary workaround, I've limited memory on the mailman-django-uwsgi container to 4 GiB (though, since I just did that this morning, I won't know until after midnight (presumably) if it worked or not), and set the RestartPolicy to always.
The above suggests it's one of the Django daily jobs that's hitting the OOM. These jobs are
appname - jobname - when - help
django_extensions - cache_cleanup - daily - Cache (db) cleanup Job django_extensions - daily_cleanup - daily - Django Daily Cleanup Job hyperkitty - orphan_emails - daily - Reattach orphan emails hyperkitty - recent_threads_cache - daily - Refresh the recent threads cache hyperkitty - sync_mailman - daily - Sync user and list properties with Mailman
I don't offhand know which of these might be a memory hog. Further, if one of these dies for out of memory, I don't see it affecting more than that job, not the entire service.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
participants (4)
-
Hirayama, Pat -
Mark Sapiro -
Odhiambo Washington -
Stephen J. Turnbull