Mailman 3 deployment: ok to use same database instance for core, web, and hyperkitty tables?
Hello,
We are currently working on a migration plan for Mailman -> Mailman 3. We have around over 3000 mailing lists with a total of 50K members and 500 worth of archived data.
Our plan is to have core+web (instance 1) and Hyperkitty (instance 2). With the archives on the FS local to the Hyperkitty. And all 3 tables (core, web, Hyperkitty) pointing to the same database on a remote SQL (mysql-enterprise) instance. Using class mailman.database.mysql.MySQLDatabase (and whatever necessary on the Django/Hyperkitty instance).
Is this feasible, if so what would be the recommended compute/storage requirements. If not what other alternatives have we got ? A remote or local PostgreSQL instance is also possible if recommended.
Thank you very much in advance and all the best !
Regards, Serban
Hi Serban,
You may or may not be on the right track, because based on that email message your usage of terminology is not always clear.
Consider these three concepts:
- Database server (a server running PostgreSQL or MySql)
- A database (one server can host multiple databases)
- A table (one database contains many tables)
When you wrote "And all 3 tables" you probably meant "all 3 apps"? When you wrote "pointing to the same database" you probably meant "pointing to (multiple) databases on the same database server instance"?
All the apps (hyperkitty, postorius, core) could share the same database server, as long as they have their own database. Rather, one DB for core, another DB for web. So, two databases. Potentially on the same hardware instance.
The new hyperkitty archives are usually stored in the database and not "FS local" (filesystem local).
Next, this part "core+web (instance 1) and Hyperkitty (instance 2)." The main division is between Web (postorius, hyperkitty) and Core (mailman3-core). In other words, Hyperkitty is part of the web component. Postorius + Hyperkitty make up the web layer. So you wouldn't usually split up web and hyperkitty separately.
Hi Serban,
s.dobrau@ucl.ac.uk writes:
We are currently working on a migration plan for Mailman -> Mailman 3. We have around over 3000 mailing lists with a total of 50K members
Medium-size. I guess a 2-CPU/4GB Linode or similar should handle the Mailman load (including Postorius and HyperKitty), you might want more memory and maybe more CPUs if you're running the RDBMS on the same host. In the 23k lists/50k users/100k inbound messages/day migration described below, we ended up reconfiguring the VMs 3 times before declaring the system stable. IMO the 2 x 8CPU/16GB final setup was way overpowered for the need, but we didn't try to calibrate finer than that.
and 500 worth of archived data.
500 whats? 500MB, 500GB, 500TB? Assuming 500GB,
Our plan is to have core+web (instance 1) and Hyperkitty (instance 2). With the archives on the FS local to the Hyperkitty.
The archives are in the database as BLOBs. There's no reason to have a separate instance for HyperKitty. It's not obvious to me that you need more than that 2x4 Linode for both Mailman and RDBMS, but if you're going to split it's Mailman vs RDMBS, not Mailman vs HyperKitty.
And all 3 tables (core, web, Hyperkitty) pointing to the same database on a remote SQL (mysql-enterprise) instance. Using class mailman.database.mysql.MySQLDatabase (and whatever necessary on the Django/Hyperkitty instance).
Make sure your MySQL database(s) for Mailman are configured with the utf8mb4 option for 4 byte UTF-8 support, or it will choke on emojis and the like. I don't think there are any other gotchas for MySQL.
Note there are a lot more than 3 tables in Mailman's databases. The "traditional" configuration is a "mailman" database for core and a "mailmanweb" database for Django and the archives, but Mark says that as the tables are disjoint across Mailman, Django, and the archives there's no reason not to use a single "mailman" database for all of them. I believe that's how this list's host is configured.
Is this feasible,
Yes.
if so what would be the recommended compute/storage requirements.
I would start with the spec above (2CPU/4GB). If that doesn't perform, bump it to taste. Mailman itself uses about a GB of RAM for all its runner processes. (I'm not a cloud expert, I just assume you can do that the way we did, very simply.) The Python code (including compiled .pycs) is about 300MB. The database storage requirements are nominal for Mailman and Django (maybe 100MB total in the database for a 23k lists, 50k users installation I worked on recently), and I think the database usage for the archives was approximately 1:1 vs. the Mailman2 mboxes. Mark may have a better estimate on that.
There is also the full-text search index for the archives. I recommend Xapian, in my experience Whoosh is really slow.[1] I don't recall offhand how big those were but I would assume they're no smaller than 1/10 the size of the mboxes, and maybe quite a bit bigger.
Importing list configurations, users, and subscriptions into Mailman 3 is pretty fast, as long as you're not using the stock Postfix support.[2] The problem is that generating the Postfix alias files for the lists seems to be noticably linear in number of lists, which means it's quadratic for the mass import. IIRC with 5000 lists you'd be looking at >1m/list just to keep regenerating Postfix's alias database, ie 5000 minutes. I found two solutions, both of which required patching Mailman. I hope to get both into the next release. I think we got it down to <5s/list, for 5000 lists that's < 1hr.
Importing archives is ... yeeech. The client decided they didn't want a ton of archives anyway (you know corporations, if it's not required by law, shred it after 6 months). I don't recall the size estimate accurately but we kept 6 months X 4500 lists, maybe 100GB out of 2.3TB of mboxes. That took 24 hours to import into HyperKitty, without doing the full-text indexing. I do know the original conservative estimate was 20 days to import the whole 2,3 TB. The full-text indexing took more than a week if I remember correctly.
The easiest thing to do is to just do no migration, keep the old archive online, and start populating the new archive when you switch over. It's possible to do both the HyperKitty import and full-text indexing incrementally with a bit of planning.
It's also quite possible to migrate incrementally, a few lists at a time. Mailman 2 and Mailman 3 can coexist happily on the same host. Mark can advise on that, I think.
The only other advice I remember offhand is that Mailman's outgoing smart host should be in the same datacenter as Mailman. We never did debug it, but the system described above with 23k lists had problems with the out queue stalling when smtp.client.com resolved to a different datacenter 500km away. :-) The problem went away as soon as they pinned the MX to the MTAs in the same datacenter.
Footnotes: [1] If you're familiar with another full-text search engine, check the django-haystack docs. Haystack supports several other engines, your preference may be among them.
[2] Exim4 does not have this problem. I don't know about other supported MTAs.
Stephen J. Turnbull wrote:
s.dobrau@ucl.ac.uk writes:
We are currently working on a migration plan for Mailman -> Mailman > 3. We have around over 3000 mailing lists with a total of 50K > members
Medium-size. I guess a 2-CPU/4GB Linode or similar should handle the Mailman load (including Postorius and HyperKitty), you might want more memory and maybe more CPUs if you're running the RDBMS on the same host. In the 23k lists/50k users/100k inbound messages/day migration described below, >we ended up reconfiguring the VMs 3 times before declaring the system stable. IMO the 2 x 8CPU/16GB final setup was way overpowered for the need, but we didn't try to calibrate finer than that.
Yep, I have helped a migration with around 150 lists but not as much archived data, and ended up with a 2 CPU with 8GB machine. I think we could have gone lower, but found we needed the larger VM to help us import the archives. I used a managed PostGreSQL instance from Amazon for the database.
Our plan is to have core+web (instance 1) and Hyperkitty (instance > 2). With the archives on the FS local to the Hyperkitty.
The archives are in the database as BLOBs. There's no reason to have a separate instance for HyperKitty. It's not obvious to me that you need more than that 2x4 Linode for both Mailman and RDBMS, but if you're going to split it's Mailman vs RDMBS, not Mailman vs HyperKitty.
I would agree. I think the issue here is there are several ways of splitting out Mailman people get bogged down in whether they should run separate parts on different instances. In my view I have always ran Mailman, Postorius and Hyperkitty on the same host and I think it works better that way. For example, if you run Mailman and the web components on different hosts, you need to make sure the Mailman REST interface is secured, whereas if its on the same host then the REST interface can just listen on localhost. The Dockerised instances do use separate containers for Mailman Core and the web components, but that is the only real time I have seen the components split.
And all 3 tables (core, web, Hyperkitty) pointing to the same > database on a remote SQL (mysql-enterprise) instance. Using class > mailman.database.mysql.MySQLDatabase (and whatever necessary on the > Django/Hyperkitty instance).
Make sure your MySQL database(s) for Mailman are configured with the utf8mb4 option for 4 byte UTF-8 support, or it will choke on emojis and the like. I don't think there are any other gotchas for MySQL.
Are there any performance benchmarks for running MySQL vs PostGreSQL? I tend to fall back to PostGreSQL because it seems to be what is used in a lot of places, I have never recommended going with MySQL for a Mailman instance.
Note there are a lot more than 3 tables in Mailman's databases. The "traditional" configuration is a "mailman" database for core and a "mailmanweb" database for Django and the archives, but Mark says that as the tables are disjoint across Mailman, Django, and the archives there's no reason not to use a >single "mailman" database for all of them. I believe that's how this list's host is configured.
Yep, the original installation guide created one database for Mailman and the tables for all components are in the same database, this was how my larger install was done. Later revisions of the guide have us split the core vs web components to different databases, I don't think there is any performance hit either way by doing that.
Importing list configurations, users, and subscriptions into Mailman 3 is pretty fast, as long as you're not using the stock Postfix support.[2] The problem is that generating the Postfix alias files for the lists seems to be noticably linear in number of lists, which means it's quadratic for the mass import. IIRC with >5000 lists you'd be looking at >1m/list just to keep regenerating Postfix's alias database, ie 5000 minutes. I found two solutions, both of which required patching Mailman. I hope to get both into the next release. I think we got it down to <5s/list, for 5000 lists that's < 1hr.
That is very interesting I had a similar issue with the Postfix alias generation in my larger setup but didn't have time to identify the root cause. In the end I used Exim sending all mails for the list domain to Mailman which is the setup I use elsewhere and it doesn't rely on the alias file generation.
Importing archives is ... yeeech. The client decided they didn't want a ton of archives anyway (you know corporations, if it's not required by law, shred it after 6 months). I don't recall the size estimate accurately but we kept 6 months X 4500 lists, maybe 100GB out of 2.3TB of mboxes. That took 24 hours to >import into HyperKitty, without doing the full-text indexing. I do know the original conservative estimate was 20 days to import the whole 2,3 TB. The full-text indexing took more than a week if I remember correctly.
Yes, I had the same problem with the archives. I did import all the archives, in our case it took us several months to get the archives in so as not to stress load on the box. Full text indexing really caused an issue for me, I ended up disabling it for the time we were importing archives then generated a clean index once the archives were imported. This wasn't a good user experience and I would probably plan this out better next time.
It's also quite possible to migrate incrementally, a few lists at a time. Mailman 2 and Mailman 3 can coexist happily on the same host. Mark can advise on that, I think.
Its worth noting this is more difficult to do now on Debian/Ubuntu installs as Python2 has been removed. This was something I ran into myself when trying to co-exist both installs on the same box.
Andrew.
Andrew Hodgson writes:
Are there any performance benchmarks for running MySQL vs PostGreSQL?
I think the Mailman devs like PostgreSQL because it tries harder to be a theoretically-correct RDBMS and because there was a mild stink of closed source around MySQL at one point. I recently saw a claim that PostgreSQL is now more performant than MySQL on some standard SQL benchmarks, but I don't have a cite so take that as a weak statement. (May have been a Xeet from the PostgreSQL devs. ;-)
I don't think there's a big difference either way. AFAIK there are only a few subqueries and joins in the normal Mailman workload and they're all reasonably optimized by both RDBMSes. I've not heard anybody complain of the DB performance as such.
Its worth noting this is more difficult to do now on Debian/Ubuntu installs as Python2 has been removed. This was something I ran into myself when trying to co-exist both installs on the same box.
Mark has Python 3 versions of the archive cleaning tools in his contrib directory. We didn't need a Python 2 installation, we just rsync'ed the mboxes across.
I do recommend using a tool to break large archives into smaller mboxes (our case was monthly anyway, but there were a few 10GB+ months in there that took many hours), and keep good logs. We had a "pulled the wrong plug" incident and recovering the archives from that was painful even though we had a pretty good idea of what was done and what wasn't. Ended up deleting a couple of posts from the HyperKitty archive and deleting half the posts in the list/month mbox in question and everything worked, but it was scary and being sure of everything we had and didn't have was tedious.
Steve
participants (4)
-
Andrew Hodgson
-
s.dobrau@ucl.ac.uk
-
samuel.d.darwin@gmail.com
-
Stephen J. Turnbull