Hi Serban,
s.dobrau@ucl.ac.uk writes:
We are currently working on a migration plan for Mailman -> Mailman 3. We have around over 3000 mailing lists with a total of 50K members
Medium-size. I guess a 2-CPU/4GB Linode or similar should handle the Mailman load (including Postorius and HyperKitty), you might want more memory and maybe more CPUs if you're running the RDBMS on the same host. In the 23k lists/50k users/100k inbound messages/day migration described below, we ended up reconfiguring the VMs 3 times before declaring the system stable. IMO the 2 x 8CPU/16GB final setup was way overpowered for the need, but we didn't try to calibrate finer than that.
and 500 worth of archived data.
500 whats? 500MB, 500GB, 500TB? Assuming 500GB,
Our plan is to have core+web (instance 1) and Hyperkitty (instance 2). With the archives on the FS local to the Hyperkitty.
The archives are in the database as BLOBs. There's no reason to have a separate instance for HyperKitty. It's not obvious to me that you need more than that 2x4 Linode for both Mailman and RDBMS, but if you're going to split it's Mailman vs RDMBS, not Mailman vs HyperKitty.
And all 3 tables (core, web, Hyperkitty) pointing to the same database on a remote SQL (mysql-enterprise) instance. Using class mailman.database.mysql.MySQLDatabase (and whatever necessary on the Django/Hyperkitty instance).
Make sure your MySQL database(s) for Mailman are configured with the utf8mb4 option for 4 byte UTF-8 support, or it will choke on emojis and the like. I don't think there are any other gotchas for MySQL.
Note there are a lot more than 3 tables in Mailman's databases. The "traditional" configuration is a "mailman" database for core and a "mailmanweb" database for Django and the archives, but Mark says that as the tables are disjoint across Mailman, Django, and the archives there's no reason not to use a single "mailman" database for all of them. I believe that's how this list's host is configured.
Is this feasible,
Yes.
if so what would be the recommended compute/storage requirements.
I would start with the spec above (2CPU/4GB). If that doesn't perform, bump it to taste. Mailman itself uses about a GB of RAM for all its runner processes. (I'm not a cloud expert, I just assume you can do that the way we did, very simply.) The Python code (including compiled .pycs) is about 300MB. The database storage requirements are nominal for Mailman and Django (maybe 100MB total in the database for a 23k lists, 50k users installation I worked on recently), and I think the database usage for the archives was approximately 1:1 vs. the Mailman2 mboxes. Mark may have a better estimate on that.
There is also the full-text search index for the archives. I recommend Xapian, in my experience Whoosh is really slow.[1] I don't recall offhand how big those were but I would assume they're no smaller than 1/10 the size of the mboxes, and maybe quite a bit bigger.
Importing list configurations, users, and subscriptions into Mailman 3 is pretty fast, as long as you're not using the stock Postfix support.[2] The problem is that generating the Postfix alias files for the lists seems to be noticably linear in number of lists, which means it's quadratic for the mass import. IIRC with 5000 lists you'd be looking at >1m/list just to keep regenerating Postfix's alias database, ie 5000 minutes. I found two solutions, both of which required patching Mailman. I hope to get both into the next release. I think we got it down to <5s/list, for 5000 lists that's < 1hr.
Importing archives is ... yeeech. The client decided they didn't want a ton of archives anyway (you know corporations, if it's not required by law, shred it after 6 months). I don't recall the size estimate accurately but we kept 6 months X 4500 lists, maybe 100GB out of 2.3TB of mboxes. That took 24 hours to import into HyperKitty, without doing the full-text indexing. I do know the original conservative estimate was 20 days to import the whole 2,3 TB. The full-text indexing took more than a week if I remember correctly.
The easiest thing to do is to just do no migration, keep the old archive online, and start populating the new archive when you switch over. It's possible to do both the HyperKitty import and full-text indexing incrementally with a bit of planning.
It's also quite possible to migrate incrementally, a few lists at a time. Mailman 2 and Mailman 3 can coexist happily on the same host. Mark can advise on that, I think.
The only other advice I remember offhand is that Mailman's outgoing smart host should be in the same datacenter as Mailman. We never did debug it, but the system described above with 23k lists had problems with the out queue stalling when smtp.client.com resolved to a different datacenter 500km away. :-) The problem went away as soon as they pinned the MX to the MTAs in the same datacenter.
Footnotes: [1] If you're familiar with another full-text search engine, check the django-haystack docs. Haystack supports several other engines, your preference may be among them.
[2] Exim4 does not have this problem. I don't know about other supported MTAs.