Race condition when mass subscribing same email addresses to two lists?
We are testing the latest Mailman core 3.3.5 and Postorius 1.3.6. My colleagues reported that some email addresses cannot subscribe to the lists when two of them are using mass subscribe feature to add email addresses to the list. Then I try whether I can repeat the same issue. I did this:
- Create two lists with default setting
- Prepare a list of 1000 fresh email addresses not added to the system before
- Mass subscribe both lists of these 1000 email addresses around the same time with all three Pre (confirm, approved, verified) checked
There are quite a lot of similar errors in mailman.log. Did not study the code yet. Which part is causing the issue? SQLAlchemy 1.3.24 psycopg2-binary 2.8.6
[2021-10-28 08:38:30 +0000] [20141] [ERROR] Error handling request /3.1/members Traceback (most recent call last): File "/opt/mailman/venv/lib/python3.6/site-packages/gunicorn/workers/sync.py", line 136, in handle self.handle_request(listener, req, client, addr) File "/opt/mailman/venv/lib/python3.6/site-packages/gunicorn/workers/sync.py", line 179, in handle_request respiter = self.wsgi(environ, resp.start_response) File "/opt/mailman/venv/lib/python3.6/site-packages/mailman/database/transaction.py", line 51, in wrapper config.db.commit() File "/opt/mailman/venv/lib/python3.6/site-packages/mailman/database/base.py", line 52, in commit self.store.commit() File "/opt/mailman/venv/lib/python3.6/site-packages/sqlalchemy/orm/session.py", line 1046, in commit self.transaction.commit() File "/opt/mailman/venv/lib/python3.6/site-packages/sqlalchemy/orm/session.py", line 502, in commit self._assert_active(prepared_ok=True) File "/opt/mailman/venv/lib/python3.6/site-packages/sqlalchemy/orm/session.py", line 296, in _assert_active code="7s2a", sqlalchemy.exc.InvalidRequestError: This Session's transaction has been rolled back due to a previous exception during flush. To begin a new transaction with this Session, first issue Session.rollback(). Original exception was: (raised as a result of Query-invoked autoflush; consider using a session.no_autoflush block if this flush is occurring prematurely) (psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "ix_address_email" DETAIL: Key (email)=(testemail999@example.com) already exists.
Alan So writes:
- Create two lists with default setting
- Prepare a list of 1000 fresh email addresses not added to the system before
- Mass subscribe both lists of these 1000 email addresses around the same time with all three Pre (confirm, approved, verified) checked [...] (psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "ix_address_email" DETAIL: Key (email)=(testemail999@example.com) already exists.
I think there's some kind of race condition. I would bet it's in Mailman core, not in the RDMBS code or the ORM. The process is
- Check if the address is known.
- If yes: a. Get the address's user. b. Add the subscription pair (address, list) to the user. c. Add the address to the list.
- If no: a. Create the address object. b. Create a user for the address, and link them together. c. Add the subscription pair (address, list) to the user. d. Add the address to the list.
Each line is a separate database query, I suspect, so the race is between 1 and 3a. If two requests for the same new address arrive at the "same" time, both will try to create the address, only one can succeed.
I guess we should catch the error and retry. Raising and handling exceptions in Python is relatively slow, so even in your well constructed worst case, this shouldn't happen on every address, so I don't think having a separate queue or putting the whole thing in a transaction would be better. If you still have the log, I'd be curious to know how many unique errors you got.
I need to do other work for the next 48 hours, but a patch would be very welcome. I don't know offhand where this code lives, sorry.
Steve
On Oct 28, 2021, at 6:46 AM, Stephen J. Turnbull <stephenjturnbull@gmail.com> wrote:
Alan So writes:
- Create two lists with default setting
- Prepare a list of 1000 fresh email addresses not added to the system before
- Mass subscribe both lists of these 1000 email addresses around the same time with all three Pre (confirm, approved, verified) checked [...] (psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "ix_address_email" DETAIL: Key (email)=(testemail999@example.com) already exists.
I think there's some kind of race condition. I would bet it's in Mailman core, not in the RDMBS code or the ORM. The process is
- Check if the address is known.
- If yes: a. Get the address's user. b. Add the subscription pair (address, list) to the user. c. Add the address to the list.
- If no: a. Create the address object. b. Create a user for the address, and link them together. c. Add the subscription pair (address, list) to the user. d. Add the address to the list.
Each line is a separate database query, I suspect, so the race is between 1 and 3a. If two requests for the same new address arrive at the "same" time, both will try to create the address, only one can succeed.
I guess we should catch the error and retry. Raising and handling exceptions in Python is relatively slow, so even in your well constructed worst case, this shouldn't happen on every address, so I don't think having a separate queue or putting the whole thing in a transaction would be better. If you still have the log, I'd be curious to know how many unique errors you got.
So, at this point, the mass subscribe feature will call the API once for each address. Each REST call in Core is wrapped in a transaction, so when one address is already created by a separate web worker, it will fail the transaction when others try to create.
I am not sure if we have an easy way to handle this kind of races. From the purposes of the API code, they both were able to successfully subscribe the user and create the address but the database rejected the changes from being committed and transaction was rolled back as it violated the constraints. By the time this exception is raised, the entire API code is done executing, so where do handle the psycopg2.errors.UniqueViolation exception is a big question.
With the multiprocessing model of runners and multiple web workers, this kind of situation is basically what we would want where the integrity of the database is preserved by constraints we put in the table definitions and the runners/web workers can continue to work assuming they have the full control of the database without separately synchronizing with each other.
The code for this lives here1, which subscribes a new address to a mailing list. It is the POST /3.1/members endpoint handler in API.
This is how we wrap every call to the WSGI app, i.e. each API call into a transaction2.
Being able to prevent this kind of race condition is difficult if we want to continue the support for multiple web workers for performance. I’ll think more about how we can re-try on such errors though. It could be either a client side re-try if we can figure out a way to signal the client that this error was re-tryable.
Whether or not we are able to translate a UniqueViolation directly into a retryable error code for Client really depends on whether there is code in Core that relies on EAFP from database for functioning correctly, since in those situations, the error, if raised, wouldn’t really be re-tryable IMO. Fun problem to solve!
A pretty obvious workaround is to subscribed users serially instead of parallell.
-- thanks, Abhilash Raj (maxking)
Stephen J. Turnbull wrote:
I guess we should catch the error and retry. Raising and handling exceptions in Python is relatively slow, so even in your well constructed worst case, this shouldn't happen on every address, so I don't think having a separate queue or putting the whole thing in a transaction would be better. If you still have the log, I'd be curious to know how many unique errors you got. I need to do other work for the next 48 hours, but a patch would be very welcome. I don't know offhand where this code lives, sorry. Steve
For this case, there are 837 unique errors. Verified with the number of members of the two lists: 576 & 587 576 + 587 + 837 = 1000 * 2
Alan So writes:
For this case, there are 837 unique errors. Verified with the number of members of the two lists: 576 & 587 576 + 587 + 837 = 1000 * 2
Thanks. That looks very much like a race condition. Maybe Abhilash or I can get to it later this week (I'm definitely booked until at least Wednesday, and Abhilash is always busy...).
Steve
Thanks a lot. Just curious whether it is a new issue of 3.3.5 or just an old issue that rarely happen practically. We have been using 3.2 for some time and we are evaluating whether we should go for an upgrade now to resolve some other issues.
Alan
On 11/11/21 5:50 PM, Alan So wrote:
Thanks a lot. Just curious whether it is a new issue of 3.3.5 or just an old issue that rarely happen practically. We have been using 3.2 for some time and we are evaluating whether we should go for an upgrade now to resolve some other issues.
It is not a new issue. It has always been there. It's easy enough to avoid, just don't run parallel mass subscribes.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
Mark Sapiro wrote:
It is not a new issue. It has always been there. It's easy enough to avoid, just don't run parallel mass subscribes.
Thanks. Since we have many administrators working on different lists and they normally remove all memberships & mass subscribe during academic term start, it is possible that they may mass subscribe at the same time. The chance to hit this issue is slim anyway since they need to subscribe the same email addresses around the same time and these email addresses never exist on Mailman.
Alan
Alan So writes:
Thanks. Since we have many administrators working on different lists and they normally remove all memberships & mass subscribe during academic term start, it is possible that they may mass subscribe at the same time. The chance to hit this issue is slim anyway since they need to subscribe the same email addresses around the same time and these email addresses never exist on Mailman.
If I understand your situation and how Mailman works correctly, you should be able to run parallel mass subscribes if the User objects and Address objects already exist. So make a "Universal" list that subscribes all these addresses before starting on the individual lists. Or perhaps you already have lists that form a partition of all addresses (freshmen, sophomores, juniors, seniors, faculty, staff). The universal list will cost a small amount of space overhead (the Subscription objects in the database, but Users and Addresses are not duplicated since you need them for the other lists).
Somebody will have to get up at 4am on the first day of term to populate the universal list before the rest of the staff wakes up, I guess, but that's probably better than having dozens of staff frustrated because half their assigned lists are only half populated.
BTW, I know my university already has a universal list for announcements in a disaster. It's implemented as an umbrella list for a set of lists that partitions university members.
Steve
Stephen J. Turnbull wrote:
Thanks. Since we have many administrators working on different lists and they normally remove all memberships & mass subscribe during academic term start, it is possible that they may mass subscribe at the same time. The chance to hit this issue is slim anyway since they need to subscribe the same email addresses around the same time and these email addresses never exist on Mailman. If I understand your situation and how Mailman works correctly, you should be able to run parallel mass subscribes if the User objects and Address objects already exist. So make a "Universal" list that subscribes all these addresses before starting on the individual
Alan So writes: lists. Or perhaps you already have lists that form a partition of all addresses (freshmen, sophomores, juniors, seniors, faculty, staff). The universal list will cost a small amount of space overhead (the Subscription objects in the database, but Users and Addresses are not duplicated since you need them for the other lists). Somebody will have to get up at 4am on the first day of term to populate the universal list before the rest of the staff wakes up, I guess, but that's probably better than having dozens of staff frustrated because half their assigned lists are only half populated. BTW, I know my university already has a universal list for announcements in a disaster. It's implemented as an umbrella list for a set of lists that partitions university members. Steve
Thanks a lot for your suggestion. I would discuss with my colleagues about how to prevent the issue from happening as much as possible. And yes, pre-populating new email addresses is feasible.
participants (4)
-
Abhilash Raj
-
Alan So
-
Mark Sapiro
-
Stephen J. Turnbull