On Oct 28, 2021, at 6:46 AM, Stephen J. Turnbull <stephenjturnbull@gmail.com> wrote:
Alan So writes:
- Create two lists with default setting
- Prepare a list of 1000 fresh email addresses not added to the system before
- Mass subscribe both lists of these 1000 email addresses around the same time with all three Pre (confirm, approved, verified) checked [...] (psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "ix_address_email" DETAIL: Key (email)=(testemail999@example.com) already exists.
I think there's some kind of race condition. I would bet it's in Mailman core, not in the RDMBS code or the ORM. The process is
- Check if the address is known.
- If yes: a. Get the address's user. b. Add the subscription pair (address, list) to the user. c. Add the address to the list.
- If no: a. Create the address object. b. Create a user for the address, and link them together. c. Add the subscription pair (address, list) to the user. d. Add the address to the list.
Each line is a separate database query, I suspect, so the race is between 1 and 3a. If two requests for the same new address arrive at the "same" time, both will try to create the address, only one can succeed.
I guess we should catch the error and retry. Raising and handling exceptions in Python is relatively slow, so even in your well constructed worst case, this shouldn't happen on every address, so I don't think having a separate queue or putting the whole thing in a transaction would be better. If you still have the log, I'd be curious to know how many unique errors you got.
So, at this point, the mass subscribe feature will call the API once for each address. Each REST call in Core is wrapped in a transaction, so when one address is already created by a separate web worker, it will fail the transaction when others try to create.
I am not sure if we have an easy way to handle this kind of races. From the purposes of the API code, they both were able to successfully subscribe the user and create the address but the database rejected the changes from being committed and transaction was rolled back as it violated the constraints. By the time this exception is raised, the entire API code is done executing, so where do handle the psycopg2.errors.UniqueViolation exception is a big question.
With the multiprocessing model of runners and multiple web workers, this kind of situation is basically what we would want where the integrity of the database is preserved by constraints we put in the table definitions and the runners/web workers can continue to work assuming they have the full control of the database without separately synchronizing with each other.
The code for this lives here1, which subscribes a new address to a mailing list. It is the POST /3.1/members endpoint handler in API.
This is how we wrap every call to the WSGI app, i.e. each API call into a transaction2.
Being able to prevent this kind of race condition is difficult if we want to continue the support for multiple web workers for performance. I’ll think more about how we can re-try on such errors though. It could be either a client side re-try if we can figure out a way to signal the client that this error was re-tryable.
Whether or not we are able to translate a UniqueViolation directly into a retryable error code for Client really depends on whether there is code in Core that relies on EAFP from database for functioning correctly, since in those situations, the error, if raised, wouldn’t really be re-tryable IMO. Fun problem to solve!
A pretty obvious workaround is to subscribed users serially instead of parallell.
-- thanks, Abhilash Raj (maxking)