[MM3-users] Re: Race condition when mass subscribing same email addresses to two lists?

Oct. 28, 2021

      ...
On Oct 28, 2021, at 6:46 AM, Stephen J. Turnbull <stephenjturnbull@gmail.com> wrote:
Alan So writes:
...

Create two lists with default setting
Prepare a list of 1000 fresh email addresses not added to the
system before
Mass subscribe both lists of these 1000 email addresses around
the same time with all three Pre (confirm, approved, verified)
checked
[...]
(psycopg2.errors.UniqueViolation) duplicate key value violates
unique constraint "ix_address_email"
DETAIL:  Key (email)=(testemail999@example.com) already exists.

I think there's some kind of race condition.  I would bet it's in
Mailman core, not in the RDMBS code or the ORM.  The process is

Check if the address is known.
If yes:
a.  Get the address's user.
b.  Add the subscription pair (address, list) to the user.
c.  Add the address to the list.
If no:
a.  Create the address object.
b.  Create a user for the address, and link them together.
c.  Add the subscription pair (address, list) to the user.
d.  Add the address to the list.

Each line is a separate database query, I suspect, so the race is
between 1 and 3a.  If two requests for the same new address arrive at
the "same" time, both will try to create the address, only one can
succeed.
I guess we should catch the error and retry.  Raising and handling
exceptions in Python is relatively slow, so even in your well
constructed worst case, this shouldn't happen on every address, so I
don't think having a separate queue or putting the whole thing in a
transaction would be better.  If you still have the log, I'd be
curious to know how many unique errors you got.
So, at this point, the mass subscribe feature will call the API once for
each address. Each REST call in Core is wrapped in a transaction, so
when one address is already created by a separate web worker, it
will fail the transaction when others try to create.
I am not sure if we have an easy way to handle this kind of races. From
the purposes of the API code, they both were able to successfully subscribe
the user and create the address but the database rejected the changes
from being committed and transaction was rolled back as it violated the
constraints. By the time this exception is raised, the entire API code is
done executing, so where do handle the psycopg2.errors.UniqueViolation
exception is a big question.
With the multiprocessing model of runners and multiple web workers,
this kind of situation is basically what we would want where the integrity
of the database is preserved by constraints we put in the table definitions
and the runners/web workers can continue to work assuming they have
the full control of the database without separately synchronizing with
each other.
The code for this lives here1, which subscribes a new address to a
mailing list. It is the POST /3.1/members endpoint handler in API.
This is how we wrap every call to the WSGI app, i.e. each API call into
a transaction2.
Being able to prevent this kind of race condition is difficult if we want
to continue the support for multiple web workers for performance. I’ll
think more about how we can re-try on such errors though. It could
be either a client side re-try if we can figure out a way to signal the
client that this error was re-tryable.
Whether or not we are able to translate a UniqueViolation directly into
a retryable error code for Client really depends on whether there is
code in Core that relies on EAFP from database for functioning correctly,
since in those situations, the error, if raised, wouldn’t really be re-tryable
IMO. Fun problem to solve!
A pretty obvious workaround is to subscribed users serially instead of
parallell.
--
thanks,
Abhilash Raj (maxking)