API errors making member changes to one list but not others
We had an issue with the UI getting API timeouts for a single list for specific functions but not other lists or other functions. This started happening after we accidentally filled the disk partition used by most of the system including the database.
There was a single person testing at the time when the disk fullness happened and it was only the single list that got ‘damaged’. After disk cleanup and restarting the docker images the single list would give API errors ("HTTP Error 400: HTTPConnectionPool(host='mailman-web', port=8000): Read timed out. (read timeout=5)” ) on the screen and logged errors (below) when list members were either subscribed or removed.
The issue was repeatable in Postorius and did not ‘heal’. Oddly the issue was resolved by making seemingingly the same API calls from the command line via curl. Making the below call to add a member succeeded and fixed being able to add members in the UI. Similarly sending a DELETE cal from curl worked where unsubscribe requests from postorius (either mass removal or picking addresses from the list member and unsubscribing selected) failed. Again, the unsubscribe from the curl DELETE call ‘healed’ the list such that subsequent unsubscribes in Postorius worked.
curl --user user:password -X POST -d 'list_id=laura-test.mm3.aca-aws.s.uw.edu' -d 'subscriber=ssw@uw.edu' -d 'pre_verified=true' -d 'pre_confirmed=true' -d 'pre_approved=true' http://172.19.199.2:8001/3.1/members
We’re running the current rolling docker images (up to date as of 25Jan2021)
Obviously, people should run their servers better than letting disk partitions fill-up but it’s going to happen and mailman3 should do a better job of reporting errors and recovering when accidents happen.
thanks, steve
I’ve included mailman.log lines from going to the mass subscribe
[26/Jan/2021:20:42:49 +0000] "GET /3.1/lists/laura-test.mm3.aca-aws.s.uw.edu HTTP/1.1" 200 434 "-" "GNU Mailman REST client v3.3.2" [26/Jan/2021:20:42:49 +0000] "GET /3.1/lists/laura-test@mm3.aca-aws.s.uw.edu/requests/count?token_owner=moderator HTTP/1.1" 200 73 "-" "GNU Mailman REST client v3.3.2" [26/Jan/2021:20:42:49 +0000] "GET /3.1/lists/laura-test@mm3.aca-aws.s.uw.edu/held/count HTTP/1.1" 200 73 "-" "GNU Mailman REST client v3.3.2" [26/Jan/2021:20:43:01 +0000] "GET /3.1/lists/laura-test.mm3.aca-aws.s.uw.edu HTTP/1.1" 200 434 "-" "GNU Mailman REST client v3.3.2" Jan 26 20:43:07 2021 (48) deque: Traceback (most recent call last): File "/usr/lib/python3.8/site-packages/urllib3/connectionpool.py", line 426, in _make_request six.raise_from(e, None) File "<string>", line 3, in raise_from File "/usr/lib/python3.8/site-packages/urllib3/connectionpool.py", line 421, in _make_request httplib_response = conn.getresponse() File "/usr/lib/python3.8/http/client.py", line 1347, in getresponse response.begin() File "/usr/lib/python3.8/http/client.py", line 307, in begin version, status, reason = self._read_status() File "/usr/lib/python3.8/http/client.py", line 268, in _read_status line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1") File "/usr/lib/python3.8/socket.py", line 669, in readinto return self._sock.recv_into(b) socket.timeout: timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/usr/lib/python3.8/site-packages/requests/adapters.py", line 439, in send resp = conn.urlopen( File "/usr/lib/python3.8/site-packages/urllib3/connectionpool.py", line 724, in urlopen retries = retries.increment( File "/usr/lib/python3.8/site-packages/urllib3/util/retry.py", line 403, in increment raise six.reraise(type(error), error, _stacktrace) File "/usr/lib/python3.8/site-packages/urllib3/packages/six.py", line 735, in reraise raise value File "/usr/lib/python3.8/site-packages/urllib3/connectionpool.py", line 670, in urlopen httplib_response = self._make_request( File "/usr/lib/python3.8/site-packages/urllib3/connectionpool.py", line 428, in _make_request self._raise_timeout(err=e, url=url, timeout_value=read_timeout) File "/usr/lib/python3.8/site-packages/urllib3/connectionpool.py", line 335, in _raise_timeout raise ReadTimeoutError( urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='mailman-web', port=8000): Read timed out. (read timeout=5)
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/usr/lib/python3.8/site-packages/mailman/app/workflow.py", line 69, in __next__ return step() File "/usr/lib/python3.8/site-packages/mailman/app/subscriptions.py", line 343, in _step_do_subscription self.member = self.mlist.subscribe( File "/usr/lib/python3.8/site-packages/mailman/database/transaction.py", line 85, in wrapper return function(args[0], config.db.store, *args[1:], **kws) File "/usr/lib/python3.8/site-packages/mailman/model/mailinglist.py", line 507, in subscribe notify(SubscriptionEvent( File "/usr/lib/python3.8/site-packages/zope/event/__init__.py", line 32, in notify subscriber(event) File "/usr/lib/python3.8/site-packages/mailman/app/membership.py", line 176, in handle_SubscriptionEvent send_welcome_message(mlist, member, member.preferred_language) File "/usr/lib/python3.8/site-packages/mailman/app/notifications.py", line 51, in send_welcome_message welcome_message = wrap(getUtility(ITemplateLoader).get( File "/usr/lib/python3.8/site-packages/mailman/model/template.py", line 188, in get contents = getUtility(ITemplateManager).get( File "/usr/lib/python3.8/site-packages/mailman/database/transaction.py", line 85, in wrapper return function(args[0], config.db.store, *args[1:], **kws) File "/usr/lib/python3.8/site-packages/mailman/model/template.py", line 110, in get contents = protocols.get(actual_uri, **auth) File "/usr/lib/python3.8/site-packages/mailman/utilities/protocols.py", line 38, in get response = requests.get(url, timeout=REQUEST_TIMEOUT, **kws) File "/usr/lib/python3.8/site-packages/requests/api.py", line 76, in get return request('get', url, params=params, **kwargs) File "/usr/lib/python3.8/site-packages/requests/api.py", line 61, in request return session.request(method=method, url=url, **kwargs) File "/usr/lib/python3.8/site-packages/requests/sessions.py", line 530, in request resp = self.send(prep, **send_kwargs) File "/usr/lib/python3.8/site-packages/requests/sessions.py", line 643, in send r = adapter.send(request, **kwargs) File "/usr/lib/python3.8/site-packages/requests/adapters.py", line 529, in send raise ReadTimeout(e, request=request) requests.exceptions.ReadTimeout: HTTPConnectionPool(host='mailman-web', port=8000): Read timed out. (read timeout=5) [26/Jan/2021:20:43:07 +0000] "POST /3.1/members HTTP/1.1" 400 130 "-" "GNU Mailman REST client v3.3.2" [26/Jan/2021:20:43:07 +0000] "GET /3.1/lists/laura-test@mm3.aca-aws.s.uw.edu/requests/count?token_owner=moderator HTTP/1.1" 200 73 "-" "GNU Mailman REST client v3.3.2" [26/Jan/2021:20:43:07 +0000] "GET /3.1/lists/laura-test@mm3.aca-aws.s.uw.edu/held/count HTTP/1.1" 200 73 "-" "GNU Mailman REST client v3.3.2"
Then the results of a successful subscribe:
[26/Jan/2021:20:46:20 +0000] "GET /3.1/lists/list2.mm3.aca-aws.s.uw.edu HTTP/1.1" 200 388 "-" "GNU Mailman REST client v3.3.2" [26/Jan/2021:20:46:20 +0000] "POST /3.1/members HTTP/1.1" 201 0 "-" "GNU Mailman REST client v3.3.2" [26/Jan/2021:20:46:20 +0000] "GET /3.1/lists/list2@mm3.aca-aws.s.uw.edu/requests/count?token_owner=moderator HTTP/1.1" 200 73 "-" "GNU Mailman REST client v3.3.2" [26/Jan/2021:20:46:20 +0000] "GET /3.1/lists/list2@mm3.aca-aws.s.uw.edu/held/count HTTP/1.1" 200 73 "-" "GNU Mailman REST client v3.3.2”
Stephen Willey Mats Mats writes:
mailman3 should do a better job of reporting errors and recovering when accidents happen.
It appears that the side which is logging the errors is working fine. It expects to receive data which is not arriving, so there's a timeout. So probably it's on the Postorius side, which is not in these logs. I don't recall offhand if Django keeps its own logs, if it does and you have them, it would help to understand the problem and design a mitigation. But since it's a network connection, it could be any number of things in the middle, too -- even a kernel problem.
If you have ideas about how better to report, fine, but it seems to me that in case of a major system incident, it's better to leave the diagnosis of faults in the interactions such complex systems to the humans.
Steve
participants (2)
-
Stephen J. Turnbull
-
Stephen Willey Mats Mats