`failover.status` not responding for several minutes
Description
Problem/Justification
Impact
Attachments
- 06 Feb 2023, 06:15 PM
Activity
Bug Clerk February 21, 2023 at 1:06 PM
22.12.2 PR: https://github.com/truenas/middleware/pull/10725
Bug Clerk February 21, 2023 at 12:12 PM
Caleb February 13, 2023 at 4:05 PM
Please use m40g3-137.dc1.ixsystems.net
root/abcd1234 as it’s experiencing the problem currently. Also, this is happening often, it’s not a transient problem.
Caleb February 13, 2023 at 4:04 PM
@Vladimir Vinogradenko It seems we might have a regression on Cobia with our websocket changes. (Or it could be something else).
From my client PC, I have a script that just connects to our public websocket endpoint and simply calls failover.status
using the session id provided in the response after establishing a connection. Below are the results
Ran the script the first time and it responded within the same second
caleb@caleb-win10:/mnt/c/Users/caleb/Downloads$ python3 websocket-test.py
CONNECTING TO: 'ws://m40g3-137.dc1.ixsystems.net:80/websocket'
getting failover.status 10:38:41
response: {'id': '4278aa14-0eba-4da8-9898-5c137f6da374', 'msg': 'result', 'result': 'MASTER'} 10:38:41
ran the same thing again and look how long it took (~16 seconds)
CONNECTING TO: 'ws://m40g3-137.dc1.ixsystems.net:80/websocket'
getting failover.status 10:38:57
response: {'id': '0d2aa347-9497-4247-a7e1-3f16ea0ff1a9', 'msg': 'result', 'result': 'MASTER'} 10:39:13
I've run it again shortly after that one responded and it hasn't even returned (been sitting for 15mins at this point)
python3 websocket-test.py
CONNECTING TO: 'ws://m40g3-137.dc1.ixsystems.net:80/websocket'
getting failover.status 10:39:40
Ievgen Stepanovych February 7, 2023 at 11:01 AM
It’s doing it again.
This time it stopped working as I was using the machine, i.e. it worked for ~30 minutes and then stopped.
Seen on http://10.238.238.199/ (API tests in TrueNAS SCALE - Cobia Jenkins plan).
Webui waits for answer to
failover.status
before showing login form. On machine in question, there is no response to this endpoint after minutes of waiting, resulting in no login form being shown.