Middleware disconnects still presisting in BETA 2.1

Description

This is a continuation of NAS-107338, which was closed with the comment: "There were two mitigations, one for BETA2 and the other scheduled for RC1."

Since that time, Beta2.1 has been released which appears to include bugfixes after beta2, and therefore I'd guess includes the bugfixes referred to above. But middleware disconnections still persist.

I noticed this on the console within the last hour. A replication ended at 2020-06-09 02.01 (UK time) and another one started at 02:38, 37 minutes later. Between those 2 times the server was utterly idle - no scrubs, no send/recv, no SMB or iSCSI activity whatsoever. Active SSH and console sessions only, virtually idle.

As middelwared log shows, in that brief 37 minute idle span, middleware lost its self-connection twice, at 02.13.10 and again 02.28.10 (and then a third time after send/recv recommenced at 02.43.10). All of these were exactly 15 minutes apart.

After that, further disconnections occurred at 02.48.10, 02.53.10 and 02.58.10, again exactly 5 minutes apart to the second.

In each case the traceback seemed to point to collectd_pyplugins/disktemp.py. and then self.sock.recv. Doubtless the logs have more detail.

I'll have to break the debug file in 2~3 parts to upload, it's got some large files in it that won't fit in the 50M upload limit. I'll upload it shortly or tomorrow as it's late

Problem/Justification

None

Impact

None

SmartDraw Connector

Katalon Manual Tests (BETA)

Activity

Show:

William GryzbowskiOctober 16, 2020 at 6:45 PM

Did you noticed the bunch of SCSI errors from several disks on mpr?

StilezOctober 15, 2020 at 2:36 PM
Edited

Sadly while a lot better than beta2/2.1, it's become obvious it's not fixed. I was away from midday 10 Oct - midday 15 Oct, during which time the server (RC1) was up but close to zero usage - if it had 2 movies watched from it by family the entire time, its a lot, plus iSCSI /SMB connections made but idle some time on the 14 Oct when a desktop was turned on (but not used).  Despite that, plenty of middleware disconnects. New debug attached ("3rd debug dump")

 

Example, sporadically:

 

Oct 11 14:12:54 fsz-1 1 2020-10-11T13:12:54.184184+00:00 MYNAS devd 2021 - - notify_clients: send() failed; dropping unresponsive client Oct 11 14:12:54 fsz-1 1 2020-10-11T13:12:54.185869+00:00 MYNAS devd 2021 - - notify_clients: send() failed; dropping unresponsive client

William GryzbowskiOctober 6, 2020 at 1:43 PM

Thanks for the update, let us know if it changes

StilezOctober 5, 2020 at 11:14 AM

- its been a few days, no sign of issues yet. I havent used the UI much or done much, so don't take that as definitive, but so far fingers crossed, etc.

StilezOctober 2, 2020 at 2:02 AM

Aha! Something's happened! solved the kernel panic on RC1 boot! Chelsio driver bug in the new cxgbe version merged in after Beta2.1, now reported upstream.

So now hopefully I will be able to answer your question in a while!

Not Applicable

Details

Assignee

Reporter

Labels

Impact

Medium

Components

Fix versions

Affects versions

Priority

More fields

Katalon Platform

Created September 6, 2020 at 2:03 AM
Updated July 1, 2022 at 4:55 PM
Resolved October 6, 2020 at 1:43 PM