Active Directory causing stopped SMB service on DC failover + WBC_ERR_WINBIND_NOT_AVAILABLE

Description

I have a setup with two Windows Server 2025 domain controllers and a TrueNAS machine called “vertex”.

The “primary” domain controller “prime-win” is at 192.168.1.13 → set as the primary DNS in TN
The secondary domain controller “vertex-win” is at 192.168.1.12 → set as the secondary DNS in TN

The “vertex” TrueNAS machine can reach both just fine and AD replication is working successfully between the two domain controllers. TrueNAS can also join the domain just fine.

At this point, running directory_service activedirectory domain_info via cli returned the LDAP server name and KDC server pointing to the name/IP of “prime-win”.

Now, if the primary DC “prime-win” goes offline for a few minutes (e.g. for Windows Updates), I get a WBC_ERR_WINBIND_NOT_AVAILABLE error from “vertex” and the SMB service is turned off (“Stopped”) after a couple of minutes. The AD service is marked as “Faulted” at that point. This is a screenshot of the alert:

After a few more minutes (here: 7 minutes after the initial alert), the above WBC_ERR_WINBIND_NOT_AVAILABLE error is cleared automatically and AD shows as “Healthy” again. However, the SMB service is still stopped and needs to be started manually again. This is unexpected.

Now, running domain_info via cli returned the LDAP server name and KDC server pointing to the name/IP of "vertex-win".

 

I’d also think that the WBC_ERR_WINBIND_NOT_AVAILABLE error should not happen and/or that “recovery/failover” shouldn’t take 7 minutes.

I guess some expected behaviors could be (1) to keep the SMB service running during the “faulted” AD state (mostly everything should be cached?), or (2) restart the SMB service automatically afterwards, or (3) not even cause this long faulted state (of 7 minutes) in the first place.

I’ll attach two debug files: The first one was taken during the “AD fault” with one domain controller down and the SMB service already shut down after I got the WBC_ERR_WINBIND_NOT_AVAILABLEerror.
The second file was taken after AD “failed over” and was “Healthy” again with the above alert cleared, but the SMB service still shut down.
(There’ll probably be a lot of noise in the logs before, as I did some testing and also left and re-joined the domain at some points to verify the issue. You can see the timestamp of the alert (“2025-03-18 05:10:39”) in the screenshot. Everything around that timestamp should be “good”.)

 

(On another note, I’m often still seeing winbind core dumps when restarting TrueNAS with Active Directory set up. Guessing it’s still the same issue as . Would love to have more information on that as well.)

Problem/Justification

The problem to be solved is that one AD domain controller failure, with another online DC available, should not result in a state where the SMB service has to be manually restarted on the TrueNAS machine.

Impact

None

Attachments

1

Activity

Show:
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Components

Priority

More fields

Katalon Platform

Created last week
Updated 3 days ago