Strange errors and gui not reachable
Description
Problem/Justification
Impact
Attachments
- 21 Aug 2023, 10:09 AM
Activity
Alexandra Bain August 21, 2023 at 8:11 PM
@nautilus7 please update to the newer release and do the actions described below, if you do still have issues after these please do submit a new ticket with the debug.
nautilus7 August 21, 2023 at 3:39 PMEdited
Guys, thanks for the input. Allow me a few comments.
I was not doing anything on the machine when these alerts start popping in my email. The machine runs for a month now without any interruption, doing nothing more that running kubernetes apps and some samba file transfers. So, I have no way to reproduce the problem…
Yes, I have quite a few apps, all from truecharts, but I think they are not too many or too heavy for this hardware. At least cpu and memory wise, the system seems to handle them quite easily. In addition, many of the apps are extremely lightweight (like tailscale, cloudflared, mosquitto, unifi controller, adguard-home, etc) and other are just kubernetes operator apps from truecharts. The “real” apps is frigate (video recorder for my cameras), plex (which is been used rarely), and homeassistant (which is a fresh installation that is not yet set up at all).
I hear the comment for my boot drive being stressed up. While the whole hardware is new, the boot drive is an old ssd I had laying around. It might not be on pair with the rest of the hardware. I was thinking about replacing it with an nvme for quite some time now as some gui pages seems a little unresponsive from time to time, but I was not sure if the boot drive is the problem or not. I have no way to tell if and how much stressed the boot drive is. The scale hardware recommendation is an sdd for boot drive, but it doesn't say that maybe an nvme is needed in case of many apps running, etc.
One of my HDDs has 3 pending sectors. I know that for at least 6 months now. I have seen the failed smart tests, but I’ve seen the succeeded smart test afterwards. At some point I decided to take the disk of and run it through “badblocks” looking for bad sectors. Unfortunately, none was found, so I decided to get it back in. I suspect that it might be a problem with some cabling or even my hba controller. I am still investigating.
I don’t really understand what “core dump” is and why qbittorrent is doing that. That seems to happened on Aug 3, which is way back from yesterday.
I didn’t update to the latest 22.12.3 version as that only seem to fix a zfs replication bug, which does not affect me.
Do you want me to update to the latest version (and reboot) and create a new debug for you to compare?
I will replace my boot drive with an nvme in the meanwhile hoping this will improve things.
Bonnie Follweiler August 21, 2023 at 1:55 PMEdited
Good Morning @nautilus7.
If possible, upgrade to 22.12.3.3 and investigate the issues that were highlighted in the comments below. if you are able to reproduce the issue (and have resolved the points made below) then please raise a new ticket.
Caleb August 21, 2023 at 1:49 PM
@nautilus7 you’re not really giving any actionable data but instead are just complaining about a free product. However, we appreciate the debug so I was able to investigate a bit to see if I could spot anything that I noticed. Here are a few areas that I would consider you investigate:
1. I see you’re running quite a few apps. If you’re running apps from a 3rd party then there is nothing we can do to for support. I’m not blaming 3rd party apps, but since you’ve not given any troubleshooting steps you’ve taken I would probably investigate this first. Maybe try to isolate the problem to an app or turn off ones that you don’t need/use.
2. The alerts that you’re getting are simply a side-effect of something occurring at a lower-level. However, I do see an alert state that you have a drive with unreadable sectors. This means, more than likely, you have a drive that is having problems and I would investigate you take a look at it to see if it a bit closer. I was able to confirm that this particular drive actually failed an extended offline test. as well.
#21 Extended offline Completed: read failure 90% 2463 2567022136
1 of 1 failed self-tests are outdated by newer successful extended offline self-test # 2
3. you’re running 22.12.3.2 which is not the latest release. We’ve released 22.12.3.3 which I would suggest upgrading to.
4. I’m also seeing that one of your apps is core dumping. That’s a pretty bad problem and could be bad software, bad hardware, etc. Specifically this is the app that core dumped.
Aug 3 20:32:41 atlas kernel: qbittorrent-nox[17398]: segfault at 20 ip 000000000045d3e1 sp 00007ffd6aacbae0 error 4 in qbittorrent-nox[400000+19b9000]
Aug 3 20:32:41 atlas kernel: Code: 48 03 82 80 00 00 00 8b 10 31 c0 e8 f1 c9 fe ff 48 8b 7c 24 08 31 f6 e8 bd 4a 85 00 0f b7 43 06 4c 8b 4c 24 08 e9 5b b0 98 00 <48> 8b 04 25 20 00 00 00 0f 0b 90 4c 89 e7 e8 2c f8 81 00 48 89 df
Aug 3 20:32:41 atlas kernel: Process 7(qbittorrent-nox) has RLIMIT_CORE set to 1
Aug 3 20:32:41 atlas kernel: Aborting core
5. I’m also seeing log messages like this which makes me think (potentially) that the boot-drive is being worked to death or something else is utilizing all the resources for the boot drive. Again it’s hard to say, but the message below is low-level socket message handling so the fact it’s taking awhile makes me think this is a side-effect of something else going on. I would see if your boot drive is potentially getting overwhelmed or maybe look at the system while it’s stuck in a bad state to see if you can identify an outlier.
[2023/08/21 01:20:43] (WARNING) middlewared._loop_monitor_thread():1754 - Task seems blocked:
File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1395, in call
return await self._call(
File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1344, in _call
return await methodobj(*prepared_call.args)
File "/usr/lib/python3/dist-packages/middlewared/plugins/kubernetes_linux/events.py", line 26, in setup_k8s_events
await self.k8s_events_internal()
File "/usr/lib/python3/dist-packages/middlewared/plugins/kubernetes_linux/events.py", line 50, in k8s_events_internal
self.middleware.send_event(
File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1507, in send_event
wsclient.send_event(name, event_type, **kwargs)
File "/usr/lib/python3/dist-packages/middlewared/main.py", line 313, in send_event
self._send(event)
File "/usr/lib/python3/dist-packages/middlewared/main.py", line 137, in _send
asyncio.run_coroutine_threadsafe(self.response.send_str(serialized), loop=self.loop)
File "/usr/lib/python3.9/asyncio/tasks.py", line 931, in run_coroutine_threadsafe
loop.call_soon_threadsafe(callback)
File "/usr/lib/python3.9/asyncio/base_events.py", line 797, in call_soon_threadsafe
self._write_to_self()
File "/usr/lib/python3.9/asyncio/selector_events.py", line 140, in _write_to_self
csock.send(b'\0')
Automation for Jira August 21, 2023 at 10:07 AM
Thank you for submitting this TrueNAS Bug Report! So that we can quickly investigate your issue, please attach a Debug file and any other information related to this issue through our secure and private upload service below. Debug files can be generated in the UI by navigating to System -> Advanced -> Save Debug.
https://ixsystems.atlassian.net/servicedesk/customer/portal/15/group/37/create/153
Yesterday, I started receiving a series of error messages in my email (see attached file).
Apparently some python code crashed, which caused the gui not working correctly. Some parts/pages did not load completely or at all. At least that is what I saw.
Maybe the problem is some other code or service, which I don’t know. I am not sure if all error messages are the same or not, so I included them all. Please ignore the message about the pending sectors for one of my hard disks.
So, during that time yesterday I tried creating a debug log, but the system failed to do so. I tried a second time, but then the gui became completely unavailable. Other services (kubernetes apps) were working and I was able to access them, while at the same time the gui was not reachable.
Anyway, I was going to reboot the system today to see if the problem goes away, but it seems to be working fine now, even without a reboot. The gui is again working and I was able to create a debug log, which is attached.
Note: It’s been almost a year now I have set up truenas scale on this hardware (specifically bought for this purpose) and still doesn’t feel it’s been stable enough for production. I still cannot trust it to take over as my main NAS system in my office. I have come across numerous problems, mostly with the kubernetes cluster, a couple of weeks back with the network, now with the gui. It’s been very frustrating and feels like we users are still beta testing the scale operating system, despite it’s supposed to be production ready.
Regards.