[NAS-132480] default app healthcheck timeout too short for scrutiny - iXsystems TrueNAS Jira

Description

I found that when I installed Scrutiny app from the catalog, that if I specified more than one disk, it would timeout when starting.

By modifying the healthcheck timeout from 5s to 10s in the yaml, then it started correctly.

```
healthcheck:
interval: 10s
retries: 30
start_period: 10s
test: >-
curl --silent --output /dev/null --show-error --fail
http://127.0.0.1:8080/api/health
timeout: 5s
```

`timeout: 5s` -> `timeout: 10s`

It seems that this timeout code is specified in the base library, in healthcheck.py,

```
class Healthcheck:
def _init_(self, render_instance: "Render"):
self._render_instance = render_instance
self._test: str | list[str] = ""
self._interval_sec: int = 10
self._timeout_sec: int = 5
self._retries: int = 30
self._start_period_sec: int = 10
self._disabled: bool = False
```

It may be that a default timeout of 5s is is too short.

When it fails, the app log shows

```
4-11-13 03:15:15.012553+00:00s6-rc: info: service s6rc-oneshot-runner: starting
2024-11-13 03:15:15.018647+00:00s6-rc: info: service s6rc-oneshot-runner successfully started
2024-11-13 03:15:15.018961+00:00s6-rc: info: service fix-attrs: starting
2024-11-13 03:15:15.025921+00:00s6-rc: info: service fix-attrs successfully started
2024-11-13 03:15:15.026223+00:00s6-rc: info: service legacy-cont-init: starting
2024-11-13 03:15:15.032061+00:00cont-init: info: running /etc/cont-init.d/01-timezone
2024-11-13 03:15:20.015043+00:00s6-rc: fatal: timed out
2024-11-13 03:15:20.019007+00:00s6-sudoc: fatal: unable to get exit status from server: Operation timed out
```

Session ID: f1d8785f-02dc-6cfc-b538-ccdb6b1b3ffd

Steps to Reproduce

None

Expected Result

None

Actual Result

None

Environment

None

Hardware Health

None

Error Message (if applicable)

None

Activity

Show:

Bug Clerk last week

This issue has now been closed. Comments made after this point may not be viewed by the TrueNAS Teams. Please open a new issue if you have found a problem or need to re-engage with the TrueNAS Engineering Teams.

Stavros Kois last week

Nope, nothing changed in the last week.
I’ll close this for now, if you can reproduce it, please let me know!

Thanks

Stuart Espey last week

Hmmm, today I have been unable to reproduce. I also tried restarting, and moving the docker dataset back to a spinning rust pool.

This is the current logs…

2024-11-20 01:23:28.354952+00:00s6-rc: info: service s6rc-oneshot-runner: starting
2024-11-20 01:23:28.361735+00:00s6-rc: info: service s6rc-oneshot-runner successfully started
2024-11-20 01:23:28.362139+00:00s6-rc: info: service fix-attrs: starting
2024-11-20 01:23:28.369012+00:00s6-rc: info: service fix-attrs successfully started
2024-11-20 01:23:28.369161+00:00s6-rc: info: service legacy-cont-init: starting
2024-11-20 01:23:28.375233+00:00cont-init: info: running /etc/cont-init.d/01-timezone
2024-11-20 01:23:29.156574+00:00cont-init: info: /etc/cont-init.d/01-timezone exited 0
2024-11-20 01:23:29.156915+00:00cont-init: info: running /etc/cont-init.d/50-cron-config
2024-11-20 01:23:29.176801+00:00cont-init: info: /etc/cont-init.d/50-cron-config exited 0
2024-11-20 01:23:29.178608+00:00s6-rc: info: service legacy-cont-init successfully started
2024-11-20 01:23:29.178872+00:00s6-rc: info: service legacy-services: starting
2024-11-20 01:23:29.191433+00:00services-up: info: copying legacy longrun collector-once (no readiness notification)
2024-11-20 01:23:29.195471+00:00services-up: info: copying legacy longrun cron (no readiness notification)
2024-11-20 01:23:29.198311+00:00services-up: info: copying legacy longrun influxdb (no readiness notification)
2024-11-20 01:23:29.201443+00:00services-up: info: copying legacy longrun scrutiny (no readiness notification)
2024-11-20 01:23:29.207430+00:00s6-rc: info: service legacy-services successfully started

And I note that the difference seems to be setting 01-timezone is not timing out?

2024-11-20 01:23:28.375233+00:00cont-init: info: running /etc/cont-init.d/01-timezone
2024-11-20 01:23:29.156574+00:00cont-init: info: /etc/cont-init.d/01-timezone exited 0

2024-11-13 03:15:15.032061+00:00cont-init: info: running /etc/cont-init.d/01-timezone
2024-11-13 03:15:20.015043+00:00s6-rc: fatal: timed out

I wonder if anything changed in the app to fix?

Stavros Kois last week

Sorry for the delay,
I’ve been trying to reproduce, even dropped startup to 1s and timeout to 1s as well. Still starts fine.

Can you please set S6_VERBOSITY to 5 via additional environment variables.
Lets see if we can see something there.

Stuart Espey November 14, 2024 at 4:11 AM

btw, left it deploying for 2 hours. no change.

Cannot Reproduce

Pinned fields

Click on the next to a field label to start pinning.

Details

Assignee

Stavros Kois

Reporter

Stuart Espey

Labels

COMMUNITY_USERready_for_review

Affects versions

SCALE-24.10.0.2 (ElectricEel)

Fix versions

N/A

Time remaining

Priority

Undefined

Katalon Platform

Created November 13, 2024 at 4:53 AM

Updated last week

Resolved last week

Configure