Alert Rules
Dashboard ships with 38 built-in alert rules covering OS health, storage, networking, hardware, ZFS, security, and service health. Each rule is evaluated on every metrics push (default: every 300 seconds / 5 minutes). When a threshold is crossed, Dashboard fires a notification to all configured channels.
You can override default thresholds per server in /etc/glassmkr/collector.yaml or globally in the Dashboard dashboard under Settings > Alert Defaults.
Tip — read the alert's ownfix_commandsfirst. Every fired alert includes afix_commandsarray in itsevidenceJSON with the most current concrete commands for that specific case. This documentation page covers the rule generally; the alert'sfix_commandsare tailored to the device or unit that tripped the rule (e.g., the rightsmartctl -a /dev/sdXwith your actual disk name, the rightsystemctl status <unit>with your actual failing service). When in doubt, the alert'sfix_commandsoverride what this page shows.
Table of contents
- Rule categories
- Alert priorities (P1-P4)
- Alert muting
- Alert tabs
- OS rules (9)
- Storage rules (8)
- Network rules (5)
- Hardware / IPMI rules (5)
- ZFS rules (2)
- Security rules (6)
- Service Health rules (3)
- Global alert settings
Rule categories
| Category | Count | Rules |
|---|---|---|
| OS | 9 | ram_high, cpu_high, load_high, cpu_iowait_high, oom_kills, clock_drift, swap_high, ntp_not_synced, unexpected_reboot |
| Storage | 8 | disk_space_high, smart_failing, nvme_wear_high, raid_degraded, disk_latency_high, filesystem_readonly, inode_high, disk_io_errors |
| Network | 5 | interface_errors, link_speed_mismatch, interface_saturation, conntrack_exhaustion, bond_slave_down |
| Hardware / IPMI | 5 | cpu_temperature_high, ecc_errors, psu_redundancy_loss, ipmi_sel_critical, ipmi_fan_failure |
| ZFS | 2 | zfs_pool_unhealthy, zfs_scrub_errors |
| Security | 6 | ssh_root_password, no_firewall, pending_security_updates, kernel_vulnerabilities, kernel_needs_reboot, unattended_upgrades_disabled |
| Service Health | 3 | systemd_service_failed, fd_exhaustion, server_unreachable |
Alert priorities (P1-P4)
Every alert is assigned a priority level based on its severity and urgency. Priority badges appear on alert cards in the dashboard and in notification messages.
| Priority | Meaning | Examples |
|---|---|---|
| P1 | Critical, immediate action required. Data loss or service outage is imminent or occurring. | raid_degraded, smart_failing, oom_kills, ecc_errors (uncorrectable) |
| P2 | High, action needed soon. Significant degradation or risk. | disk_space_high (critical threshold), cpu_temperature_high (critical), psu_redundancy_loss |
| P3 | Medium, investigate when convenient. Performance impact or early warning. | ram_high, cpu_high, disk_latency_high, inode_high |
| P4 | Low, informational. Proactive recommendations. | pending_security_updates, unattended_upgrades_disabled, nvme_wear_high |
Alert cards in the dashboard show the priority badge (P1-P4), a one-line summary, evidence links to relevant charts, and copy-pasteable fix commands you can run on the server.
Evidence path attribution
Some rules have multiple data sources (for example, ECC errors come from either named IPMI sensors or SEL events; CPU temperature comes from hwmon or IPMI fallback). For those rules, the alert's evidence.path field records which source fired. If you script against the API and need to act differently on different paths — or if you're investigating why a rule fired the way it did — read this field. The per-rule sections below name the values each rule emits.
Alert muting
You can mute specific alert rules on a per-server basis. Muted rules stop firing and stop sending notifications for that server. This is useful during maintenance windows or when a known condition is expected.
To mute a rule, go to the server detail page, open the Alerts tab, and click the mute icon next to the rule. You can also mute rules via the API or in the configuration file:
muted_rules: - disk_space_high # mute during disk migration - cpu_iowait_high # mute during RAID rebuild
Muted rules are re-evaluated on the next ingest cycle after unmuting. They do not fire retroactively for conditions that occurred while muted.
Alert tabs
The server detail page provides three alert tabs for filtering:
- Active: alerts currently firing. These need attention.
- Acknowledged: alerts that have been acknowledged but not yet resolved. Notifications are silenced.
- All: complete alert history including resolved alerts, filterable by date range and rule.
OS rules (9)
1. ram_high
What it means
The server's physical RAM usage has exceeded the configured threshold. This is calculated as (total - available) / total * 100, where "available" includes buffers and cache that the kernel can reclaim under pressure.
Why it matters
Sustained high memory usage leaves little headroom for traffic spikes or new processes. If RAM fills completely, the Linux OOM killer will start terminating processes, potentially taking down critical services.
What to do
- Identify the top memory consumers:
ps aux --sort=-%mem | head -20 - Check for memory leaks in long-running processes by comparing RSS over time.
- Consider adding swap as a safety net (though swap is not a substitute for adequate RAM).
- If usage is consistently high, upgrade the server's memory or redistribute workloads.
Configuration
alerts:
ram_high:
enabled: true
threshold: 90
duration: 300 # seconds the condition must persist before firing2. cpu_high
What it means
The aggregate CPU utilization (user + system + iowait) has exceeded the threshold for the configured duration. On servers with per-core monitoring enabled (Crucible 0.3.0+), the alert also reports which cores are saturated.
Why it matters
Sustained high CPU usage means the server is at capacity. New requests queue, response times increase, and background tasks (cron jobs, log rotation) may not complete on time. If steal time is also high, the hypervisor is overcommitting CPU resources.
What to do
- Identify CPU-heavy processes:
top -bn1 | head -20 - Check per-core usage in the Dashboard dashboard to see if the load is evenly distributed or pinned to specific cores.
- Look for runaway processes or infinite loops.
- Consider scaling horizontally or upgrading CPU resources.
Configuration
alerts:
cpu_high:
enabled: true
threshold: 90
duration: 3003. cpu_iowait_high
What it means
The percentage of CPU time spent waiting for I/O operations to complete has exceeded the threshold. High iowait indicates that the CPU is idle because it is waiting for disk or network I/O.
Why it matters
Elevated iowait is a strong signal that storage is the bottleneck. Applications that depend on disk reads or writes will experience increased latency. This often correlates with slow database queries, sluggish log processing, or degraded RAID rebuilds.
What to do
- Identify processes generating I/O:
iotop -oP - Check disk latency with
iostat -x 1 5and look at theawaitcolumn. - If a RAID array is rebuilding, iowait is expected and will resolve on its own.
- Consider moving heavy I/O workloads to faster storage (NVMe).
- Tune the I/O scheduler or increase the filesystem's commit interval for write-heavy workloads.
Configuration
alerts:
cpu_iowait_high:
enabled: true
threshold: 20
duration: 1804. oom_kills
What it means
The Linux kernel's Out-of-Memory killer has terminated one or more processes since the last check. Crucible reads this from /proc/vmstat (the oom_kill counter) and from kernel log messages.
Why it matters
OOM kills mean the server ran out of memory and the kernel had to sacrifice processes to keep the system alive. The killed process may be your database, web server, or another critical service. OOM events frequently cause cascading failures.
What to do
- Check which process was killed:
dmesg | grep -i "oom-killer" - Review memory usage trends in the Dashboard dashboard to identify the growth pattern.
- Set memory limits on containers or systemd services using
MemoryMax=to prevent a single process from consuming all RAM. - Add or increase swap as a safety buffer.
- If OOM kills recur, the server needs more RAM or the workload needs to be reduced.
Configuration
alerts:
oom_kills:
enabled: true
threshold: 1 # number of new OOM kills to trigger5. load_high
What it means
The system's 5-minute load average has exceeded the threshold, which defaults to twice the number of CPU cores. A load average above the core count means processes are waiting for CPU time.
Why it matters
High load averages cause increased latency for all processes. Unlike CPU percentage, load average counts processes waiting for both CPU and I/O, so it captures bottlenecks that pure CPU metrics miss.
What to do
- Check current load and CPU count:
uptimeandnproc - Identify processes in D state (uninterruptible sleep, usually I/O):
ps aux | awk '$8 ~ /D/' - If load is high but CPU usage is low, the bottleneck is likely disk I/O. Check with
iostat -x 1 5. - If load is high and CPU is also high, the server is CPU-bound. Reduce workload or add capacity.
Configuration
alerts:
load_high:
enabled: true
threshold: 0 # 0 = auto (2x core count). Set a fixed number to override.
duration: 3006. clock_drift
What it means
The system clock has drifted more than the configured threshold from the expected time. Crucible compares the local clock against NTP reference data from timedatectl or chronyc.
Why it matters
Clock drift breaks TLS certificate validation, causes log timestamps to be unreliable, desynchronizes distributed systems (databases, consensus protocols), and can cause authentication failures with time-sensitive tokens (TOTP, Kerberos). Even small drifts compound over time if NTP is misconfigured.
What to do
- Check current drift:
timedatectl statusorchronyc tracking - Verify NTP is running:
systemctl status chronydorsystemctl status systemd-timesyncd - Force a sync:
chronyc makesteportimedatectl set-ntp true - Check that NTP servers are reachable from the server's network.
Configuration
alerts:
clock_drift:
enabled: true
threshold: 500 # milliseconds7. swap_high
What it means
Swap space usage has exceeded the configured threshold. Crucible reads swap usage from /proc/meminfo. High swap usage means the system is actively paging memory to disk.
Why it matters
Swap exists as a safety net, not as a primary memory source. When a server is actively swapping, performance degrades significantly because disk I/O is orders of magnitude slower than RAM access. Database queries slow down, application response times spike, and the system can enter a thrashing state where it spends more time swapping than doing useful work.
What to do
- Check swap usage:
free -handswapon --show - Identify processes using swap:
for f in /proc/*/status; do awk '/VmSwap/{swap=$2} /Name/{name=$2} END{if(swap>0) print swap,name}' "$f" 2>/dev/null; done | sort -rn | head -20 - Check if RAM is the bottleneck: review memory usage trends in the Dashboard dashboard.
- If swap usage is sustained, the server likely needs more RAM or the workload needs to be reduced.
Configuration
alerts:
swap_high:
enabled: true
threshold: 80 # percentage of total swap8. ntp_not_synced
What it means
The system's NTP synchronization is not active. Crucible checks timedatectl for "NTP synchronized: yes" and verifies that an NTP daemon (chrony, ntpd, or systemd-timesyncd) is running.
Why it matters
Without active NTP synchronization, the system clock will drift over time. Hardware clocks are imprecise and can drift seconds per day. This leads to the same issues as clock_drift but is a more fundamental problem: the server has no mechanism to correct its time at all.
What to do
- Check NTP status:
timedatectl status - Enable time sync:
sudo timedatectl set-ntp true - If using chrony:
sudo systemctl enable --now chronyd - If using systemd-timesyncd:
sudo systemctl enable --now systemd-timesyncd - Verify NTP servers are configured in
/etc/chrony.confor/etc/systemd/timesyncd.conf.
Configuration
alerts:
ntp_not_synced:
enabled: true9. unexpected_reboot
What it means
The server's uptime has decreased since the last snapshot, indicating a reboot occurred between collection intervals. Crucible detects this by comparing the current uptime against the previous snapshot's uptime value.
Why it matters
Unexpected reboots can indicate hardware instability (kernel panics, power loss, watchdog timer expiry), firmware issues, or someone rebooting the server without coordination. Even planned reboots should be tracked for audit purposes. Repeated unexpected reboots are a strong signal of a failing component.
What to do
- Check the reboot cause:
last rebootandjournalctl --boot=-1 -e - Check for kernel panics:
dmesg | grep -i panic - Check IPMI SEL for power events:
ipmitool sel list - If reboots recur, investigate hardware (PSU, memory, thermal shutdown) and check for watchdog timer kills.
Auto-resolution
unexpected_reboot alerts automatically resolve after 24 hours of continuous stable uptime. If a server reboots unexpectedly the alert fires; if the server then runs for 24 hours without another reboot, the alert resolves with resolution_reason: auto_decay_stable_24h. The original incident remains in the resolved-alerts history.
This avoids manual ack work on transient reboot events while still surfacing the original incident. If you want a different decay window for a specific server, set unexpected_reboot_decay_hours in the server's config_overrides (positive integer, hours).
Use sudo glassmkr-crucible mark-reboot --reason "..." before any deliberate reboot (kernel upgrade, microcode load, maintenance) to suppress the alert entirely. The agent records a single-use expected-reboot marker, and the next snapshot post-reboot fires nothing.
Configuration
alerts:
unexpected_reboot:
enabled: true
# Triggers when uptime decreases between consecutive snapshots
# Per-server: unexpected_reboot_decay_hours in config_overrides (default 24)Storage rules (8)
10. disk_space_high
What it means
A mounted filesystem has exceeded the configured disk usage threshold. Dashboard monitors all mounted filesystems except tmpfs, devtmpfs, and other virtual mounts.
Why it matters
When a filesystem fills to 100%, writes fail. This can crash databases, corrupt logs, prevent SSH logins (if /var or /tmp are full), and make the server difficult to recover remotely. The reserved blocks for root (typically 5% on ext4) provide a small buffer but are not a long-term solution.
What to do
- Find large files:
du -h --max-depth=2 /var | sort -hr | head -20 - Clean up old logs:
journalctl --vacuum-time=7d - Remove old package caches:
apt cleanordnf clean all - Check for core dumps or stale temporary files in /tmp and /var/tmp.
- If the filesystem is consistently near capacity, expand the volume or move data to a larger disk.
Configuration
alerts:
disk_space_high:
enabled: true
threshold: 90
critical_threshold: 95
exclude_mounts:
- /mnt/backup # ignore specific mount points11. smart_failing
What it means
A disk's SMART self-assessment has reported a failing status, or one or more critical SMART attributes (Reallocated Sector Count, Current Pending Sector, Offline Uncorrectable) have crossed their vendor-defined thresholds. Crucible uses smartctl to read these values. The dashboard displays the drive model name, power-on days, reallocated sector count, and temperature.
Why it matters
SMART failures are a strong predictor of imminent disk failure. A disk reporting "FAILING" can die within hours or weeks. Data loss is a real risk, especially if no RAID or backup is in place.
Read evidence.triggering_signals first
The rule fires on three independent conditions: aggregate SMART health != PASSED, reallocated_sectors > 0, or pending_sectors > 0. You may see this alert fire with evidence.health: "PASSED" — that means a sector-level condition tripped while the aggregate self-test still passes. The evidence.triggering_signals[] array names exactly which condition(s) fired with a per-signal reason string. Read that array before reacting to health alone.
What to do
- Check the SMART report:
smartctl -a /dev/sdX - Back up the disk immediately if backups are not current.
- If the disk is part of a RAID array, replace it as soon as possible and let the array rebuild.
- Order a replacement drive. Do not wait for the disk to fail completely.
- If you are in a data center, open a hardware ticket with your provider.
Configuration
alerts:
smart_failing:
enabled: true
# No threshold - any SMART failure triggers this alert
ignore_disks:
- /dev/sda # optionally ignore specific disks12. nvme_wear_high
What it means
An NVMe drive's "Percentage Used" indicator (from the NVMe health log) has exceeded the threshold. This value estimates how much of the drive's rated write endurance has been consumed. A value of 100% means the drive has reached its rated endurance, though many drives continue operating beyond this point.
Why it matters
NVMe flash cells have a finite number of program/erase cycles. As wear increases, the drive's internal spare cells are consumed. Eventually the drive will transition to read-only mode or fail entirely. Planning a replacement before 100% wear avoids unexpected downtime.
What to do
- Check current wear:
smartctl -a /dev/nvme0 | grep "Percentage Used" - Review Data Units Written to estimate remaining lifespan based on your write rate.
- If wear is above 90%, order a replacement drive and schedule a migration.
- Reduce unnecessary writes (disable access time updates with
noatime, move logs to a different drive).
Configuration
alerts:
nvme_wear_high:
enabled: true
threshold: 80 # percentage used13. disk_latency_high
What it means
The average I/O latency for a block device has exceeded the threshold. Crucible measures this from /sys/block/*/stat by computing the average time per completed I/O operation over the collection interval.
Why it matters
High disk latency directly impacts application performance. Database queries slow down, file operations block, and services become unresponsive. For NVMe drives, latency should typically be under 1 ms. For SATA SSDs, under 5 ms. For spinning disks, under 20 ms. Anything above 50 ms is a clear sign of trouble.
What to do
- Check per-device latency:
iostat -x 1 5(look atawait). - Identify I/O-heavy processes:
iotop -oP - If the disk is healthy, latency may be caused by I/O saturation. Reduce concurrent I/O or upgrade to faster storage.
- Check if a RAID rebuild or filesystem check is running in the background.
- If latency is intermittent, check SMART data for signs of failing hardware.
Configuration
alerts:
disk_latency_high:
enabled: true
threshold: 50 # milliseconds
duration: 120
exclude_devices:
- loop0
- loop114. disk_io_errors
What it means
Kernel-level I/O errors have been reported in dmesg or syslog. These indicate hardware-level read/write failures that the drive's firmware could not recover from.
Why it matters
Kernel I/O errors are a strong signal of imminent drive failure. Unlike SMART warnings which are predictive, I/O errors mean data operations are already failing. Applications may experience silent corruption.
What to do
- Check
dmesg | grep -i "i/o error"for the affected device. - Run
smartctl -a /dev/sdXfor the device mentioned in the errors. - Back up data from the affected device immediately.
- Schedule drive replacement.
Configuration
alerts:
disk_io_errors:
enabled: true
# Triggers on any kernel I/O error in the collection interval15. filesystem_readonly
What it means
A filesystem that should be read-write has been remounted as read-only by the kernel. This typically happens when the kernel detects filesystem corruption or I/O errors and remounts the filesystem to prevent further damage.
Why it matters
A read-only filesystem means all write operations fail. Applications crash, logs stop writing, and databases become unavailable. This is usually a sign of underlying hardware failure or filesystem corruption.
What to do
- Check mount options:
mount | grep "ro," - Check kernel logs for the cause:
dmesg | grep -i "remount\|error\|readonly" - If caused by disk errors, check SMART data and plan a replacement.
- If the filesystem is corrupted, run
fsckfrom a rescue environment.
Configuration
alerts:
filesystem_readonly:
enabled: true
exclude_mounts:
- /mnt/cdrom # ignore intentionally read-only mounts16. inode_high
What it means
A filesystem's inode usage has exceeded the threshold. Inodes track file metadata; when they run out, no new files can be created even if free space remains.
Why it matters
Inode exhaustion is a subtle failure mode. Disk usage may show plenty of free space, but the server cannot create new files. This breaks log rotation, temp file creation, and application writes. It is common on filesystems with many small files (mail spools, cache directories, container layers).
What to do
- Check inode usage:
df -i - Find directories with many small files:
find / -xdev -printf '%h\n' | sort | uniq -c | sort -rn | head -20 - Clean up unnecessary small files (session files, cache entries, old mail).
- If the filesystem was created with too few inodes, it must be reformatted with a higher inode ratio.
Configuration
alerts:
inode_high:
enabled: true
threshold: 90
exclude_mounts: []17. raid_degraded
What it means
A software RAID array (mdadm) or hardware RAID controller has reported a degraded state. This means one or more member disks have failed or been removed from the array. Crucible reads /proc/mdstat for software RAID and uses vendor tools (MegaCLI, storcli) for hardware RAID when available.
Why it matters
A degraded array has lost its redundancy. If another disk fails before the array is rebuilt, data loss is likely (or certain, depending on the RAID level). RAID 1 with one failed disk has zero redundancy. RAID 5 with one failed disk cannot survive another failure. RAID 6 with one failed disk is reduced to RAID 5 levels of protection.
What to do
- Identify the failed disk:
cat /proc/mdstatormdadm --detail /dev/md0 - Replace the failed disk as soon as possible.
- Add the replacement to the array:
mdadm --add /dev/md0 /dev/sdX - Monitor the rebuild progress:
watch cat /proc/mdstat - Avoid heavy I/O during the rebuild to speed up reconstruction.
Configuration
alerts:
raid_degraded:
enabled: true
# No threshold - any degradation triggers this alert
arrays:
- /dev/md0
- /dev/md1Network rules (5)
18. interface_errors
What it means
A network interface is reporting errors (RX errors, TX errors, drops, or overruns) above the threshold rate. Crucible reads these counters from /sys/class/net/*/statistics/.
Why it matters
Network errors cause packet retransmissions, increased latency, and reduced throughput. Persistent errors often indicate a hardware problem: a bad cable, a failing NIC, or a misconfigured switch port. Drops can also be caused by receive buffer exhaustion under high traffic.
Drops vs errors: look at evidence.driver first
The alert's evidence.driver field discriminates the two cases: driver: "errors" means real RX/TX errors (almost always hardware — cable, SFP, NIC, switch port). driver: "drops" means packets dropped by the kernel before reaching userspace, which has a different and often non-hardware root cause.
Drops are often firewall rules doing their job. On an internet-facing host with UFW, iptables, or nftables, every blocked packet shows up in the rx_dropped counter. If your evidence.driver is "drops", check the alert's fix_commands array — it ships a specific dmesg -T | grep -i "UFW BLOCK\|DROP\|REJECT" probe to tell you whether the drops are the firewall doing what you asked. Other drop sources: receive buffer exhaustion under high traffic (fix with larger ring buffers), VLAN mismatch, MTU mismatch.
Third case: the driver is dropping protocol-mismatch frames on purpose
If the firewall probe above shows zero blocks and you don't see RX errors either, but drops still accumulate, the NIC driver itself may be discarding frames by design — a protocol mismatch between what arrives on the wire and what the interface is configured to accept. The driver writes a one-line explanation to the kernel log. Surface it:
sudo dmesg -T | grep -iE 'drop|mismatch|tag|vlan|s-tag|qinq' | tail -20
Common patterns:
- Mellanox
mlx5_core: a line likeS-tagged traffic will be dropped while C-tag vlan stripping is enabledmeans the driver is dropping QinQ frames by design. Either turn off C-tag stripping on the interface (sudo ethtool -K <iface> rxvlan off) or accept the drops and raise the per-interface threshold. - Intel
ixgbe/igb: VLAN filter table mismatches show up as drops on frames whose VLAN tag isn't in the filter table. Inspect withbridge vlan showandip -d link show type vlanfor stale configs and remove unused tags. - Broadcom
bnxt_en: similar protocol-filtering messages; the dmesg line names the protocol field that mismatched.
If the drops really are intentional and you're not going to change the configuration (e.g., upstream sends C-tagged frames you genuinely don't want), the right action is a per-interface threshold override rather than a real fix — see Configuration below.
What to do
- Check error counters:
ip -s link show eth0 - If
driver: "drops": first check whether the drops are firewall-blocked packets via the alert'sfix_commandsdmesg probe. If they are, the alert is fine — increment the threshold or disable per-interface. If they aren't, investigate buffer / MTU / VLAN. - If
driver: "errors": inspect the cable and SFP modules. Reseat connections. Check switch port counters and logs for CRC errors or alignment errors. - Increase ring buffer sizes:
ethtool -G eth0 rx 4096 tx 4096 - If the NIC is faulty, replace it.
Configuration
alerts:
interface_errors:
enabled: true
threshold: 10 # errors per minute
exclude_interfaces:
- lo
- docker019. link_speed_mismatch
What it means
A network interface is operating at a lower speed than expected. For example, a 10 Gbps NIC negotiating at 1 Gbps. Crucible reads the link speed from /sys/class/net/*/speed and compares it against the configured expected speed.
Why it matters
A link speed mismatch means you are getting a fraction of the bandwidth you are paying for or that your network design requires. This is usually caused by a bad cable, a damaged SFP module, or a switch port that auto-negotiated to a lower speed.
What to do
- Check current link speed:
ethtool eth0 | grep Speed - Reseat the cable and SFP module.
- Try a different cable. Cat5e cables cannot support 10 Gbps; use Cat6a or fiber.
- Check the switch port configuration. Force the expected speed if auto-negotiation is failing.
- If the NIC supports multiple speeds, verify the firmware is up to date.
Configuration
alerts:
link_speed_mismatch:
enabled: true
interfaces:
eth0:
expected_speed: 10000 # Mbps
eth1:
expected_speed: 100020. interface_saturation
What it means
A network interface's throughput has exceeded the configured percentage of its link speed. Crucible measures bytes transmitted and received over the collection interval and compares the rate to the interface's reported link speed.
Why it matters
A saturated network link causes packet queuing, increased latency, and dropped packets. Services that depend on network throughput (file servers, databases with replication, backup jobs) will degrade. Saturation at 80% is a warning because TCP throughput collapses well before reaching 100% utilization due to protocol overhead and buffering.
What to do
- Identify traffic sources:
iftop -i eth0ornload eth0 - Check if a backup job or large transfer is running.
- Implement traffic shaping or QoS to prioritize critical traffic.
- Consider bonding multiple interfaces or upgrading to a faster link.
- Move bulk transfers to off-peak hours.
Configuration
alerts:
interface_saturation:
enabled: true
threshold: 80 # percentage of link speed
duration: 60
exclude_interfaces:
- lo21. conntrack_exhaustion
What it means
The kernel's connection tracking (conntrack) table is approaching capacity. Crucible reads /proc/sys/net/netfilter/nf_conntrack_count and /proc/sys/net/netfilter/nf_conntrack_max to calculate the usage percentage.
Why it matters
When the conntrack table fills up, the kernel drops new connections silently. This affects all stateful firewall rules (iptables, nftables) and NAT. Services appear unreachable, but the server looks healthy otherwise. This is a common failure mode on busy NAT gateways, load balancers, and servers with many short-lived connections.
What to do
- Check current usage:
cat /proc/sys/net/netfilter/nf_conntrack_countandcat /proc/sys/net/netfilter/nf_conntrack_max - Increase the limit temporarily:
sysctl -w net.netfilter.nf_conntrack_max=262144 - Make it permanent in
/etc/sysctl.d/99-conntrack.conf - Reduce timeouts for idle connections:
sysctl -w net.netfilter.nf_conntrack_tcp_timeout_time_wait=30 - If the server does not need connection tracking, consider using stateless firewall rules.
Configuration
alerts:
conntrack_exhaustion:
enabled: true
threshold: 80
critical_threshold: 9522. bond_slave_down
What it means
A network interface that is part of a bond (e.g. bond0) has gone down. Crucible reads /sys/class/net/{iface}/operstate and detects bond membership from /proc/net/bonding/*. Requires Crucible 0.6.5 or newer.
Why it matters
Bond interfaces provide network redundancy. When one slave goes down, the bond continues working but with reduced capacity and no redundancy. A second failure would cause a full network outage. This is often caused by a failed cable, SFP transceiver, or switch port.
What to do
- Check bond status:
cat /proc/net/bonding/bond0 - Check the slave interface:
ip link show enp1s0f0andethtool enp1s0f0 - Try bringing it back up:
sudo ip link set enp1s0f0 up - If the interface won't stay up, check the physical connection (cable, SFP, switch port).
- Check kernel messages:
dmesg -T | grep -i "enp1s0f0" | tail -10
Hardware / IPMI rules (5)
22. cpu_temperature_high
What it means
The CPU package temperature has exceeded the threshold. Crucible 0.8.0+ reads from kernel hwmon first (/sys/class/hwmon/ via snap.thermal.max_cpu_celsius) and falls back to IPMI sensor data only when hwmon is unavailable. The path that fired is recorded in evidence.path as "hwmon" or "ipmi". Temperatures are displayed in Celsius (e.g., 85 C).
Why hwmon-first: hwmon readings come straight from the CPU vendor's on-die digital thermal sensor and use standardised sensor naming. The previous IPMI-only path was misleading on some platforms (notably Gigabyte AMD boards with BMC firmware 12.61, where the CPU<N>_DTS sensor reads about 30 C hotter than the actual die). Crucible 0.9.1 also drops CPU<N>_DTS from the IPMI fallback when a sibling CPU<N>_TEMP sensor exists on the same socket.
Why it matters
CPUs throttle their clock speed when they get too hot, which reduces performance. At extreme temperatures (above Tjunction max, typically 100-105 C), the CPU will shut down to protect itself, causing an unclean server restart. Sustained high temperatures also reduce the CPU's lifespan.
What to do
- Check current temperatures:
sensors(from lm-sensors package). - Verify that fans are running:
ipmitool sdr type Fan - Clean dust from heatsinks and fans.
- Check that the thermal paste between the CPU and heatsink is not dried out.
- If in a data center, check the room temperature and airflow. Verify hot/cold aisle separation.
- Reduce CPU load temporarily if temperatures are critical.
Configuration
alerts:
cpu_temperature_high:
enabled: true
threshold: 85 # warning threshold in Celsius
critical_threshold: 95 The sensor: hint is no longer needed; hwmon CPU readings are auto-classified by chip and label.
23. ecc_errors (correctable, rate-based)
What it means
The server's ECC memory has reported correctable single-bit errors. These are silently fixed by the ECC hardware but logged. The rule fires when more than the configured threshold of correctable errors is observed within a rolling time window (default 10 errors per 24 hours). The underlying counter source is the higher of two paths:
- Named-sensor path (Supermicro, ASRockRack, Gigabyte): the BMC exposes a numeric "Correctable ECC" / "Uncorrectable ECC" sensor; Crucible reports the cumulative count in
snap.ipmi.ecc_errors. - SEL-derived path (Dell iDRAC, some HPE iLO firmwares): the BMC reports ECC events only via the System Event Log on the Memory entity, never as named sensors. Crucible 0.8.0+ counts ECC entries in the SEL and reports them in
snap.ipmi.ecc_errors_from_sel.
Dashboard computes new errors as the difference between the latest snapshot and the oldest snapshot inside the rate window. This avoids false alerts on long-running healthy hosts where the BMC's cumulative counter has accumulated background noise over months.
evidence.delta_correctable is the count over the window, evidence.window_hours the window length, evidence.threshold the configured trigger, and evidence.path records which underlying source was authoritative ("named", "sel", or "both").
Counter reset detection
If the underlying counter regressed between the oldest-in-window snapshot and the current one (SEL clear, BMC reboot, host reboot zeroing named-sensor accumulators), Dashboard skips that evaluation cycle. The next snapshot resumes evaluation against a fresh oldest-in-window row. No false alerts from the reset itself.
Why it matters
Occasional correctable ECC errors are normal over long periods. A sustained rate of correctable errors on a single DIMM is a strong predictor of imminent failure. Rate-based evaluation surfaces that pattern without false-positive noise from healthy long-uptime hosts.
What to do
- Check named-sensor counts:
ipmitool sdr type Memoryoripmitool sensor list | grep -i ecc. - On Dell iDRAC or HPE iLO, inspect the SEL:
ipmitool sel list | grep -i memory. - If errors persist after a DIMM swap, check the slot itself.
- Run a memory test (
memtest86+) during the next maintenance window.
Per-server overrides
# Wider window for a chronically-noisy host (errors/week instead of /24h): ecc_rate_window_hours: 168 # Higher threshold to suppress (e.g. allow up to 50 per day): ecc_correctable_rate_warning: 50
Both fields live in the server's config_overrides JSON. The legacy ecc_correctable_warning field (pre-Phase 7 P1) is automatically migrated to ecc_correctable_rate_warning by migration 014 with the same numeric value; this means an old override of 100 (originally meant as a lifetime ceiling) is now read as "100 errors per 24h" — far above realistic per-day rates, so it effectively still suppresses noise.
24. ecc_errors (uncorrectable)
What it means
The server's ECC memory has reported uncorrectable multi-bit errors. These cannot be repaired by ECC and may cause data corruption or application crashes.
Why it matters
Uncorrectable errors are serious. Corrupted data was delivered to the CPU, which can cause application crashes, data corruption, or silent data damage. This DIMM should be replaced immediately.
What to do
- Identify the affected DIMM:
edac-util -v - Replace the DIMM immediately.
- Check application data integrity, especially database checksums.
- Run
memtest86+to confirm the diagnosis.
Configuration
alerts:
ecc_errors:
critical_on_uncorrectable: true25. psu_redundancy_loss
What it means
A redundant power supply unit has failed or been disconnected. Crucible reads PSU state from IPMI only (no hwmon path). Two-tier evaluation:
- Aggregate path (Dell PowerEdge): Crucible's vendor classifier surfaces the BMC's overall redundancy status as
snap.ipmi.psu_redundancy_state(fully_redundant/redundancy_lost/redundancy_degraded/unknown). When the value is meaningful, the rule fires from this single signal. - Per-PSU path (Supermicro, Gigabyte, ASRockRack, others): Crucible iterates each individual PSU sensor in
snap.ipmi.sensorsand looks for fault states.
evidence.path records which source fired: "aggregate-redundancy", "per-psu-fault", "discrete-status-ok", or "all-healthy". In a typical 1+1 redundant configuration, the server continues running on the remaining PSU, but it has lost its power redundancy.
Why it matters
Servers with redundant PSUs are designed to survive a single PSU failure. Once one PSU is down, you are running without a safety net. If the remaining PSU fails, the server goes down immediately with no graceful shutdown.
What to do
- Check PSU status:
ipmitool sdr type "Power Supply" - On Dell, read the aggregate redundancy field directly:
ipmitool sdr | grep -i redundan. - Verify that the failed PSU is receiving power (check the outlet and PDU).
- If the PSU has a fault LED, note the error pattern.
- Replace the failed PSU. Most servers support hot-swap PSU replacement.
- If in a data center, open a hardware ticket immediately.
Configuration
alerts:
psu_redundancy_loss:
enabled: true
# No threshold — any PSU failure triggers this alert.26. ipmi_fan_failure
What it means
An IPMI-monitored fan has stopped spinning or dropped below the minimum RPM threshold. Crucible reads fan RPM values from IPMI SDR records. Fan speeds are displayed with proper units (RPM).
Why it matters
Fan failure leads to rising temperatures, which cause CPU throttling, component damage, and eventually thermal shutdown. In servers with redundant fans, a single failure reduces cooling capacity and puts stress on the remaining fans.
What to do
- Check fan status:
ipmitool sdr type Fan - Inspect the fan for physical damage or cable disconnection.
- If the server is in a data center, open a hardware ticket for fan replacement.
- Monitor CPU temperatures closely until the fan is replaced.
Configuration
alerts:
ipmi_fan_failure:
enabled: true
min_rpm: 500 # fans below this RPM are considered failed27. ipmi_sel_critical
What it means
A critical event has been logged in the IPMI System Event Log (SEL) within the rolling time window (default 30 days). This includes events like machine check exceptions, PCI-E fatal errors, and power unit failures. Crucible reads the SEL via ipmitool sel elist.
Why it matters
Critical SEL events indicate hardware-level problems that may not be visible through OS-level monitoring. These events are logged by the BMC independently of the operating system and can indicate problems that the OS cannot detect on its own.
Time window (default 30 days)
The rule only counts events asserted within the last ipmi_sel_critical_window_days days (default 30). Older events stay in evidence as events_outside_window for context but don't fire the alert. This prevents the rule from firing forever on a year-old transient (e.g., a power-supply AC-loss event that was paired with a deassertion the same minute but never cleared from the SEL).
Each event in evidence.critical_events[] carries an age_days field. null means the timestamp couldn't be parsed (older Crucible agents on certain BMC firmwares emit non-ISO date strings; the event is included anyway, fail-open). Upgrade to Crucible 0.9.2+ to normalise the emission.
What to do
- Read recent events:
ipmitool sel elist | tail -30(the alert'sfix_commandsships this exact command). - Look at the
age_dayson each event in evidence. A cluster of recent events points at a current incident; one ancient event in a window-of-many means the rule is right but the underlying problem may already be resolved. - If the event indicates a component failure, schedule replacement.
- Clearing the SEL erases the audit trail permanently. Only run
ipmitool sel clearafter you've recorded what was in there (the alert'scritical_eventsare a snapshot — they don't go away when the SEL is cleared, but the live BMC view does). For most ops workflows, you don't need to clear; the time-window filter handles aging.
Configuration
alerts:
ipmi_sel_critical:
enabled: true
# Per-server override of the rolling window (default 30 days):
# ipmi_sel_critical_window_days: 90ZFS rules (2)
28. zfs_pool_unhealthy
What it means
A ZFS pool health status is something other than ONLINE. This includes DEGRADED (redundancy lost), FAULTED (data loss possible), and UNAVAIL (pool cannot be accessed).
Why it matters
A non-ONLINE ZFS pool means either redundancy is lost (DEGRADED) or data may already be inaccessible (FAULTED/UNAVAIL). Immediate action is required to prevent data loss.
What to do
- Check pool status:
zpool status - If DEGRADED: identify the failed vdev and replace the drive with
zpool replace - If FAULTED: attempt
zpool clearthen investigate the cause. - Never reboot a FAULTED pool without understanding the failure first.
Configuration
alerts:
zfs_pool_unhealthy:
enabled: true
# Triggers when any zpool reports non-ONLINE state29. zfs_scrub_errors
What it means
Checksum or data errors were found during ZFS scrub operations. ZFS scrubs verify every block of data against its checksum to detect silent data corruption (bit rot).
Why it matters
Scrub errors mean data on disk does not match its checksum. On redundant pools, ZFS auto-repairs from good copies. On non-redundant pools, this is data corruption. Either way, it signals failing hardware.
What to do
- Check scrub results:
zpool status -v - If on a mirror/raidz: ZFS auto-repaired. Identify the drive with errors and plan replacement.
- If on a single vdev: data corruption occurred. Restore affected files from backup.
- Run
smartctl -aon the underlying device to check for hardware issues.
Configuration
alerts:
zfs_scrub_errors:
enabled: true
# Triggers when zpool scrub reports any errorsSecurity rules (6)
30. ssh_root_password
What it means
The SSH daemon is configured to allow root login with a password. Crucible checks /etc/ssh/sshd_config for PermitRootLogin yes or PermitRootLogin prohibit-password not being set.
Why it matters
Root login via password is a common attack vector. Brute-force SSH attacks target root constantly. Key-based authentication is much more secure.
What to do
- Verify key-based access works before changing anything. Locking yourself out is the failure mode this rule's fix can cause. The alert's
fix_commandsarray ships an explicit probe — run it from a separate shell first:ls -la ~/.ssh/authorized_keys ssh -o PasswordAuthentication=no root@localhost exit 2>/dev/null && echo "Key auth working" || echo "Key auth FAILED — do NOT disable password login yet"
The probe succeeds silently when key-based auth works, and printsKey auth FAILEDwhen it doesn't. - Once key access is confirmed, set
PermitRootLogin prohibit-passwordin/etc/ssh/sshd_config. - Restart SSH from one shell while keeping another shell open:
sudo systemctl restart sshd. If you lose the new connection but keep the old, you can roll back from the still-open session.
Configuration
alerts:
ssh_root_password:
enabled: true31. no_firewall
What it means
No active firewall was detected. Crucible checks for iptables rules, nftables, ufw, and firewalld. If all are empty or inactive, this alert fires.
Why it matters
A server without a firewall exposes all listening services to the internet. Even services bound to localhost can be exposed if a misconfiguration changes the bind address.
Running on a cloud provider with an external firewall?
If your server sits behind a cloud security group (AWS, GCP, Azure, Hetzner Cloud Firewall, DigitalOcean Cloud Firewalls, etc.) and you've set inbound rules there, the host-level firewall is optional. Disable this rule for the affected servers — host firewall and cloud firewall is belt-and-braces, but a misconfigured cloud security group can't be papered over by an empty ufw stack. See "Configuration" below for the disable snippet.
What to do (host-level firewall path)
Pick the firewall stack your distro ships:
- Debian / Ubuntu (ufw):
sudo apt install ufw && sudo ufw default deny incoming && sudo ufw allow ssh && sudo ufw enable— note theallow sshordering: enable that BEFORE the default-deny takes effect or you may lose your current session. - RHEL / Rocky / AlmaLinux / Fedora (firewalld): on a minimal-server install
firewalldmay not be present, so install it if missing and then enable it —sudo dnf install -y firewalld && sudo systemctl enable --now firewalld && sudo firewall-cmd --permanent --add-service=ssh && sudo firewall-cmd --reload. Thednf installis a no-op when the package is already present, so this one-liner is safe across full and minimal installs. Verify withsudo firewall-cmd --list-all. - Or configure iptables/nftables directly with appropriate rules for your services.
Configuration
alerts:
no_firewall:
# set to false on cloud hosts with an upstream security group:
enabled: true32. pending_security_updates
What it means
The package manager has pending security updates that have not been installed. Crucible checks apt (Debian/Ubuntu) or dnf (RHEL/Rocky/Alma) for available security patches.
Why it matters
Unpatched security vulnerabilities are one of the most common attack vectors. Security updates should be applied promptly, especially for internet-facing services.
What to do
- Review pending updates:
apt list --upgradableordnf check-update --security - Apply security updates:
sudo apt upgradeorsudo dnf update --security - Consider enabling automatic security updates (see
unattended_upgrades_disabledbelow).
Configuration
alerts:
pending_security_updates:
enabled: true33. kernel_vulnerabilities
What it means
The running kernel has known vulnerabilities that are mitigatable or patchable. Crucible checks /sys/devices/system/cpu/vulnerabilities/ for Spectre, Meltdown, and other CPU/kernel vulnerabilities.
Why it matters
Kernel vulnerabilities can allow privilege escalation, container escapes, or data leaks between processes. While some mitigations are applied automatically, others require a kernel update and reboot.
Read the status text first
Each unmitigated entry in the alert's evidence.unmitigated[] array has a status field that names the actual gap. The fix depends on what the status says:
"Vulnerable"(no qualifier) — usually means a kernel-level mitigation isn't enabled. Update the kernel and reboot. This is the common case."Vulnerable: ... no microcode"— the kernel has the mitigation code but the CPU is missing the microcode update it needs to apply. Kernel upgrades alone will not fix this. You need either (a) a BIOS / UEFI firmware update from your motherboard or server vendor, which ships the microcode in firmware, or (b) a userland microcode package (intel-microcodeon Debian/Ubuntu,microcode_ctlorlinux-firmwareon RHEL-family) that loads on boot before the kernel."Mitigation: ..."— the mitigation is active. No action needed; this row shouldn't appear inunmitigated[]unless something looks off.
What to do
- Check vulnerability status:
grep . /sys/devices/system/cpu/vulnerabilities/* - If the status is plain
"Vulnerable": update the kernel (sudo apt upgrade linux-image-genericorsudo dnf upgrade kernel) and reboot. - If the status mentions
"no microcode": install the microcode package (sudo apt install intel-microcodefor Intel,sudo apt install amd64-microcodefor AMD, orsudo dnf install microcode_ctl linux-firmwareon RHEL-family) and reboot.
Debian users on AMD:amd64-microcodelives in thenon-free-firmwarecomponent, which is not enabled by default. Ifapt install amd64-microcodereports "Package 'amd64-microcode' has no installation candidate", addnon-free-firmwareto your sources first. On Debian 12+, the default sources live in/etc/apt/sources.list.d/debian.sources(deb822 format) — edit that file and addnon-free-firmwareto theComponents:line for the main + security + updates entries, thensudo apt update. On Debian 11 / older populated/etc/apt/sources.listsetups, the legacy form works:sudo sed -i 's/main$/main non-free-firmware/' /etc/apt/sources.list sudo apt update sudo apt install amd64-microcode
Then reboot to load. If the package is already installed and the status still saysno microcode, check your BIOS/UEFI for a firmware update from the motherboard vendor — some microcode is only delivered via BIOS. - Debian users on Intel: Intel CPUs use
intel-microcode. On Debian 12+ this lives innon-free-firmwarejust likeamd64-microcode, so the deb822 step above applies. Install withsudo apt install -y intel-microcodethen reboot. On Debian 11 / older where it was inmain, nonon-free-firmwarestep is needed. - If this host runs VMs (Proxmox, KVM/libvirt, vSphere, etc.), the reboot will take guests with it. Drain or migrate before mark-reboot, or run inside a scheduled maintenance window.
- Some vulnerabilities have no fix and never will (older CPUs that are EOL for microcode). For those, mute the rule on the affected server (see
Alert muting) and document the residual risk.
Configuration
alerts:
kernel_vulnerabilities:
enabled: true34. kernel_needs_reboot
What it means
A kernel update has been installed but the server is still running the old kernel. Crucible detects this by comparing the running kernel version against the installed version and by checking for /var/run/reboot-required.
Why it matters
Security patches in the new kernel are not active until the server reboots. The server remains vulnerable to patched exploits until the reboot occurs.
What to do
- Mark the reboot first so Crucible doesn't fire
unexpected_rebooton the next boot. Runsudo glassmkr-crucible mark-reboot --reason "kernel update"on the box. The agent writes a single-use marker that suppresses the unexpected-reboot alert on the very next snapshot post-reboot. (If you forget, the alert fires once and you can resolve it manually; it's not destructive, just noisy.)
Or runsudo glassmkr-crucible reboot --reason "kernel update"to do both the mark and the reboot in one step. - Schedule a maintenance window and reboot the server.
- If this host runs VMs (Proxmox, KVM/libvirt, vSphere, etc.), the reboot will take guests with it. Drain or migrate before mark-reboot, or run inside a scheduled maintenance window.
- Verify the new kernel is running after reboot:
uname -r
Configuration
alerts:
kernel_needs_reboot:
enabled: true35. unattended_upgrades_disabled
What it means
Automatic security updates are not configured. On Debian/Ubuntu, Crucible checks whether the unattended-upgrades package is installed and enabled. On RHEL-based systems, it checks for dnf-automatic.
Why it matters
Without automatic security updates, critical patches sit uninstalled until someone manually runs the update. For servers that are not actively maintained, this can leave known vulnerabilities open for weeks or months.
What to do
- Install and enable automatic updates. The install creates
/etc/apt/apt.conf.d/20auto-upgradeswith sensible defaults; the systemctl line ensures the service is enabled and running:sudo apt install unattended-upgrades sudo systemctl enable --now unattended-upgrades
Only if you previously disabled auto-updates (e.g. setEnable=0in50unattended-upgrades), runsudo dpkg-reconfigure -plow unattended-upgradesinteractively to re-enable. - Or on RHEL:
sudo dnf install dnf-automatic && sudo systemctl enable --now dnf-automatic.timer - If you prefer manual updates, you can disable this rule.
Configuration
alerts:
unattended_upgrades_disabled:
enabled: trueService Health rules (3)
36. systemd_service_failed
What it means
One or more systemd services have entered the "failed" state. Crucible runs systemctl list-units --state=failed on each collection cycle and reports any units that are not running as expected.
Why it matters
Failed services may include databases, web servers, monitoring agents, or critical system daemons. A service in the failed state is not running and will not restart automatically unless configured to do so. Operators often do not notice failed services until users report problems.
Read evidence.journal_excerpts first (Crucible 0.9.2+)
The alert evidence now includes a journal_excerpts field (an object mapping each failed unit name to its last 5 journal lines), collected by Crucible at snapshot time. For most failures the root cause is in those 5 lines (a config error, a missing dependency, a permission issue). Read those before SSHing to the box. If the field is empty for a unit, your Crucible is pre-0.9.2 — upgrade with sudo npm install -g @glassmkr/crucible@latest and the field will populate on the next ingest cycle.
What to do
- Read the per-unit journal excerpt in
evidence.journal_excerpts. The first line is usually enough to diagnose the root cause. - If you need more context:
sudo journalctl -u <unit> --no-pager -n 50 - Attempt a restart:
sudo systemctl restart <unit> - If the service fails repeatedly with a config error visible in the excerpt: fix the config and restart. See the common patterns below.
- For services you intentionally disabled, add them to the ignore list.
Common journal-excerpt patterns
Most failed-service cases fall into one of these shapes. The fix for each is small and bounded — but only if you know to look for the right line in the excerpt.
fail2ban.service: "Have not found any log file for sshd jail"
Symptom: ERROR Failed during configuration: Have not found any log file for sshd jail followed by Async configuration of server failed. Common on Debian 12+ / Ubuntu 22.04+ where sshd logs to journald only (no /var/log/auth.log). Fix: tell fail2ban to read from journald instead of a log file. Create /etc/fail2ban/jail.d/sshd-systemd.local with:
[sshd] backend = systemd
Then sudo systemctl restart fail2ban. The journal excerpt should now show the jail starting cleanly.
NetworkManager-wait-online.service / systemd-networkd-wait-online.service: "Failed with result 'exit-code'"
These are "wait until the network is fully up" oneshots. They fail when one or more configured-but-disconnected interfaces (a second NIC without a cable, an inactive bond slave, a VLAN that won't link) make the unit hit its timeout. The host is usually fine — only the wait-online unit is failed.
Two options:
- Easy: add to the ignore list in your collector config (see Configuration below). These units are oneshots that don't affect actual networking once boot completes.
- Fix-properly: restrict the unit to interfaces you actually expect to be up. For NetworkManager:
sudo systemctl edit NetworkManager-wait-online.serviceand overrideExecStart=/usr/bin/nm-online -s -q --timeout=30 --interfaces=<your-real-iface>.
Multi-NIC RHEL-family boxes: usually it's stale NetworkManager profiles. On Rocky / Alma / RHEL with several NICs, the failure is most often that NetworkManager has connection profiles bound to interfaces that no longer exist or are currently unplugged. wait-online waits for every "auto-connect" profile to come up; one stale profile is enough to time out the whole unit.
Recipe to find and clear stale profiles:
# List active connections (the ones that did come up) nmcli connection show --active # List every configured connection, including inactive ones nmcli connection show # For each row that is bound to a non-existent or unplugged interface and # that you do NOT need, delete it: sudo nmcli connection delete "<connection-name>" # Retry wait-online and confirm it's clean sudo systemctl restart NetworkManager-wait-online sudo systemctl status NetworkManager-wait-online --no-pager
If a profile is bound to an interface you genuinely need but which is currently unplugged (a redundant cable path, a NIC that comes up on demand), don't delete it — either lower its autoconnect-retries so wait-online gives up sooner (nmcli connection modify <name> connection.autoconnect-retries 1), or mask the wait-online unit entirely if nothing else in your boot ordering needs the dependency (sudo systemctl mask NetworkManager-wait-online.service).
"Address already in use" / port conflict
Symptom: excerpt contains bind() ... failed (98: Address already in use) or similar. Another process holds the port. Find what: sudo ss -tlnp | grep ':<port>'. Either kill the other process or change one of their port assignments.
"Failed to start ... dependency"
Symptom: excerpt contains Failed to start <some-unit>.service - ... Job ... failed because of unavailable resources. or Dependency failed for .... Look at the named dependency unit's own status (sudo systemctl status <dep>) — the real failure is one level deeper. Fix that one and the dependent unit will recover.
Configuration
alerts:
systemd_service_failed:
enabled: true
ignore_services:
- bluetooth.service # ignore services that are not relevant
- ModemManager.service
- NetworkManager-wait-online.service # see "Common patterns" above37. fd_exhaustion
What it means
The system's file descriptor usage has exceeded the configured percentage of the maximum allowed. Crucible reads /proc/sys/fs/file-nr to get the current allocation and the system-wide limit.
Why it matters
File descriptors are used for open files, sockets, pipes, and other I/O handles. When the system runs out of file descriptors, processes cannot open new files or establish new network connections. This causes cascading failures: databases refuse connections, web servers return errors, and logging stops working.
What to do
- Check current usage:
cat /proc/sys/fs/file-nr(allocated, unused, max) - Find processes with many open FDs:
for pid in /proc/[0-9]*; do echo "$(ls "$pid/fd" 2>/dev/null | wc -l) $(cat "$pid/comm" 2>/dev/null)"; done | sort -rn | head -20 - Increase the system limit temporarily:
sysctl -w fs.file-max=1048576 - Make it permanent in
/etc/sysctl.d/99-file-max.conf - Check per-process limits with
cat /proc/PID/limitsand adjust with systemdLimitNOFILE=. - Investigate if a process is leaking file descriptors (opening without closing).
Configuration
alerts:
fd_exhaustion:
enabled: true
threshold: 80
critical_threshold: 9538. server_unreachable
What it means
The server has stopped sending snapshots to Dashboard. Crucible is an agent-based collector; if the server goes down, the agent goes down with it and Dashboard stops receiving data. This rule runs server-side on a schedule (every 2 minutes), not as part of the snapshot evaluation.
Why it matters
A server that stops reporting may be down, rebooting, or have a crashed Crucible service. Without this rule, the only signal would be the "Last seen X minutes ago" label on the dashboard, which is easy to miss.
How it works
- Threshold: 2x the server's collection interval (default 300s, so 10 minutes).
- Scales with custom intervals: if a server pushes every 600s, the threshold is 20 minutes.
- Onboarding grace: servers younger than 10 minutes never fire this alert.
- Servers that have never sent a snapshot are not alerted on.
- Auto-resolves when the server sends its next snapshot.
What to do
- Check if the server is reachable:
ping {server_ip} - If reachable, check Crucible:
ssh {server} sudo systemctl status glassmkr-crucible - Check logs:
ssh {server} sudo journalctl -u glassmkr-crucible -n 20 --no-pager - If not reachable, check your hosting panel for IPMI or KVM access.
Global alert settings
These settings apply to all alert rules and can be set in the configuration file or the dashboard:
alerts:
global:
cooldown: 3600 # seconds between repeated notifications for the same alert
resolve_notify: true # send a notification when an alert resolves
channels:
- telegram
- email