Alert Rules

Dashboard ships with 38 built-in alert rules covering OS health, storage, networking, hardware, ZFS, security, and service health. Each rule is evaluated on every metrics push (default: every 300 seconds / 5 minutes). When a threshold is crossed, Dashboard fires a notification to all configured channels.

You can override default thresholds per server in /etc/glassmkr/collector.yaml or globally in the Dashboard dashboard under Settings > Alert Defaults.

Tip — read the alert's own fix_commands first. Every fired alert includes a fix_commands array in its evidence JSON with the most current concrete commands for that specific case. This documentation page covers the rule generally; the alert's fix_commands are tailored to the device or unit that tripped the rule (e.g., the right smartctl -a /dev/sdX with your actual disk name, the right systemctl status <unit> with your actual failing service). When in doubt, the alert's fix_commands override what this page shows.

Table of contents

Rule categories

CategoryCountRules
OS9ram_high, cpu_high, load_high, cpu_iowait_high, oom_kills, clock_drift, swap_high, ntp_not_synced, unexpected_reboot
Storage8disk_space_high, smart_failing, nvme_wear_high, raid_degraded, disk_latency_high, filesystem_readonly, inode_high, disk_io_errors
Network5interface_errors, link_speed_mismatch, interface_saturation, conntrack_exhaustion, bond_slave_down
Hardware / IPMI5cpu_temperature_high, ecc_errors, psu_redundancy_loss, ipmi_sel_critical, ipmi_fan_failure
ZFS2zfs_pool_unhealthy, zfs_scrub_errors
Security6ssh_root_password, no_firewall, pending_security_updates, kernel_vulnerabilities, kernel_needs_reboot, unattended_upgrades_disabled
Service Health3systemd_service_failed, fd_exhaustion, server_unreachable

Alert priorities (P1-P4)

Every alert is assigned a priority level based on its severity and urgency. Priority badges appear on alert cards in the dashboard and in notification messages.

PriorityMeaningExamples
P1Critical, immediate action required. Data loss or service outage is imminent or occurring.raid_degraded, smart_failing, oom_kills, ecc_errors (uncorrectable)
P2High, action needed soon. Significant degradation or risk.disk_space_high (critical threshold), cpu_temperature_high (critical), psu_redundancy_loss
P3Medium, investigate when convenient. Performance impact or early warning.ram_high, cpu_high, disk_latency_high, inode_high
P4Low, informational. Proactive recommendations.pending_security_updates, unattended_upgrades_disabled, nvme_wear_high

Alert cards in the dashboard show the priority badge (P1-P4), a one-line summary, evidence links to relevant charts, and copy-pasteable fix commands you can run on the server.

Evidence path attribution

Some rules have multiple data sources (for example, ECC errors come from either named IPMI sensors or SEL events; CPU temperature comes from hwmon or IPMI fallback). For those rules, the alert's evidence.path field records which source fired. If you script against the API and need to act differently on different paths — or if you're investigating why a rule fired the way it did — read this field. The per-rule sections below name the values each rule emits.

Alert muting

You can mute specific alert rules on a per-server basis. Muted rules stop firing and stop sending notifications for that server. This is useful during maintenance windows or when a known condition is expected.

To mute a rule, go to the server detail page, open the Alerts tab, and click the mute icon next to the rule. You can also mute rules via the API or in the configuration file:

muted_rules:
  - disk_space_high    # mute during disk migration
  - cpu_iowait_high    # mute during RAID rebuild

Muted rules are re-evaluated on the next ingest cycle after unmuting. They do not fire retroactively for conditions that occurred while muted.

Alert tabs

The server detail page provides three alert tabs for filtering:

  • Active: alerts currently firing. These need attention.
  • Acknowledged: alerts that have been acknowledged but not yet resolved. Notifications are silenced.
  • All: complete alert history including resolved alerts, filterable by date range and rule.

OS rules (9)

1. ram_high

Category: OS Severity: Warning Default threshold: 90%

What it means

The server's physical RAM usage has exceeded the configured threshold. This is calculated as (total - available) / total * 100, where "available" includes buffers and cache that the kernel can reclaim under pressure.

Why it matters

Sustained high memory usage leaves little headroom for traffic spikes or new processes. If RAM fills completely, the Linux OOM killer will start terminating processes, potentially taking down critical services.

What to do

  • Identify the top memory consumers: ps aux --sort=-%mem | head -20
  • Check for memory leaks in long-running processes by comparing RSS over time.
  • Consider adding swap as a safety net (though swap is not a substitute for adequate RAM).
  • If usage is consistently high, upgrade the server's memory or redistribute workloads.

Configuration

alerts:
  ram_high:
    enabled: true
    threshold: 90
    duration: 300  # seconds the condition must persist before firing

2. cpu_high

Category: OS Severity: Warning Default threshold: 90%

What it means

The aggregate CPU utilization (user + system + iowait) has exceeded the threshold for the configured duration. On servers with per-core monitoring enabled (Crucible 0.3.0+), the alert also reports which cores are saturated.

Why it matters

Sustained high CPU usage means the server is at capacity. New requests queue, response times increase, and background tasks (cron jobs, log rotation) may not complete on time. If steal time is also high, the hypervisor is overcommitting CPU resources.

What to do

  • Identify CPU-heavy processes: top -bn1 | head -20
  • Check per-core usage in the Dashboard dashboard to see if the load is evenly distributed or pinned to specific cores.
  • Look for runaway processes or infinite loops.
  • Consider scaling horizontally or upgrading CPU resources.

Configuration

alerts:
  cpu_high:
    enabled: true
    threshold: 90
    duration: 300

3. cpu_iowait_high

Category: OS Severity: Warning Default threshold: 20%

What it means

The percentage of CPU time spent waiting for I/O operations to complete has exceeded the threshold. High iowait indicates that the CPU is idle because it is waiting for disk or network I/O.

Why it matters

Elevated iowait is a strong signal that storage is the bottleneck. Applications that depend on disk reads or writes will experience increased latency. This often correlates with slow database queries, sluggish log processing, or degraded RAID rebuilds.

What to do

  • Identify processes generating I/O: iotop -oP
  • Check disk latency with iostat -x 1 5 and look at the await column.
  • If a RAID array is rebuilding, iowait is expected and will resolve on its own.
  • Consider moving heavy I/O workloads to faster storage (NVMe).
  • Tune the I/O scheduler or increase the filesystem's commit interval for write-heavy workloads.

Configuration

alerts:
  cpu_iowait_high:
    enabled: true
    threshold: 20
    duration: 180

4. oom_kills

Category: OS Severity: Critical Default threshold: 1 (any OOM kill)

What it means

The Linux kernel's Out-of-Memory killer has terminated one or more processes since the last check. Crucible reads this from /proc/vmstat (the oom_kill counter) and from kernel log messages.

Why it matters

OOM kills mean the server ran out of memory and the kernel had to sacrifice processes to keep the system alive. The killed process may be your database, web server, or another critical service. OOM events frequently cause cascading failures.

What to do

  • Check which process was killed: dmesg | grep -i "oom-killer"
  • Review memory usage trends in the Dashboard dashboard to identify the growth pattern.
  • Set memory limits on containers or systemd services using MemoryMax= to prevent a single process from consuming all RAM.
  • Add or increase swap as a safety buffer.
  • If OOM kills recur, the server needs more RAM or the workload needs to be reduced.

Configuration

alerts:
  oom_kills:
    enabled: true
    threshold: 1  # number of new OOM kills to trigger

5. load_high

Category: OS Severity: Warning Default threshold: 2x CPU core count

What it means

The system's 5-minute load average has exceeded the threshold, which defaults to twice the number of CPU cores. A load average above the core count means processes are waiting for CPU time.

Why it matters

High load averages cause increased latency for all processes. Unlike CPU percentage, load average counts processes waiting for both CPU and I/O, so it captures bottlenecks that pure CPU metrics miss.

What to do

  • Check current load and CPU count: uptime and nproc
  • Identify processes in D state (uninterruptible sleep, usually I/O): ps aux | awk '$8 ~ /D/'
  • If load is high but CPU usage is low, the bottleneck is likely disk I/O. Check with iostat -x 1 5.
  • If load is high and CPU is also high, the server is CPU-bound. Reduce workload or add capacity.

Configuration

alerts:
  load_high:
    enabled: true
    threshold: 0  # 0 = auto (2x core count). Set a fixed number to override.
    duration: 300

6. clock_drift

Category: OS Severity: Warning Default threshold: 500 ms

What it means

The system clock has drifted more than the configured threshold from the expected time. Crucible compares the local clock against NTP reference data from timedatectl or chronyc.

Why it matters

Clock drift breaks TLS certificate validation, causes log timestamps to be unreliable, desynchronizes distributed systems (databases, consensus protocols), and can cause authentication failures with time-sensitive tokens (TOTP, Kerberos). Even small drifts compound over time if NTP is misconfigured.

What to do

  • Check current drift: timedatectl status or chronyc tracking
  • Verify NTP is running: systemctl status chronyd or systemctl status systemd-timesyncd
  • Force a sync: chronyc makestep or timedatectl set-ntp true
  • Check that NTP servers are reachable from the server's network.

Configuration

alerts:
  clock_drift:
    enabled: true
    threshold: 500  # milliseconds

7. swap_high

Category: OS Severity: Warning Default threshold: 80%

What it means

Swap space usage has exceeded the configured threshold. Crucible reads swap usage from /proc/meminfo. High swap usage means the system is actively paging memory to disk.

Why it matters

Swap exists as a safety net, not as a primary memory source. When a server is actively swapping, performance degrades significantly because disk I/O is orders of magnitude slower than RAM access. Database queries slow down, application response times spike, and the system can enter a thrashing state where it spends more time swapping than doing useful work.

What to do

  • Check swap usage: free -h and swapon --show
  • Identify processes using swap: for f in /proc/*/status; do awk '/VmSwap/{swap=$2} /Name/{name=$2} END{if(swap>0) print swap,name}' "$f" 2>/dev/null; done | sort -rn | head -20
  • Check if RAM is the bottleneck: review memory usage trends in the Dashboard dashboard.
  • If swap usage is sustained, the server likely needs more RAM or the workload needs to be reduced.

Configuration

alerts:
  swap_high:
    enabled: true
    threshold: 80  # percentage of total swap

8. ntp_not_synced

Category: OS Severity: Warning Default: NTP synchronization not active

What it means

The system's NTP synchronization is not active. Crucible checks timedatectl for "NTP synchronized: yes" and verifies that an NTP daemon (chrony, ntpd, or systemd-timesyncd) is running.

Why it matters

Without active NTP synchronization, the system clock will drift over time. Hardware clocks are imprecise and can drift seconds per day. This leads to the same issues as clock_drift but is a more fundamental problem: the server has no mechanism to correct its time at all.

What to do

  • Check NTP status: timedatectl status
  • Enable time sync: sudo timedatectl set-ntp true
  • If using chrony: sudo systemctl enable --now chronyd
  • If using systemd-timesyncd: sudo systemctl enable --now systemd-timesyncd
  • Verify NTP servers are configured in /etc/chrony.conf or /etc/systemd/timesyncd.conf.

Configuration

alerts:
  ntp_not_synced:
    enabled: true

9. unexpected_reboot

Category: OS Severity: Warning Default: uptime decreased between snapshots

What it means

The server's uptime has decreased since the last snapshot, indicating a reboot occurred between collection intervals. Crucible detects this by comparing the current uptime against the previous snapshot's uptime value.

Why it matters

Unexpected reboots can indicate hardware instability (kernel panics, power loss, watchdog timer expiry), firmware issues, or someone rebooting the server without coordination. Even planned reboots should be tracked for audit purposes. Repeated unexpected reboots are a strong signal of a failing component.

What to do

  • Check the reboot cause: last reboot and journalctl --boot=-1 -e
  • Check for kernel panics: dmesg | grep -i panic
  • Check IPMI SEL for power events: ipmitool sel list
  • If reboots recur, investigate hardware (PSU, memory, thermal shutdown) and check for watchdog timer kills.

Auto-resolution

unexpected_reboot alerts automatically resolve after 24 hours of continuous stable uptime. If a server reboots unexpectedly the alert fires; if the server then runs for 24 hours without another reboot, the alert resolves with resolution_reason: auto_decay_stable_24h. The original incident remains in the resolved-alerts history.

This avoids manual ack work on transient reboot events while still surfacing the original incident. If you want a different decay window for a specific server, set unexpected_reboot_decay_hours in the server's config_overrides (positive integer, hours).

Use sudo glassmkr-crucible mark-reboot --reason "..." before any deliberate reboot (kernel upgrade, microcode load, maintenance) to suppress the alert entirely. The agent records a single-use expected-reboot marker, and the next snapshot post-reboot fires nothing.

Configuration

alerts:
  unexpected_reboot:
    enabled: true
    # Triggers when uptime decreases between consecutive snapshots
    # Per-server: unexpected_reboot_decay_hours in config_overrides (default 24)

Storage rules (8)

10. disk_space_high

Category: Storage Severity: Warning (90%), Critical (95%) Default threshold: 90%

What it means

A mounted filesystem has exceeded the configured disk usage threshold. Dashboard monitors all mounted filesystems except tmpfs, devtmpfs, and other virtual mounts.

Why it matters

When a filesystem fills to 100%, writes fail. This can crash databases, corrupt logs, prevent SSH logins (if /var or /tmp are full), and make the server difficult to recover remotely. The reserved blocks for root (typically 5% on ext4) provide a small buffer but are not a long-term solution.

What to do

  • Find large files: du -h --max-depth=2 /var | sort -hr | head -20
  • Clean up old logs: journalctl --vacuum-time=7d
  • Remove old package caches: apt clean or dnf clean all
  • Check for core dumps or stale temporary files in /tmp and /var/tmp.
  • If the filesystem is consistently near capacity, expand the volume or move data to a larger disk.

Configuration

alerts:
  disk_space_high:
    enabled: true
    threshold: 90
    critical_threshold: 95
    exclude_mounts:
      - /mnt/backup  # ignore specific mount points

11. smart_failing

Category: Storage Severity: Critical Default threshold: any SMART failure

What it means

A disk's SMART self-assessment has reported a failing status, or one or more critical SMART attributes (Reallocated Sector Count, Current Pending Sector, Offline Uncorrectable) have crossed their vendor-defined thresholds. Crucible uses smartctl to read these values. The dashboard displays the drive model name, power-on days, reallocated sector count, and temperature.

Why it matters

SMART failures are a strong predictor of imminent disk failure. A disk reporting "FAILING" can die within hours or weeks. Data loss is a real risk, especially if no RAID or backup is in place.

Read evidence.triggering_signals first

The rule fires on three independent conditions: aggregate SMART health != PASSED, reallocated_sectors > 0, or pending_sectors > 0. You may see this alert fire with evidence.health: "PASSED" — that means a sector-level condition tripped while the aggregate self-test still passes. The evidence.triggering_signals[] array names exactly which condition(s) fired with a per-signal reason string. Read that array before reacting to health alone.

What to do

  • Check the SMART report: smartctl -a /dev/sdX
  • Back up the disk immediately if backups are not current.
  • If the disk is part of a RAID array, replace it as soon as possible and let the array rebuild.
  • Order a replacement drive. Do not wait for the disk to fail completely.
  • If you are in a data center, open a hardware ticket with your provider.

Configuration

alerts:
  smart_failing:
    enabled: true
    # No threshold - any SMART failure triggers this alert
    ignore_disks:
      - /dev/sda  # optionally ignore specific disks

12. nvme_wear_high

Category: Storage Severity: Warning Default threshold: 80% (percentage used)

What it means

An NVMe drive's "Percentage Used" indicator (from the NVMe health log) has exceeded the threshold. This value estimates how much of the drive's rated write endurance has been consumed. A value of 100% means the drive has reached its rated endurance, though many drives continue operating beyond this point.

Why it matters

NVMe flash cells have a finite number of program/erase cycles. As wear increases, the drive's internal spare cells are consumed. Eventually the drive will transition to read-only mode or fail entirely. Planning a replacement before 100% wear avoids unexpected downtime.

What to do

  • Check current wear: smartctl -a /dev/nvme0 | grep "Percentage Used"
  • Review Data Units Written to estimate remaining lifespan based on your write rate.
  • If wear is above 90%, order a replacement drive and schedule a migration.
  • Reduce unnecessary writes (disable access time updates with noatime, move logs to a different drive).

Configuration

alerts:
  nvme_wear_high:
    enabled: true
    threshold: 80  # percentage used

13. disk_latency_high

Category: Storage Severity: Warning Default threshold: 50 ms (average)

What it means

The average I/O latency for a block device has exceeded the threshold. Crucible measures this from /sys/block/*/stat by computing the average time per completed I/O operation over the collection interval.

Why it matters

High disk latency directly impacts application performance. Database queries slow down, file operations block, and services become unresponsive. For NVMe drives, latency should typically be under 1 ms. For SATA SSDs, under 5 ms. For spinning disks, under 20 ms. Anything above 50 ms is a clear sign of trouble.

What to do

  • Check per-device latency: iostat -x 1 5 (look at await).
  • Identify I/O-heavy processes: iotop -oP
  • If the disk is healthy, latency may be caused by I/O saturation. Reduce concurrent I/O or upgrade to faster storage.
  • Check if a RAID rebuild or filesystem check is running in the background.
  • If latency is intermittent, check SMART data for signs of failing hardware.

Configuration

alerts:
  disk_latency_high:
    enabled: true
    threshold: 50  # milliseconds
    duration: 120
    exclude_devices:
      - loop0
      - loop1

14. disk_io_errors

Category: Storage Severity: Critical Default threshold: any kernel I/O errors

What it means

Kernel-level I/O errors have been reported in dmesg or syslog. These indicate hardware-level read/write failures that the drive's firmware could not recover from.

Why it matters

Kernel I/O errors are a strong signal of imminent drive failure. Unlike SMART warnings which are predictive, I/O errors mean data operations are already failing. Applications may experience silent corruption.

What to do

  • Check dmesg | grep -i "i/o error" for the affected device.
  • Run smartctl -a /dev/sdX for the device mentioned in the errors.
  • Back up data from the affected device immediately.
  • Schedule drive replacement.

Configuration

alerts:
  disk_io_errors:
    enabled: true
    # Triggers on any kernel I/O error in the collection interval

15. filesystem_readonly

Category: Storage Severity: Critical Default threshold: any read-only remount

What it means

A filesystem that should be read-write has been remounted as read-only by the kernel. This typically happens when the kernel detects filesystem corruption or I/O errors and remounts the filesystem to prevent further damage.

Why it matters

A read-only filesystem means all write operations fail. Applications crash, logs stop writing, and databases become unavailable. This is usually a sign of underlying hardware failure or filesystem corruption.

What to do

  • Check mount options: mount | grep "ro,"
  • Check kernel logs for the cause: dmesg | grep -i "remount\|error\|readonly"
  • If caused by disk errors, check SMART data and plan a replacement.
  • If the filesystem is corrupted, run fsck from a rescue environment.

Configuration

alerts:
  filesystem_readonly:
    enabled: true
    exclude_mounts:
      - /mnt/cdrom  # ignore intentionally read-only mounts

16. inode_high

Category: Storage Severity: Warning Default threshold: 90%

What it means

A filesystem's inode usage has exceeded the threshold. Inodes track file metadata; when they run out, no new files can be created even if free space remains.

Why it matters

Inode exhaustion is a subtle failure mode. Disk usage may show plenty of free space, but the server cannot create new files. This breaks log rotation, temp file creation, and application writes. It is common on filesystems with many small files (mail spools, cache directories, container layers).

What to do

  • Check inode usage: df -i
  • Find directories with many small files: find / -xdev -printf '%h\n' | sort | uniq -c | sort -rn | head -20
  • Clean up unnecessary small files (session files, cache entries, old mail).
  • If the filesystem was created with too few inodes, it must be reformatted with a higher inode ratio.

Configuration

alerts:
  inode_high:
    enabled: true
    threshold: 90
    exclude_mounts: []

17. raid_degraded

Category: Storage Severity: Critical Default threshold: any degradation

What it means

A software RAID array (mdadm) or hardware RAID controller has reported a degraded state. This means one or more member disks have failed or been removed from the array. Crucible reads /proc/mdstat for software RAID and uses vendor tools (MegaCLI, storcli) for hardware RAID when available.

Why it matters

A degraded array has lost its redundancy. If another disk fails before the array is rebuilt, data loss is likely (or certain, depending on the RAID level). RAID 1 with one failed disk has zero redundancy. RAID 5 with one failed disk cannot survive another failure. RAID 6 with one failed disk is reduced to RAID 5 levels of protection.

What to do

  • Identify the failed disk: cat /proc/mdstat or mdadm --detail /dev/md0
  • Replace the failed disk as soon as possible.
  • Add the replacement to the array: mdadm --add /dev/md0 /dev/sdX
  • Monitor the rebuild progress: watch cat /proc/mdstat
  • Avoid heavy I/O during the rebuild to speed up reconstruction.

Configuration

alerts:
  raid_degraded:
    enabled: true
    # No threshold - any degradation triggers this alert
    arrays:
      - /dev/md0
      - /dev/md1

Network rules (5)

18. interface_errors

Category: Network Severity: Warning Default threshold: 10 errors/minute

What it means

A network interface is reporting errors (RX errors, TX errors, drops, or overruns) above the threshold rate. Crucible reads these counters from /sys/class/net/*/statistics/.

Why it matters

Network errors cause packet retransmissions, increased latency, and reduced throughput. Persistent errors often indicate a hardware problem: a bad cable, a failing NIC, or a misconfigured switch port. Drops can also be caused by receive buffer exhaustion under high traffic.

Drops vs errors: look at evidence.driver first

The alert's evidence.driver field discriminates the two cases: driver: "errors" means real RX/TX errors (almost always hardware — cable, SFP, NIC, switch port). driver: "drops" means packets dropped by the kernel before reaching userspace, which has a different and often non-hardware root cause.

Drops are often firewall rules doing their job. On an internet-facing host with UFW, iptables, or nftables, every blocked packet shows up in the rx_dropped counter. If your evidence.driver is "drops", check the alert's fix_commands array — it ships a specific dmesg -T | grep -i "UFW BLOCK\|DROP\|REJECT" probe to tell you whether the drops are the firewall doing what you asked. Other drop sources: receive buffer exhaustion under high traffic (fix with larger ring buffers), VLAN mismatch, MTU mismatch.

Third case: the driver is dropping protocol-mismatch frames on purpose

If the firewall probe above shows zero blocks and you don't see RX errors either, but drops still accumulate, the NIC driver itself may be discarding frames by design — a protocol mismatch between what arrives on the wire and what the interface is configured to accept. The driver writes a one-line explanation to the kernel log. Surface it:

sudo dmesg -T | grep -iE 'drop|mismatch|tag|vlan|s-tag|qinq' | tail -20

Common patterns:

  • Mellanox mlx5_core: a line like S-tagged traffic will be dropped while C-tag vlan stripping is enabled means the driver is dropping QinQ frames by design. Either turn off C-tag stripping on the interface (sudo ethtool -K <iface> rxvlan off) or accept the drops and raise the per-interface threshold.
  • Intel ixgbe / igb: VLAN filter table mismatches show up as drops on frames whose VLAN tag isn't in the filter table. Inspect with bridge vlan show and ip -d link show type vlan for stale configs and remove unused tags.
  • Broadcom bnxt_en: similar protocol-filtering messages; the dmesg line names the protocol field that mismatched.

If the drops really are intentional and you're not going to change the configuration (e.g., upstream sends C-tagged frames you genuinely don't want), the right action is a per-interface threshold override rather than a real fix — see Configuration below.

What to do

  • Check error counters: ip -s link show eth0
  • If driver: "drops": first check whether the drops are firewall-blocked packets via the alert's fix_commands dmesg probe. If they are, the alert is fine — increment the threshold or disable per-interface. If they aren't, investigate buffer / MTU / VLAN.
  • If driver: "errors": inspect the cable and SFP modules. Reseat connections. Check switch port counters and logs for CRC errors or alignment errors.
  • Increase ring buffer sizes: ethtool -G eth0 rx 4096 tx 4096
  • If the NIC is faulty, replace it.

Configuration

alerts:
  interface_errors:
    enabled: true
    threshold: 10  # errors per minute
    exclude_interfaces:
      - lo
      - docker0

20. interface_saturation

Category: Network Severity: Warning Default threshold: 80%

What it means

A network interface's throughput has exceeded the configured percentage of its link speed. Crucible measures bytes transmitted and received over the collection interval and compares the rate to the interface's reported link speed.

Why it matters

A saturated network link causes packet queuing, increased latency, and dropped packets. Services that depend on network throughput (file servers, databases with replication, backup jobs) will degrade. Saturation at 80% is a warning because TCP throughput collapses well before reaching 100% utilization due to protocol overhead and buffering.

What to do

  • Identify traffic sources: iftop -i eth0 or nload eth0
  • Check if a backup job or large transfer is running.
  • Implement traffic shaping or QoS to prioritize critical traffic.
  • Consider bonding multiple interfaces or upgrading to a faster link.
  • Move bulk transfers to off-peak hours.

Configuration

alerts:
  interface_saturation:
    enabled: true
    threshold: 80  # percentage of link speed
    duration: 60
    exclude_interfaces:
      - lo

21. conntrack_exhaustion

Category: Network Severity: Warning (80%), Critical (95%) Default threshold: 80%

What it means

The kernel's connection tracking (conntrack) table is approaching capacity. Crucible reads /proc/sys/net/netfilter/nf_conntrack_count and /proc/sys/net/netfilter/nf_conntrack_max to calculate the usage percentage.

Why it matters

When the conntrack table fills up, the kernel drops new connections silently. This affects all stateful firewall rules (iptables, nftables) and NAT. Services appear unreachable, but the server looks healthy otherwise. This is a common failure mode on busy NAT gateways, load balancers, and servers with many short-lived connections.

What to do

  • Check current usage: cat /proc/sys/net/netfilter/nf_conntrack_count and cat /proc/sys/net/netfilter/nf_conntrack_max
  • Increase the limit temporarily: sysctl -w net.netfilter.nf_conntrack_max=262144
  • Make it permanent in /etc/sysctl.d/99-conntrack.conf
  • Reduce timeouts for idle connections: sysctl -w net.netfilter.nf_conntrack_tcp_timeout_time_wait=30
  • If the server does not need connection tracking, consider using stateless firewall rules.

Configuration

alerts:
  conntrack_exhaustion:
    enabled: true
    threshold: 80
    critical_threshold: 95

22. bond_slave_down

Category: Network Severity: Critical Priority: P1 Urgent

What it means

A network interface that is part of a bond (e.g. bond0) has gone down. Crucible reads /sys/class/net/{iface}/operstate and detects bond membership from /proc/net/bonding/*. Requires Crucible 0.6.5 or newer.

Why it matters

Bond interfaces provide network redundancy. When one slave goes down, the bond continues working but with reduced capacity and no redundancy. A second failure would cause a full network outage. This is often caused by a failed cable, SFP transceiver, or switch port.

What to do

  • Check bond status: cat /proc/net/bonding/bond0
  • Check the slave interface: ip link show enp1s0f0 and ethtool enp1s0f0
  • Try bringing it back up: sudo ip link set enp1s0f0 up
  • If the interface won't stay up, check the physical connection (cable, SFP, switch port).
  • Check kernel messages: dmesg -T | grep -i "enp1s0f0" | tail -10

Hardware / IPMI rules (5)

22. cpu_temperature_high

Category: Hardware Severity: Warning (85C), Critical (95C) Default threshold: 85C

What it means

The CPU package temperature has exceeded the threshold. Crucible 0.8.0+ reads from kernel hwmon first (/sys/class/hwmon/ via snap.thermal.max_cpu_celsius) and falls back to IPMI sensor data only when hwmon is unavailable. The path that fired is recorded in evidence.path as "hwmon" or "ipmi". Temperatures are displayed in Celsius (e.g., 85 C).

Why hwmon-first: hwmon readings come straight from the CPU vendor's on-die digital thermal sensor and use standardised sensor naming. The previous IPMI-only path was misleading on some platforms (notably Gigabyte AMD boards with BMC firmware 12.61, where the CPU<N>_DTS sensor reads about 30 C hotter than the actual die). Crucible 0.9.1 also drops CPU<N>_DTS from the IPMI fallback when a sibling CPU<N>_TEMP sensor exists on the same socket.

Why it matters

CPUs throttle their clock speed when they get too hot, which reduces performance. At extreme temperatures (above Tjunction max, typically 100-105 C), the CPU will shut down to protect itself, causing an unclean server restart. Sustained high temperatures also reduce the CPU's lifespan.

What to do

  • Check current temperatures: sensors (from lm-sensors package).
  • Verify that fans are running: ipmitool sdr type Fan
  • Clean dust from heatsinks and fans.
  • Check that the thermal paste between the CPU and heatsink is not dried out.
  • If in a data center, check the room temperature and airflow. Verify hot/cold aisle separation.
  • Reduce CPU load temporarily if temperatures are critical.

Configuration

alerts:
  cpu_temperature_high:
    enabled: true
    threshold: 85           # warning threshold in Celsius
    critical_threshold: 95

The sensor: hint is no longer needed; hwmon CPU readings are auto-classified by chip and label.

23. ecc_errors (correctable, rate-based)

Category: Hardware Severity: Warning Default threshold: 10 correctable errors per 24h

What it means

The server's ECC memory has reported correctable single-bit errors. These are silently fixed by the ECC hardware but logged. The rule fires when more than the configured threshold of correctable errors is observed within a rolling time window (default 10 errors per 24 hours). The underlying counter source is the higher of two paths:

  • Named-sensor path (Supermicro, ASRockRack, Gigabyte): the BMC exposes a numeric "Correctable ECC" / "Uncorrectable ECC" sensor; Crucible reports the cumulative count in snap.ipmi.ecc_errors.
  • SEL-derived path (Dell iDRAC, some HPE iLO firmwares): the BMC reports ECC events only via the System Event Log on the Memory entity, never as named sensors. Crucible 0.8.0+ counts ECC entries in the SEL and reports them in snap.ipmi.ecc_errors_from_sel.

Dashboard computes new errors as the difference between the latest snapshot and the oldest snapshot inside the rate window. This avoids false alerts on long-running healthy hosts where the BMC's cumulative counter has accumulated background noise over months.

evidence.delta_correctable is the count over the window, evidence.window_hours the window length, evidence.threshold the configured trigger, and evidence.path records which underlying source was authoritative ("named", "sel", or "both").

Counter reset detection

If the underlying counter regressed between the oldest-in-window snapshot and the current one (SEL clear, BMC reboot, host reboot zeroing named-sensor accumulators), Dashboard skips that evaluation cycle. The next snapshot resumes evaluation against a fresh oldest-in-window row. No false alerts from the reset itself.

Why it matters

Occasional correctable ECC errors are normal over long periods. A sustained rate of correctable errors on a single DIMM is a strong predictor of imminent failure. Rate-based evaluation surfaces that pattern without false-positive noise from healthy long-uptime hosts.

What to do

  • Check named-sensor counts: ipmitool sdr type Memory or ipmitool sensor list | grep -i ecc.
  • On Dell iDRAC or HPE iLO, inspect the SEL: ipmitool sel list | grep -i memory.
  • If errors persist after a DIMM swap, check the slot itself.
  • Run a memory test (memtest86+) during the next maintenance window.

Per-server overrides

# Wider window for a chronically-noisy host (errors/week instead of /24h):
ecc_rate_window_hours: 168

# Higher threshold to suppress (e.g. allow up to 50 per day):
ecc_correctable_rate_warning: 50

Both fields live in the server's config_overrides JSON. The legacy ecc_correctable_warning field (pre-Phase 7 P1) is automatically migrated to ecc_correctable_rate_warning by migration 014 with the same numeric value; this means an old override of 100 (originally meant as a lifetime ceiling) is now read as "100 errors per 24h" — far above realistic per-day rates, so it effectively still suppresses noise.

24. ecc_errors (uncorrectable)

Category: Hardware Severity: Critical Default threshold: 1 (any uncorrectable error)

What it means

The server's ECC memory has reported uncorrectable multi-bit errors. These cannot be repaired by ECC and may cause data corruption or application crashes.

Why it matters

Uncorrectable errors are serious. Corrupted data was delivered to the CPU, which can cause application crashes, data corruption, or silent data damage. This DIMM should be replaced immediately.

What to do

  • Identify the affected DIMM: edac-util -v
  • Replace the DIMM immediately.
  • Check application data integrity, especially database checksums.
  • Run memtest86+ to confirm the diagnosis.

Configuration

alerts:
  ecc_errors:
    critical_on_uncorrectable: true

25. psu_redundancy_loss

Category: Hardware Severity: Critical Default threshold: any PSU failure

What it means

A redundant power supply unit has failed or been disconnected. Crucible reads PSU state from IPMI only (no hwmon path). Two-tier evaluation:

  • Aggregate path (Dell PowerEdge): Crucible's vendor classifier surfaces the BMC's overall redundancy status as snap.ipmi.psu_redundancy_state (fully_redundant / redundancy_lost / redundancy_degraded / unknown). When the value is meaningful, the rule fires from this single signal.
  • Per-PSU path (Supermicro, Gigabyte, ASRockRack, others): Crucible iterates each individual PSU sensor in snap.ipmi.sensors and looks for fault states.

evidence.path records which source fired: "aggregate-redundancy", "per-psu-fault", "discrete-status-ok", or "all-healthy". In a typical 1+1 redundant configuration, the server continues running on the remaining PSU, but it has lost its power redundancy.

Why it matters

Servers with redundant PSUs are designed to survive a single PSU failure. Once one PSU is down, you are running without a safety net. If the remaining PSU fails, the server goes down immediately with no graceful shutdown.

What to do

  • Check PSU status: ipmitool sdr type "Power Supply"
  • On Dell, read the aggregate redundancy field directly: ipmitool sdr | grep -i redundan.
  • Verify that the failed PSU is receiving power (check the outlet and PDU).
  • If the PSU has a fault LED, note the error pattern.
  • Replace the failed PSU. Most servers support hot-swap PSU replacement.
  • If in a data center, open a hardware ticket immediately.

Configuration

alerts:
  psu_redundancy_loss:
    enabled: true
    # No threshold — any PSU failure triggers this alert.

26. ipmi_fan_failure

Category: Hardware Severity: Critical Default threshold: any fan failure or RPM below minimum

What it means

An IPMI-monitored fan has stopped spinning or dropped below the minimum RPM threshold. Crucible reads fan RPM values from IPMI SDR records. Fan speeds are displayed with proper units (RPM).

Why it matters

Fan failure leads to rising temperatures, which cause CPU throttling, component damage, and eventually thermal shutdown. In servers with redundant fans, a single failure reduces cooling capacity and puts stress on the remaining fans.

What to do

  • Check fan status: ipmitool sdr type Fan
  • Inspect the fan for physical damage or cable disconnection.
  • If the server is in a data center, open a hardware ticket for fan replacement.
  • Monitor CPU temperatures closely until the fan is replaced.

Configuration

alerts:
  ipmi_fan_failure:
    enabled: true
    min_rpm: 500  # fans below this RPM are considered failed

27. ipmi_sel_critical

Category: Hardware Severity: Critical Default threshold: any critical SEL event in last 30 days

What it means

A critical event has been logged in the IPMI System Event Log (SEL) within the rolling time window (default 30 days). This includes events like machine check exceptions, PCI-E fatal errors, and power unit failures. Crucible reads the SEL via ipmitool sel elist.

Why it matters

Critical SEL events indicate hardware-level problems that may not be visible through OS-level monitoring. These events are logged by the BMC independently of the operating system and can indicate problems that the OS cannot detect on its own.

Time window (default 30 days)

The rule only counts events asserted within the last ipmi_sel_critical_window_days days (default 30). Older events stay in evidence as events_outside_window for context but don't fire the alert. This prevents the rule from firing forever on a year-old transient (e.g., a power-supply AC-loss event that was paired with a deassertion the same minute but never cleared from the SEL).

Each event in evidence.critical_events[] carries an age_days field. null means the timestamp couldn't be parsed (older Crucible agents on certain BMC firmwares emit non-ISO date strings; the event is included anyway, fail-open). Upgrade to Crucible 0.9.2+ to normalise the emission.

What to do

  • Read recent events: ipmitool sel elist | tail -30 (the alert's fix_commands ships this exact command).
  • Look at the age_days on each event in evidence. A cluster of recent events points at a current incident; one ancient event in a window-of-many means the rule is right but the underlying problem may already be resolved.
  • If the event indicates a component failure, schedule replacement.
  • Clearing the SEL erases the audit trail permanently. Only run ipmitool sel clear after you've recorded what was in there (the alert's critical_events are a snapshot — they don't go away when the SEL is cleared, but the live BMC view does). For most ops workflows, you don't need to clear; the time-window filter handles aging.

Configuration

alerts:
  ipmi_sel_critical:
    enabled: true
    # Per-server override of the rolling window (default 30 days):
    # ipmi_sel_critical_window_days: 90

ZFS rules (2)

28. zfs_pool_unhealthy

Category: ZFS Severity: Critical Default threshold: pool state != ONLINE

What it means

A ZFS pool health status is something other than ONLINE. This includes DEGRADED (redundancy lost), FAULTED (data loss possible), and UNAVAIL (pool cannot be accessed).

Why it matters

A non-ONLINE ZFS pool means either redundancy is lost (DEGRADED) or data may already be inaccessible (FAULTED/UNAVAIL). Immediate action is required to prevent data loss.

What to do

  • Check pool status: zpool status
  • If DEGRADED: identify the failed vdev and replace the drive with zpool replace
  • If FAULTED: attempt zpool clear then investigate the cause.
  • Never reboot a FAULTED pool without understanding the failure first.

Configuration

alerts:
  zfs_pool_unhealthy:
    enabled: true
    # Triggers when any zpool reports non-ONLINE state

29. zfs_scrub_errors

Category: ZFS Severity: Warning Default threshold: any scrub errors

What it means

Checksum or data errors were found during ZFS scrub operations. ZFS scrubs verify every block of data against its checksum to detect silent data corruption (bit rot).

Why it matters

Scrub errors mean data on disk does not match its checksum. On redundant pools, ZFS auto-repairs from good copies. On non-redundant pools, this is data corruption. Either way, it signals failing hardware.

What to do

  • Check scrub results: zpool status -v
  • If on a mirror/raidz: ZFS auto-repaired. Identify the drive with errors and plan replacement.
  • If on a single vdev: data corruption occurred. Restore affected files from backup.
  • Run smartctl -a on the underlying device to check for hardware issues.

Configuration

alerts:
  zfs_scrub_errors:
    enabled: true
    # Triggers when zpool scrub reports any errors

Security rules (6)

30. ssh_root_password

Category: Security Severity: Warning Default: detects PermitRootLogin with password

What it means

The SSH daemon is configured to allow root login with a password. Crucible checks /etc/ssh/sshd_config for PermitRootLogin yes or PermitRootLogin prohibit-password not being set.

Why it matters

Root login via password is a common attack vector. Brute-force SSH attacks target root constantly. Key-based authentication is much more secure.

What to do

  1. Verify key-based access works before changing anything. Locking yourself out is the failure mode this rule's fix can cause. The alert's fix_commands array ships an explicit probe — run it from a separate shell first:
    ls -la ~/.ssh/authorized_keys
    ssh -o PasswordAuthentication=no root@localhost exit 2>/dev/null && echo "Key auth working" || echo "Key auth FAILED — do NOT disable password login yet"
    The probe succeeds silently when key-based auth works, and prints Key auth FAILED when it doesn't.
  2. Once key access is confirmed, set PermitRootLogin prohibit-password in /etc/ssh/sshd_config.
  3. Restart SSH from one shell while keeping another shell open: sudo systemctl restart sshd. If you lose the new connection but keep the old, you can roll back from the still-open session.

Configuration

alerts:
  ssh_root_password:
    enabled: true

31. no_firewall

Category: Security Severity: Warning Default: detects no active firewall

What it means

No active firewall was detected. Crucible checks for iptables rules, nftables, ufw, and firewalld. If all are empty or inactive, this alert fires.

Why it matters

A server without a firewall exposes all listening services to the internet. Even services bound to localhost can be exposed if a misconfiguration changes the bind address.

Running on a cloud provider with an external firewall?

If your server sits behind a cloud security group (AWS, GCP, Azure, Hetzner Cloud Firewall, DigitalOcean Cloud Firewalls, etc.) and you've set inbound rules there, the host-level firewall is optional. Disable this rule for the affected servers — host firewall and cloud firewall is belt-and-braces, but a misconfigured cloud security group can't be papered over by an empty ufw stack. See "Configuration" below for the disable snippet.

What to do (host-level firewall path)

Pick the firewall stack your distro ships:

  • Debian / Ubuntu (ufw): sudo apt install ufw && sudo ufw default deny incoming && sudo ufw allow ssh && sudo ufw enable — note the allow ssh ordering: enable that BEFORE the default-deny takes effect or you may lose your current session.
  • RHEL / Rocky / AlmaLinux / Fedora (firewalld): on a minimal-server install firewalld may not be present, so install it if missing and then enable it — sudo dnf install -y firewalld && sudo systemctl enable --now firewalld && sudo firewall-cmd --permanent --add-service=ssh && sudo firewall-cmd --reload. The dnf install is a no-op when the package is already present, so this one-liner is safe across full and minimal installs. Verify with sudo firewall-cmd --list-all.
  • Or configure iptables/nftables directly with appropriate rules for your services.

Configuration

alerts:
  no_firewall:
    # set to false on cloud hosts with an upstream security group:
    enabled: true

32. pending_security_updates

Category: Security Severity: Warning Default: any pending security update

What it means

The package manager has pending security updates that have not been installed. Crucible checks apt (Debian/Ubuntu) or dnf (RHEL/Rocky/Alma) for available security patches.

Why it matters

Unpatched security vulnerabilities are one of the most common attack vectors. Security updates should be applied promptly, especially for internet-facing services.

What to do

  • Review pending updates: apt list --upgradable or dnf check-update --security
  • Apply security updates: sudo apt upgrade or sudo dnf update --security
  • Consider enabling automatic security updates (see unattended_upgrades_disabled below).

Configuration

alerts:
  pending_security_updates:
    enabled: true

33. kernel_vulnerabilities

Category: Security Severity: Warning Default: any known kernel vulnerability

What it means

The running kernel has known vulnerabilities that are mitigatable or patchable. Crucible checks /sys/devices/system/cpu/vulnerabilities/ for Spectre, Meltdown, and other CPU/kernel vulnerabilities.

Why it matters

Kernel vulnerabilities can allow privilege escalation, container escapes, or data leaks between processes. While some mitigations are applied automatically, others require a kernel update and reboot.

Read the status text first

Each unmitigated entry in the alert's evidence.unmitigated[] array has a status field that names the actual gap. The fix depends on what the status says:

  • "Vulnerable" (no qualifier) — usually means a kernel-level mitigation isn't enabled. Update the kernel and reboot. This is the common case.
  • "Vulnerable: ... no microcode" — the kernel has the mitigation code but the CPU is missing the microcode update it needs to apply. Kernel upgrades alone will not fix this. You need either (a) a BIOS / UEFI firmware update from your motherboard or server vendor, which ships the microcode in firmware, or (b) a userland microcode package (intel-microcode on Debian/Ubuntu, microcode_ctl or linux-firmware on RHEL-family) that loads on boot before the kernel.
  • "Mitigation: ..." — the mitigation is active. No action needed; this row shouldn't appear in unmitigated[] unless something looks off.

What to do

  • Check vulnerability status: grep . /sys/devices/system/cpu/vulnerabilities/*
  • If the status is plain "Vulnerable": update the kernel (sudo apt upgrade linux-image-generic or sudo dnf upgrade kernel) and reboot.
  • If the status mentions "no microcode": install the microcode package (sudo apt install intel-microcode for Intel, sudo apt install amd64-microcode for AMD, or sudo dnf install microcode_ctl linux-firmware on RHEL-family) and reboot.
    Debian users on AMD: amd64-microcode lives in the non-free-firmware component, which is not enabled by default. If apt install amd64-microcode reports "Package 'amd64-microcode' has no installation candidate", add non-free-firmware to your sources first. On Debian 12+, the default sources live in /etc/apt/sources.list.d/debian.sources (deb822 format) — edit that file and add non-free-firmware to the Components: line for the main + security + updates entries, then sudo apt update. On Debian 11 / older populated /etc/apt/sources.list setups, the legacy form works:
    sudo sed -i 's/main$/main non-free-firmware/' /etc/apt/sources.list
    sudo apt update
    sudo apt install amd64-microcode
    Then reboot to load. If the package is already installed and the status still says no microcode, check your BIOS/UEFI for a firmware update from the motherboard vendor — some microcode is only delivered via BIOS.
  • Debian users on Intel: Intel CPUs use intel-microcode. On Debian 12+ this lives in non-free-firmware just like amd64-microcode, so the deb822 step above applies. Install with sudo apt install -y intel-microcode then reboot. On Debian 11 / older where it was in main, no non-free-firmware step is needed.
  • If this host runs VMs (Proxmox, KVM/libvirt, vSphere, etc.), the reboot will take guests with it. Drain or migrate before mark-reboot, or run inside a scheduled maintenance window.
  • Some vulnerabilities have no fix and never will (older CPUs that are EOL for microcode). For those, mute the rule on the affected server (see Alert muting) and document the residual risk.

Configuration

alerts:
  kernel_vulnerabilities:
    enabled: true

34. kernel_needs_reboot

Category: Security Severity: Warning Default: reboot required after kernel update

What it means

A kernel update has been installed but the server is still running the old kernel. Crucible detects this by comparing the running kernel version against the installed version and by checking for /var/run/reboot-required.

Why it matters

Security patches in the new kernel are not active until the server reboots. The server remains vulnerable to patched exploits until the reboot occurs.

What to do

  1. Mark the reboot first so Crucible doesn't fire unexpected_reboot on the next boot. Run sudo glassmkr-crucible mark-reboot --reason "kernel update" on the box. The agent writes a single-use marker that suppresses the unexpected-reboot alert on the very next snapshot post-reboot. (If you forget, the alert fires once and you can resolve it manually; it's not destructive, just noisy.)
    Or run sudo glassmkr-crucible reboot --reason "kernel update" to do both the mark and the reboot in one step.
  2. Schedule a maintenance window and reboot the server.
  3. If this host runs VMs (Proxmox, KVM/libvirt, vSphere, etc.), the reboot will take guests with it. Drain or migrate before mark-reboot, or run inside a scheduled maintenance window.
  4. Verify the new kernel is running after reboot: uname -r

Configuration

alerts:
  kernel_needs_reboot:
    enabled: true

35. unattended_upgrades_disabled

Category: Security Severity: Warning Default: detects disabled automatic security updates

What it means

Automatic security updates are not configured. On Debian/Ubuntu, Crucible checks whether the unattended-upgrades package is installed and enabled. On RHEL-based systems, it checks for dnf-automatic.

Why it matters

Without automatic security updates, critical patches sit uninstalled until someone manually runs the update. For servers that are not actively maintained, this can leave known vulnerabilities open for weeks or months.

What to do

  • Install and enable automatic updates. The install creates /etc/apt/apt.conf.d/20auto-upgrades with sensible defaults; the systemctl line ensures the service is enabled and running:
    sudo apt install unattended-upgrades
    sudo systemctl enable --now unattended-upgrades
    Only if you previously disabled auto-updates (e.g. set Enable=0 in 50unattended-upgrades), run sudo dpkg-reconfigure -plow unattended-upgrades interactively to re-enable.
  • Or on RHEL: sudo dnf install dnf-automatic && sudo systemctl enable --now dnf-automatic.timer
  • If you prefer manual updates, you can disable this rule.

Configuration

alerts:
  unattended_upgrades_disabled:
    enabled: true

Service Health rules (3)

36. systemd_service_failed

Category: Service Health Severity: Warning Default: any failed systemd service

What it means

One or more systemd services have entered the "failed" state. Crucible runs systemctl list-units --state=failed on each collection cycle and reports any units that are not running as expected.

Why it matters

Failed services may include databases, web servers, monitoring agents, or critical system daemons. A service in the failed state is not running and will not restart automatically unless configured to do so. Operators often do not notice failed services until users report problems.

Read evidence.journal_excerpts first (Crucible 0.9.2+)

The alert evidence now includes a journal_excerpts field (an object mapping each failed unit name to its last 5 journal lines), collected by Crucible at snapshot time. For most failures the root cause is in those 5 lines (a config error, a missing dependency, a permission issue). Read those before SSHing to the box. If the field is empty for a unit, your Crucible is pre-0.9.2 — upgrade with sudo npm install -g @glassmkr/crucible@latest and the field will populate on the next ingest cycle.

What to do

  • Read the per-unit journal excerpt in evidence.journal_excerpts. The first line is usually enough to diagnose the root cause.
  • If you need more context: sudo journalctl -u <unit> --no-pager -n 50
  • Attempt a restart: sudo systemctl restart <unit>
  • If the service fails repeatedly with a config error visible in the excerpt: fix the config and restart. See the common patterns below.
  • For services you intentionally disabled, add them to the ignore list.

Common journal-excerpt patterns

Most failed-service cases fall into one of these shapes. The fix for each is small and bounded — but only if you know to look for the right line in the excerpt.

fail2ban.service: "Have not found any log file for sshd jail"

Symptom: ERROR Failed during configuration: Have not found any log file for sshd jail followed by Async configuration of server failed. Common on Debian 12+ / Ubuntu 22.04+ where sshd logs to journald only (no /var/log/auth.log). Fix: tell fail2ban to read from journald instead of a log file. Create /etc/fail2ban/jail.d/sshd-systemd.local with:

[sshd]
backend = systemd

Then sudo systemctl restart fail2ban. The journal excerpt should now show the jail starting cleanly.

NetworkManager-wait-online.service / systemd-networkd-wait-online.service: "Failed with result 'exit-code'"

These are "wait until the network is fully up" oneshots. They fail when one or more configured-but-disconnected interfaces (a second NIC without a cable, an inactive bond slave, a VLAN that won't link) make the unit hit its timeout. The host is usually fine — only the wait-online unit is failed.

Two options:

  • Easy: add to the ignore list in your collector config (see Configuration below). These units are oneshots that don't affect actual networking once boot completes.
  • Fix-properly: restrict the unit to interfaces you actually expect to be up. For NetworkManager: sudo systemctl edit NetworkManager-wait-online.service and override ExecStart=/usr/bin/nm-online -s -q --timeout=30 --interfaces=<your-real-iface>.

Multi-NIC RHEL-family boxes: usually it's stale NetworkManager profiles. On Rocky / Alma / RHEL with several NICs, the failure is most often that NetworkManager has connection profiles bound to interfaces that no longer exist or are currently unplugged. wait-online waits for every "auto-connect" profile to come up; one stale profile is enough to time out the whole unit.

Recipe to find and clear stale profiles:

# List active connections (the ones that did come up)
nmcli connection show --active

# List every configured connection, including inactive ones
nmcli connection show

# For each row that is bound to a non-existent or unplugged interface and
# that you do NOT need, delete it:
sudo nmcli connection delete "<connection-name>"

# Retry wait-online and confirm it's clean
sudo systemctl restart NetworkManager-wait-online
sudo systemctl status NetworkManager-wait-online --no-pager

If a profile is bound to an interface you genuinely need but which is currently unplugged (a redundant cable path, a NIC that comes up on demand), don't delete it — either lower its autoconnect-retries so wait-online gives up sooner (nmcli connection modify <name> connection.autoconnect-retries 1), or mask the wait-online unit entirely if nothing else in your boot ordering needs the dependency (sudo systemctl mask NetworkManager-wait-online.service).

"Address already in use" / port conflict

Symptom: excerpt contains bind() ... failed (98: Address already in use) or similar. Another process holds the port. Find what: sudo ss -tlnp | grep ':<port>'. Either kill the other process or change one of their port assignments.

"Failed to start ... dependency"

Symptom: excerpt contains Failed to start <some-unit>.service - ... Job ... failed because of unavailable resources. or Dependency failed for .... Look at the named dependency unit's own status (sudo systemctl status <dep>) — the real failure is one level deeper. Fix that one and the dependent unit will recover.

Configuration

alerts:
  systemd_service_failed:
    enabled: true
    ignore_services:
      - bluetooth.service   # ignore services that are not relevant
      - ModemManager.service
      - NetworkManager-wait-online.service   # see "Common patterns" above

37. fd_exhaustion

Category: Service Health Severity: Warning (80%), Critical (95%) Default threshold: 80%

What it means

The system's file descriptor usage has exceeded the configured percentage of the maximum allowed. Crucible reads /proc/sys/fs/file-nr to get the current allocation and the system-wide limit.

Why it matters

File descriptors are used for open files, sockets, pipes, and other I/O handles. When the system runs out of file descriptors, processes cannot open new files or establish new network connections. This causes cascading failures: databases refuse connections, web servers return errors, and logging stops working.

What to do

  • Check current usage: cat /proc/sys/fs/file-nr (allocated, unused, max)
  • Find processes with many open FDs: for pid in /proc/[0-9]*; do echo "$(ls "$pid/fd" 2>/dev/null | wc -l) $(cat "$pid/comm" 2>/dev/null)"; done | sort -rn | head -20
  • Increase the system limit temporarily: sysctl -w fs.file-max=1048576
  • Make it permanent in /etc/sysctl.d/99-file-max.conf
  • Check per-process limits with cat /proc/PID/limits and adjust with systemd LimitNOFILE=.
  • Investigate if a process is leaking file descriptors (opening without closing).

Configuration

alerts:
  fd_exhaustion:
    enabled: true
    threshold: 80
    critical_threshold: 95

38. server_unreachable

Category: Service Health Severity: Critical Priority: P1 Urgent

What it means

The server has stopped sending snapshots to Dashboard. Crucible is an agent-based collector; if the server goes down, the agent goes down with it and Dashboard stops receiving data. This rule runs server-side on a schedule (every 2 minutes), not as part of the snapshot evaluation.

Why it matters

A server that stops reporting may be down, rebooting, or have a crashed Crucible service. Without this rule, the only signal would be the "Last seen X minutes ago" label on the dashboard, which is easy to miss.

How it works

  • Threshold: 2x the server's collection interval (default 300s, so 10 minutes).
  • Scales with custom intervals: if a server pushes every 600s, the threshold is 20 minutes.
  • Onboarding grace: servers younger than 10 minutes never fire this alert.
  • Servers that have never sent a snapshot are not alerted on.
  • Auto-resolves when the server sends its next snapshot.

What to do

  • Check if the server is reachable: ping {server_ip}
  • If reachable, check Crucible: ssh {server} sudo systemctl status glassmkr-crucible
  • Check logs: ssh {server} sudo journalctl -u glassmkr-crucible -n 20 --no-pager
  • If not reachable, check your hosting panel for IPMI or KVM access.

Global alert settings

These settings apply to all alert rules and can be set in the configuration file or the dashboard:

alerts:
  global:
    cooldown: 3600          # seconds between repeated notifications for the same alert
    resolve_notify: true    # send a notification when an alert resolves
    channels:
      - telegram
      - email