Changelog

Behaviour changes, operational improvements, and notable fixes shipped to Dashboard. Most recent at top.

To add a new entry, append a new <section class="release" id="YYYY-MM-DD"> block at the top of the list below.

2026-05-13

Crucible 0.9.4

Fixed. PSU sensor classification now works across every observed BMC vendor (Supermicro, Gigabyte, ASRockRack, ASUS). Previously the PS<N> name shape was Dell-gated and four of five PSU-having boxes were silently filtered out, so per-PSU alerts never fired regardless of state. The bitmask interpretation now follows IPMI 2.0 spec table 42-3 (Failure detected, AC lost, predictive, inactive).

Changed. When the agent cannot probe IPMI at all (no ipmitool, no /dev/ipmi0, etc.), the snapshot now emits null for ipmi.ecc_errors and sel_entries_count rather than stub zeros. The Dashboard dashboard renders this as "no signal (BMC not probed)" instead of the misleading "0 / 0". The header reads "IPMI: Not detected (ipmitool not installed)" with a one-click link to the troubleshooting page.

Changed. IPMI capability detection now re-runs every hour. Installing ipmitool after the agent started is picked up automatically; no service restart needed.

Added. glassmkr-crucible doctor ipmi subcommand for customer self-diagnosis. Runs the same probes the agent runs and prints actionable guidance per failure mode (no_ipmitool_binary, permission_denied, no_bmc_device, execution_failed). See /docs/troubleshooting/ipmi.

Upgrade. sudo npm install -g @glassmkr/crucible@latest && sudo systemctl restart glassmkr-crucible

Tier-gating policy for the programmatic API

The programmatic API surface (account keys) is now uniformly gated by Pro plan. Free customers retain full web-dashboard access and full read API access for their own data; programmatic writes (channel CRUD, alert acknowledge/resolve, server CRUD, mutes, restore endpoints, key management) and Pro features (AI analysis, trend warnings) now return 402 pro_required when called via an API key on a Free plan. Web-dashboard sessions are unaffected at any tier. Full breakdown at /docs/api/tier-gating.

Unexpected-reboot alerts now auto-resolve

An unexpected_reboot alert that has been firing for at least 24 hours of continuous stable uptime now auto-resolves with resolution_reason: auto_decay_stable_24h. The original incident remains in the resolved-alerts history. Tunable per-server via config_overrides.unexpected_reboot_decay_hours. See /docs/alerts#unexpected_reboot.

2026-05-12 (later still)

Rate-based ECC error detection

The ecc_errors alert rule now uses a rolling 24-hour rate window instead of a cumulative threshold. Default: a warning fires when more than 10 correctable errors are observed in 24 hours. Uncorrectable ECC errors continue to fire critical immediately on any non-zero count.

This eliminates false-positive alerts on long-running hosts where BMC error counters accumulate over months without indicating an active hardware issue.

If you previously set a per-server ecc_correctable_warning override to suppress a noisy host, the override has been preserved with the new rate semantic (your threshold now applies to errors per 24 hours, not lifetime count). For most hosts this is the right behaviour; if you want to adjust, the new field names are ecc_correctable_rate_warning (count threshold) and ecc_rate_window_hours (window length, default 24).

When the BMC counter is cleared (SEL reset, BMC reboot), the rule skips one evaluation cycle and resumes on the next snapshot — no false alerts from the reset itself.

2026-05-12 (later)

API key management UI + scopes

Pro customers can now manage API keys via the dashboard at /settings/keys. New keys can be created with one of three scopes (Read, Write, or Admin), an optional expiry date (up to 5 years), and graceful 48-hour rotation. The old key keeps working through the grace window so automation can be updated without downtime, then auto-revokes. Emergency revoke is one click away with the "Revoke immediately" option for compromised keys.

Reminder emails fire 7 and 1 days before key expiry, plus a notification when a key is auto-revoked on expiry. Audit log view is at /settings/audit with date-range and action filters (Pro plan; 365-day retention).

Existing keys retain full Admin access and no expiry — no action required to keep working.

2026-05-12

Pro-tier API gating

Programmatic API access (creating account API keys, rotating account keys, reading the audit log, plus programmatic server management via account keys: create, rename, delete, rotate collector keys) is now restricted to Pro-plan customers. Free-plan accounts can continue to ingest snapshots from their 3 free servers and use the dashboard to manage them; programmatic management requires Pro.

If you encounter a 402 response on an endpoint that previously worked, your account may have been downgraded or never had Pro access. Existing API keys created before this change keep working for ingest; manage them via the dashboard if you need to rotate or delete.

2026-05-09

Hardware visibility on dashboard

Server tiles and detail pages now show the hardware vendor and product detected by the monitoring agent (e.g., "GIGABYTE / R292-4S1-00"), plus an IPMI badge when sensor data is available. This helps identify what's actually on each box, especially when server names don't match the underlying hardware.

2026-05-08

Stripe billing enforcement

Pro customers without a payment method on file now have servers beyond the 3-server free quota disabled at the end of their billing period. The oldest 3 servers stay active. Snapshot ingest continues for disabled servers, so historical data is preserved and restoration is instant once a card is added.

Affected customers receive a sequence of warning emails: when a payment method is removed, 3 days before disable, 1 day before disable, and at the moment of disable. Restoration is a single-click operation in the dashboard once a card is on file: Settings → Disabled servers → Restore all. Servers can also be deleted individually from that screen if you'd rather drop into the free quota than pay.

The 3-server free quota is preserved on Pro plan customers without a card to avoid abrupt service loss.

2026-05-07

Behaviour changes

`cpu_temperature_high` reads hwmon directly, with IPMI fallback

Most customers see no change. The evaluator now reads CPU thermals from the kernel's hwmon interface (more accurate, vendor-agnostic), falling back to IPMI sensor data when hwmon isn't available.

Customers running on Gigabyte AMD platforms (B650, B660, X670, EPYC) will see fewer false-positive alerts. Their BMC firmware (12.61, the shipping default on these boards) reports a CPU<N>_DTS IPMI sensor that runs about 30°C hotter than the actual CPU die temperature exposed via the kernel. The fix lands in two places:

Dashboard side (this release) reads hwmon first, treats the BMC reading as a fallback only.
Agent side (Crucible 0.9.1, on npm and Docker now) drops the inflated CPU<N>_DTS sensor whenever a CPU<N>_TEMP sibling exists on the same socket.

If you previously had elevated noise from CPU temperature alerts on Gigabyte hosts, this release should reduce them.

ECC alerts now fire on Dell and HPE iDRAC platforms too

Previously, ECC alerts only fired when the BMC exposed a named "correctable / uncorrectable" sensor. Dell iDRAC and some HPE iLO firmwares report ECC events only via the System Event Log (SEL), not as named sensors, so customers on those platforms were missing alerts silently.

This release adds a SEL-derived counter as a second source. If your hardware uses iDRAC or HPE iLO and has had any ECC events in its SEL history, you may now see correctable or uncorrectable ECC alerts that were previously silent. This is the intended fix.

If you'd been relying on ECC monitoring elsewhere on these hosts and don't need the duplicate signal, you can mute the rule per-server from the Settings page.

ECC threshold is unchanged this release

The underlying counter is cumulative since the BMC's last SEL clear, which makes any static threshold misbehave on long-running hosts. We're shipping a rate-based design (alerts based on correctable errors per time window) in the next release. Until then, the warning threshold remains as before: any non-zero correctable count fires a warning.

If you see noisy ECC warnings from a host that has accumulated correctable events over months or years of uptime, set ecc_correctable_warning to a higher number on the Settings page (for example, 100 to ignore historical accumulation up to 100 events).

Critical alerts on uncorrectable ECC errors are unaffected. They fire at the first uncorrectable event on either path, regardless of this threshold.

Operational changes

`install.sh` simplified

The bootstrap installer at https://glassmkr.com/install.sh now delegates configuration and systemd setup to the new glassmkr-crucible init subcommand (Crucible 0.9.1+). Net effect: the same end state, with explicit API key validation and consistent binary-path detection across Ubuntu, Debian, and Proxmox.

The --dashboard-key argument is preserved as an alias for --api-key so existing automation (Ansible playbooks, Terraform modules) keeps working.

Existing installations are unaffected. The change applies only to newly bootstrapped hosts.

Crucible 0.9.1 released to npm and Docker

Customers on auto-update receive 0.9.1 on the next service restart.

To upgrade manually:

npm install -g @glassmkr/crucible@latest
systemctl restart glassmkr-crucible

To pin to a specific version: npm install -g @glassmkr/[email protected].

Docker images are at ghcr.io/glassmkr/crucible:0.9.1 and glassmkr/crucible:0.9.1.

What's new in 0.9.1:

glassmkr-crucible init — first-run setup wizard. Validates your API key, optionally probes the ingest endpoint to confirm it's accepted, writes /etc/glassmkr/collector.yaml (mode 0600), writes the systemd unit with the dynamically-detected binary path, and enables the service. Run glassmkr-crucible init --help for the full flag list. Supports --api-key - to read the key from stdin for password-manager pipes.
BMC CPU<N>_DTS filter — drops the inflated DTS sensor when a CPU<N>_TEMP sibling is present. See the cpu_temperature_high section above.
README revision — init is documented as the canonical first-run path. The manual install flow is preserved for ops engineers customising the systemd unit.

0.9.0 → 0.9.1 is a drop-in replacement, no config changes required.

Migration notes

None required. Existing 0.9.0 installs with hand-written collector.yaml and systemd units continue working unchanged.

What's coming

The ECC rate-based threshold redesign is the next main-line work, expected in the next Dashboard release. PSU redundancy alerting follows; design will incorporate the cumulative-counter lesson from ECC.