Troubleshooting
This page covers common issues with the Crucible agent and Dashboard dashboard, along with step-by-step solutions.
Topic pages
- IPMI: how Crucible detects IPMI, why "Not detected" can be correct behaviour, using
glassmkr-crucible doctor ipmi, per-vendor notes.
Common issues
Crucible service fails to start
Symptom: systemctl status glassmkr-crucible shows failed or inactive (dead).
Steps:
- Check the service logs:
journalctl -u glassmkr-crucible --no-pager -n 50
- If you see a YAML parse error, re-run the init wizard with the same key to rewrite the config from scratch:
sudo glassmkr-crucible init --api-key <your_collector_key>
The wizard validates the key against Dashboard before writing the config, so a typo in the key surfaces immediately. Common YAML mistakes include using tabs instead of spaces, missing quotes around strings with special characters, and incorrect indentation. - If you see
permission denied, ensure the configuration file is readable:ls -la /etc/glassmkr/collector.yaml
The file should be owned by root with mode 0600. - If you see
bind: address already in use, another instance may be running:pgrep -a glassmkr-crucible
Kill the stale process and try again.
Server shows "offline" in the dashboard
Symptom: The server card in Dashboard shows a gray status indicator and "last seen" is more than 5 minutes ago.
Steps:
- Check that Crucible is running:
systemctl status glassmkr-crucible
- Check network connectivity to the API:
curl -s -o /dev/null -w "%{http_code}" https://app.glassmkr.com/api/v1/healthYou should get200. If not, check DNS resolution, firewall rules, and proxy settings. - Check if the token is valid:
sudo journalctl -u glassmkr-crucible --since "5 min ago" --no-pager
If you seeauth error: 401, generate a new token in the Dashboard dashboard and update/etc/glassmkr/collector.yaml. - Check for network-level blocks. Some firewalls or security groups block outbound HTTPS. Verify that port 443 to
app.glassmkr.comis open:nc -zv app.glassmkr.com 443
- If you are behind a proxy, configure it in
collector.yaml:proxy: https: http://proxy.internal:3128
Metrics are delayed or missing
Symptom: The dashboard shows gaps in charts or data arrives minutes late.
Steps:
- Check the agent's push timing:
sudo journalctl -u glassmkr-crucible --since "5 min ago" --no-pager
The "Last push" value should be close to the configured interval (default: 300 seconds). - If pushes are slow, check the agent log for timeout errors:
grep -i "timeout\|retry" /var/log/glassmkr/crucible.log | tail -20
- If the server's clock is significantly off, metrics may be dropped. Verify NTP is working:
timedatectl status
The system clock should be synchronized. If not, enable NTP:sudo timedatectl set-ntp true
- If specific collectors are slow (e.g., SMART queries on many disks), they can delay the entire push. Check collector timing:
sudo journalctl -u glassmkr-crucible -f
Consider increasing the collection interval or disabling slow collectors.
SMART data is not appearing
Symptom: The Disk tab in the dashboard shows no SMART information.
Steps:
- Ensure
smartmontoolsis installed:# Debian/Ubuntu sudo apt install smartmontools # RHEL/Rocky/Alma sudo dnf install smartmontools
- Verify that
smartctlcan read your drives:sudo smartctl -a /dev/sda
If this fails with a permission error, Crucible needs to run as root (which is the default for the systemd service). - For hardware RAID controllers, drives behind the controller are not visible to
smartctlwithout the-dflag. Check if your controller is supported:sudo smartctl -a /dev/sda -d megaraid,0
- Verify the SMART collector is enabled in
collector.yaml:collectors: smart: enabled: true
Telegram notifications are not arriving
Symptom: Alerts fire in the dashboard but no Telegram messages are received.
Steps:
- Test the channel from the dashboard or API:
curl -X POST https://app.glassmkr.com/api/v1/channels/CHANNEL_ID/test \ -H "Authorization: Bearer YOUR_TOKEN"
- If the test fails with
401 Unauthorized, the bot token is invalid. Create a new bot with BotFather or regenerate the token. - If the test fails with
400 Bad Request: chat not found, the chat ID is wrong. Common mistakes:- Missing the
-100prefix for supergroups. - The bot was removed from the group after setup.
- The bot has not received any messages in the chat yet (send a message to the bot first).
- Missing the
- If the test succeeds but real alerts do not arrive, check the channel routing. Go to Settings > Alert Defaults and verify that your Telegram channel is listed.
- Check the alert cooldown. By default, Dashboard only sends one notification per alert per hour. If you acknowledged the alert or it was recently notified, additional notifications are suppressed.
Email notifications go to spam
Symptom: Test emails arrive in the spam folder.
Steps:
- Check the spam folder and mark messages as "not spam" to train your mail provider.
- Add
[email protected]to your contacts or safe senders list. - If you control the recipient domain, add an SPF record allowing Glassmkr's mail servers. Contact support for the current IP ranges.
- For better deliverability, use a custom SMTP server with your own domain. See the Channels page for setup instructions.
Temperature or IPMI data is missing
Symptom: The Hardware tab shows no temperature, fan, or PSU data.
Steps:
- Install
lm-sensorsfor hwmon data:# Debian/Ubuntu sudo apt install lm-sensors sudo sensors-detect --auto
- For IPMI data, install
ipmitool:sudo apt install ipmitool
Verify it works:sudo ipmitool sdr list
- If IPMI is not available (common on consumer hardware and many cloud VMs), Crucible reads thermal data from hwmon directly. Virtual machines typically have no thermal sensors at all.
- Check that the thermal collector is not disabled:
collectors: thermal: enabled: true source: auto - If you're on a Gigabyte AMD board (B650, B660, X670, EPYC) and CPU temperature alerts have been noisy, the BMC's
CPU<N>_DTSsensor reads about 30 C hotter than the actual die. Crucible 0.9.1+ filters that sensor when a siblingCPU<N>_TEMPexists, and Dashboard'scpu_temperature_highrule reads hwmon first, so the inflated DTS reading is no longer the source of truth. Upgrade the agent (sudo npm install -g @glassmkr/crucible@latest && sudo systemctl restart glassmkr-crucible) if you're still seeing noise.
High CPU usage by Crucible
Symptom: The Crucible process uses more than 1-2% CPU consistently.
Steps:
- Check which collectors are running:
sudo journalctl -u glassmkr-crucible -f
- SMART queries on many disks can be expensive. If you have more than 20 disks, increase the interval or limit which disks are scanned:
collectors: smart: devices: - /dev/sda - /dev/sdb - Per-core CPU metrics on machines with 64+ cores generate a lot of data. Disable per-core reporting if you do not need it:
collectors: cpu: per_core: false - If the collection interval is set very low (e.g., 10 seconds), increase it to reduce overhead:
collectors: interval: 300
Registration fails with "server limit reached"
Symptom: the Dashboard dashboard ("+ Add Server") returns an error about the server limit.
Steps:
- The Free plan allows 3 servers. Pro is unmetered above 3 ($3/node/month, first 3 free).
- If you have decommissioned servers still registered, delete them from the dashboard to free up slots.
- To upgrade your plan, go to Settings > Billing.
My servers are disabled (lock icon, "no payment method on file")
Symptom: some server tiles on the dashboard show a lock-icon overlay and "Manage in Settings". Notifications stopped firing for those servers.
Why: on the Pro plan, servers beyond the 3-server free quota are disabled at the end of the billing period (or 30 days after account creation, whichever is later) when no payment method is on file. The first 3 servers always stay active. Disabled servers continue to ingest snapshots so historical data is preserved; they just stop firing notifications.
Steps:
- Add a payment method: Settings > Billing > Add card (opens the Stripe portal).
- Restore in bulk: Settings > Disabled servers > Restore all. Restoration is instant once a card is on file.
- If you'd rather drop into the free quota than pay, delete individual servers from the same screen.
Dashboard sends a sequence of warning emails before disable: when the payment method is removed, 3 days before disable, 1 day before disable, and at the moment of disable. If you're not seeing these, check your spam folder and confirm the account email is correct.
ECC correctable-error count keeps increasing on a long-running host
Symptom: the ecc_errors warning fires repeatedly on a server with months or years of uptime, even though no new errors are happening.
Why: the underlying counter (whether IPMI named-sensor or SEL-derived) is cumulative since the BMC's last SEL clear, not a rate. The current threshold fires at any non-zero count, so historical accumulation triggers warnings on long-running hosts. A rate-based redesign is in flight; until it ships, raise the per-server threshold to ignore historical accumulation.
Steps:
- Open the server detail page in Dashboard, scroll to the Alerts section.
- Open the per-server settings for
ecc_errorsand setecc_correctable_warningto a value above the current cumulative count (e.g. 100). The warning will only fire on net-new errors past that point. - Critical alerts on uncorrectable ECC are unaffected — they still fire at the first uncorrectable event regardless of this threshold.
Configuration changes are not taking effect
Symptom: You edited collector.yaml but Crucible still uses the old settings.
Steps:
- Restart the service after any configuration change:
sudo systemctl restart glassmkr-crucible
- Verify the running config by inspecting the agent's startup banner:
sudo journalctl -u glassmkr-crucible --since "1 min ago" --no-pager
The first lines after restart print the resolved interval, enabled collectors, and Dashboard URL. - Check that you edited the correct file. The systemd unit may pin a non-default config path:
systemctl show glassmkr-crucible -p Environment
- Environment variables override the config file. Check if any
GLASSMKR_*variables are set in the systemd unit or the shell environment.
Per-core CPU data is not showing
Symptom: The per-core CPU chart does not appear in the expanded CPU view, or per-core data is missing from AI analysis.
Steps:
- Per-core monitoring requires Crucible 0.3.0 or later. Check your version:
glassmkr-crucible --version
- Ensure per-core monitoring is enabled in the configuration:
collectors: cpu: per_core: true - Restart Crucible after changing the configuration:
sudo systemctl restart glassmkr-crucible
- Wait for the next collection interval (default: 5 minutes) for data to appear.
Muted rules are still firing
Symptom: You muted a rule but it continues to fire alerts or send notifications.
Steps:
- Muting takes effect on the next ingest cycle. Wait for at least one full collection interval (default: 5 minutes) after muting.
- If you muted via the configuration file, restart Crucible for the change to take effect:
sudo systemctl restart glassmkr-crucible
- If you muted via the dashboard, no restart is needed, but the change applies on the next push from that server.
- Verify the rule is muted in the dashboard under the server's Alerts tab. Muted rules show a mute icon.
Getting help
If your issue is not covered here:
- Capture an hour of agent logs:
sudo journalctl -u glassmkr-crucible --since "1 hour ago" --no-pager > crucible.log. Attach it when contacting support. - Email [email protected] with your server ID and a description of the issue.