Troubleshooting

This page covers common issues with the Crucible agent and Dashboard dashboard, along with step-by-step solutions.

Topic pages

  • IPMI: how Crucible detects IPMI, why "Not detected" can be correct behaviour, using glassmkr-crucible doctor ipmi, per-vendor notes.

Common issues

Crucible service fails to start

Symptom: systemctl status glassmkr-crucible shows failed or inactive (dead).

Steps:

  1. Check the service logs:
    journalctl -u glassmkr-crucible --no-pager -n 50
  2. If you see a YAML parse error, re-run the init wizard with the same key to rewrite the config from scratch:
    sudo glassmkr-crucible init --api-key <your_collector_key>
    The wizard validates the key against Dashboard before writing the config, so a typo in the key surfaces immediately. Common YAML mistakes include using tabs instead of spaces, missing quotes around strings with special characters, and incorrect indentation.
  3. If you see permission denied, ensure the configuration file is readable:
    ls -la /etc/glassmkr/collector.yaml
    The file should be owned by root with mode 0600.
  4. If you see bind: address already in use, another instance may be running:
    pgrep -a glassmkr-crucible
    Kill the stale process and try again.

Server shows "offline" in the dashboard

Symptom: The server card in Dashboard shows a gray status indicator and "last seen" is more than 5 minutes ago.

Steps:

  1. Check that Crucible is running:
    systemctl status glassmkr-crucible
  2. Check network connectivity to the API:
    curl -s -o /dev/null -w "%{http_code}" https://app.glassmkr.com/api/v1/health
    You should get 200. If not, check DNS resolution, firewall rules, and proxy settings.
  3. Check if the token is valid:
    sudo journalctl -u glassmkr-crucible --since "5 min ago" --no-pager
    If you see auth error: 401, generate a new token in the Dashboard dashboard and update /etc/glassmkr/collector.yaml.
  4. Check for network-level blocks. Some firewalls or security groups block outbound HTTPS. Verify that port 443 to app.glassmkr.com is open:
    nc -zv app.glassmkr.com 443
  5. If you are behind a proxy, configure it in collector.yaml:
    proxy:
      https: http://proxy.internal:3128

Metrics are delayed or missing

Symptom: The dashboard shows gaps in charts or data arrives minutes late.

Steps:

  1. Check the agent's push timing:
    sudo journalctl -u glassmkr-crucible --since "5 min ago" --no-pager
    The "Last push" value should be close to the configured interval (default: 300 seconds).
  2. If pushes are slow, check the agent log for timeout errors:
    grep -i "timeout\|retry" /var/log/glassmkr/crucible.log | tail -20
  3. If the server's clock is significantly off, metrics may be dropped. Verify NTP is working:
    timedatectl status
    The system clock should be synchronized. If not, enable NTP:
    sudo timedatectl set-ntp true
  4. If specific collectors are slow (e.g., SMART queries on many disks), they can delay the entire push. Check collector timing:
    sudo journalctl -u glassmkr-crucible -f
    Consider increasing the collection interval or disabling slow collectors.

SMART data is not appearing

Symptom: The Disk tab in the dashboard shows no SMART information.

Steps:

  1. Ensure smartmontools is installed:
    # Debian/Ubuntu
    sudo apt install smartmontools
    
    # RHEL/Rocky/Alma
    sudo dnf install smartmontools
  2. Verify that smartctl can read your drives:
    sudo smartctl -a /dev/sda
    If this fails with a permission error, Crucible needs to run as root (which is the default for the systemd service).
  3. For hardware RAID controllers, drives behind the controller are not visible to smartctl without the -d flag. Check if your controller is supported:
    sudo smartctl -a /dev/sda -d megaraid,0
  4. Verify the SMART collector is enabled in collector.yaml:
    collectors:
      smart:
        enabled: true

Telegram notifications are not arriving

Symptom: Alerts fire in the dashboard but no Telegram messages are received.

Steps:

  1. Test the channel from the dashboard or API:
    curl -X POST https://app.glassmkr.com/api/v1/channels/CHANNEL_ID/test \
      -H "Authorization: Bearer YOUR_TOKEN"
  2. If the test fails with 401 Unauthorized, the bot token is invalid. Create a new bot with BotFather or regenerate the token.
  3. If the test fails with 400 Bad Request: chat not found, the chat ID is wrong. Common mistakes:
    • Missing the -100 prefix for supergroups.
    • The bot was removed from the group after setup.
    • The bot has not received any messages in the chat yet (send a message to the bot first).
  4. If the test succeeds but real alerts do not arrive, check the channel routing. Go to Settings > Alert Defaults and verify that your Telegram channel is listed.
  5. Check the alert cooldown. By default, Dashboard only sends one notification per alert per hour. If you acknowledged the alert or it was recently notified, additional notifications are suppressed.

Email notifications go to spam

Symptom: Test emails arrive in the spam folder.

Steps:

  1. Check the spam folder and mark messages as "not spam" to train your mail provider.
  2. Add [email protected] to your contacts or safe senders list.
  3. If you control the recipient domain, add an SPF record allowing Glassmkr's mail servers. Contact support for the current IP ranges.
  4. For better deliverability, use a custom SMTP server with your own domain. See the Channels page for setup instructions.

Temperature or IPMI data is missing

Symptom: The Hardware tab shows no temperature, fan, or PSU data.

Steps:

  1. Install lm-sensors for hwmon data:
    # Debian/Ubuntu
    sudo apt install lm-sensors
    sudo sensors-detect --auto
  2. For IPMI data, install ipmitool:
    sudo apt install ipmitool
    Verify it works:
    sudo ipmitool sdr list
  3. If IPMI is not available (common on consumer hardware and many cloud VMs), Crucible reads thermal data from hwmon directly. Virtual machines typically have no thermal sensors at all.
  4. Check that the thermal collector is not disabled:
    collectors:
      thermal:
        enabled: true
        source: auto
  5. If you're on a Gigabyte AMD board (B650, B660, X670, EPYC) and CPU temperature alerts have been noisy, the BMC's CPU<N>_DTS sensor reads about 30 C hotter than the actual die. Crucible 0.9.1+ filters that sensor when a sibling CPU<N>_TEMP exists, and Dashboard's cpu_temperature_high rule reads hwmon first, so the inflated DTS reading is no longer the source of truth. Upgrade the agent (sudo npm install -g @glassmkr/crucible@latest && sudo systemctl restart glassmkr-crucible) if you're still seeing noise.

High CPU usage by Crucible

Symptom: The Crucible process uses more than 1-2% CPU consistently.

Steps:

  1. Check which collectors are running:
    sudo journalctl -u glassmkr-crucible -f
  2. SMART queries on many disks can be expensive. If you have more than 20 disks, increase the interval or limit which disks are scanned:
    collectors:
      smart:
        devices:
          - /dev/sda
          - /dev/sdb
  3. Per-core CPU metrics on machines with 64+ cores generate a lot of data. Disable per-core reporting if you do not need it:
    collectors:
      cpu:
        per_core: false
  4. If the collection interval is set very low (e.g., 10 seconds), increase it to reduce overhead:
    collectors:
      interval: 300

Registration fails with "server limit reached"

Symptom: the Dashboard dashboard ("+ Add Server") returns an error about the server limit.

Steps:

  1. The Free plan allows 3 servers. Pro is unmetered above 3 ($3/node/month, first 3 free).
  2. If you have decommissioned servers still registered, delete them from the dashboard to free up slots.
  3. To upgrade your plan, go to Settings > Billing.

My servers are disabled (lock icon, "no payment method on file")

Symptom: some server tiles on the dashboard show a lock-icon overlay and "Manage in Settings". Notifications stopped firing for those servers.

Why: on the Pro plan, servers beyond the 3-server free quota are disabled at the end of the billing period (or 30 days after account creation, whichever is later) when no payment method is on file. The first 3 servers always stay active. Disabled servers continue to ingest snapshots so historical data is preserved; they just stop firing notifications.

Steps:

  1. Add a payment method: Settings > Billing > Add card (opens the Stripe portal).
  2. Restore in bulk: Settings > Disabled servers > Restore all. Restoration is instant once a card is on file.
  3. If you'd rather drop into the free quota than pay, delete individual servers from the same screen.

Dashboard sends a sequence of warning emails before disable: when the payment method is removed, 3 days before disable, 1 day before disable, and at the moment of disable. If you're not seeing these, check your spam folder and confirm the account email is correct.

ECC correctable-error count keeps increasing on a long-running host

Symptom: the ecc_errors warning fires repeatedly on a server with months or years of uptime, even though no new errors are happening.

Why: the underlying counter (whether IPMI named-sensor or SEL-derived) is cumulative since the BMC's last SEL clear, not a rate. The current threshold fires at any non-zero count, so historical accumulation triggers warnings on long-running hosts. A rate-based redesign is in flight; until it ships, raise the per-server threshold to ignore historical accumulation.

Steps:

  1. Open the server detail page in Dashboard, scroll to the Alerts section.
  2. Open the per-server settings for ecc_errors and set ecc_correctable_warning to a value above the current cumulative count (e.g. 100). The warning will only fire on net-new errors past that point.
  3. Critical alerts on uncorrectable ECC are unaffected — they still fire at the first uncorrectable event regardless of this threshold.

Configuration changes are not taking effect

Symptom: You edited collector.yaml but Crucible still uses the old settings.

Steps:

  1. Restart the service after any configuration change:
    sudo systemctl restart glassmkr-crucible
  2. Verify the running config by inspecting the agent's startup banner:
    sudo journalctl -u glassmkr-crucible --since "1 min ago" --no-pager
    The first lines after restart print the resolved interval, enabled collectors, and Dashboard URL.
  3. Check that you edited the correct file. The systemd unit may pin a non-default config path:
    systemctl show glassmkr-crucible -p Environment
  4. Environment variables override the config file. Check if any GLASSMKR_* variables are set in the systemd unit or the shell environment.

Per-core CPU data is not showing

Symptom: The per-core CPU chart does not appear in the expanded CPU view, or per-core data is missing from AI analysis.

Steps:

  1. Per-core monitoring requires Crucible 0.3.0 or later. Check your version:
    glassmkr-crucible --version
  2. Ensure per-core monitoring is enabled in the configuration:
    collectors:
      cpu:
        per_core: true
  3. Restart Crucible after changing the configuration:
    sudo systemctl restart glassmkr-crucible
  4. Wait for the next collection interval (default: 5 minutes) for data to appear.

Muted rules are still firing

Symptom: You muted a rule but it continues to fire alerts or send notifications.

Steps:

  1. Muting takes effect on the next ingest cycle. Wait for at least one full collection interval (default: 5 minutes) after muting.
  2. If you muted via the configuration file, restart Crucible for the change to take effect:
    sudo systemctl restart glassmkr-crucible
  3. If you muted via the dashboard, no restart is needed, but the change applies on the next push from that server.
  4. Verify the rule is muted in the dashboard under the server's Alerts tab. Muted rules show a mute icon.

Getting help

If your issue is not covered here:

  • Capture an hour of agent logs: sudo journalctl -u glassmkr-crucible --since "1 hour ago" --no-pager > crucible.log. Attach it when contacting support.
  • Email [email protected] with your server ID and a description of the issue.