Skip to main content

Monitoring

Overview

 

The lab uses a self-hosted monitoring stack to track CPU, GPU, memory, disk, network, and per-process resource usage across all lab servers. Metrics are visualised in Grafana, which is available at https://grafana.lab.pyarelal.xyz. Log in with your lab account via the Sign in with Kanidm button.

What is monitored

  • CPU usage (by type: user, system, iowait, etc.)
  • RAM usage (used, cached, buffers)
  • Network traffic (sent and received)
  • Disk I/O (read and write)
  • GPU utilisation, memory, temperature, and power draw (on GPU-equipped hosts)
  • Top processes by CPU and memory

Monitored hosts

HostGPU monitoring
orcaYes (NVIDIA)
krakenYes (NVIDIA)
leviathanYes (NVIDIA)
starfishNo
eelNo

Using the dashboard

After logging in, open the Infrastructure Overview dashboard. Use the Host dropdown at the top to switch between servers. The time range selector in the top right controls how far back the graphs show.

The dashboard is divided into three sections:

  • System — CPU, RAM, network, and disk panels visible for all hosts
  • GPU — GPU panels, populated only for GPU-equipped hosts
  • Processes — top 10 processes by CPU and memory usage

Access

Access to Grafana is controlled via the grafana_users Kanidm group. Contact a sysadmin if you need access.