Comprehensive Guide to Linux GPU Monitoring Tools

Monitoring GPU performance is crucial for tasks ranging from machine learning to gaming. Having real-time insights into resource usage helps you proactively identify and resolve issues. Here’s a guide to some essential utilities for tracking GPU activity on Linux.

NVIDIA-SMI

NVIDIA-SMI (System Management Interface) is the official command-line utility included with NVIDIA drivers, offering detailed real-time monitoring of GPU performance. It provides hardware-level information on usage, temperature, power consumption, and active processes. This tool is ideal for advanced users and administrators seeking to optimize performance and integrate GPU monitoring into system management workflows.

Step 1: Ensure NVIDIA drivers are installed. NVIDIA-SMI is automatically installed with the drivers.

Step 2: Open a terminal.

Step 3: To view GPU information, run:

nvidia-smi

Step 4: For continuous monitoring, use the watch command:

watch -n 1 nvidia-smi

More information: NVIDIA-SMI Documentation.

NVTOP

NVTOP (Neat Video card TOP) provides a dynamic, htop-like interface for monitoring GPUs. It supports multiple GPUs, displaying real-time load and temperature data in an easily readable format. This tool is particularly useful for users managing systems with multiple GPUs, offering clear and simultaneous performance monitoring across complex environments.

Step 1: For Ubuntu/Debian systems, use the following command:

sudo apt install nvtop

Step 2: If the above command fails, add the repository first:

sudo add-apt-repository ppa:flexiondotorg/nvtop

Step 3: Update the package list:

sudo apt update

Step 4: Then install nvtop:

sudo apt install nvtop

More information: NVTOP GitHub page.

NVITOP

NVITOP is an interactive tool specifically designed for NVIDIA GPUs, offering detailed process management and an API for integration into custom monitoring solutions. It provides an interactive view of NVIDIA GPUs and process management, along with an extensible API. It’s well-suited for developers and system administrators aiming to incorporate GPU monitoring data into custom solutions or dashboards.

Step 1: Install using pip:

pip install nvitop

More information: NVITOP GitHub page.

GPUStat

GPUStat caters to users seeking a lightweight and straightforward method for monitoring NVIDIA GPUs. Its ncurses-based interface delivers a quick snapshot of GPU usage, making it ideal for rapid checks and troubleshooting without consuming significant system resources. Note that GPUStat exclusively supports NVIDIA devices.

Step 1: Install GPUStat from PyPI with root privileges:

pip install gpustat

Step 2: Alternatively, install it within your user namespace if you lack root privileges:

pip install --user gpustat

More information: GPUstat GitHub page.

ROCm

For users with AMD GPUs, ROCm is a suite of tools specifically tailored for monitoring and managing GPU performance on AMD hardware. It features comprehensive documentation and active community support, making it a valuable resource for developers and administrators focused on optimizing performance and troubleshooting on AMD platforms.

Step 1: Consult the detailed installation instructions for your specific distribution on the ROCm documentation site.

More information: ROCm Documentation.

AI-Z

AI-Z provides a unified view of hardware resource utilization across both NVIDIA and AMD GPUs. Its simple interface and cross-platform compatibility make it an appealing choice for users working with mixed GPU environments, allowing them to monitor their entire system without relying on multiple specialized tools.

More information: AI-Z Website.

Worthy Mentions

Besides the aforementioned tools, several other options merit consideration, each offering unique features based on specific use cases.

nvidia_gpu_exporter

nvidia_gpu_exporter is a tool designed to gather NVIDIA GPU metrics and present them in Prometheus format. This tool enhances your Prometheus and Grafana monitoring setup by enabling the monitoring of GPU performance alongside other system metrics. It fetches real-time metrics from NVIDIA GPUs and serves them through an HTTP endpoint, facilitating comprehensive tracking of GPU performance along with other system metrics.

Step 1: Clone the repository:

git clone https://github.com/utkuozdemir/nvidia_gpu_exporter.git

Step 2: Navigate to the cloned directory:

cd nvidia_gpu_exporter

Step 3: Build the exporter using Go:

go build -o nvidia_gpu_exporter

Step 4: Run the exporter:

./nvidia_gpu_exporter

More information: nvidia_gpu_exporter GitHub page

jupyterlab-nvdashboard

NVDashboard seamlessly integrates GPU usage metrics directly into your JupyterLab environment, empowering developers and data scientists to monitor hardware performance without disrupting their interactive workflow. This is exceptionally beneficial for those involved in training machine learning models or conducting data analysis, as it ensures tight integration between development and monitoring.

Step 1: Install the JupyterLab extension using pip:

pip install jupyterlab-nvdashboard

More information: jupyterlab-nvdashboard GitHub page

Glances

Glances distinguishes itself by providing a holistic overview of your system, consolidating CPU, memory, disk, and GPU statistics into a single interface. As a cross-platform system monitoring tool, it supports a wide array of plugins, including GPU stats, rendering it ideal for users in need of an all-encompassing monitoring solution adaptable to diverse hardware configurations and usage scenarios.

Step 1: Install Glances via pip:

pip install glances

Step 2: Alternatively, install it through your distribution’s package manager (Ubuntu/Debian):

sudo apt install glances

More information: Glances Website.

btm (bottom)

btm (bottom) represents a contemporary system monitor implemented in Rust, featuring a visually appealing and highly customizable terminal interface. While configuring it to display GPU temperatures alongside CPU, memory, and disk usage might require some initial setup, its speed and aesthetics appeal to power users and system administrators.

Step 1: On distributions where btm is available as a package, use:

sudo apt install btm

Step 2: Or, utilize Rust’s package manager, Cargo:

cargo install bottom

More information: btm GitHub page.


Selecting the appropriate GPU monitoring tool depends on your specific requirements; whether you prioritize lightweight simplicity, interactive process management, system-wide overviews, or in-depth hardware insights, a tool exists to meet your needs.