Monitoring GPU performance is crucial for tasks ranging from machine learning to gaming. Having real-time insights into resource usage helps you proactively identify and resolve issues. Here’s a guide to some essential utilities for tracking GPU activity on Linux.
NVIDIA-SMI
NVIDIA-SMI (System Management Interface) is the official command-line utility included with NVIDIA drivers, offering detailed real-time monitoring of GPU performance. It provides hardware-level information on usage, temperature, power consumption, and active processes. This tool is ideal for advanced users and administrators seeking to optimize performance and integrate GPU monitoring into system management workflows.
Step 1: Ensure NVIDIA drivers are installed. NVIDIA-SMI is automatically installed with the drivers.
Step 2: Open a terminal.
Step 3: To view GPU information, run:
nvidia-smi
Step 4: For continuous monitoring, use the watch
command:
watch -n 1 nvidia-smi
More information: NVIDIA-SMI Documentation.
NVTOP
NVTOP (Neat Video card TOP) provides a dynamic, htop
-like interface for monitoring GPUs. It supports multiple GPUs, displaying real-time load and temperature data in an easily readable format. This tool is particularly useful for users managing systems with multiple GPUs, offering clear and simultaneous performance monitoring across complex environments.
Step 1: For Ubuntu/Debian systems, use the following command:
sudo apt install nvtop
Step 2: If the above command fails, add the repository first:
sudo add-apt-repository ppa:flexiondotorg/nvtop
Step 3: Update the package list:
sudo apt update
Step 4: Then install nvtop
:
sudo apt install nvtop
More information: NVTOP GitHub page.
NVITOP
NVITOP is an interactive tool specifically designed for NVIDIA GPUs, offering detailed process management and an API for integration into custom monitoring solutions. It provides an interactive view of NVIDIA GPUs and process management, along with an extensible API. It’s well-suited for developers and system administrators aiming to incorporate GPU monitoring data into custom solutions or dashboards.
Step 1: Install using pip
:
pip install nvitop
More information: NVITOP GitHub page.
GPUStat
GPUStat caters to users seeking a lightweight and straightforward method for monitoring NVIDIA GPUs. Its ncurses
-based interface delivers a quick snapshot of GPU usage, making it ideal for rapid checks and troubleshooting without consuming significant system resources. Note that GPUStat exclusively supports NVIDIA devices.
Step 1: Install GPUStat from PyPI with root privileges:
pip install gpustat
Step 2: Alternatively, install it within your user namespace if you lack root privileges:
pip install --user gpustat
More information: GPUstat GitHub page.
ROCm
For users with AMD GPUs, ROCm is a suite of tools specifically tailored for monitoring and managing GPU performance on AMD hardware. It features comprehensive documentation and active community support, making it a valuable resource for developers and administrators focused on optimizing performance and troubleshooting on AMD platforms.
Step 1: Consult the detailed installation instructions for your specific distribution on the ROCm documentation site.
More information: ROCm Documentation.
AI-Z
AI-Z provides a unified view of hardware resource utilization across both NVIDIA and AMD GPUs. Its simple interface and cross-platform compatibility make it an appealing choice for users working with mixed GPU environments, allowing them to monitor their entire system without relying on multiple specialized tools.
More information: AI-Z Website.
Worthy Mentions
Besides the aforementioned tools, several other options merit consideration, each offering unique features based on specific use cases.
nvidia_gpu_exporter
nvidia_gpu_exporter
is a tool designed to gather NVIDIA GPU metrics and present them in Prometheus format. This tool enhances your Prometheus and Grafana monitoring setup by enabling the monitoring of GPU performance alongside other system metrics. It fetches real-time metrics from NVIDIA GPUs and serves them through an HTTP endpoint, facilitating comprehensive tracking of GPU performance along with other system metrics.
Step 1: Clone the repository:
git clone https://github.com/utkuozdemir/nvidia_gpu_exporter.git
Step 2: Navigate to the cloned directory:
cd nvidia_gpu_exporter
Step 3: Build the exporter using Go:
go build -o nvidia_gpu_exporter
Step 4: Run the exporter:
./nvidia_gpu_exporter
More information: nvidia_gpu_exporter GitHub page
jupyterlab-nvdashboard
NVDashboard seamlessly integrates GPU usage metrics directly into your JupyterLab environment, empowering developers and data scientists to monitor hardware performance without disrupting their interactive workflow. This is exceptionally beneficial for those involved in training machine learning models or conducting data analysis, as it ensures tight integration between development and monitoring.
Step 1: Install the JupyterLab extension using pip
:
pip install jupyterlab-nvdashboard
More information: jupyterlab-nvdashboard GitHub page
Glances
Glances distinguishes itself by providing a holistic overview of your system, consolidating CPU, memory, disk, and GPU statistics into a single interface. As a cross-platform system monitoring tool, it supports a wide array of plugins, including GPU stats, rendering it ideal for users in need of an all-encompassing monitoring solution adaptable to diverse hardware configurations and usage scenarios.
Step 1: Install Glances via pip
:
pip install glances
Step 2: Alternatively, install it through your distribution’s package manager (Ubuntu/Debian):
sudo apt install glances
More information: Glances Website.
btm (bottom)
btm
(bottom) represents a contemporary system monitor implemented in Rust, featuring a visually appealing and highly customizable terminal interface. While configuring it to display GPU temperatures alongside CPU, memory, and disk usage might require some initial setup, its speed and aesthetics appeal to power users and system administrators.
Step 1: On distributions where btm
is available as a package, use:
sudo apt install btm
Step 2: Or, utilize Rust’s package manager, Cargo:
cargo install bottom
More information: btm GitHub page.
Selecting the appropriate GPU monitoring tool depends on your specific requirements; whether you prioritize lightweight simplicity, interactive process management, system-wide overviews, or in-depth hardware insights, a tool exists to meet your needs.