Skip to content

Instantly share code, notes, and snippets.

@tudo75
Last active March 14, 2024 22:17
Show Gist options
  • Save tudo75/af94deb1036a2c45029154e54acbf4fe to your computer and use it in GitHub Desktop.
Save tudo75/af94deb1036a2c45029154e54acbf4fe to your computer and use it in GitHub Desktop.
Useful nvidia-smi queries

Useful nvidia-smi Queries

Updated 11/06/2017 07:32 AM

What are useful nvidia-smi queries for troubleshooting?

VBIOS Version

Query the VBIOS version of each device:

$ nvidia-smi --query-gpu=gpu_name,gpu_bus_id,vbios_version --format=csv
name, pci.bus_id, vbios_version
GRID K2, 0000:87:00.0, 80.04.D4.00.07
GRID K2, 0000:88:00.0, 80.04.D4.00.08
Query Description
timestamp The timestamp of where the query was made in format "YYYY/MM/DD HH:MM:SS.msec".
gpu_name The official product name of the GPU. This is an alphanumeric string. For all products.
gpu_bus_id PCI bus id as "domain:bus:device.function", in hex.
vbios_version The BIOS of the GPU board.

Query GPU metrics for host-side logging

This query is good for monitoring the hypervisor-side GPU metrics. This query will work for both ESXi and XenServer

$ nvidia-smi --query-gpu=timestamp, name, pci.bus_id, driver_version, pstate, pcie.link.gen.max, pcie.link.gen.current, temperature.gpu, utilization.gpu, utilization.memory, memory.total, memory.free, memory.used --format=csv -l 5

When adding additional parameters to a query, ensure that no spaces are added between the queries options.

Query Description
timestamp The timestamp of where the query was made in format "YYYY/MM/DD HH:MM:SS.msec".
name The official product name of the GPU.
This is an alphanumeric string. For all products.
pci.bus_id PCI bus id as "domain:bus:device.function", in hex.
driver_version The version of the installed NVIDIA display driver.
This is an alphanumeric string.
pstate The current performance state for the GPU.
States range from P0 (maximum performance) to P12 (minimum performance).
pcie.link.gen.max The maximum PCI-E link generation possible with this GPU and system configuration.
For example, if the GPU supports a higher PCIe generation than the system supports then this reports the system PCIe generation.
pcie.link.gen.current The current PCI-E link generation. These may be reduced when the GPU is not in use.
temperature.gpu Core GPU temperature. in degrees C.
utilization.gpu Percent of time over the past sample period during which one or more kernels was executing on the GPU.
The sample period may be between 1 second and 1/6 second depending on the product.
utilization.memory Percent of time over the past sample period during which global (device) memory was being read or written.
The sample period may be between 1 second and 1/6 second depending on the product.
memory.total Total installed GPU memory.
memory.free Total free memory.
memory.used Total memory allocated by active contexts.

You can get a complete list of the query arguments by issuing: nvidia-smi --help-query-gpu

nvidia-smi Usage for logging

Short-term logging

Add the option "-f " to redirect the output to a file

Prepend "timeout -t " to run the query for and stop logging.

Ensure the your query granularity is appropriately sized for the use required:

Purpose nvidia-smi "-l" value interval timeout "-t" value Duration
Fine-grain GPU behavior 5 5 seconds 600 10 minutes
General GPU behavior 60 1 minute 3600 1 hour
Broad GPU behavior 3600 1 hour 86400 24 hours

Long-term logging

Create a shell script to automate the creation of the log file with timestamp data added to the filename and query parameters

Add a custom cron job to /var/spool/cron/crontabs to call the script at the intervals required.

Additional low level commands used for clocks and power

Enable Persistence Mode

Any settings below for clocks and power get reset between program runs unless you enable persistence mode (PM) for the driver.

Also note that the nvidia-smi command runs much faster if PM mode is enabled.

nvidia-smi -pm 1 — Make clock, power and other settings persist across program runs / driver invocations

Clocks

Command Detail
nvidia-smi -ac <MEM clock, Graphics clock> View clocks supported
nvidia-smi -q -d SUPPORTED_CLOCKS Set one of supported clocks
nvidia-smi -q -d CLOCK View current clock
nvidia-smi --auto-boost-default=ENABLED -i 0 Enable boosting GPU clocks (K80 and later)
nvidia-smi --rac Reset clocks back to base

Power

Command Detail
nvidia-smi –pl N Set power cap (maximum wattage the GPU will use)
nvidia-smi -pm 1 Enable persistence mode
nvidia-smi stats -i <device#> -d pwrDraw Command that provides continuous monitoring of detail stats such as power
nvidia-smi --query-gpu=index,timestamp,power.draw,clocks.sm,clocks.mem,clocks.gr --format=csv -l 1 Continuously provide time stamped power and clock
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment