Updated 11/06/2017 07:32 AM
What are useful nvidia-smi queries for troubleshooting?
Query the VBIOS version of each device:
$ nvidia-smi --query-gpu=gpu_name,gpu_bus_id,vbios_version --format=csv
name, pci.bus_id, vbios_version
GRID K2, 0000:87:00.0, 80.04.D4.00.07
GRID K2, 0000:88:00.0, 80.04.D4.00.08
Query | Description |
---|---|
timestamp | The timestamp of where the query was made in format "YYYY/MM/DD HH:MM:SS.msec". |
gpu_name | The official product name of the GPU. This is an alphanumeric string. For all products. |
gpu_bus_id | PCI bus id as "domain:bus:device.function", in hex. |
vbios_version | The BIOS of the GPU board. |
This query is good for monitoring the hypervisor-side GPU metrics. This query will work for both ESXi and XenServer
$ nvidia-smi --query-gpu=timestamp, name, pci.bus_id, driver_version, pstate, pcie.link.gen.max, pcie.link.gen.current, temperature.gpu, utilization.gpu, utilization.memory, memory.total, memory.free, memory.used --format=csv -l 5
When adding additional parameters to a query, ensure that no spaces are added between the queries options.
Query | Description |
---|---|
timestamp | The timestamp of where the query was made in format "YYYY/MM/DD HH:MM:SS.msec". |
name | The official product name of the GPU. This is an alphanumeric string. For all products. |
pci.bus_id | PCI bus id as "domain:bus:device.function", in hex. |
driver_version | The version of the installed NVIDIA display driver. This is an alphanumeric string. |
pstate | The current performance state for the GPU. States range from P0 (maximum performance) to P12 (minimum performance). |
pcie.link.gen.max | The maximum PCI-E link generation possible with this GPU and system configuration. For example, if the GPU supports a higher PCIe generation than the system supports then this reports the system PCIe generation. |
pcie.link.gen.current | The current PCI-E link generation. These may be reduced when the GPU is not in use. |
temperature.gpu | Core GPU temperature. in degrees C. |
utilization.gpu | Percent of time over the past sample period during which one or more kernels was executing on the GPU. The sample period may be between 1 second and 1/6 second depending on the product. |
utilization.memory | Percent of time over the past sample period during which global (device) memory was being read or written. The sample period may be between 1 second and 1/6 second depending on the product. |
memory.total | Total installed GPU memory. |
memory.free | Total free memory. |
memory.used | Total memory allocated by active contexts. |
You can get a complete list of the query arguments by issuing: nvidia-smi --help-query-gpu
Add the option "-f " to redirect the output to a file
Prepend "timeout -t " to run the query for and stop logging.
Ensure the your query granularity is appropriately sized for the use required:
Purpose | nvidia-smi "-l" value | interval | timeout "-t" value | Duration |
---|---|---|---|---|
Fine-grain GPU behavior | 5 | 5 seconds | 600 | 10 minutes |
General GPU behavior | 60 | 1 minute | 3600 | 1 hour |
Broad GPU behavior | 3600 | 1 hour | 86400 | 24 hours |
Create a shell script to automate the creation of the log file with timestamp data added to the filename and query parameters
Add a custom cron job to /var/spool/cron/crontabs to call the script at the intervals required.
Any settings below for clocks and power get reset between program runs unless you enable persistence mode (PM) for the driver.
Also note that the nvidia-smi command runs much faster if PM mode is enabled.
nvidia-smi -pm 1 — Make clock, power and other settings persist across program runs / driver invocations
Command | Detail |
---|---|
nvidia-smi -ac <MEM clock, Graphics clock> | View clocks supported |
nvidia-smi -q -d SUPPORTED_CLOCKS | Set one of supported clocks |
nvidia-smi -q -d CLOCK | View current clock |
nvidia-smi --auto-boost-default=ENABLED -i 0 | Enable boosting GPU clocks (K80 and later) |
nvidia-smi --rac | Reset clocks back to base |
Command | Detail |
---|---|
nvidia-smi –pl N | Set power cap (maximum wattage the GPU will use) |
nvidia-smi -pm 1 | Enable persistence mode |
nvidia-smi stats -i <device#> -d pwrDraw | Command that provides continuous monitoring of detail stats such as power |
nvidia-smi --query-gpu=index,timestamp,power.draw,clocks.sm,clocks.mem,clocks.gr --format=csv -l 1 | Continuously provide time stamped power and clock |