nvidia-smi命令解读
程序员文章站
2022-07-15 09:47:06
...
经常会用到nvidia-smi
命令来查看gpu的使用情况,
具体的命令释义为:
参考教程1,
第一行为机器的当前时间,
第二行为驱动的版本号,
第三行是GPU名字,持续模式,Bus-Id,Disp.A,
第四行是具体的使用情况,
1)Fan:N/A是风扇转速,从0到100%之间变动,这个速度是计算机期望的风扇转速,实际情况下如果风扇堵转,可能打不到显示的转速。有的设备不会返回转速,因为它不依赖风扇冷却而是通过其他外设保持低温(比如我们实验室的服务器是常年放在空调房间里的)。
2)Temp:是温度,单位摄氏度。
3)Perf:是性能状态,从P0到P12,P0表示最大性能,P12表示状态最小性能。
4)Pwr:是能耗,上方的Persistence-M:是持续模式的状态,持续模式虽然耗能大,但是在新的GPU应用启动时,花费的时间更少,这里显示的是off的状态。
5)Bus-Id是涉及GPU总线的东西,domain:bus:device.function
6)Disp.A是Display Active,表示GPU的显示是否初始化。
第五第六栏下方的Memory Usage是显存使用率。
第七栏是浮动的GPU利用率。
第八栏上方是关于ECC的东西。
第八栏下方Compute M是计算模式。
下面一张表示每个进程占用的显存使用率。
显存占用和GPU占用是两个不一样的东西,显卡是由GPU和显存等组成的,显存和GPU的关系有点类似于内存和CPU的关系。听说跑caffe代码的时候显存占得少,GPU占得多;跑TensorFlow代码的时候,显存占得多,GPU占得少。
执行nvidia-smi -h > log.txt
将帮助信息导出到log.txt
中,
NVIDIA System Management Interface -- v375.39
NVSMI provides monitoring information for Tesla and select Quadro devices.
The data is presented in either a plain text or an XML format, via stdout or a file.
NVSMI also provides several management operations for changing the device state.
Note that the functionality of NVSMI is exposed through the NVML C-based
library. See the NVIDIA developer website for more information about NVML.
Python wrappers to NVML are also available. The output of NVSMI is
not guaranteed to be backwards compatible; NVML and the bindings are backwards
compatible.
http://developer.nvidia.com/nvidia-management-library-nvml/
http://pypi.python.org/pypi/nvidia-ml-py/
Supported products:
- Full Support
- All Tesla products, starting with the Fermi architecture
- All Quadro products, starting with the Fermi architecture
- All GRID products, starting with the Kepler architecture
- GeForce Titan products, starting with the Kepler architecture
- Limited Support
- All Geforce products, starting with the Fermi architecture
nvidia-smi [OPTION1 [ARG1]] [OPTION2 [ARG2]] ...
-h, --help Print usage information and exit.
LIST OPTIONS:
-L, --list-gpus Display a list of GPUs connected to the system.
SUMMARY OPTIONS:
<no arguments> Show a summary of GPUs connected to the system.
[plus any of]
-i, --id= Target a specific GPU.
-f, --filename= Log to a specified file, rather than to stdout.
-l, --loop= Probe until Ctrl+C at specified second interval.
QUERY OPTIONS:
-q, --query Display GPU or Unit info.
[plus any of]
-u, --unit Show unit, rather than GPU, attributes.
-i, --id= Target a specific GPU or Unit.
-f, --filename= Log to a specified file, rather than to stdout.
-x, --xml-format Produce XML output.
--dtd When showing xml output, embed DTD.
-d, --display= Display only selected information: MEMORY,
UTILIZATION, ECC, TEMPERATURE, POWER, CLOCK,
COMPUTE, PIDS, PERFORMANCE, SUPPORTED_CLOCKS,
PAGE_RETIREMENT, ACCOUNTING.
Flags can be combined with comma e.g. ECC,POWER.
Sampling data with max/min/avg is also returned
for POWER, UTILIZATION and CLOCK display types.
Doesn't work with -u or -x flags.
-l, --loop= Probe until Ctrl+C at specified second interval.
-lms, --loop-ms= Probe until Ctrl+C at specified millisecond interval.
SELECTIVE QUERY OPTIONS:
Allows the caller to pass an explicit list of properties to query.
[one of]
--query-gpu= Information about GPU.
Call --help-query-gpu for more info.
--query-supported-clocks= List of supported clocks.
Call --help-query-supported-clocks for more info.
--query-compute-apps= List of currently active compute processes.
Call --help-query-compute-apps for more info.
--query-accounted-apps= List of accounted compute processes.
Call --help-query-accounted-apps for more info.
--query-retired-pages= List of device memory pages that have been retired.
Call --help-query-retired-pages for more info.
[mandatory]
--format= Comma separated list of format options:
csv - comma separated values (MANDATORY)
noheader - skip the first line with column headers
nounits - don't print units for numerical
values
[plus any of]
-i, --id= Target a specific GPU or Unit.
-f, --filename= Log to a specified file, rather than to stdout.
-l, --loop= Probe until Ctrl+C at specified second interval.
-lms, --loop-ms= Probe until Ctrl+C at specified millisecond interval.
DEVICE MODIFICATION OPTIONS:
[any one of]
-pm, --persistence-mode= Set persistence mode: 0/DISABLED, 1/ENABLED
-e, --ecc-config= Toggle ECC support: 0/DISABLED, 1/ENABLED
-p, --reset-ecc-errors= Reset ECC error counts: 0/VOLATILE, 1/AGGREGATE
-c, --compute-mode= Set MODE for compute applications:
0/DEFAULT, 1/EXCLUSIVE_PROCESS,
2/PROHIBITED
--gom= Set GPU Operation Mode:
0/ALL_ON, 1/COMPUTE, 2/LOW_DP
-r --gpu-reset Trigger reset of the GPU.
Can be used to reset the GPU HW state in situations
that would otherwise require a machine reboot.
Typically useful if a double bit ECC error has
occurred.
Reset operations are not guarenteed to work in
all cases and should be used with caution.
--id= switch is mandatory for this switch
-vm --virt-mode= Switch GPU Virtualization Mode:
Sets GPU virtualization mode to 3/VGPU or 4/VSGA
Virtualization mode of a GPU can only be set when
it is running on a hypervisor.
-ac --applications-clocks= Specifies <memory,graphics> clocks as a
pair (e.g. 2000,800) that defines GPU's
speed in MHz while running applications on a GPU.
-rac --reset-applications-clocks
Resets the applications clocks to the default values.
-acp --applications-clocks-permission=
Toggles permission requirements for -ac and -rac commands:
0/UNRESTRICTED, 1/RESTRICTED
-pl --power-limit= Specifies maximum power management limit in watts.
-am --accounting-mode= Enable or disable Accounting Mode: 0/DISABLED, 1/ENABLED
-caa --clear-accounted-apps
Clears all the accounted PIDs in the buffer.
--auto-boost-default= Set the default auto boost policy to 0/DISABLED
or 1/ENABLED, enforcing the change only after the
last boost client has exited.
--auto-boost-permission=
Allow non-admin/root control over auto boost mode:
0/UNRESTRICTED, 1/RESTRICTED
[plus optional]
-i, --id= Target a specific GPU.
UNIT MODIFICATION OPTIONS:
-t, --toggle-led= Set Unit LED state: 0/GREEN, 1/AMBER
[plus optional]
-i, --id= Target a specific Unit.
SHOW DTD OPTIONS:
--dtd Print device DTD and exit.
[plus optional]
-f, --filename= Log to a specified file, rather than to stdout.
-u, --unit Show unit, rather than device, DTD.
--debug= Log encrypted debug information to a specified file.
STATISTICS: (EXPERIMENTAL)
stats Displays device statistics. "nvidia-smi stats -h" for more information.
Device Monitoring:
dmon Displays device stats in scrolling format.
"nvidia-smi dmon -h" for more information.
daemon Runs in background and monitor devices as a daemon process.
This is an experimental feature.
"nvidia-smi daemon -h" for more information.
replay Used to replay/extract the persistent stats generated by daemon.
This is an experimental feature.
"nvidia-smi replay -h" for more information.
Process Monitoring:
pmon Displays process stats in scrolling format.
"nvidia-smi pmon -h" for more information.
TOPOLOGY:
topo Displays device/system topology. "nvidia-smi topo -h" for more information.
DRAIN STATES:
drain Displays/modifies GPU drain states for power idling. "nvidia-smi drain -h" for more information.
NVLINK:
nvlink Displays device nvlink information. "nvidia-smi nvlink -h" for more information.
CLOCKS:
clocks Control and query clock information. "nvidia-smi clocks -h" for more information.
GRID vGPU:
vgpu Displays vGPU information. "nvidia-smi vgpu -h" for more information.
Please see the nvidia-smi(1) manual page for more detailed information.