Modern multi-socket servers exhibit NUMA characteristics that may hurt application performance if ignored. On a NUMA system (Non-uniform Memory Access), all memory is shared between/among processors. Each processor has access to its own memory - local memory - as well as memory that is local to another processor -- remote memory. However the memory access time (latency) depends on the memory location relative to the processor. A processor can access its local memory faster than the remote memory, and these varying memory latencies play a big role in application performance.
Solaris organizes the hardware resources -- CPU, memory and I/O devices -- into one or more logical groups based on their proximity to each other in such a way that all the hardware resources in a group are considered local to that group. These groups are referred as locality groups or NUMA nodes. In other words, a locality group (lgroup) is an abstraction that tells what hardware resources are near each other on a NUMA system. Each locality group has at least one processor and possibly some associated memory and/or IO devices. To minimize the impact of NUMA characteristics, Solaris considers the lgroup based physical topology when mapping threads and data to CPUs and memory.
Note that even though Solaris attempts to provide good performance out of the box, some applications may still suffer the impact of NUMA either due to misconfiguration of the hardware/software or some other reason. Engineered systems such as Oracle SuperCluster go to great lengths in setting up customer environments to minimize the impact of NUMA so applications perform as expected in a predictable manner. Still application developers and system/application administrators need to take NUMA factor into account while developing for and managing applications on large systems. Solaris provided tools and APIs can be used to observe, diagnose, control and even correct or fix the issues related to locality and latency. Rest of this post is about the tools that can be used to examine the locality of cores, memory and I/O devices.
Sample outputs are collected from a SPARC T4-4 server.
Locality Group Hierarchy
lgrpinfo
prints information about the lgroup hierarchy and its contents. It is useful in understanding the context in which the OS is trying to optimize applications for locality, and also in figuring out which CPUs are closer, how much memory is near them, and the relative latencies between the CPUs and different memory blocks.
eg.,
# lgrpinfo -a
lgroup 0 (root):
Children: 1-4
CPUs: 0-255
Memory: installed 1024G, allocated 75G, free 948G
Lgroup resources: 1-4 (CPU); 1-4 (memory)
Latency: 18
lgroup 1 (leaf):
Children: none, Parent: 0
CPUs: 0-63
Memory: installed 256G, allocated 18G, free 238G
Lgroup resources: 1 (CPU); 1 (memory)
Load: 0.0227
Latency: 12
lgroup 2 (leaf):
Children: none, Parent: 0
CPUs: 64-127
Memory: installed 256G, allocated 15G, free 241G
Lgroup resources: 2 (CPU); 2 (memory)
Load: 0.000153
Latency: 12
lgroup 3 (leaf):
Children: none, Parent: 0
CPUs: 128-191
Memory: installed 256G, allocated 20G, free 236G
Lgroup resources: 3 (CPU); 3 (memory)
Load: 0.016
Latency: 12
lgroup 4 (leaf):
Children: none, Parent: 0
CPUs: 192-255
Memory: installed 256G, allocated 23G, free 233G
Lgroup resources: 4 (CPU); 4 (memory)
Load: 0.00824
Latency: 12
Lgroup latencies:
------------------
| 0 1 2 3 4
------------------
0 | 18 18 18 18 18
1 | 18 12 18 18 18
2 | 18 18 12 18 18
3 | 18 18 18 12 18
4 | 18 18 18 18 12
------------------
CPU Locality
lgrpinfo
utility shown above already provides CPU locality in a clear manner. Here is another way to retrieve the association between CPU ids and lgroups.
# echo ::lgrp -p | mdb -k
LGRPID PSRSETID LOAD #CPU CPUS
1 0 17873 64 0-63
2 0 17755 64 64-127
3 0 2256 64 128-191
4 0 18173 64 192-255
Memory Locality
lgrpinfo
utility shown above shows the total memory that belongs to each of the locality groups. However, it doesn't show exactly what memory blocks belong to what locality groups. One of mdb's debugger command (dcmd) helps retrieve this information.
1. List memory blocks
# ldm list-devices -a memory
MEMORY
PA SIZE BOUND
0xa00000 32M _sys_
0x2a00000 96M _sys_
0x8a00000 374M _sys_
0x20000000 1048064M primary
2. Print the physical memory layout of the system
# echo ::syslayout | mdb -k
STARTPA ENDPA SIZE MG MN STL ETL
20000000 200000000 7.5g 0 0 4 40
200000000 400000000 8g 1 1 800 840
400000000 600000000 8g 2 2 1000 1040
600000000 800000000 8g 3 3 1800 1840
800000000 a00000000 8g 0 0 40 80
a00000000 c00000000 8g 1 1 840 880
c00000000 e00000000 8g 2 2 1040 1080
e00000000 1000000000 8g 3 3 1840 1880
1000000000 1200000000 8g 0 0 80 c0
1200000000 1400000000 8g 1 1 880 8c0
1400000000 1600000000 8g 2 2 1080 10c0
1600000000 1800000000 8g 3 3 1880 18c0
...
...
The values under MN column (memory node) can be treated as lgroup numbers after adding 1 to existing values. For example, a value of zero under MN translates to lgroup 1, 1 under MN translate to lgroup 2 and so on. Better yet, ::mnode
debugger command lists out the mapping of mnodes to lgroups as shown below.
# echo ::mnode | mdb -k
MNODE ID LGRP ASLEEP UTOTAL UFREE UCACHE KTOTAL KFREE KCACHE
2075ad80000 0 1 - 249g 237g 114m 5.7g 714m -
2075ad802c0 1 2 - 240g 236g 288m 15g 4.8g -
2075ad80580 2 3 - 246g 234g 619m 9.6g 951m -
2075ad80840 3 4 - 247g 231g 24m 9g 897m -
Unrelated notes:
Main memory on T4-4 is interleaved across all memory banks with 8 GB interleave size -- meaning first 8 GB chunk excluding _sys_ blocks will be populated in lgroup 1 closer to processor #1, second 8 GB chunk in lgroup 2 closer to processor #2, third 8 GB chunk in lgroup 3 closer to processor #3, fourth 8 GB chunk in lgroup 4 closer to processor #4 and then the fifth 8 GB chunk again in lgroup 1 closer to processor #1 and so on. Memory is not interleaved on T5 and M6 systems (confirm by running the
::syslayout
dcmd). Conceptually memory interleaving is similar to disk striping.Keep in mind that debugger commands (dcmd) are not committed - thus, there is no guarantee that they continue to work on future versions of Solaris. Some of these dcmds may not work on some of the existing versions of Solaris.
I/O Device Locality
-d
option to lgrpinfo
utility accepts a specified path to an I/O device and return the lgroup IDs closest to that device. Each I/O device on the system can be connected to one or more NUMA nodes - so, it is not uncommon to see more than one lgroup ID returned by lgrpinfo
.
eg.,
# lgrpinfo -d /dev/dsk/c1t0d0
lgroup ID : 1
# dladm show-phys | grep 10000
net4 Ethernet up 10000 full ixgbe0
# lgrpinfo -d /dev/ixgbe0
lgroup ID : 1
# dladm show-phys | grep ibp0
net12 Infiniband up 32000 unknown ibp0
# lgrpinfo -d /dev/ibp0
lgroup IDs : 1-4
NUMA IO Groups
Debugger command ::numaio_group
shows information about all NUMA I/O Groups.
# dladm show-phys | grep up
net0 Ethernet up 1000 full igb0
net12 Ethernet up 10 full usbecm2
net4 Ethernet up 10000 full ixgbe0
# echo ::numaio_group | mdb -k
ADDR GROUP_NAME CONSTRAINT
10050e1eba48 net4 lgrp : 1
10050e1ebbb0 net0 lgrp : 1
10050e1ebd18 usbecm2 lgrp : 1
10050e1ebe80 scsi_hba_ngrp_mpt_sas1 lgrp : 4
10050e1ebef8 scsi_hba_ngrp_mpt_sas0 lgrp : 1
Relying on prtconf
is another way to find the NUMA IO locality for an IO device.
eg.,
# dladm show-phys | grep up | grep ixgbe
net4 Ethernet up 10000 full ixgbe0
== Find the device path for the network interface ==
# grep ixgbe /etc/path_to_inst | grep " 0 "
"/pci@400/pci@1/pci@0/pci@4/network@0" 0 "ixgbe"
== Find NUMA IO Lgroups ==
# prtconf -v /devices/pci@400/pci@1/pci@0/pci@4/network@0
...
Hardware properties:
...
name='numaio-lgrps' type=int items=1
value=00000001
...
Process, Thread Locality
-H
ofprstat
command shows the home lgroup of active user processes and threads.-h
ofps
command can be used to examine the home lgroup of all user processes and threads.-H
option can be used to list all processes that are in a certain locality group.
[Related] Solaris assigns a thread to an lgroup when the thread is created. That lgroup is called the thread's home lgroup. Solaris runs the thread on the CPUs in the thread's home lgroup and allocates memory from that lgroup whenever possible.plgrp
tool shows the placement of threads among locality groups. Same tool can be used to set the home locality group and lgroup affinities for one or more processes, threads, or LWPs.-L
option ofpmap
command shows the lgroup that contains the physical memory backing some virtual memory.
[Related] Breakdown of Oracle SGA into Solaris Locality GroupsMemory placement among lgroups can possibly be achieved using
pmadvise
when the application is running or by usingmadvise()
system call during development, which provides advice to the kernel's virtual memory manager. The OS will use this hint to determine how to allocate memory for the specified range. This mechanism is beneficial when the administrators and developers understand the target application's data access patterns.It is not possible to specify memory placement locality for OSM & ISM segments using
pmadvise
command ormadvise()
system call (DISM is an exception).
Examples:
# prstat -H
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU LGRP PROCESS/NLWP
1865 root 420M 414M sleep 59 0 447:51:13 0.1% 2 java/108
3659 oracle 1428M 1413M sleep 38 0 68:39:28 0.0% 4 oracle/1
1814 oracle 155M 110M sleep 59 0 70:45:17 0.0% 4 gipcd.bin/9
8 root 0K 0K sleep 60 - 70:52:21 0.0% 0 vmtasks/257
3765 root 447M 413M sleep 59 0 29:24:20 0.0% 3 crsd.bin/43
3949 oracle 505M 456M sleep 59 0 0:59:42 0.0% 2 java/124
10825 oracle 1097M 1074M sleep 59 0 18:13:27 0.0% 3 oracle/1
3941 root 210M 184M sleep 59 0 20:03:37 0.0% 4 orarootagent.bi/14
3743 root 119M 98M sleep 110 - 24:53:29 0.0% 1 osysmond.bin/13
3324 oracle 266M 225M sleep 110 - 19:52:31 0.0% 4 ocssd.bin/34
1585 oracle 122M 91M sleep 59 0 18:06:34 0.0% 3 evmd.bin/10
3918 oracle 168M 144M sleep 58 0 14:35:31 0.0% 1 oraagent.bin/28
3427 root 112M 80M sleep 59 0 12:34:28 0.0% 4 octssd.bin/12
3635 oracle 1425M 1406M sleep 101 - 13:55:31 0.0% 4 oracle/1
1951 root 183M 161M sleep 59 0 9:26:51 0.0% 4 orarootagent.bi/21
Total: 251 processes, 2414 lwps, load averages: 1.37, 1.46, 1.47
== Locality group 2 is the home lgroup of the java process with pid 1865 ==
# plgrp 1865
PID/LWPID HOME
1865/1 2
1865/2 2
...
...
1865/22 4
1865/23 4
...
...
1865/41 1
1865/42 1
...
...
1865/60 3
1865/61 3
...
...
# plgrp 1865 | awk '{print $2}' | grep 2 | wc -l
30
# plgrp 1865 | awk '{print $2}' | grep 1 | wc -l
25
# plgrp 1865 | awk '{print $2}' | grep 3 | wc -l
25
# plgrp 1865 | awk '{print $2}' | grep 4 | wc -l
28
== Let's reset the home lgroup of the java process id 1865 to 4 ==
# plgrp -H 4 1865
PID/LWPID HOME
1865/1 2 => 4
1865/2 2 => 4
1865/3 2 => 4
1865/4 2 => 4
...
...
1865/184 1 => 4
1865/188 4 => 4
# plgrp 1865 | awk '{print $2}' | egrep "1|2|3" | wc -l
0
# plgrp 1865 | awk '{print $2}' | grep 4 | wc -l
108
# prstat -H -p 1865
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU LGRP PROCESS/NLWP
1865 root 420M 414M sleep 59 0 447:57:30 0.1% 4 java/108
== List the home lgroup of all processes ==
# ps -aeH
PID LGRP TTY TIME CMD
0 0 ? 0:11 sched
5 0 ? 4:47 zpool-rp
1 4 ? 21:04 init
8 0 ? 4253:54 vmtasks
75 4 ? 0:13 ipmgmtd
11 3 ? 3:09 svc.star
13 4 ? 2:45 svc.conf
3322 1 ? 301:51 cssdagen
...
11155 3 ? 0:52 oracle
13091 4 ? 0:00 sshd
13124 3 pts/5 0:00 bash
24703 4 pts/8 0:00 bash
12812 2 pts/3 0:00 bash
...
== Find out the lgroups which shared memory segments are allocated from ==
# pmap -Ls 24513 | egrep "Lgrp|256M|2G"
Address Bytes Pgsz Mode Lgrp Mapped File
0000000400000000 33554432K 2G rwxs- 1 [ osm shmid=0x78000047 ]
0000000C00000000 262144K 256M rwxs- 3 [ osm shmid=0x78000048 ]
0000000C10000000 524288K 256M rwxs- 2 [ osm shmid=0x78000048 ]
0000000C30000000 262144K 256M rwxs- 3 [ osm shmid=0x78000048 ]
0000000C40000000 524288K 256M rwxs- 1 [ osm shmid=0x78000048 ]
0000000C60000000 262144K 256M rwxs- 2 [ osm shmid=0x78000048 ]
== Apply MADV_ACCESS_LWP policy advice to a segment at a specific address ==
# pmap -Ls 1865 | grep anon
00000007DAC00000 20480K 4M rw--- 4 [ anon ]
00000007DC000000 4096K - rw--- - [ anon ]
00000007DFC00000 90112K 4M rw--- 4 [ anon ]
00000007F5400000 110592K 4M rw--- 4 [ anon ]
# pmadvise -o 7F5400000=access_lwp 1865
# pmap -Ls 1865 | grep anon
00000007DAC00000 20480K 4M rw--- 4 [ anon ]
00000007DC000000 4096K - rw--- - [ anon ]
00000007DFC00000 90112K 4M rw--- 4 [ anon ]
00000007F5400000 73728K 4M rw--- 4 [ anon ]
00000007F9C00000 28672K - rw--- - [ anon ]
00000007FB800000 8192K 4M rw--- 4 [ anon ]
SEE ALSO:
- - Man pages of
lgrpinfo(1), plgrp(1), pmap(1), prstat(1M), ps(1), pmadvise(1), madvise(3C), madv.so.1(1), mdb(1)
- - Web search keywords: NUMA, cc-NUMA, locality group, lgroup, lgrp, Memory Placement Optimization, MPO