Linux Kernel Parameters

Prequel

Recently I was dealing with Linux kernel parameters, which are new to me and in my case they are the key to performance of the Database (DB2).

In DB2 Kernel parameter requirements (Linux), the database manager uses a formula to automatically tune kernel parameter settings and eliminate the need for manual updates to these settings.

So the DB2 relies on good kernel parameters used for IPC to perform well.

When instances are started, if an interprocess communication (IPC) kernel parameter is below the enforced minimum value, the database manager updates it to the enforced minimum value.

There are severl Linux IPC kernel parameters need to be adjusted for DB2:

1
2
3
4
5
6
7
8
9
10
kernel.shmmni (SHMMNI)
kernel.shmmax (SHMMAX)
kernel.shmall (SHMALL)
kernel.sem (SEMMNI)
kernel.sem (SEMMSL)
kernel.sem (SEMMNS)
kernel.sem (SEMOPM)
kernel.msgmni (MSGMNI)
kernel.msgmax (MSGMAX)
kernel.msgmnb (MSGMNB)

By the time the case was we had to run DB2 as root in container(becuase it would tune kernel parameters), so to minimize the root user privilege we decide to remove some of Linux capibilities: SYS_RESOURCE and SYS_ADMIN, but these removed caps may impact the kernel parameters tuning, so we ran test suite to expose failures and errors on xmeta pods. 后来想想,当时应该查看一下 DB2 进程的 cap 用到了哪些, 比如用pscap or getpcaps command.

For example, if you check SYS_RESOURCE manual, you can see:

1
2
3
4
5
6
7
8
9
CAP_SYS_RESOURCE

* raise msg_qbytes limit for a System V message queue above
the limit in /proc/sys/kernel/msgmnb (see msgop(2) and
msgctl(2));
* use F_SETPIPE_SZ to increase the capacity of a pipe above
the limit specified by /proc/sys/fs/pipe-max-size;
* override /proc/sys/fs/mqueue/queues_max limit when creating
POSIX message queues (see mq_overview(7));

Without granting SYS_RESOURCE, msgmnb (maybe also other kernel parameters) cannot be changed properly (actually I doubt this after checking having and not having SYS_RESOURCE result).

About IPC

Let’s first understand what is IPC?

IPC Mechanisms IPC Mechanisms on Linux - Introduction

This post seems on dying, forwards it here (after I go through it, I can still remember some tech words from 402 Operating System, but I forget the detail).

Inter-Process-Communication (or IPC for short) are mechanisms provided by the kernel to allow processes to communicate with each other. On modern systems, IPCs form the web that bind together each process within a large scale software architecture.

The Linux kernel provides the following IPC mechanisms:

  1. Signals
  2. Anonymous Pipes
  3. Named Pipes or FIFOs
  4. SysV Message Queues
  5. POSIX Message Queues
  6. SysV Shared memory
  7. POSIX Shared memory
  8. SysV semaphores
  9. POSIX semaphores
  10. FUTEX locks
  11. File-backed and anonymous shared memory using mmap
  12. UNIX Domain Sockets
  13. Netlink Sockets
  14. Network Sockets
  15. Inotify mechanisms
  16. FUSE subsystem
  17. D-Bus subsystem

While the above list seems quite a lot, each IPC mechanism from the list describe above, is tailored to work better for a particular use-case scenario.

  • SIGNALS Signals are the cheapest forms of IPC provided by Linux. Their primary use is to notify processes of change in states or events that occur within the kernel or other processes. We use signals in real world to convey messages with least overhead - think of hand and body gestures. For example, in a crowded gathering, we raise a hand to gain attention, wave hand at a friend to greet and so on.

    On Linux, the kernel notifies a process when an event or state change occurs by interrupting the process’s normal flow of execution and invoking one of the signal handler functinos registered by the process or by the invoking one of the default signal dispositions supplied by the kernel, for the said event.

  • ANONYMOUS PIPES Anonymous pipes (or simply pipes, for short) provide a mechanism for one process to stream data to another. A pipe has two ends associated with a pair of file descriptors - making it a one-to-one messaging or communication mechanism. One end of the pipe is the read-end which is associated with a file-descriptor that can only be read, and the other end is the write-end which is associated with a file descriptor that can only be written. This design means that pipes are essentially half-duplex.

    Anonymous pipes can be setup and used only between processes that share parent-child relationship. Generally the parent process creates a pipe and then forks child processes. Each child process gets access to the pipe created by the parent process via the file descriptors that get duplicated into their address space. This allows the parent to communicate with its children, or the children to communicate with each other using the shared pipe.

    Pipes are generally used to implement Producer-Consumer design amongst processes - where one or more processes would produce data and stream them on one end of the pipe, while other processes would consume the data stream from the other end of the pipe.

  • NAMED PIPES OR FIFO Named pipes (or FIFO) are variants of pipe that allow communication between processes that are not related to each other. The processes communicate using named pipes by opening a special file known as a FIFO file. One process opens the FIFO file from writing while the other process opens the same file for reading. Thus any data written by the former process gets streamed through a pipe to the latter process. The FIFO file on disk acts as the contract between the two processes that wish to communicate.

  • MESSAGE QUEUES Message Queues are synonymous to mailboxes. One process writes a message packet on the message queue and exits. Another process can access the message packet from the same message queue at a latter point in time. The advantage of message queues over pipes/FIFOs are that the sender (or writer) processes do not have to wait for the receiver (or reader) processes to connect. Think of communication using pipes as similar to two people communicating over phone, while message queues are similar to two people communicating using mail or other messaging services.

    There are two standard specifications for message queues.

    • SysV message queues. The AT&T SysV message queues support message channeling. Each message packet sent by senders carry a message number. The receivers can either choose to receive message that match a particular message number, or receive all other messages excluding a particular message number or all messages.

    • POSIX message queues. The POSIX message queues support message priorities. Each message packet sent by the senders carry a priority number along with the message payload. The messages get ordered based on the priority number in the message queue. When the receiver tries to read a message at a later point in time, the messages with higher priority numbers get delivered first. POSIX message queues also support asynchronous message delivery using threads or signal based notification.

    Linux support both of the above standards for message queues.

  • SHARED MEMORY As the name implies, this IPC mechanism allows one process to share a region of memory in its address space with another. This allows two or more processes to communicate data more efficiently amongst themselves with minimal kernel intervention.

    There are two standard specifications for Shared memory.

    • SysV Shared memory. Many applications even today use this mechanism for historical reasons. It follows some of the artifacts of SysV IPC semantics.

    • POSIX Shared memory. The POSIX specifications provide a more elegant approach towards implementing shared memory interface. On Linux, POSIX Shared memory is actually implemented by using files backed by RAM-based filesystem. I recommend using this mechanism over the SysV semantics due to a more elegant file based semantics.

  • SEMAPHORES Semaphores are locking and synchronization mechanism used most widely when processes share resources. Linux supports both SysV semaphores and POSIX semaphores. POSIX semaphores provide a more simpler and elegant implementation and thus is most widely used when compared to SysV semaphores on Linux.

  • FUTEXES Futexes are high-performance low-overhead locking mechanisms provided by the kernel. Direct use of futexes is highly discouraged in system programs. Futexes are used internally by POSIX threading API for condition variables and its mutex implementations.

  • UNIX DOMAIN SOCKETS UNIX Domain Sockets provide a mechanism for implementing applications that communicate using the Client-Server architecture. They support both stream and datagram oriented communication, are full-duplex and support a variety of options. They are very widely used for developing many large-scale frameworks.

  • NETLINK SOCKETS Netlink sockets are similar to UNIX Domain Sockets in its API semantics - but used mainly for two purposes:

    For communication between a process in user-space to a thread in kernel-space For communication amongst processes in user-space using broadcast mode.

  • NETWORK SOCKETS Based on the same API semantics like UNIX Domain Sockets, Network Sockets API provide mechanisms for communication between processes that run on different hosts on a network. Linux has rich support for features and various protocol stacks for using network sockets API. For all kinds of network programming and distributed programming - network socket APIs form the core interface.

  • INOTIFY MECHANISMS The Inotify API on Linux provides a method for processes to know of any changes on a monitored file or a directory asynchronously. By adding a file to inotify watch-list, a process will be notified by the kernel on any changes to the file like open, read, write, changes to file stat, deleting a file and so on.

  • FUSE SUBSYSTEM FUSE provides a method to implement a fully functional filesystem in user-space. Various operations on the mounted FUSE filesystem would trigger functions registered by the user-space filesystem handler process. This technique can also be used as an IPC mechanism to implement Client-Server architecture without using socket API semantics.

  • D-BUS SUBSYSTEM D-Bus is a high-level IPC mechanism built generally on top of socket API that provides a mechanism for multiple processes to communicate with each other using various messaging patterns. D-Bus is a standards specification for processes communicating with each other and very widely used today by GUI implementations on Linux following Freedesktop.org specifications.

you can use ipcs command to show IPC facilities information: shared memory segments, message queues, and semaphore arrays.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
ipcs -l

------ Shared Memory Limits --------
max number of segments = 4096 // SHMMNI
max seg size (kbytes) = 32768 // SHMMAX
max total shared memory (kbytes) = 8388608 // SHMALL
min seg size (bytes) = 1

------ Semaphore Limits --------
max number of arrays = 1024 // SEMMNI
max semaphores per array = 250 // SEMMSL
max semaphores system wide = 256000 // SEMMNS
max ops per semop call = 32 // SEMOPM
semaphore max value = 32767

------ Messages: Limits --------
max queues system wide = 1024 // MSGMNI
max size of message (bytes) = 65536 // MSGMAX
default max size of queue (bytes) = 65536 // MSGMNB

Also, you can use sysctl command to view kernel parameters:

1
2
#sysctl -a | grep -i shmmni
kernel.shmmni = 4096

or

1
2
#sysctl kernel.shmmni
kernel.shmmni = 4096

Modify Kernel Parameters

From this post Db2 Modify Kernel Parameters.

Modify the kernel parameters that you have to adjust by editing the /etc/sysctl.conf file. If this file does not exist, create it. The following lines are examples of what must be placed into the file:

1
2
3
4
5
6
7
8
kernel.shmmni=4096
kernel.shmmax=17179869184
kernel.shmall=8388608
#kernel.sem=<SEMMSL> <SEMMNS> <SEMOPM> <SEMMNI>
kernel.sem=4096 1024000 250 4096
kernel.msgmni=16384
kernel.msgmax=65536
kernel.msgmnb=65536

Reload settings from the default file /etc/sysctl.conf:

1
sysctl -p

For RedHat The rc.sysinit initialization script reads the /etc/sysctl.conf file automatically after each reboot.

You can also make the change inpermanent, for example:

1
2
sysctl -w kernel.shmmni=4096
sysctl -w kernel.sem="4096 1024000 250 4096"

Or directly wirte it to procfs file:

1
echo "4096 1024000 250 4096" > /proc/sys/kernel/sem
0%