Let’s first see what does cron represent from wiki:

The software utility cron is a time-based job scheduler in Unix-like computer operating systems. People who set up and maintain software environments use cron to schedule jobs (commands or shell scripts) to run periodically at fixed times, dates, or intervals.

cron is most suitable for scheduling repetitive tasks. For example, it runs log file rotation utilities to ensure that your hard drive doesn’t fill up with old log files. You should know how to use cron because it’s just plain useful.

Also see cronjob in K8s.

Install crontab

In CentOS or RedHat, you can run:

1
yum install cronie

If you are not sure, try yum provides crontab to see which package will provide this service.

To check if cron service is running or not:

1
systemctl status crond

If inactive, enable and restart it.

crontab File

Cron is driven by a crontab(cron table) file, a configuration file that specifies shell commands to run periodically on a given schedule.

The program running through cron is called a cron job. To install a cron job, you’ll create an entry line in your crontab file, usually by running the crontab command.

Each user can have his or her own crontab file, which means that every system may have multiple crontabs, usually found in /var/spool/cron/ folder. the crontab command installs, lists, edits, and removes a user’s crontab.

crontab Commands

For example, run as root, I want to set a recurring task for user dsadm:

1
crontab -u dsadm -e

Then edit like this:

1
00 21 * * * /home/dsadm/test.sh > /tmp/cron-log 2>&1

This means on everyday at 9:00PM, user dsadm will run test.sh and redirect output to /tmp/cron-log file. You can also put the entries into a file and run:

1
crontab -u dsadm <entry file>

The meaning of the entry is:

1
2
3
4
5
6
7
8
9
# ┌───────────── minute (0 - 59)
# │ ┌───────────── hour (0 - 23)
# │ │ ┌───────────── day of the month (1 - 31)
# │ │ │ ┌───────────── month (1 - 12)
# │ │ │ │ ┌───────────── day of the week (0 - 6) (Sunday to Saturday;
# │ │ │ │ │ 7 is also Sunday on some systems)
# │ │ │ │ │
# │ │ │ │ │
# * * * * * command to execute

A * in any field means to match every value.

Now, if you check /var/spool/cron directory, the dsadm crontab file is created there.

To list the dsadm cron job:

1
crontab -u dsadm -l

To remove dsadm cron job:

1
crontab -u dsadm -r

Run as Non-Root

If you want to run crontab as dsadm, you must set the cron permission:

  • /etc/cron.allow - If this file exists, it must contain your username for you to use cron jobs.
  • /etc/cron.deny - If the cron.allow file does not exist but the /etc/cron.deny file does exist then, to use cron jobs, you must not be listed in the /etc/cron.deny file.

So, if you put dsadm in /etc/cron.allow file, then you can use crontab directly.

System crontab File

Linux distributions normally have an /etc/crontab file. You can also edit here, but the format is a little bit difference:

1
2
3
4
5
6
7
8
# Example of job definition:
# .---------------- minute (0 - 59)
# | .------------- hour (0 - 23)
# | | .---------- day of month (1 - 31)
# | | | .------- month (1 - 12) OR jan,feb,mar,apr ...
# | | | | .---- day of week (0 - 6) (Sunday=0 or 7) OR sun,mon,tue,wed,thu,fri,sat
# | | | | |
# * * * * * user-name command to be executed

This blog is for Network file system NFS with only one server. 但NFS有新版本支持multi-server了。 For distributed file system (multiple servers): openAFS, GlusterFS (native support on Redhat/centOS). For cluster file systems: GFS2 (linux native support)

So, what is the difference among network, distributed and cluster file systems?. 补充一点: Distributed or netowrk file system also ensure data availability across multiple server nodes and can usually handle nodes being added and removed more gracefully. Don’t make assumptions about how well this will work; be sure to test not only for performance and latency, but also for the impact of changes to the cluster and for failure scenarios.

NFS是一个概念(protocol),它不是一个文件系统的类型,它是一种文件系统的共享方式, 这里默认使用本地Linux的文件系统类型了。 NFS allows remote hosts to mount filesystems(can be any) over a network and interact with those filesystems as though they are mounted locally.

NFS lets you leverage storage space in a different location and allows you to write onto the same space from multiple servers or clients in an effortless manner. It, thus, works fairly well for directories that users need to access frequently.

常见用法,比如分享home directories so if switch to different host, just mount them, supports diskless machines, but be careful with UIDs and GIDs, ensure they are the same person on each machine.

Server Setup

Acutally there are many other package may needed, this video may give you more details.

1
2
# install nfs package
yum -y install nfs-utils

Then enable and start nfs server:

1
2
3
4
5
6
7
8
9
10
11
# the rpc by default should be on, but may not
# rpc is called remote producture call
# nfs server needs this to access and operate on others
systemctl enable rpcbind
systemctl start rpcbind

systemctl enable nfs-server.service
systemctl start nfs-server.service
# create server side shared folder
mkdir –p /data
chmod –R 755 /data

Edit /etc/exports file to expose shared folder

1
2
3
4
5
6
vi /etc/exports 

# any client can access this folder
/data *(rw,insecure,async,no_root_squash)
# or specific client can access this folder
/data <client ip>(rw,insecure,async,no_root_squash)

Then export the shared folder:

1
2
exportfs -a
systemctl restart nfs-server.service

Check shared folder is up showmount -e <server ip>:

1
showmount -e localhost

If you change the content in /etc/exports, need to reload:

1
exportfs -ra

Check mount options:

1
exportfs -v

Client Setup

First create or using existing folder as folder to mount, here I use /mntiis folder in client machine

1
2
3
4
5
6
# install nfs package
yum -y install nfs-utils

# create client side shared folder
mkdir -p /mntiis
chmod -R 0755 /mntiis

If you use non-persistent mount in command line, this connection will disappear after rebooting:

1
mount <server ip>:/data     /mntiis    

For persistent mount, go to /etc/fstab file, add this line

1
2
3
<server ip>:/data /mntiis nfs defaults 0 0
# or with other mount options
<server ip>:/data /mntiis nfs defaults,timeo=10,retrans=3,rsize=1048576,wsize=1048576 0 0

Enable mount:

1
2
3
mount /mntiis
# or reload all in fstab file
mount -a

Verify all set, you can see the /mntiis in the output:

1
df -hk

When you remove the entry in /etc/fstab filem, also unmount the folder, otherwise you will see /mntiis is marked by ? in filesystem:

1
umount /mntiis

Mount On-demand

You can set NFS auto mount on-demand via autofs, this can avoid wasting resources by unmounting mount point when not using and mounting it when access again.

1
2
yum install -y autofs
systemctl start autofs

把这十几年间喜欢的音乐收集列出来…(有一段时间没更新了)

我弹奏的钢琴曲😃

2019年6月17号我拿到了我买的YAMAHA P-125,从此开始了我钢琴自学之旅,记录一下自己的进步与快乐。

最喜欢的😍

喜欢的🌹

🎤奔跑 🎤嘻唰唰 🎤浪漫满屋 🎤快乐崇拜 🎤数码宝贝Brave Heart 🎤数码宝贝Butterfly 🎤你的微笑 🎹梦中的婚礼 🎤不死之身 🎤江南 🎹星空 🎤Lydia 🎤Soledad 🎤Evergreen 🎤Black Black Heart 🎤7 Days 🎹Summer 🎤紫藤花 🎤I swear 🎹River Flows in You 🎤Brave 🎹忧伤还是快乐 🎤留在我身边 🎤情非得已 🎹The Sound of Silence 🎤残酷な天使のテーゼ 🎤分开旅行 🎤老男孩 🎹幽灵公主 🎤心跳 🎹🎻Santorini 🎤The Fox 🎤Counting Stars 🎤Stronger 🎤转动命运之轮 🎤下个,路口,见 🎹Run Away With Me 🎻Love Me Like You Do 🎤生来倔强 🎤飞-致我们的星辰大海 🎤追梦赤子心 🎤小苹果 🎤荷塘月色 🎤温柔 🎤远走高飞 🎤美人鱼 🎤杀破狼 🎤三国恋 🎤时间飞了 🎤曾经的你 🎤稻香 🎤偏爱 🎤Adventure Of A Lifetime

随便记记😆

🎤没那么简单 🎤王妃 🎤最炫民族风 🎤猪之歌 🎤我相信 🎤燃烧你的卡路里 🎤Rolling in the Deep 🎤幸福糊涂虫 🎤No Promises 🎤Whataya Want from Me 🎤 🎤羅曼蒂克的愛情 🎤江湖笑 🎤你不是真正的快乐 🎤新贵妃醉酒 🎤Attention 🎤最初的梦想 🎤给我你的爱 🎤咱们结婚吧 🎤最好的舞台 🎤最美情侣 🎹Horizon 🎤空空如也 🎤渴望光荣 🎤那么骄傲 🎤CAN’T STOP THE FEELING 🎤明天会更好 🎤Love Me Again 🎤平凡之路 🎤怒放的生命 🎤新的心跳 🎤Be what You Wanna Be 🎤回到过去 🎤青花瓷 🎤夜的第七章 🎤不仅仅是喜欢 🎤夜曲 🎤只要有你 🎤有點甜 🎤Nevada 🎤光年之外 🎤戰火榮耀

In the non-root development for DataStage, in order to launch applications as non-root user, I use su - dsadm with following commands. Later I occassionally notice that badal use su dsadm… So what are the differences?

There are two main shell instance types: interactive and noninteractive, but of those, only interactive shells are of interest because noninteractive shells (such as those that run shell scripts) usually don’t read any startup files.

Interactive shells are the ones that you use to run commands from a terminal, they can be classified as login or non-login. I know there are lots of startup files under each user’s home directory, how do they be called and in what order?

Login Shell

Logging in remotely with SSH will give you a login shell (Because we actually need credentials to login).

You can tell if a shell is a login shell by running echo $0; if the first character is a -, the shell’s a login shell.

1
2
3
# if not sure it's login or non-login, check it
echo $0
-bash

When Bash is invoked as a Login shell:

  1. Login process calls /etc/profile (this is for all users)
  2. /etc/profile calls the scripts in /etc/profile.d/
  3. Login process calls $HOME/.bash_profile, $HOME/.bash_login and $HOME/.profile in order, the first found file is run and rest are ignored, most Linux distributions use only one or two of these four startup files. Notice that $HOME/.bashrc are not in the list, it typically run from one of these files.

Login Shells created by explicitly telling to login, there is a - or -l flag:

1
2
3
4
5
6
7
su -
su -l
su --login
su USERNAME -
su -l USERNAME
su --login USERNAME
sudo -i

Non-login Shell

When bash is invoked as a Non-login shell (for example, you just run bash or sh without login): 如果先ssh进入系统,再运行bash,则就是non-login shell了。

  1. Non-login process(shell) calls /etc/bashrc
  2. then calls $HOME/.bashrc (remember!)

Non-Login shells created using the below commands:

1
2
su
su USERNAME

Of course you can source $HOME/.bashrc in $HOME/.bash_profile files to satisfy both login and non-login shell, for example: add . $HOME/.bashrc in $HOME/.bash_profile (it’s usually there by default setting).

The reasoning behind the two different startup filesystems is that in the old days, users logged in through a traditional terminal with a login shell, then started non-login subshells with windowing systems or the screen program. For the non-login subshells, it was deemed a waste to repeatedly set the user environment and run a bunch of programs that had already been run. With login shells, you could run fancy startup commands in a file such as .bash_profile, leaving only aliases and other “lightweight” things to your .bashrc.

This can explain that if you use non-login like su dsadm, the parent exported environment variables are still there in env scope. But if you run su - dsadm, the parent exported environment variables are gone.

Bash Parameters

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Make bash act as if it had been invoked as a login shell
# 但实际上并不是真正的login shell, echo $0 可知
bash --login
bash -l

# -c: string If the -c option is present, then commands are read from string.
# If there are arguments after the string
# they are assigned to the positional parameters, starting with $0.
# 这就不是interactive shell了
bash -c /tmp/test.sh hello world!

# do not run ~/.bashrc, by default same as `sh`
bash --norc

# 这个只对login shell才有用,不进行任何初始化
bash --noprofile

# specify other script to replace .bashrc
bash --rcfile <path to file>

# do syntax check only
bash -n <script>

Interesting question: How to start a shell in clean

~/.bash_logout will be executed when exit login shell.

Question

Docker or K8s pod init process 会初始化 ~/.bashrc吗? 这个init process 是user login shell呢 还是non-login shell? 应该这么理解: k8s pod或者 docker container运行的时候,是可以设置环境变量的,这些环境变量可以被script 使用. 环境变量是独立于login/non-login这个概念的,只要是bash 环境,则环境变量就可以存在而被使用。这个init process 既不是login,也不是non-login,它就是一个普通的内部的程序而已。

cURL stands for Client URL, it is a command-line tool for getting or sending data including files using URL syntax. Since cURL uses libcurl, it supports a range of common network protocols, currently including HTTP, HTTPS, FTP, FTPS, SCP, SFTP, TFTP, LDAP, DAP, DICT, TELNET, FILE, IMAP, POP3, SMTP and RTSP.

Here is a open-source book: Everything curl

About curl proxy environment variables(see man for more details):

  • http_proxy(only be lower case) to specify http proxy.
  • https_proxy to specify https proxy (curl to proxy over ssl/tls).
  • NO_PROXY or --noproxy '*' to skip proxy use for all hosts, or a list for some hosts.

When using curl download files, make sure it actually downloads it rather than the HTML page, you can check the file type via file command, for example:

1
2
3
curl -Ss <URL> -o xxx.tar.gz
# Check if it is a real tar gz file
file xxx.tar.gz | grep gzip

If the URL is wrong, curl will download the HTML page instead. wget will show 404 error and fail.

Date and Note

1
2
3
4
5
6
7
8
9
10
11
12
13
05/23/2019 download file
06/16/2019 http request verbose
06/22/2020 redirect, PUT/GET/DELETE
09/02/2020 check headers only
09/03/2020 use specific forward proxy
09/06/2020 resume download
09/07/2020 limit rate
09/08/2020 fetch headers only
09/09/2020 proxy tunnel
10/25/2022 name resolve tricks
04/07/2022 retry
04/01/2023 post request with json payload
04/03/2023 trace and see request body

05/23/2019

If the file server needs user name/password (usually will prompt when you open in browser).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# -O: downloads the file and saves it with the same name as in the URL.
# -u: specify "user:password"
# -k: explicitly allows curl to perform 'insecure' SSL connections and transfers.
# -L: allow redirect
# -o: custom file name
# -s: slient all output
# -S: used with slient, output error message
USER="username"
PASSWD="passwd"
USER_PWD="$USER:$PASSWD"
STREAMURL="https://xxx/target.tar.gz"
# Downlaod file with user/password.
curl -k -u ${USER_PWD} -LO ${STREAMURL}

# Download file with custom name /tmp/archive.tgz and slient.
curl -sS <url> -o /tmp/archive.tgz

If you don’t turn off the certificate check(-k), you will get error message and fail:

1
2
3
4
5
6
7
8
9
10
11
12
13
curl: (60) Peer's Certificate has expired.
More details here: http://curl.haxx.se/docs/sslcerts.html

curl performs SSL certificate verification by default, using a "bundle"
of Certificate Authority (CA) public keys (CA certs). If the default
bundle file isn't adequate, you can specify an alternate file
using the --cacert option.
If this HTTPS server uses a certificate signed by a CA represented in
the bundle, the certificate verification probably failed due to a
problem with the certificate (it might be expired, or the name might
not match the domain name in the URL).
If you'd like to turn off curl's verification of the certificate, use
the -k (or --insecure) option.

06/16/2019

When I was working on Docker Registry API, I primarily used curl to do the job.

1
2
3
4
5
6
7
8
# -v: Makes the fetching more verbose/talkative. Mostly useful for debugging. A
# line starting with `>` means `header data` sent by curl, `<` means `header
# data` received by curl that is hidden in normal cases, and a line starting
# with `*` means additional info provided by curl.

# -X: (HTTP) Specifies a custom request method to use when communicating with
# the HTTP server.
curl -v -k -XGET http://localhost:5000/v2/_catalog

06/22/2020

1
2
3
4
5
6
7
8
9
10
11
12
# -L: redirect
# -i: include response header info to output
curl -iL http://...

# -v: verbose, to see better error message
curl -v http://...

# -d|--data: value to be put
# -X, --request
curl -X PUT -d '50s' http://localhost:8500/v1/kv/prod/portal/haproxy/timeout-server
curl -X DELETE http://localhost:8500/v1/kv/prod/portal/haproxy/timeout-server
curl -X GET http://localhost:8500/v1/kv/prod/portal/haproxy/timeout-server?pretty

09/02/2020

Add additional headers, only show header info:

1
2
3
4
5
6
7
8
9
# -H,--header: add header
# -L: redirect
# -I: fetch headers only
# -v: verbose, to see better error message
# -s: hide progress bar, slient
# > /dev/null: hide output, show only the -v output
curl --header "Host: chengdol.github.io" \
--header "..." \
-L -Ivs http://185.199.110.153 > /dev/null

09/03/2020

Using HTTP proxy (forward) with proxy authentication, learned from Envoy.

1
2
3
4
5
6
7
8
# -x, --proxy: use the specific forward proxy
curl -v -x <proxy:port> http://www.example.com
# the same as
http_proxy=<proxy:port> curl -v -x http://www.example.com

# -U, --proxy-user: user/password for proxy itself.
# this option overrides the existing proxy environment variable
curl -v -U <user:password> -x <proxy:port> http://www.example.com

Default to use basic authentication scheme, Some proxies will require another authentication scheme (and the headers that are returned when you get a 407 response will tell you which).

1
2
## --proxy-anyauth: ask curl to use any method the proxy wants
curl -U <user:password> --proxy-anyauth -x myproxy:80 http://example.com

09/06/2020

Resume the download:

1
2
3
4
curl <url> -o archive
# then break and resume
# -C -: automatrically resume from the break point
curl <url> -C - -o archive

09/07/2020

Download/upload speed limit if you have a limit bandwidth.

1
2
# limit rate as 1m/second, for example: 10k, 1g
curl <url> -O --limit-rate 1m

09/08/2020

Only fetch header information, no body:

1
2
3
# -I,--head: (HTTP FTP FILE) Fetch the headers only!
# notice that -i is to have response header to output, they are different
curl -I <url>

09/09/2020

Non-HTTP protocols over HTTP proxy Most HTTP proxies allow clients to “tunnel through” it to a server on the other side. That’s exactly what’s done every time you use HTTPS through the HTTP proxy.

1
2
3
4
# -p, --proxytunnel: make curl tunnel through the proxy
# used with -x, --proxy options
# here tunnel ftp protocol
curl -p -x http://proxy.example.com:80 ftp://ftp.example.com/file.txt

10/25/2020

The curl with name resolve tricks and another post about it.

This feature is primarily used for HTTP server development and testing the server locally to mimic real world situations.

1
2
3
# First start a simple http server within python virtualenv:
python3 -m http.server
Serving HTTP on :: port 8000 (http://[::]:8000/) ...
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# Override the default Host header otherwise the Host header value
# will be localhost:8000
# This is not enough for HTTPS server due to SNI, see next --resolve.
curl -vI -H "Host: www.example.com:80" http://localhost:8000
# The header info:
> HEAD / HTTP/1.1
> Host: www.example.com:80
> User-Agent: curl/7.84.0
> Accept: */*

# www.myfakelist.com does not exist and dns it to 127.0.0.1
# port 8000 must be the same.
curl -vI --resolve www.myfakelist.com:8000:127.0.0.1 http://www.myfakelist.com:8000
# The header info:
> GET / HTTP/1.1
> Host: www.myfakelist.com:8000
> User-Agent: curl/7.84.0
> Accept: */*
# works for HTTPS server as well
curl -kvI --resolve www.myfakelist.com:443:127.0.0.1 https://www.myfakelist.com

# www.myfakelist.com does not exist and map both the name:port
# to localhost:8000 points to the fake python http server.
curl -vI --connect-to www.myfakelist.com:80:localhost:8000 http://www.myfakelist.com
# The header info:
> GET / HTTP/1.1
> Host: www.myfakelist.com
> User-Agent: curl/7.84.0
> Accept: */*
# works for HTTPS server as well
curl -kv --connect-to www.myfakelist.com:443:127.0.0.1:443 https://www.myfakelist.com

04/07/2022

Retry until reaching condition limits:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# --retry: worked on a timeout, an FTP 4xx response code or an HTTP 408, 429,
# 500, 502, 503 or 504 response code.
# --retry-all-errors: retry on any errors. If you want to retry on all response
# codes that indicate HTTP errors (4xx and 5xx) then combine with `-f`, `--fail`.
# --retry-delay: if not set, retries it will double the waiting time until it
# reaches 10 minutes.
# --retry-max-time: given total time limit for retries.
# --retry-connrefused: connect refused retry.
# -m, --max-time: timtout for whole operation.
# --connect-timeout: timeout for curl's connection to take.
curl -kI \
--max-time 3.55 \
--connect-timeout 2.12 \
--retry-all-errors \
--retry 3 \
--retry-delay 1 \
--retry-max-time 10 \
https://no-op

# output
curl: (6) Could not resolve host: no-op
Warning: Problem : timeout. Will retry in 2 seconds. 3 retries left.
curl: (6) Could not resolve host: no-op
Warning: Problem : timeout. Will retry in 2 seconds. 2 retries left.
curl: (6) Could not resolve host: no-op
Warning: Problem : timeout. Will retry in 2 seconds. 1 retries left.
curl: (6) Could not resolve host: no-op

04/01/2023

For example, the POST request is like:

1
2
3
4
5
6
curl -vs -X POST \
--http1.1 \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $TOKEN" \
-d @payload.json \
https://xxx.com/v1/do-something

Regardless of the format in payload.json, curl will always remove the line returns and one-line the JSON content and send it, which is different from Postman who will send payload with line returns (which could be identified as security issue by firewall rules).

04/03/2023

To check the request payload sent, use --trace-ascii:

1
2
3
4
5
6
7
curl -vs -X POST \
--http1.1 \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $TOKEN" \
-d @payload.json \
https://xxx.com/v1/do-something \
--trace-ascii /dev/stdout

This blog is a follow-up of <<Docker Run Reference>>.

When builds an image from a Dockerfile or by committing from a running container, we can set startup parameters for the new image.

Four of the Dockerfile commands cannot be overridden at runtime: FROM, MAINTAINER, RUN, and ADD. Everything else has a corresponding override in docker run command.

CMD

The CMD can be the default startup command for a container or the arguments for entrypoint.

1
docker run [OPTIONS] IMAGE[:TAG|@DIGEST] [COMMAND] [ARG...]

If the image has an ENTRYPOINT specified then the CMD or COMMAND is appended as arguments to the ENTRYPOINT(see next section).

For example, overrides the CMD in busybox by /bin/sh -c ls -ltr:

1
docker run -it busybox /bin/sh -c ls -ltr

You can use inspect to check the default CMD in image, it shows the default CMD for busybox is [sh]. If you override it by /bin/sh -c ls -ltr like above example, then you run

1
2
# There is also a "ContainerConfig" section, but it is not related to CMD.
docker inspect -f "{{.Config.Cmd}}" busybox

You can see under the COMMAND column, it changes to /bin/sh -c ls -ltr, easy to verify.

1
2
# --no-trunc: no truncate output
docker ps -a --no-trunc

ENTRYPOINT

The ENTRYPOINT is the default start point of the running container.

1
2
# Overwrite the default entrypoint set by the image
--entrypoint="":

The ENTRYPOINT of an image is similar to a COMMAND because it specifies what executable to run when the container starts, but it is (purposely) more difficult to override. The ENTRYPOINT gives a container its default nature or behavior, so that when you set an ENTRYPOINT you can run the container as if it was that binary, complete with default options, and you can pass in more options via the COMMAND.

Check the default entrypoint of a image by:

1
docker inspect -f "{{.Config.Entrypoint}}" <image or container>

To override the entrypoint as /bin/sh and pass parameters tail -f /dev/null to it:

1
2
3
4
docker run -d \
--entrypoint=/bin/sh \
<image>:<tag> \
-c "tail -f /dev/null"

NOTE: --entrypoint will clear out any default command in image.

EXPOSE

The EXPOSE is used for incoming traffic when published.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
--expose=[]: Expose a port or a range of ports inside the container.
These are additional to those exposed by the `EXPOSE` instruction
-P : Publish all exposed ports to the host interfaces
-p=[] : Publish a container's port or a range of ports to the host
format: ip:hostPort:containerPort | ip::containerPort | hostPort:containerPort | containerPort
Both hostPort and containerPort can be specified as a
range of ports. When specifying ranges for both, the
number of container ports in the range must match the
number of host ports in the range, for example:
-p 1234-1236:1234-1236/tcp

When specifying a range for hostPort only, the
containerPort must not be a range. In this case the
container port is published somewhere within the
specified hostPort range. (e.g., `-p 1234-1236:1234/tcp`)

(use 'docker port' to see the actual mapping)

--link="" : Add link to another container (<name or id>:alias or <name or id>)

With the exception of the EXPOSE directive, an image developer hasn’t got much control over networking. The EXPOSE instruction defines the initial incoming ports (listens on specific network ports) that provide services. These ports are available to processes inside the container. An operator can use the --expose option to add to the exposed ports.

NOTE: EXPOSE will not allow communication between container and host or other containers from different network. To allow this you need to publish the ports.

NOTE: using -P or -p rather than --net=host for incoming traffic.

To expose a container’s internal port, using the -P or -p flag. The exposed port is accessible by any client that can access the host.

NOTE: in K8s, if the pods are in the same namespace, the pods can communicate with each other, no additional config is needed except you want to access the pods from outside of the cluster.

USER

1
2
3
4
5
-u="", --user="": Sets the username or UID used and optionally the groupname o
GID for the specified command.

The followings examples are all valid:
--user=[ user | user:group | uid | uid:gid | user:gid | uid:group ]

root (id = 0) is the default user in a container. The developer can create additional users.

ENV

Docker automatically sets some environment variables when creating a Linux container.

The following environment variables are set for Linux containers:

  • HOME: Set based on the value of USER
  • HOSTNAME: The hostname associated with the container
  • PATH: Includes popular directories, for example: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
  • TERM: xterm if the container is allocated a pseudo-TTY

Additionally, the operator can set any environment variable in the container by using one or more -e flags. If the operator names an environment variable without specifying a value, then the current value of the named variable is populated into the container’s environment.

VOLUME

1
2
3
4
5
6
7
8
9
10
11
12
13
14
-v, --volume=[host-src:]container-dest[:<options>]: Bind mount a volume.
The comma-delimited `options` are [rw|ro], [z|Z],
[[r]shared|[r]slave|[r]private], and [nocopy].
The 'host-src' is an absolute path or a name value.

If neither 'rw' or 'ro' is specified then the volume is mounted in
read-write mode.

The `nocopy` mode is used to disable automatically copying the requested volume
path in the container to the volume storage location.
For named volumes, `copy` is the default mode. Copy modes are not supported
for bind-mounted volumes.

--volumes-from="": Mount all volumes from the given container(s)

The volumes commands are complex enough to have their own documentation.

The container-dest must always be an absolute path such as /src/docs. The host-src can either be an absolute path or a name value. If you supply an absolute path for the host-src, Docker bind-mounts to the path you specify. If you supply a name, Docker creates a named volume by that name.

For example, you can specify either /foo or foo for a host-src value. If you supply the /foo value, Docker creates a bind mount. If you supply the foo specification, Docker creates a named volume.

Other Resources

Docker run reference Dockerfile reference Expose vs publish: Docker port commands explained simply

This is a summary from Docker run reference

Docker runs processes in isolated containers. A container is a process which runs on a host. The host may be local or remote. When an operator executes docker run, the container process that runs is isolated in that it has its own file system, its own networking, and its own isolated process tree separate from the host.

1
docker run [OPTIONS] IMAGE[:TAG|@DIGEST] [COMMAND] [ARG...]

The docker run command can override nearly all the defaults set from docker image.

Let’s first see the docker run command I encountered:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
docker run --detach \
--name=${DB2_XMETA_HOST} \
--restart=always \
--privileged=false \
--cap-add=SYS_NICE \
--cap-add=IPC_OWNER \
--cap-add=SETFCAP \
--user 1000 \
-e MY_POD_NAMESPACE=${MY_POD_NAMESPACE} \
-e SHARED_VOL=${SHARED_REPOS_VOLPATH} \
--hostname=${DB2_XMETA_HOST} \
-p ${DB2_XMETA_PORT}:${DB2_XMETA_PORT} \
-v ${SHARED_VOL}:${SHARED_VOL} \
is-xmetadocker:11.7.1 \

Detched [-d]

To start a container in detached mode, you use -d=true or just -d option. By design, containers started in detached mode exit when the root process used to run the container exits, unless you also specify the --rm option. If you use -d with --rm, the container is removed when it exits or when the daemon exits, whichever happens first.

This is why we specify tail -f /dev/null at end of start script in container.

Foreground

In foreground mode (the default when -d is not specified), docker run can start the process in the container and attach the console to the process’s standard input, output, and standard error. It can even pretend to be a TTY (this is what most command line executables expect) and pass along signals.

For interactive processes (like a shell), you must use -it together in order to allocate a tty for the container process.

For example:

1
docker run -it --rm busybox /bin/sh

This will directly open a shell to operate on container, once exit, container will be removed.

Name [--name]

Specify container name. If you do not assign a container name with the --name option, then the daemon generates a random string name for you. Defining a name can be a handy way to add meaning to a container.

IPC settings [--ipc]

1
--ipc="MODE"  : Set the IPC mode for the container

IPC (POSIX/SysV IPC) namespace provides separation of named shared memory segments, semaphores and message queues.

Shared memory segments are used to accelerate inter-process communication at memory speed, rather than through pipes or through the network stack.

1
--ipc=<Value>

Value Description

  • “”: Use daemon’s default.
  • “none”: Own private IPC namespace, with /dev/shm not mounted.
  • “private”: Own private IPC namespace.
  • “shareable”:Own private IPC namespace, with a possibility to share it with other containers.
  • “container: <name-or-ID>”: Join another (“shareable”) container’s IPC namespace.
  • “*host”: Use the host system’s IPC namespace.

If not specified, daemon default is used, which can either be private or shareable, depending on the daemon version and configuration.

If these types of applications are broken into multiple containers, you might need to share the IPC mechanisms of the containers, using “shareable” mode for the main container, and container:<donor-name-or-ID> for other containers.

Network settings

1
2
3
4
5
6
7
8
9
10
11
12
13
14
--dns=[]           : Set custom dns servers for the container
--network="bridge" : Connect a container to a network
'bridge': create a network stack on the default Docker bridge
'none': no networking
# set this to join other's network
'container:<name|id>': reuse another container's network stack
'host': use the Docker host network stack
'<network-name>|<network-id>': connect to a user-defined network
--network-alias=[] : Add network-scoped alias for the container
--add-host="" : Add a line to /etc/hosts (host:IP)
--mac-address="" : Sets the container's Ethernet device's MAC address
--ip="" : Sets the container's Ethernet device's IPv4 address
--ip6="" : Sets the container's Ethernet device's IPv6 address
--link-local-ip=[] : Sets one or more container's Ethernet device's link local IPv4/IPv6 addresses

I meet --add-host flag in service docker:

1
--add-host="${SERVICES_HOST} ${DB2_XMETA_HOST} ${ENGINE_HOST}":${SERVICES_HOST_IP} \

Restart policies (--restart)

Using the --restart flag on Docker run you can specify a restart policy for how a container should or should not be restarted on exit.

When a restart policy is active on a container, it will be shown as either Up or Restarting in docker ps.

Exit Status

The exit code from docker run gives information about why the container failed to run or why it exited. When docker run exits with a non-zero code, the exit codes follow the chroot standard.

Clean up [--rm]

By default a container’s file system persists even after the container exits. This makes debugging a lot easier (since you can inspect the final state) and you retain all your data by default. But if you are running short-term foreground processes, these container file systems can really pile up. If instead you’d like Docker to automatically clean up the container and remove the file system when the container exits, you can add the --rm flag.

HOSTNAME [--hostname]

1
--hostname="xxx"		Container host name

Set the hostname of the container.

Runtime privilege and Linux capabilities

I separate this section to the blog <<Docker Capability>> since it’s important to me.

Logging drivers [--log-driver]

The container can have a different logging driver than the Docker daemon. Use the --log-driver=VALUE with the docker run command to configure the container’s logging driver.

Default logging driver is json format. The docker logs command is available only for the json-file and journald logging drivers.

Overriding Dockerfile image defaults

I separate this section to the blog <<Docker Image Defaults>> since it’s important to me.

-p

Remember, the first part of the -p value is the host port and the second part is the port within the container

VOLUME (shared filesystems)

When use -v option binds mount a volume from host machine to inside container, if the container originally has contents inside the mount target folder, they will all be removed when mount and replaced by contents from source host machine folder.

Note that docker commit will not include any data contained in volumes mounted inside the container.

In my blog <<Linux Capability>>. I talk the basic and general knowlwdge about Capability. This blog will focus on Capability in Docker container.

In docker run command, there are some flags about runtime privilege and capabilities:

1
2
3
4
--cap-add: Add Linux capabilities
--cap-drop: Drop Linux capabilities
--privileged=false: Give extended privileges to this container
--device=[]: Allows you to run devices inside the container without the --privileged flag.

By default, Docker containers are unprivileged and cannot, for example, run a Docker daemon inside a Docker container. This is because by default a container is not allowed to access any devices (/dev) on host, but a “privileged” container is given access to all devices on host.

The --privileged flag gives all capabilities to the container, and it also lifts all the limitations enforced by the device cgroup controller. In other words, the container can then do almost everything that the host can do. This flag exists to allow special use-cases, like running Docker within Docker.

How to verify? you can run a busybox with --privileged enabled or not, first try enable it:

1
docker run --rm -it --privileged busybox sh

then let’s check init process capabilities (busybox doesn’t have getpcaps):

1
2
3
4
5
6
7
# cat /proc/1/status | grep -i cap

CapInh: 0000001fffffffff
CapPrm: 0000001fffffffff
CapEff: 0000001fffffffff
CapBnd: 0000001fffffffff
CapAmb: 0000000000000000

then decode in another machine, we can see full capabilities here:

1
2
3
4
5
6
7
8
9
# capsh --decode=0000001fffffffff

0x0000001fffffffff=cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,
cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,
cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,
cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,
cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,
cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,
35,36

if not enabled, only see default ones:

1
2
3
4
5
# capsh --decode=00000000a80425fb

0x00000000a80425fb=cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,
cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,
cap_audit_write,cap_setfcap

By default, Docker has a default list of capabilities that are kept. The following table lists the Linux capability options which are allowed by default and can be dropped.

  1. SETPCAP: Modify process capabilities.
  2. MKNOD: Create special files using mknod(2).
  3. AUDIT_WRITE: Write records to kernel auditing log.
  4. CHOWN: Make arbitrary changes to file UIDs and GIDs (see chown(2)).
  5. NET_RAW: Use RAW and PACKET sockets.
  6. DAC_OVERRIDE: Bypass file read, write, and execute permission checks.
  7. FOWNER Bypass: permission checks on operations that normally require the file system UID of the process to match the UID of the file.
  8. FSETID: Don’t clear set-user-ID and set-group-ID permission bits when a file is modified.
  9. KILL: Bypass permission checks for sending signals.
  10. SETGID: Make arbitrary manipulations of process GIDs and supplementary GID list.
  11. SETUID: Make arbitrary manipulations of process UIDs.
  12. NET_BIND_SERVICE: Bind a socket to internet domain privileged ports (port numbers less than 1024).
  13. SYS_CHROOT: Use chroot(2), change root directory.
  14. SETFCAP: Set file capabilities.

Further reference information is available on the capabilities(7) - Linux man page

Resource

Docker run reference Docker security

Starting with kernel 2.2, Linux divides the privileges traditionally associated with superuser into distinct units, known as capabilities, which can be independently enabled and disabled. This way the full set of privileges is reduced and decreasing the risks of exploitation.

This story started by removing SYS_ADMIN and SYS_RESOURCE Linux capabilities from K8s container which hosts DB2. Why we removed them? Because by the time DB2 has to run as root(to tune kernel parameters), so we want to minimize the privilege of the root user by removing some risky Linux capabilities from pod/container.

And a container is really just a process running on the system, separated using cgroups and namespaces in the kernel. This means that capabilities can be assigned to the container in just the same way as with any other process and this is handled by the container runtime when it creates the container.

How to add/remove Linux capabilities in K8s

Secure Your Containers with this One Weird Trick: The way I describe it is that most people think of root as being all powerful. This isn’t the whole picture, the root user with all capabilities is all powerful.

Linux capabilities in Kubernetes

For the purpose of performing permission checks, traditional UNIX implementations distinguish two categories of processes: privileged processes (whose effective user ID is 0, referred to as superuser or root), and unprivileged processes (whose effective UID is nonzero).

Basic Capability Thing

Header File

Linux capabilities are defined in a header file with the non-surprising name capability.h, in /usr/include/linux/capability.h. They’re pretty self-explanatory and well commented

Capability Number

To see the highest capability number for your kernel, use the data from the /proc file system.

1
2
# cat /proc/sys/kernel/cap_last_cap
36

Current Capabilities

To see the current capabilities list, run capsh --print, for example, as normal user dsadm:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# capsh --print

Current: =
Bounding set =cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,
cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,
cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,
cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,
cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,
cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,
35,36
Securebits: 00/0x0/1'b0
secure-noroot: no (unlocked)
secure-no-suid-fixup: no (unlocked)
secure-keep-caps: no (unlocked)
uid=1002(dsadm)
gid=1002(dsadm)
groups=1002(dsadm)

you see the Current: = is empty, but if you run as root user:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
$ capsh --print

Current: = cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,
cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,
cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,
cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,
cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,
cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,
35,36+ep
Bounding set =cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,
cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,
cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,
cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,
cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,
cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,
35,36
Securebits: 00/0x0/1'b0
secure-noroot: no (unlocked)
secure-no-suid-fixup: no (unlocked)
secure-keep-caps: no (unlocked)
uid=0(root)
gid=0(root)
groups=0(root)

To see the capabilities for a particular process, run cat /proc/<PID>/status | grep -i cap:

1
2
3
4
5
6
7
# cat /proc/1/status | grep -i cap

CapInh: 00000000a884a5fb
CapPrm: 00000000a884a5fb
CapEff: 00000000a884a5fb
CapBnd: 00000000a884a5fb
CapAmb: 0000000000000000

This is the bit map for capabilities, the meaning for each is:

  • CapInh = Inherited capabilities
  • CapPrm – Permitted capabilities
  • CapEff = Effective capabilities
  • CapBnd = Bounding set
  • CapAmb = Ambient capabilities set

The CapBnd defines the upper level of available capabilities. During the time a process runs, no capabilities can be added to this list. Only the capabilities in the bounding set can be added to the inheritable set, which uses the capset() system call. If a capability is dropped from the boundary set, that process or its children can no longer have access to it.

Using the capsh utility we can decode them into the capabilities name:

1
2
3
4
5
# capsh --decode=00000000a884a5fb

0x00000000a884a5fb=cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,
cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_ipc_owner,cap_sys_chroot,
cap_sys_nice,cap_mknod,cap_audit_write,cap_setfcap

The another easy way is use getpcaps utility:

1
2
3
4
5
# getpcaps 1965

Capabilities for `1965': = cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,
cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_ipc_owner,
cap_sys_chroot,cap_sys_nice,cap_mknod,cap_audit_write,cap_setfcap+eip

It is also interesting to see the capabilities of a set of processes that have a relationship.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# getpcaps $(pgrep db2)

Capabilities for `1965': = cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,
cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_ipc_owner,
cap_sys_chroot,cap_sys_nice,cap_mknod,cap_audit_write,cap_setfcap+eip
Capabilities for `2151': = cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,
cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_ipc_owner,
cap_sys_chroot,cap_sys_nice,cap_mknod,cap_audit_write,cap_setfcap+i
Capabilities for `2245': = cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,
cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_ipc_owner,
cap_sys_chroot,cap_sys_nice,cap_mknod,cap_audit_write,cap_setfcap+eip
Capabilities for `2246': = cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,
cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_ipc_owner,
cap_sys_chroot,cap_sys_nice,cap_mknod,cap_audit_write,cap_setfcap+eip
Capabilities for `2247': = cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,
cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_ipc_owner,
cap_sys_chroot,cap_sys_nice,cap_mknod,cap_audit_write,cap_setfcap+eip
Capabilities for `2249': = cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,
cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_ipc_owner,
cap_sys_chroot,cap_sys_nice,cap_mknod,cap_audit_write,cap_setfcap+i
Capabilities for `2614': = cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,
cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_ipc_owner,
cap_sys_chroot,cap_sys_nice,cap_mknod,cap_audit_write,cap_setfcap+i
Capabilities for `4213': = cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,
cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_ipc_owner,
cap_sys_chroot,cap_sys_nice,cap_mknod,cap_audit_write,cap_setfcap+i
Capabilities for `4238': = cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,
cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_ipc_owner,
cap_sys_chroot,cap_sys_nice,cap_mknod,cap_audit_write,cap_setfcap+i

Limit Capability

You can test what happens when a particular capability is dropped by using the capsh utility. This is a way to see what capabilities a particular program may need to function correctly. The capsh command can run a particular process and restrict the set of available capabilities.

1
capsh --print -- -c "/bin/ping -c 1 localhost"

After dropping cap_net_raw, ping not permitted.

1
capsh --drop=cap_net_raw --print -- -c "/bin/ping -c 1 localhost"

Capability Meet

List the capabilities I have seen so far:

  • CAP_SYS_ADMIN Without it, I cannot perform hostname command for docker container in K8s.
  • CAP_SYS_RESOURCE This is for adjust DB2 kernel parameters

These 3 necessaries are for DB2:

  • CAP_SETFCAP Set arbitrary capabilities on a file. (actually this is default in unprivileged docker container)
  • CAP_SYS_NICE
  • CAP_IPC_OWNER Bypass permission checks for operations on System V IPC objects.

My Questions

  1. Is capability granted to user or process? Frist ensure the env(container) has enough Linux caps. Then using command grant process certains caps, but need root privilege to do that, then you can run process as ordinary user.

  2. Privilege process bypass all kernel permission check? Does that mean linux capabilities are only for non-privilege user or process? I think there is a global or default capability set in system to determine what ever processes on system is allowed to do. Then you can fine-tune for unprivileged process.

  3. If we have root and normal user both in docker container, so capabilities are applied on root or normal user or both? After testing and comparing by capsh --print with different user in xmeta container, I think capabilities are applied on all users in K8s environment.

Later I post blogs to talk about <<Capability in Docker>>.

Resources

Linux Programmer’s Manual Linux capabilities 101

I have seen something like this in permission bits, What do these s/S and t/T stand for?

1
2
3
4
5
# s bit
-r-sr-sr-x 1 root db2iadm1 115555 Mar 21 03:19 db2start
# s/t/T bit
drwxrwsr-t 1 db2inst1 db2iadm1 212 May 10 17:16 sqllib
drwxr-x--T 1 wasadmin dstage 21 Mar 21 03:56 usr

Basic Permission Bits

  • Read - a readable permission allows the contents of the file to be viewed. A read permission on a directory allows you to list the contents of a directory.
  • Write - a write permission on a file allows you to modify the contents of that file. For a directory, the write permission allows you to edit the contents of a directory (e.g. add/delete files).
  • Execute - for a file, the executable permission allows you to run the file and execute a program or script. For a directory, the execute permission allows you to change to a different directory and make it your current working directory.

Setuid and Setgid Bits

Note that setuid and setgid have an effect only on binary executable files and not on scripts (e.g., Bash, Perl, Python), see wiki.

Assigning the setuid bit to binaries, most often this is given to a few programs owned by the superuser. When an ordinary user runs a program that is setuid root, the program runs with the effective privileges of the superuser. Linux capabilities is a great alternative to reduce the usage of setuid.

Capabilities break up root privileges in smaller units, so root access is no longer needed. Most of the binaries that have a setuid flag, can be changed to use capabilities instead.

Apply on File

When the setuid or setgid attributes are set on an executable file, then any users able to execute the file will automatically execute the file with the privileges of the file’s owner (usually root) and/or the file’s group, depending upon the flags set. This may pose potential security risks in some cases and executables should be properly evaluated before set.

For example the passwd file:

1
2
3
# ordinary user can run this binary effectively as root
# this is a binary executable
-rwsr-xr-x. 1 root root 27832 Jan 29 2014 /usr/bin/passwd

set setuid on binary executable:

1
2
3
chmod u+s <executable>
# remove setuid on file:
chmod u-s <executable>

This post explains why setuid/setgid is usually ignored on shell script.

Another thing is about ping, it is a binary executable, for now, it may not have setuid:

1
-rwxr-xr-x 1 root root 81608 Feb  5 04:37 /usr/bin/ping

Instead, it may have capability see hardening Linux binaries by removing setuid:

1
2
# getcap /usr/bin/ping
/usr/bin/ping cap_net_raw=ep

我也总结了一篇关于 Linux capabilities 的博客.

Or not at all, see thisping without SETUID and Capabilities: Creating (normal) ICMP packets does not require special permissions anymore.

Apply on Directory

Setting the setgid permission on a directory causes new files and subdirectories created within it to inherit its group ID, rather than the primary group ID of the user who created the file (the owner ID is never affected, only the group ID).

set setgid on directory:

1
2
3
chmod g+s dir
# remove setgid on directory:
chmod g-s dir

The setuid permission set on a directory is ignored on most UNIX and Linux systems.However FreeBSD can be configured to interpret setuid in a manner similar to setgid, in which case it forces all files and sub-directories created in a directory to be owned by that directory’s owner - a simple form of inheritance.

Note that both the setuid and setgit bits have no effect if the executable bit is not set. if executable bit is not set, s changes to S.

Chown Removes Setuid

If you run chown on a setuid script, you will find that the s is gone. This is a reasonable design, otherwise the s will apply to the new owner, a big security hole.

The chown command sometimes clears the set-user-ID or set-group-ID permission bits. This behavior depends on the policy and functionality of the underlying chown system call, which may make system-dependent file mode modifications outside the control of the chown command. For example, the chown command might not affect those bits when invoked by a user with appropriate privileges, or when the bits signify some function other than executable permission (e.g., mandatory locking). When in doubt, check the underlying system behavior.

Note that if you want to have setuid and owner no change when copy, for example last time deal with sqllib, please perserve when you do copy like:

1
/bin/cp -rfp /home/dfdcdc/sqllib /tmp

Sticky Bit

When set on a file or directory, the sticky bit, or +t mode, means that only the owner (or root) can delete the file (or files under the directory), regardless of which users have write access to this file or directory by way of group membership or ownership! This is often used to control access to a shared directory, such as /tmp.

This is useful when a file or directory is owned by a group through which a number of users share write access to a given set of files.

Set sticky bit:

1
chmod +t script.sh

Remove sticky bit, note that to change the sticky bit, you need to be either root or the file owner. The root user will be able to delete files regardless of the status of the sticky bit.

1
chmod -t script.sh

Sometimes you see T instead of t, usually t sits with all fields x, but if the executable bit is not set then the t is flagged up as a capital, for example:

1
2
3
4
5
touch file
chmod u=rwx,go=rx file # "-rwxr-xr-x 1 roaima 0 Sep 10 23:13 file"
chmod +t file # "-rwxr-xr-t 1 roaima 0 Sep 10 23:13 file"
chmod o-x file # "-rwxr-xr-T 1 roaima 0 Sep 10 23:13 file"
chmod u=rwx,go=,+t file # "-rwx-----T 1 roaima 0 Sep 10 23:13 file"

Now if a user is not in that group, it cannot even enter the directory.

Resources

wiki setuid and setgid how to set T bit chown remove setuid

0%