Hypernetes: Bringing Security and Multi-tenancy to Kubernetes Jun 06, 2016

Notes: this post is copied from http://blog.kubernetes.io/2016/05/hypernetes-security-and-multi-tenancy-in-kubernetes.html.

Today’s guest post is written by Harry Zhang and Pengfei Ni, engineers at HyperHQ, describing a new hypervisor based container called HyperContainer

While many developers and security professionals are comfortable with Linux containers as an effective boundary, many users need a stronger degree of isolation, particularly for those running in a multi-tenant environment. Sadly, today, those users are forced to run their containers inside virtual machines, even one VM per container.

...
How docker 1.11 share network accross containers May 11, 2016

Docker 1.11 has moved to runc with containerd, I am interested in how it processing shared netns accross containers.

For example, I have already running a container 75599a6f387b7842c6da57efd38f9742b2ca621782f891402f83852c66dbd706. A new container within same netns can be created with cmd:

docker run -itd --net=container:75599a6f387b alpine sh

This will generate a runc config.json as follows:

{
    "ociVersion": "0.6.0-dev",
    "platform": {
        "os": "linux",
        "arch": "amd64"
    },
    "process": {
        "terminal": true,
        "user": {
            "additionalGids": [
                0,
                1,
                2,
                3,
                4,
                6,
                10,
                11,
                20,
                26,
                27
            ]
        },
        "args": [
            "sh"
        ],
        "env": [
            "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
            "HOSTNAME=75599a6f387b",
            "TERM=xterm"
        ],
        "cwd": "/",
        "capabilities": [
            "CAP_CHOWN",
            "CAP_DAC_OVERRIDE",
            "CAP_FSETID",
            "CAP_FOWNER",
            "CAP_MKNOD",
            "CAP_NET_RAW",
            "CAP_SETGID",
            "CAP_SETUID",
            "CAP_SETFCAP",
            "CAP_SETPCAP",
            "CAP_NET_BIND_SERVICE",
            "CAP_SYS_CHROOT",
            "CAP_KILL",
            "CAP_AUDIT_WRITE"
        ]
    },
    "root": {
        "path": "/var/lib/docker/devicemapper/mnt/d33c7932917e64bde482b437fc3ccaad9a00a04e0cf49e39f9d3be5d71991db6/rootfs",
        "readonly": false
    },
    "hostname": "75599a6f387b",
    "mounts": [
        {
            "destination": "/proc",
            "type": "proc",
            "source": "proc",
            "options": [
                "nosuid",
                "noexec",
                "nodev"
            ]
        },
        {
            "destination": "/dev",
            "type": "tmpfs",
            "source": "tmpfs",
            "options": [
                "nosuid",
                "strictatime",
                "mode=755"
            ]
        },
        {
            "destination": "/dev/pts",
            "type": "devpts",
            "source": "devpts",
            "options": [
                "nosuid",
                "noexec",
                "newinstance",
                "ptmxmode=0666",
                "mode=0620",
                "gid=5"
            ]
        },
        {
            "destination": "/sys",
            "type": "sysfs",
            "source": "sysfs",
            "options": [
                "nosuid",
                "noexec",
                "nodev",
                "ro"
            ]
        },
        {
            "destination": "/sys/fs/cgroup",
            "type": "cgroup",
            "source": "cgroup",
            "options": [
                "ro",
                "nosuid",
                "noexec",
                "nodev"
            ]
        },
        {
            "destination": "/dev/mqueue",
            "type": "mqueue",
            "source": "mqueue",
            "options": [
                "nosuid",
                "noexec",
                "nodev"
            ]
        },
        {
            "destination": "/etc/resolv.conf",
            "type": "bind",
            "source": "/var/lib/docker/containers/75599a6f387b7842c6da57efd38f9742b2ca621782f891402f83852c66dbd706/resolv.conf",
            "options": [
                "rbind",
                "rprivate"
            ]
        },
        {
            "destination": "/etc/hostname",
            "type": "bind",
            "source": "/var/lib/docker/containers/75599a6f387b7842c6da57efd38f9742b2ca621782f891402f83852c66dbd706/hostname",
            "options": [
                "rbind",
                "rprivate"
            ]
        },
        {
            "destination": "/etc/hosts",
            "type": "bind",
            "source": "/var/lib/docker/containers/75599a6f387b7842c6da57efd38f9742b2ca621782f891402f83852c66dbd706/hosts",
            "options": [
                "rbind",
                "rprivate"
            ]
        },
        {
            "destination": "/dev/shm",
            "type": "bind",
            "source": "/var/lib/docker/containers/d8230e57e88d15515a94138ef512a4271e31d03bb6fb257b3d57a847e70b5c68/shm",
            "options": [
                "rbind",
                "rprivate"
            ]
        }
    ],
    "hooks": {},
    "linux": {
        "resources": {
            "devices": [
                {
                    "allow": false,
                    "access": "rwm"
                },
                {
                    "allow": true,
                    "type": "c",
                    "major": 1,
                    "minor": 5,
                    "access": "rwm"
                },
                {
                    "allow": true,
                    "type": "c",
                    "major": 1,
                    "minor": 3,
                    "access": "rwm"
                },
                {
                    "allow": true,
                    "type": "c",
                    "major": 1,
                    "minor": 9,
                    "access": "rwm"
                },
                {
                    "allow": true,
                    "type": "c",
                    "major": 1,
                    "minor": 8,
                    "access": "rwm"
                },
                {
                    "allow": true,
                    "type": "c",
                    "major": 5,
                    "minor": 0,
                    "access": "rwm"
                },
                {
                    "allow": true,
                    "type": "c",
                    "major": 5,
                    "minor": 1,
                    "access": "rwm"
                },
                {
                    "allow": false,
                    "type": "c",
                    "major": 10,
                    "minor": 229,
                    "access": "rwm"
                }
            ],
            "disableOOMKiller": false,
            "oomScoreAdj": 0,
            "memory": {
                "kernelTCP": null,
                "swappiness": 18446744073709551615
            },
            "cpu": {},
            "pids": {
                "limit": 0
            },
            "blockIO": {
                "blkioWeight": 0
            }
        },
        "cgroupsPath": "/docker/d8230e57e88d15515a94138ef512a4271e31d03bb6fb257b3d57a847e70b5c68",
        "namespaces": [
            {
                "type": "mount"
            },
            {
                "type": "network",
                "path": "/proc/14702/ns/net"
            },
            {
                "type": "uts"
            },
            {
                "type": "pid"
            },
            {
                "type": "ipc"
            }
        ],
        "devices": [
            {
                "path": "/dev/zero",
                "type": "c",
                "major": 1,
                "minor": 5,
                "fileMode": 438,
                "uid": 0,
                "gid": 0
            },
            {
                "path": "/dev/null",
                "type": "c",
                "major": 1,
                "minor": 3,
                "fileMode": 438,
                "uid": 0,
                "gid": 0
            },
            {
                "path": "/dev/urandom",
                "type": "c",
                "major": 1,
                "minor": 9,
                "fileMode": 438,
                "uid": 0,
                "gid": 0
            },
            {
                "path": "/dev/random",
                "type": "c",
                "major": 1,
                "minor": 8,
                "fileMode": 438,
                "uid": 0,
                "gid": 0
            },
            {
                "path": "/dev/fuse",
                "type": "c",
                "major": 10,
                "minor": 229,
                "fileMode": 438,
                "uid": 0,
                "gid": 0
            }
        ],
        "maskedPaths": [
            "/proc/kcore",
            "/proc/latency_stats",
            "/proc/timer_stats",
            "/proc/sched_debug"
        ],
        "readonlyPaths": [
            "/proc/asound",
            "/proc/bus",
            "/proc/fs",
            "/proc/irq",
            "/proc/sys",
            "/proc/sysrq-trigger"
        ]
    }
}

So, it is very clear how it works:

...
Go performance optimize May 06, 2016

**Go性能优化技巧(By 雨痕)

  1. 字符串(string)作为一种不可变类型,在与字节数组(slice, [ ]byte)转换时需付出 “沉重” 代价,根本原因是对底层字节数组的复制。
package main

import (
    "fmt"
    "unsafe"
)

func str2bytes(s string) []byte {
    ptr := (*[2]uintptr)(unsafe.Pointer(&s))
    btr := [3]uintptr{ptr[0], ptr[1], ptr[1]}
    return *(*[]byte)(unsafe.Pointer(&btr))
}

func bytes2str(b []byte) string {
    return *(*string)(unsafe.Pointer(&b))
}

func main() {
    s := "abcdefghi"
    b := str2bytes(s)
    s2 := bytes2str(b)

    fmt.Println(s, b, s2)
}
  1. Go Proverbs: A little copying is better than a little dependency. 对于一些短小的对象,复制成本远小于在堆上分配和回收操作。

    ...
The Rise of Cloud Computing Systems - Jeff Dean May 05, 2016

{% pdf http://feiskyer.github.io/assets/ccs.pdf %}

Reading notes of week 17 Apr 29, 2016

SIG-Networking: Kubernetes Network Policy APIs Coming in 1.3

One problem many users have is that the open access network policy of Kubernetes is not suitable for applications that need more precise control over the traffic that accesses a pod or service. Today, this could be a multi-tier application where traffic is only allowed from a tier’s neighbor. But as new Cloud Native applications are built by composing microservices, the ability to control traffic as it flows among these services becomes even more critical.

...
runc and runV Apr 28, 2016

runc is a CLI tool for spawning and running containers according to the OCI specification, while runV is a hypervisor-based runtime for OCI. Both of them are recommanded (implementations](https://github.com/opencontainers/runtime-spec/blob/master/implementations.md) of OCI.

Playing with runc

Install runc:

yum install -y libseccomp-devel
mkdir -p $GOPATH/src/github.com/opencontainers
cd $GOPATH/src/github.com/opencontainers
git clone https://github.com/opencontainers/runc
cd runc
make
sudo make install

Run busybox:

$ docker pull busybox
$ mkdir rootfs
$ docker export $(docker create busybox) | tar -C rootfs -xvf -
$ runc spec .
$ runc start test
/ # ps
PID   USER     COMMAND
1 root     sh
9 root     ps

Playing with docker-containerd

docker-containerd is installed togather with docker 1.11.

...
Container runtime in Docker v1.11 Apr 28, 2016

Docker v1.11正式集成了runc(终于支持OCI了),并将原来的一个二进制文件拆分为多个,同时还保持docker CLI和API不变:

  • docker
  • docker-containerd
  • docker-containerd-shim
  • docker-runc
  • docker-containerd-ctr

...
DPDK Introduction Apr 24, 2016

DPDK Introduction

Intel DPDK全称Intel Data Plane Development Kit,是intel提供的数据平面开发工具集,为Intel architecture(IA)处理器架构下用户空间高效的数据包处理提供库函数和驱动的支持,它不同于Linux系统以通用性设计为目的,而是专注于网络应用中数据包的高性能处理。DPDK应用程序是运行在用户空间上利用自身提供的数据平面库来收发数据包,绕过了Linux内核协议栈对数据包处理过程。Linux内核将DPDK应用程序看作是一个普通的用户态进程,包括它的编译、连接和加载方式和普通程序没有什么两样。DPDK程序启动后只能有一个主线程,然后创建一些子线程并绑定到指定CPU核心上运行。

...
Tips for cgo Apr 24, 2016

cgo的一些tips

基本类型

The standard C numeric types are available under the names C.char, C.schar (signed char), C.uchar (unsigned char), C.short, C.ushort (unsigned short), C.int, C.uint (unsigned int), C.long, C.ulong (unsigned long), C.longlong (long long), C.ulonglong (unsigned long long), C.float, C.double, C.complexfloat (complex float), and C.complexdouble (complex double). The C type void* is represented by Go’s unsafe.Pointer. The C types __int128_t and __uint128_t are represented by [16]byte.

...
cgo in go 1.6 Apr 19, 2016

The major change is the definition of rules for sharing Go pointers with C code, to ensure that such C code can coexist with Go’s garbage collector. Briefly, Go and C may share memory allocated by Go when a pointer to that memory is passed to C as part of a cgo call, provided that the memory itself contains no pointers to Go-allocated memory, and provided that C does not retain the pointer after the call returns. These rules are checked by the runtime during program execution: if the runtime detects a violation, it prints a diagnosis and crashes the program. The checks can be disabled by setting the environment variable GODEBUG=cgocheck=0, but note that the vast majority of code identified by the checks is subtly incompatible with garbage collection in one way or another. Disabling the checks will typically only lead to more mysterious failure modes. Fixing the code in question should be strongly preferred over turning off the checks.

...