Analysis of Podman Integration into HySDS core

Currently, HySDS uses docker as the default container engine for running the verdi job worker container (verdi) as well as the job containers themselves. However, as of RedHat Enterprise Linux 8 and its various open-source community variants (RockyLinux, AlmaLinux, OracleLinux, and so on), support for Docker has been removed and instead replaced with Podman, an alternative container engine that touts security. From https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/building_running_and_managing_containers/assembly_starting-with-containers_building-running-and-managing-containers#con_running-containers-without-docker_assembly_starting-with-containers:

Red Hat removed the Docker container engine and the docker command from RHEL 8.
If you still want to use Docker in RHEL, you can get Docker from different upstream projects, but it is unsupported in RHEL 8.
- You can install the podman-docker package, every time you run a docker command, it actually runs a podman command.
- Podman also supports the Docker Socket API, so the podman-docker package also sets up a link between /var/run/docker.sock and /var/run/podman/podman.sock. As a result, you can continue to run your Docker API commands with docker-py and docker-compose tools without requiring the Docker daemon. Podman will service the requests.
- The podman command, like the docker command, can build container images from a Containerfile or Dockerfile. The available commands that are usable inside a Containerfile and a Dockerfile are equivalent.
- Options to the docker command that are not supported by podman include network, node, plugin (podman does not support plugins), rename (use rm and create to rename containers with podman), secret, service, stack, and swarm (podman does not support Docker Swarm). The container and image options are used to run subcommands that are used directly in podman.

Also it is important to note that some features of docker may not be implemented in podman. Please reference the podman documentation to verify the existence of any docker features in podman.

In this document, we will:

describe HySDS core’s current capability in using docker by using an example from NISAR
describe 2 designs for integrating Podman into HySDS core based on Podman’s features and security enhancements
propose the design to move forward with

NOTE: The screenshots of diagrams below are extracted from this presentation which includes animations:

Current Support for Docker

The following diagram shows the layout of a worker (verdi) EC2 instance:

Note the following assumptions which are based on NISAR’s adaptation requirements:

a verdi worker instance is an instance of an AWS autoscaling group which when instantiated
- ensures the the docker daemon is running
- creates an EBS volume from snapshot which contains the verdi container image cached along with the project specific container images (nisar_pcm, l0a_pge, etc.)
- utilizes an EBS volume for the root work directory: /data/work

Upon bootup, the verdi worker instance starts up the verdi container which is configured to pull jobs and publish products to its configured HySDS cluster (mozart, metrics, grq, factotum). The celery worker that is running in this hysds/verdi container is the process that specifically pulls a job from the mozart RabbitMQ queue its listening to. One of its main functions is to create a unique job work directory under /data/work:

Note that to ensure containers that are run on the instance are able to read and write output to their respective work directories, the UID:GID of the user on the host that starts up the verdi container must pass in the UID:GID via the docker option -u UID:GID. This ensures that the user that is running in the container has the appropriate ownership attributes within the job work directory.

Afterwards, the celery worker is ready to startup the actual job as a container. Now this is not, in the case of docker, DinD (docker in docker). Instead, the verdi worker is able to communicate with the docker daemon on the host to spawn off other containers. Here the job that verdi pulled starts up the nisar_pcm container which contains all the logic (pre-conditions, post-conditions, configuration, input localization, etc.) needed to successfully run a SAS (scientific algorithm software) container:

In the diagram above we see the verdi container talk to the docker daemon running on the host to start up the nisar_pcm container which:

starts off in the job work directory that was created for it by the verdi container
is given the UID:GID of the user (via -u UID:GID option) on the host that owns the job work directory
starts downloading inputs and creates any run configuration files needed to run the actual SAS container to produce science products

Once the nisar_pcm container has done what it needs to do to prep for the execution of the SAS container, it itself talks to the docker daemon on the host to start up the SAS container, in this case, l0a_pge:

In the diagram above and similar to the previous step, we see the nisar_pcm container talk to the docker daemon running on the host to start up the l0a_pge container which:

also starts off in the job work directory that was created by the verdi container
is given the UID:GID of the user (via -u UID:GID option) on the host that owns the job work directory
utilizes the created run configuration files and inputs downloaded by the nisar_pcm container to generate output products

Once the l0a_pge container is done creating products, it exits and the container is removed. Similarly, the nisar_pcm container will do some final things (e.g. create datasets, clean up large data files, etc.) as post-conditions and it too will exit and the container removed:

Finally, as shown above, the verdi container will publish any output datasets it recognizes in the job work directory and pull another job to begin the whole cycle over again. In pulling the next job, verdi will check that there is enough space to run the next job and if not, delete old work directories.

Podman

There is extensive information out there on the web on podman so we will defer to those resources:

There is a nice write up on Podman here by @Dustin Lo: https://hysds-core.atlassian.net/wiki/spaces/HYS/pages/1972895745.

In regards to integrating podman into HySDS as an alternative container runtime/engine, we will utilize the above use case described in the “Current Support for Docker” section as the baseline capability that podman needs to support and the core requirements we can extract from that use case are these:

a container should be able to CRUD (create/read/update/delete) files and directories under the host
a container (parent) should be able to run another container (child)
a container should be able to CRUD files and directories that were created by its ancestor containers
vice versa, a container should be able to CRUD files and directories that were created by its descendant containers

In docker, fulfilling these requirements is done via the:

-v <host mount>:<container mount> option
- this mounts a host directory into the container at some mount point
- this option can also be used to mount in the docker socket (/var/run/docker.sock) into the container which can be used by the docker command in the container to communicate with the host’s docker daemon
-u <UID>:<GID> option
- this overrides the UID and GID of the USER in the docker container with that values passed in

HySDS is able to utilize docker to fulfill the above use case by using a set of specific options that complement each other and propagating those options to subsequent docker container instantiations. That is, as long as these are true for all containers that run on a host, the above use case can be executed successfully:

UID:GID of the host user is passed to the docker container
host directory/files being mounted into the docker container are owned by UID:GID of the host user
the /var/run/docker.sock file on the host is mounted into the docker container at /var/run/docker.sock
docker is installed in the docker image

Podman however introduces a kink because it uses a Linux feature called user namespaces to isolate processes and provide better security. Although podman touts being able to just alias docker=podman, that isn’t necessarily true when it comes to a bit more complicated use case such as the one we described above. Although the syntax of podman commands mirror that of docker commands, the semantics of some of these podman options are different.

For a good introduction to podman’s use of user namespaces and how it interacts with volume mounts, see . The following image from that page gives a high-level view of this:

Additionally, podman does the following things different from docker:

all docker operations go through the docker daemon (running as root) whereas podman by default is daemonless and podman operations are isolated by user
container images (podman pull) are stored in user-specific areas (by default $HOME/.local/share/containers) where as in docker (docker pull) stores images in a global location (/var/lib/docker) irrespective of what host user pulls them
- this can be modified via podman configuration files under /etc/containers

Podman Integration: Design #1

PINP (Podman in Podman) is possible ( ) however because each podman container needs to download the container image its going to run, by default it needs to do that every single time even for the same image. We can play games with mounting in $HOME/.local/share/containers into each and every container we start. However the bigger problem is accounting for the subuid/subgid configurations of each container with respect to its parent container as well as its child container and how that interplays with the host mounts. More investigation needs to be done to determine the final viability of this design but initially it looks like it deviates quite a bit from the generality of the docker design and HySDS' integration of docker. The following figure shows the above use case using PINP (Podman in Podman):

Podman Integration: Design #2

Although podman touts being daemonless, installation of podman does install (though doesn’t enable) the podman.socket systemd service which enables the Podman 2.x API ( ). In short, this API provides 2 sets of methods: one compatible with the docker API (docker daemon) and one that is specific to the libpod API that’s used by podman what this means is that with a few modifications to the OS running on the host, we can have the podman daemon run at system bootup:

$ sudo systemctl enable --now podman
$ sudo chmod 777 /var/run/podman/podman.sock

The above command will enable the podman service and a socket file will be created at /var/run/podman/podman.sock. Note however that the podman service run this way runs under the root user’s environment and thus the sock file will only be readable and writable by root. We would subsequently need to chmod the sock file so that all users on the host can communicate with the API.

In order to startup a container using this API and bypassing the default behavior, we would need to call podman with the following options: --remote --url unix:/var/run/podman/podman.sock. The following table shows the differences of running podman with and without these options:

command	storage location

command	storage location
podman pull docker.io/hysds/verdi	$HOME/.local/share/containers
podman --remote --url unix:/var/run/podman/podman.sock pull docker.io/hysds/verdi	/var/lib/containers (because API is running as root)

Alternatively, the podman API could be run via systemd on a per-user basis:

$ sudo loginctl enable-linger $USER
$ export XDG_RUNTIME_DIR=/run/user/$(id -u)
$ systemctl --user --now enable podman.socket
Created symlink /home/ops/.config/systemd/user/sockets.target.wants/podman.socket → /usr/lib/systemd/user/podman.socket.
$ systemctl --user status podman.socket
● podman.socket - Podman API Socket
   Loaded: loaded (/usr/lib/systemd/user/podman.socket; enabled; vendor preset: enabled)
   Active: active (listening) since Mon 2022-02-14 18:15:42 UTC; 36s ago
     Docs: man:podman-system-service(1)
   Listen: /run/user/1001/podman/podman.sock (Stream)
   CGroup: /user.slice/user-1001.slice/user@1001.service/podman.socket
$ export PODMAN_SOCK=/run/user/1001/podman/podman.sock

The above commands will enable the podman service in userspace and a socket file will be created for our host user at /run/user/1001/podman/podman.sock. Note that the podman service run this way runs under the host user’s environment and thus the sock file will only be readable and writable by that user. We don’t need to chmod the sock file because our host user can already communicate with the API because it owns the sock file by default.

In order to startup a container using this API and bypassing the default behavior, we would need to call podman with the following options: --remote --url unix:/run/user/1001/podman/podman.sock. The following table shows the differences of running podman with and without these options:

command	storage location

command	storage location
podman pull docker.io/hysds/verdi	$HOME/.local/share/containers
podman --remote --url unix:/run/user/1001/podman/podman.sock pull docker.io/hysds/verdi	$HOME/.local/share/containers

The following diagram shows how we could utilize the podman API to fulfill the test case described above:

However we will still have to deal with user namespaces and the fact that podman utilizes subuids and subgids to compartmentalize user containers. The following sections shows experimentation with podman and how we can get most of the way there. However there remains one outstanding issue that needs to be resolved in podman in order to be a drop-in replacement for docker.

Testing/Experimenting with podman using vagrant (Oracle Linux 8 VM)

Start VM and ssh into it as vagrant user

$ cd ~/tmp
$ mkdir podman_testing
$ vagrant init hysds/base
$ vagrant up
$ vagrant ssh

Create ops user with UID:GID that is not 1000:1000 and has sudo

Become the ops user and confirm UID:GID is not 1000:1000

Depending on your VM set up, you may need to become root first prior to becoming the ops user:

Install podman

Startup up the podman service (2 OPTIONS)

(OPTION 1) Start up the podman-service as root user and allow all users to write to it

(OR OPTION 2) Start up the podman-service as ops user

NOTE: The example commands below will make reference to PODMAN_SOCK to account for running the podman socket service as either root or as a user.

Pull hysds/verdi:develop-podman image

verdi docker image variant using rockylinux:8 as image base
ops user as UID:GID 1000:1000
uninstalled docker
installed podman

Pull hysds/pge-base:develop-9999-podman image

peg-base docker image variant using rockylinux:8 as image base
ops user as UID:GID 9999:9999
uninstalled docker
installed podman

Run verdi container mounting in the root work directory

Create /data/work directory owned by ops

Check UID:GID

Check $HOME

Check ownership of /data/work

Try creating job directory and file

What do the permissions look like on host:

Unshared (in user namespace):

Do not run the podman unshare command if you’re using a podman socket started by user. That resulted in this error and any subsequent podman commands you try to run afterwards:

To resolve, at this time, you’d have to run a “podman system reset” to get it back into a usable state.

Check what pods are running on host:

Can the user on the host remove the work dir? Yes.

Back to verdi container, restore the work dir and test file:

start pge-base:develop-9999-podman container using remote socket

Where are we and what is home?

source its .bash_profile

Try to write a file and directory in the work directory from pge-base container:

What does permission look like on host?

What does permissions look like on host in user namespace (unshare)?

Do not run the podman unshare command if you’re using a podman socket started by user. That resulted in this error and any subsequent podman commands you try to run afterwards:

To resolve, at this time, you’d have to run a “podman system reset” to get it back into a usable state.

Check what containers are running on host:

Let’s run a third container (from the PGE container) using a hysds/verdi:develop-podman container (ops user is 1000):

Where is this third container we’re in?

Source its .bash_profile

Try to write a file and directory in the scratch directory created from the previous container:

Exit out of all containers back to verdi, clean out the work directory, and exit verdi

Outstanding issues

need this issue fixed so that the rewrite of HOME and sourcing of .bash_profile hack is not needed: option to prevent —userns=keep-id from setting the value of —workdir option as the HOME · Issue #13185 · containers/podman · GitHub
- NOTE: This issue has been resolved and the above commands reflect usage of the new --passwd-entry flag option in order to set $HOME to /home/ops
Should we run verdi using the --remote --url unix:/var/run/podman/podman.sock as well so that it too runs under root?

Things to note

podman containers run per user whereas docker containers run globally so one user could exec into another user’s container; with podman that won’t be straightforward (possible?)
podman images (podman pull) are saved per user under $HOME/.local/share/containers. Will need modification to the podman configuration (/etc/containers/storage.conf) to enable using a global cache of images