Analysis of Podman Integration into HySDS core
Currently, HySDS uses docker as the default container engine for running the verdi job worker container (verdi
) as well as the job containers themselves. However, as of RedHat Enterprise Linux 8 and its various open-source community variants (RockyLinux, AlmaLinux, OracleLinux, and so on), support for Docker has been removed and instead replaced with Podman, an alternative container engine that touts security. From https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/building_running_and_managing_containers/assembly_starting-with-containers_building-running-and-managing-containers#con_running-containers-without-docker_assembly_starting-with-containers:
Red Hat removed the Docker container engine and the docker command from RHEL 8.
If you still want to use Docker in RHEL, you can get Docker from different upstream projects, but it is unsupported in RHEL 8.
- You can install the podman-docker package, every time you run a docker command, it actually runs a podman command.
- Podman also supports the Docker Socket API, so the podman-docker package also sets up a link between /var/run/docker.sock and /var/run/podman/podman.sock. As a result, you can continue to run your Docker API commands with docker-py and docker-compose tools without requiring the Docker daemon. Podman will service the requests.
- The podman command, like the docker command, can build container images from a Containerfile or Dockerfile. The available commands that are usable inside a Containerfile and a Dockerfile are equivalent.
- Options to the docker command that are not supported by podman include network, node, plugin (podman does not support plugins), rename (use rm and create to rename containers with podman), secret, service, stack, and swarm (podman does not support Docker Swarm). The container and image options are used to run subcommands that are used directly in podman.
Also it is important to note that some features of docker may not be implemented in podman. Please reference the podman documentation to verify the existence of any docker features in podman.
In this document, we will:
describe HySDS core’s current capability in using docker by using an example from NISAR
describe 2 designs for integrating Podman into HySDS core based on Podman’s features and security enhancements
propose the design to move forward with
NOTE: The screenshots of diagrams below are extracted from this presentation which includes animations:
Current Support for Docker
The following diagram shows the layout of a worker (verdi) EC2 instance:
Note the following assumptions which are based on NISAR’s adaptation requirements:
a verdi worker instance is an instance of an AWS autoscaling group which when instantiated
ensures the the docker daemon is running
creates an EBS volume from snapshot which contains the
verdi
container image cached along with the project specific container images (nisar_pcm
,l0a_pge
, etc.)utilizes an EBS volume for the root work directory:
/data/work
Upon bootup, the verdi worker instance starts up the verdi
container which is configured to pull jobs and publish products to its configured HySDS cluster (mozart
, metrics
, grq
, factotum
). The celery worker that is running in this hysds/verdi
container is the process that specifically pulls a job from the mozart
RabbitMQ queue its listening to. One of its main functions is to create a unique job work directory under /data/work
:
Note that to ensure containers that are run on the instance are able to read and write output to their respective work directories, the UID:GID of the user on the host that starts up the verdi
container must pass in the UID:GID via the docker option -u UID:GID
. This ensures that the user that is running in the container has the appropriate ownership attributes within the job work directory.
Afterwards, the celery worker is ready to startup the actual job as a container. Now this is not, in the case of docker, DinD (docker in docker). Instead, the verdi worker is able to communicate with the docker daemon on the host to spawn off other containers. Here the job that verdi pulled starts up the nisar_pcm
container which contains all the logic (pre-conditions, post-conditions, configuration, input localization, etc.) needed to successfully run a SAS (scientific algorithm software) container:
In the diagram above we see the verdi
container talk to the docker daemon running on the host to start up the nisar_pcm
container which:
starts off in the job work directory that was created for it by the
verdi
containeris given the UID:GID of the user (via
-u UID:GID
option) on the host that owns the job work directorystarts downloading inputs and creates any run configuration files needed to run the actual SAS container to produce science products
Once the nisar_pcm
container has done what it needs to do to prep for the execution of the SAS container, it itself talks to the docker daemon on the host to start up the SAS container, in this case, l0a_pge
:
In the diagram above and similar to the previous step, we see the nisar_pcm
container talk to the docker daemon running on the host to start up the l0a_pge
container which:
also starts off in the job work directory that was created by the
verdi
containeris given the UID:GID of the user (via
-u UID:GID
option) on the host that owns the job work directoryutilizes the created run configuration files and inputs downloaded by the
nisar_pcm
container to generate output products
Once the l0a_pge
container is done creating products, it exits and the container is removed. Similarly, the nisar_pcm
container will do some final things (e.g. create datasets, clean up large data files, etc.) as post-conditions and it too will exit and the container removed:
Finally, as shown above, the verdi
container will publish any output datasets it recognizes in the job work directory and pull another job to begin the whole cycle over again. In pulling the next job, verdi
will check that there is enough space to run the next job and if not, delete old work directories.
Podman
There is extensive information out there on the web on podman so we will defer to those resources:
There is a nice write up on Podman here by @Dustin Lo: https://hysds-core.atlassian.net/wiki/spaces/HYS/pages/1972895745.
In regards to integrating podman into HySDS as an alternative container runtime/engine, we will utilize the above use case described in the “Current Support for Docker” section as the baseline capability that podman needs to support and the core requirements we can extract from that use case are these:
a container should be able to CRUD (create/read/update/delete) files and directories under the host
a container (parent) should be able to run another container (child)
a container should be able to CRUD files and directories that were created by its ancestor containers
vice versa, a container should be able to CRUD files and directories that were created by its descendant containers
In docker, fulfilling these requirements is done via the:
-v <host mount>:<container mount>
optionthis mounts a host directory into the container at some mount point
this option can also be used to mount in the docker socket (
/var/run/docker.sock
) into the container which can be used by the docker command in the container to communicate with the host’s docker daemon
-u <UID>:<GID>
optionthis overrides the UID and GID of the
USER
in the docker container with that values passed in
HySDS is able to utilize docker to fulfill the above use case by using a set of specific options that complement each other and propagating those options to subsequent docker container instantiations. That is, as long as these are true for all containers that run on a host, the above use case can be executed successfully:
UID:GID of the host user is passed to the docker container
host directory/files being mounted into the docker container are owned by UID:GID of the host user
the
/var/run/docker.sock
file on the host is mounted into the docker container at/var/run/docker.sock
docker is installed in the docker image
Podman however introduces a kink because it uses a Linux feature called user namespaces to isolate processes and provide better security. Although podman touts being able to just alias docker=podman
, that isn’t necessarily true when it comes to a bit more complicated use case such as the one we described above. Although the syntax of podman commands mirror that of docker commands, the semantics of some of these podman options are different.
For a good introduction to podman’s use of user namespaces and how it interacts with volume mounts, see . The following image from that page gives a high-level view of this:
Additionally, podman does the following things different from docker:
all docker operations go through the docker daemon (running as root) whereas podman by default is daemonless and podman operations are isolated by user
container images (
podman pull
) are stored in user-specific areas (by default$HOME/.local/share/containers
) where as in docker (docker pull
) stores images in a global location (/var/lib/docker
) irrespective of what host user pulls themthis can be modified via podman configuration files under
/etc/containers
Podman Integration: Design #1
PINP (Podman in Podman) is possible ( ) however because each podman container needs to download the container image its going to run, by default it needs to do that every single time even for the same image. We can play games with mounting in $HOME/.local/share/containers
into each and every container we start. However the bigger problem is accounting for the subuid/subgid configurations of each container with respect to its parent container as well as its child container and how that interplays with the host mounts. More investigation needs to be done to determine the final viability of this design but initially it looks like it deviates quite a bit from the generality of the docker design and HySDS' integration of docker. The following figure shows the above use case using PINP (Podman in Podman):
Podman Integration: Design #2
Although podman touts being daemonless, installation of podman does install (though doesn’t enable) the podman.socket
systemd service which enables the Podman 2.x API ( ). In short, this API provides 2 sets of methods: one compatible with the docker API (docker daemon) and one that is specific to the libpod API that’s used by podman what this means is that with a few modifications to the OS running on the host, we can have the podman daemon run at system bootup:
$ sudo systemctl enable --now podman
$ sudo chmod 777 /var/run/podman/podman.sock
The above command will enable the podman service and a socket file will be created at /var/run/podman/podman.sock
. Note however that the podman service run this way runs under the root user’s environment and thus the sock file will only be readable and writable by root. We would subsequently need to chmod
the sock file so that all users on the host can communicate with the API.
In order to startup a container using this API and bypassing the default behavior, we would need to call podman with the following options: --remote --url unix:/var/run/podman/podman.sock
. The following table shows the differences of running podman with and without these options:
command | storage location |
---|---|
podman pull docker.io/hysds/verdi | $HOME/.local/share/containers |
podman --remote --url unix:/var/run/podman/podman.sock pull docker.io/hysds/verdi | /var/lib/containers (because API is running as root) |
Alternatively, the podman API could be run via systemd on a per-user basis:
$ sudo loginctl enable-linger $USER
$ export XDG_RUNTIME_DIR=/run/user/$(id -u)
$ systemctl --user --now enable podman.socket
Created symlink /home/ops/.config/systemd/user/sockets.target.wants/podman.socket → /usr/lib/systemd/user/podman.socket.
$ systemctl --user status podman.socket
● podman.socket - Podman API Socket
Loaded: loaded (/usr/lib/systemd/user/podman.socket; enabled; vendor preset: enabled)
Active: active (listening) since Mon 2022-02-14 18:15:42 UTC; 36s ago
Docs: man:podman-system-service(1)
Listen: /run/user/1001/podman/podman.sock (Stream)
CGroup: /user.slice/user-1001.slice/user@1001.service/podman.socket
$ export PODMAN_SOCK=/run/user/1001/podman/podman.sock
The above commands will enable the podman service in userspace and a socket file will be created for our host user at /run/user/1001/podman/podman.sock
. Note that the podman service run this way runs under the host user’s environment and thus the sock file will only be readable and writable by that user. We don’t need to chmod
the sock file because our host user can already communicate with the API because it owns the sock file by default.
In order to startup a container using this API and bypassing the default behavior, we would need to call podman with the following options: --remote --url unix:/run/user/1001/podman/podman.sock
. The following table shows the differences of running podman with and without these options:
command | storage location |
---|---|
podman pull docker.io/hysds/verdi | $HOME/.local/share/containers |
podman --remote --url unix:/run/user/1001/podman/podman.sock pull docker.io/hysds/verdi | $HOME/.local/share/containers |
The following diagram shows how we could utilize the podman API to fulfill the test case described above:
However we will still have to deal with user namespaces and the fact that podman utilizes subuids and subgids to compartmentalize user containers. The following sections shows experimentation with podman and how we can get most of the way there. However there remains one outstanding issue that needs to be resolved in podman in order to be a drop-in replacement for docker.
Testing/Experimenting with podman using vagrant (Oracle Linux 8 VM)
Start VM and ssh into it as vagrant user
$ cd ~/tmp
$ mkdir podman_testing
$ vagrant init hysds/base
$ vagrant up
$ vagrant ssh
Create ops user with UID:GID that is not 1000:1000 and has sudo
Become the ops user and confirm UID:GID is not 1000:1000
Depending on your VM set up, you may need to become root first prior to becoming the ops user:
Install podman
Startup up the podman service (2 OPTIONS)
(OPTION 1) Start up the podman-service as root user and allow all users to write to it
(OR OPTION 2) Start up the podman-service as ops user
NOTE: The example commands below will make reference to PODMAN_SOCK to account for running the podman socket service as either root or as a user.
Pull hysds/verdi:develop-podman image
verdi docker image variant using rockylinux:8 as image base
ops user as UID:GID 1000:1000
uninstalled docker
installed podman
Pull hysds/pge-base:develop-9999-podman image
peg-base docker image variant using rockylinux:8 as image base
ops user as UID:GID 9999:9999
uninstalled docker
installed podman
Run verdi container mounting in the root work directory
Create /data/work directory owned by ops
Check UID:GID
Check $HOME
Check ownership of /data/work
Try creating job directory and file
What do the permissions look like on host:
Unshared (in user namespace):
Do not run the podman unshare command if you’re using a podman socket started by user. That resulted in this error and any subsequent podman commands you try to run afterwards:
To resolve, at this time, you’d have to run a “podman system reset” to get it back into a usable state.
Check what pods are running on host:
Can the user on the host remove the work dir? Yes.
Back to verdi container, restore the work dir and test file:
start pge-base:develop-9999-podman container using remote socket
Where are we and what is home?
source its .bash_profile
Try to write a file and directory in the work directory from pge-base container:
What does permission look like on host?
What does permissions look like on host in user namespace (unshare)?
Do not run the podman unshare command if you’re using a podman socket started by user. That resulted in this error and any subsequent podman commands you try to run afterwards:
To resolve, at this time, you’d have to run a “podman system reset” to get it back into a usable state.
Check what containers are running on host:
Let’s run a third container (from the PGE container) using a hysds/verdi:develop-podman container (ops user is 1000):
Where is this third container we’re in?
Source its .bash_profile
Try to write a file and directory in the scratch directory created from the previous container:
Exit out of all containers back to verdi, clean out the work directory, and exit verdi
Outstanding issues
need this issue fixed so that the rewrite of HOME and sourcing of .bash_profile hack is not needed: option to prevent —userns=keep-id from setting the value of —workdir option as the HOME · Issue #13185 · containers/podman · GitHub
NOTE: This issue has been resolved and the above commands reflect usage of the new --passwd-entry flag option in order to set $HOME to /home/ops
Should we run verdi using the
--remote --url unix:/var/run/podman/podman.sock
as well so that it too runs under root?
Things to note
podman containers run per user whereas docker containers run globally so one user could
exec
into another user’s container; with podman that won’t be straightforward (possible?)podman images (
podman pull
) are saved per user under$HOME/.local/share/containers
. Will need modification to the podman configuration (/etc/containers/storage.conf
) to enable using a global cache of images