This post is based in 2 articles, written by Andrea Zonca:
These articles helped me a lot to implement my cluster, but I had many problems because the frameworks have changed in their configurations. I update this information with the current frameworks:
- Docker version 18.03.1-ce
- Jupyterhub 0.8.1
- nvidia-docker2
In my particular case, I need an internal cluster for research, and my site won't be public, so I will remove the authentication part and I implemented my own authentication class. I make the cluster for single Ubuntu user, and implement the authentication for access with specific usernames (In this blog only show a dummy authentication for simplicity). I create a share folder outside my home user (/export), you can change this like zonca article if you wish. I am not using a OpenStack, but I hope to integrate it later.
Until now, nvidia-docker2 doesn't have support to use docker swarm mode. So I used Docker Swarm.
We start since this point:
1) Main Server
Setup Docker Swarm
You must login as a root.
Configure the file /etc/init/docker.conf and replace DOCKER_OPTS= in the start section with:
DOCKER_OPTS="-H tcp://0.0.0.0:2375 -H unix:///var/run/docker.sock"
This will be used to communicated the server with the nodes. Then you can restart the docker service.
systemctl daemon-reload
systemctl restart docker
You can check if you configuration is ok with the command:
service docker status
Will be appear the docker service and the subprocess, the daemon dockerd must appear like this:
CGroup: /system.slice/docker.service
├─12764 /usr/bin/dockerd -H tcp://0.0.0.0:2375 -H unix:///var/run/docker.sock
├─12776 docker-containerd --config /var/run/docker/containerd/containerd.toml
Tip: If after restart you service docker status is not like
this, you can stop the docker service and execute this command:
service docker stop
dockerd -H tcp://0.0.0.0:2375 -H unix:///var/run/docker.sock
service docker start
Now, we need to run 2 swarm service:
- Consul: a distributed key-store listening on port 8500. It will store the information about the available nodes.
docker run --restart=always -d -p 8500:8500 --name=consul progrium/consul -server -bootstrap
- Swarm Manager: Which provide the interface to Docker Swarm:
HUB_LOCAL_IP=<THE IP IN YOUR PRIVATE NETWORK>
docker run --restart=always -d -p 4000:4000 dockerswarm/swarm:master manage -H :4000 --replication --advertise $HUB_LOCAL_IP:4000 consul://$HUB_LOCAL_IP:8500
I recommend that you write your internal IP for HUB_LOCAL_IP.
You can check if the containers are running with:
docker ps -a
and then you can check if connection works with docker Swarm on port 4000:
docker -H :4000 ps -a
Setup Jupyterhub
Create a user : in my case the username is user.
In your host you must install Jupyterhub. I install using the step by step from zonca:
wget --no-check-certificate https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
```
use all defaults, answer "yes" to modify PATH
```
sudo apt-get install npm nodejs-legacy
sudo npm install -g configurable-http-proxy
conda install traitlets tornado jinja2 sqlalchemy
pip install jupyterhub
Then , you must install dockerspawner:
pip install dockerspawner
You need a jupyterhub_config.py to configure your connection with docker. You can use my configuration.
- I configure nvidia runtime and have some example of volumes (to share folders).
- I put some constraint to control the CPU # cores limits and memory limits.
- I put a DummyAuthenticator as example. You can change this for your specific case.
Share user home via NFS
Install NFS with package manager:
sudo apt-get install nfs-kernel-server
Create a folder /export/nfs and edit /etc/exports :
/export *(rw,sync,no_subtree_check)
2) Nodes
Setup the Docker Swarm nodes
Configure the file /etc/init/docker.conf and replace DOCKER_OPTS= in the start section with:
DOCKER_OPTS="-H tcp://0.0.0.0:2375 -H unix:///var/run/docker.sock"
You must check if the docker_opts are working such as the fisrt part.
Then run the container that interfaces with Swarm:
HUB_LOCAL_IP=10.XX.XX.XX
NODE_LOCAL_IP=$(ip route get 8.8.8.8 | awk 'NR==1 {print $NF}')
docker run --restart=always -d swarm join --advertise=$NODE_LOCAL_IP:2375 consul://$HUB_LOCAL_IP:8500
HUB_LOCAL_IP :Is the LOCAL IP of your manager computer.
NODE_LOCAL_IP: is the LOCAL IP of your node computer.
Setup mounting the home filesystem
sudo apt-get install autofs
mount the folder taht will be shared across nodes and server:
sudo mount HUB_LOCAL_IP:/export /export
After all, you can enter into your Jupyterhub server (MYIP:9000 in my case) and enjoy!
References
- https://zonca.github.io/2016/10/dockerspawner-cuda.html
- https://zonca.github.io/2016/05/jupyterhub-docker-swarm.html
- https://github.com/jupyterhub/dockerspawner
- https://hub.docker.com/_/swarm/
- https://github.com/nvidia/nvidia-docker/wiki/Installation-(version-2.0)
- https://docs.docker.com/install/
Author: Cristian Muñoz
e-mail: crisstrink@gmail.com