使用Docker搭建Ray集群

本文使用Docker中Ubuntu16.04镜像,搭建MiniConda环境,在conda之上搭建Ray集群环境

Docker镜像构建

编写Dockerfile文件,构建镜像,换源为阿里云:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
FROM ubuntu:16.04
RUN mv /etc/apt/sources.list.d /etc/apt/sources.list.d.bak
RUN mv /etc/apt/sources.list /etc/apt/sources.list.bak && \
echo "deb http://mirrors.aliyun.com/ubuntu/ xenial main restricted" >>/etc/apt/sources.list && \
echo "deb-src http://mirrors.aliyun.com/ubuntu/ xenial main restricted multiverse universe" >>/etc/apt/sources.list && \
echo "deb http://mirrors.aliyun.com/ubuntu/ xenial-updates main restricted" >>/etc/apt/sources.list && \
echo "deb-src http://mirrors.aliyun.com/ubuntu/ xenial-updates main restricted multiverse universe" >>/etc/apt/sources.list && \
echo "deb http://mirrors.aliyun.com/ubuntu/ xenial universe" >>/etc/apt/sources.list && \
echo "deb http://mirrors.aliyun.com/ubuntu/ xenial-updates universe" >>/etc/apt/sources.list && \
echo "deb http://mirrors.aliyun.com/ubuntu/ xenial multiverse" >>/etc/apt/sources.list && \
echo "deb http://mirrors.aliyun.com/ubuntu/ xenial-updates multiverse" >>/etc/apt/sources.list && \
echo "deb http://mirrors.aliyun.com/ubuntu/ xenial-backports main restricted universe multiverse" >>/etc/apt/sources.list && \
echo "deb-src http://mirrors.aliyun.com/ubuntu/ xenial-backports main restricted universe multiverse" >>/etc/apt/sources.list && \
echo "deb http://mirrors.aliyun.com/ubuntu/ xenial-security main restricted" >>/etc/apt/sources.list && \
echo "deb-src http://mirrors.aliyun.com/ubuntu/ xenial-security main restricted multiverse universe" >>/etc/apt/sources.list && \
echo "deb http://mirrors.aliyun.com/ubuntu/ xenial-security universe" >>/etc/apt/sources.list && \
echo "deb http://mirrors.aliyun.com/ubuntu/ xenial-security multiverse" >>/etc/apt/sources.list
RUN apt update

将Dockerfile文件放在用户目录下,运行下面命令构建镜像:

1
2
3
docker build -t awebone/ubuntu16.04 .
docker image ls -a # 查看镜像


基本准备工作

进入容器:

1
2
docker run --name ray -itd awebone/ubuntu16.04 /bin/bash
docker exec -it ray /bin/bash

安装基本软件包:

1
2
apt install python-software-properties -y
apt install software-properties-common -y

安装经常使用的包:

1
apt install -y vim wget git


MiniConda安装

Conda安装:

1
2
3
4
5
6
7
wget https://mirrors.tuna.tsinghua.edu.cn/anaconda/miniconda/Miniconda3-py37_4.8.2-Linux-x86_64.sh
bash Miniconda3-py37_4.8.2-Linux-x86_64.sh
source ~/.bashrc
conda -V

conda镜像设置

通过修改用户目录下的 .condarc 文件。先执行 conda config --set show_channel_urls yes 生成该文件之后再修改

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
channels:
- defaults
show_channel_urls: true
channel_alias: https://mirrors.tuna.tsinghua.edu.cn/anaconda
default_channels:
- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free
- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r
- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/pro
- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/msys2
custom_channels:
conda-forge: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
msys2: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
bioconda: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
menpo: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
pytorch: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
simpleitk: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud

运行 conda clean -i 清除索引缓存,保证用的是镜像站提供的索引。

环境创建

1
2
3
4
5
conda create -n ray python=3.6
conda activate ray #开启ray环境
conda deactivate #关闭环境
conda env list #显示所有的虚拟环境
conda info --envs #显示所有的虚拟环境


Ray安装

pip镜像设置

激活conda环境,运行:

1
2
conda activate ray #开启ray环境
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple

ray安装

1
2
3
4
5
pip install ray # ray安装
pip install ray[tune] # ray组件安装
pip install ray[rllib] # ray组件安装
pip install tensorflow # tf cpu安装
pip install requests # requests安装


容器保存

将修改好的容器保存为镜像

1
docker commit (容器ID)[37639cc72d75] ubuntu-conda-ray


Ray集群运行并使用

容器启动

1
2
docker run --shm-size 1000m --name ray1 -itd ubuntu-conda-ray /bin/bash
docker run --shm-size 1000m --name ray2 -itd ubuntu-conda-ray /bin/bash

分两个终端进入容器

1
2
docker exec -it ray1 /bin/bash
docker exec -it ray2 /bin/bash

两个容器分别激活Conda环境

1
conda activate ray #开启ray环境

Ray启动

1
2
3
Ray1上:ray start --head --port=6379 # 启动head节点
Ray1上:ray start --address 172.16.0.2:6379 # 向集群添加节点
Ray2上:ray start --address 172.16.0.3:6379 # 向集群添加节点

Ray集群测试

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# -*- coding: utf-8 -*-
import time
import ray
ray.init(address="auto")
def f1():
time.sleep(1)
@ray.remote
def f2():
time.sleep(1)
#以下需要十秒。
time1=time.time()
[ f1() for _ in range(50)]
print(time.time()-time1)
#以下需要一秒(假设系统至少有10个CPU)。
time2=time.time()
ray.get([ f2.remote() for _ in range(50)])
print(time.time()-time2)

Ray RLLib使用

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
rllib train \
--run=PG \
--env=CartPole-v0 \
--config='{"output": "/tmp/cartpole-out", "output_max_file_size": 5000000}' \
--stop='{"timesteps_total": 100000}'
ls -l /tmp/cartpole-out
rllib train \
--run=DQN \
--env=CartPole-v0 \
--config='{
"input": "/tmp/cartpole-out",
"input_evaluation": [],
"explore": false}'

强化学习PPO算法实践

编写ppo.py算法文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import gym
from gym.spaces import Discrete, Box
from ray import tune
class SimpleCorridor(gym.Env):
def __init__(self, config):
self.end_pos = config["corridor_length"]
self.cur_pos = 0
self.action_space = Discrete(2)
self.observation_space = Box(0.0, self.end_pos, shape=(1, ))
def reset(self):
self.cur_pos = 0
return [self.cur_pos]
def step(self, action):
if action == 0 and self.cur_pos > 0:
self.cur_pos -= 1
elif action == 1:
self.cur_pos += 1
done = self.cur_pos >= self.end_pos
return [self.cur_pos], 1 if done else 0, done, {}
tune.run(
"PPO",
config={
"env": SimpleCorridor,
"num_workers": 4,
"env_config": {"corridor_length": 5}})

运行ppo.py,查看训练状态:

ppo状态

ppo-status

查看CPU状态:

cpu-ray1-htop


参考资料

https://docs.docker.com/engine/reference/run/

https://www.jianshu.com/p/2a5cd519e583

https://mirrors.tuna.tsinghua.edu.cn/anaconda/miniconda/

https://blog.csdn.net/u011552182/article/details/80054899

https://github.com/ray-project/rl-experiments

https://www.twblogs.net/a/5db5755fbd9eee310da07c50?lang=zh-cn

https://blog.csdn.net/luanpeng825485697/article/details/88242020

-------------本文结束感谢您的阅读-------------

本文标题:使用Docker搭建Ray集群

文章作者:Awebone

发布时间:2020年07月13日 - 15:07

最后更新:2020年07月18日 - 16:07

原始链接:https://www.awebone.com/posts/101d1957/

许可协议: 署名-非商业性使用-禁止演绎 4.0 国际 转载请保留原文链接及作者。