# ElasticDL: A Kubernetes-native Deep Learning Framework
## Development Docker Image
Note that Docker 17.05 or higher is required to build docker images, as
Dockerfile is using multi-stage build.
Development Docker image contains dependencies for ElasticDL development. In
repo's root directory, run the following command:
```bash
docker build \
--target dev \
-t elasticdl:dev \
-f elasticdl/docker/Dockerfile .
```
To build the Docker image with GPU support, run the following command:
```bash
docker build \
--target dev \
-t elasticdl:dev-gpu \
-f elasticdl/docker/Dockerfile \
--build-arg BASE_IMAGE=tensorflow/tensorflow:2.1.0-gpu-py3 .
```
Note that since ElasticDL depends on TensorFlow, the base image must have
TensorFlow installed.
When having difficulties downloading from the main PyPI site or Golang site, you
could pass some extra build arguments to `docker build`, `EXTRA_PYPI_INDEX` for
PyPI site and `GO_MIRROR_URL` for the mirror of Golang installation package:
```bash
docker build \
--build-arg EXTRA_PYPI_INDEX=https://mirrors.aliyun.com/pypi/simple \
--build-arg GO_MIRROR_URL=http://mirrors.ustc.edu.cn/golang \
--target dev \
-t elasticdl:dev \
-f elasticdl/docker/Dockerfile .
```
To develop in the Docker container, run the following command to mount your
cloned `elasticdl` git repo directory (e.g. `EDL_REPO` below) to `/elasticdl`
directory in the container and start container:
```bash
EDL_REPO=<your_elasticdl_git_repo>
docker run --rm -u $(id -u):$(id -g) -it \
-v $EDL_REPO:/edl_dir \
-w /edl_dir \
elasticdl:dev
```
## Continuous Integration Docker Image
Continuous integration docker image contains everything from the development
docker image, processed demo data in RecordIO format and the ElasticDL source
code. It is used to run continuous integration with the latest version of the
source code. In repo's root directory, run the following command:
```bash
docker build \
--target ci \
-t elasticdl:ci \
-f elasticdl/docker/Dockerfile .
```
## Test and Debug
### Pre-commit Check
We have set up pre-commit checks in the Github repo for pull requests, which can
catch some Python style problems. However, to avoid waiting in the Travis CI
queue, you can run the pre-commit checks locally:
```bash
docker run --rm -it -v $EDL_REPO:/edl_dir -w /edl_dir \
elasticdl:dev \
bash -c "pre-commit run -a"
```
### Unit Tests
In dev Docker container's `elasticdl` repo's root directory, do the following:
```bash
make -f elasticdl/Makefile && K8S_TESTS=False pytest elasticdl/python/tests
```
Could also start Docker container and run unit tests in a single command:
```bash
docker run --rm -u $(id -u):$(id -g) -it \
-v $EDL_REPO:/edl_dir \
-w /edl_dir \
elasticdl:dev \
bash -c "make -f elasticdl/Makefile && K8S_TESTS=False pytest elasticdl/python/tests"
```
Note that, some unit tests may require a running Kubernetes cluster
available. To include those unit tests, run the following:
```bash
make -f elasticdl/Makefile && pytest elasticdl/python/tests
```
[MaxCompute](https://www.alibabacloud.com/product/maxcompute)-related tests
require additional environment variables. To run those tests, execute the
following:
```bash
docker run --rm -it -v $PWD:/edl_dir -w /edl_dir \
-e MAXCOMPUTE_PROJECT=xxx \
-e MAXCOMPUTE_AK=xxx \
-e MAXCOMPUTE_SK=xxx \
-e MAXCOMPUTE_ENDPOINT=xxx \
elasticdl:dev bash -c "make -f elasticdl/Makefile && \
K8S_TESTS=False pytest elasticdl/python/tests/odps_* \
elasticdl/python/tests/data_reader_test.py"
```
### Test in Docker
In a terminal, start master to distribute mnist training tasks.
```bash
docker run --net=host --rm -it -v $EDL_REPO:/edl_dir -w /edl_dir \
elasticdl:dev \
bash -c "python -m elasticdl.python.master.main \
--model_zoo=model_zoo \
--model_def=mnist.mnist_functional_api.custom_model \
--job_name=test \
--training_data=/data/mnist/train \
--validation_data=/data/mnist/test \
--evaluation_steps=15 \
--num_epochs=2 \
--checkpoint_steps=2 \
--grads_to_wait=2 \
--minibatch_size=10 \
--num_minibatches_per_task=10 \
--log_level=INFO"
```
In another terminal, start a worker
```bash
docker run --net=host --rm -it -v $EDL_REPO:/edl_dir -w /edl_dir \
elasticdl:dev \
bash -c "python -m elasticdl.python.worker.main \
--worker_id=1 \
--model_zoo=model_zoo \
--model_def=mnist.mnist_functional_api.custom_model \
--minibatch_size=10 \
--job_type=training_with_evaluation \
--master_addr=localhost:50001 \
--log_level=INFO"
```
This will train MNIST data with a model defined in
[model_zoo/mnist_functional_api/mnist_functional_api.py](../model_zoo/mnist_functional_api/mnist_functional_api.py)
for 2 epoches. Note that, the master will save model checkpoints in a local
directory `checkpoint_dir`.
If you get some issues related to proto definitions, please run the following
command to build latest proto components.
```bash
make -f elasticdl/Makefile
```
### Test with Kubernetes
We can also test ElasticDL job in a Kubernetes cluster using the previously
built [image](#development-docker-image).
First make sure the built image has been pushed to a docker registry, and then
run the following command to launch the job.
```bash
kubectl apply -f manifests/elasticdl.yaml
```
You might want to change the value of the `imagePullPolicy` property into
`Alway` or `Never` in your trial.
If you find permission error in the main pod log, e.g., `"pods is forbidden:
User \"system:serviceaccount:default:default\" cannot create resource
\"pods\""`, you need to grant pod-related permissions for the default user.
```bash
kubectl apply -f manifests/examples/elasticdl-rbac.yaml
```
### Test on Travis CI
All tests will be executed on [Travis
CI](https://travis-ci.org/sql-machine-learning/elasticdl), which includes:
- Pre-commit checks
- Unit tests
- Integration tests
The unit tests and integration tests also contain tests running on a local
Kubernetes cluster via
[Minikube](https://kubernetes.io/docs/setup/learning-environment/minikube/) and
tests that require data sources from
[MaxCompute](https://www.alibabacloud.com/product/maxcompute). Please refer to
[Travis configuration file](../.travis.yml) for more details.
Note that tests related to MaxCompute will not be executed on pull requests
created from forks since the MaxCompute access information has been secured on
Travis and only those who have write access can retrieve it. Developers who have
write access to this repo are encouraged to submit pull requests from branches
instead of forks if any code related to MaxCompute has been modified.
Also note that two test cases of integration tests involve loading
checkpoint. It is not easy to automatically generate checkpoints when doing
integration tests. Currently we save a checkpoint file in the [test data
folder](python/tests/testdata) of the ElasticDL Github repository and use this
checkpoint file for integration tests. Thus you need to re-generate a new
checkpoint file if your PR modifies the definition of Model protocol buffer.
If you want to trigger Travis builds without submitting a pull request, you can
do so by developing on a branch and add this branch name to the list in
`branches` section in [Travis configuration file](../.travis.yml). Note that you
can also trigger Travis builds from forks but it requires additional work such
as activating Travis for the forked repo and MaxCompute related tests will be
skipped as mentioned earlier.
没有合适的资源?快使用搜索试试~ 我知道了~
资源推荐
资源详情
资源评论
收起资源包目录
ElasticDL是一个基于TensorFlow 2.0构建的Kubernetes原生深度学习框架,支持容错和弹性调度 (430个子文件)
install-go.bash 536B
install-protobuf.bash 215B
bashrc 1KB
kernel_api.cc 3KB
.isort.cfg 259B
Dockerfile.ci 974B
.clang-format 385B
cifar10_resnet20_acc.csv 4KB
2.csv 4KB
1c.csv 3KB
1s.csv 2KB
3-1.csv 2KB
3-2.csv 1KB
overall_cluster_gpu.csv 189B
elastic_allreduce_gpu.csv 169B
elastic_training_gpu.csv 107B
tf_serving_gpu.csv 107B
gang_allreduce_gpu.csv 101B
Dockerfile 4KB
.flake8 57B
.gitignore 1KB
.gitignore 16B
3-1.gnuplot 955B
3-2.gnuplot 949B
1.gnuplot 574B
2.gnuplot 545B
optimizer.go 11KB
optimizer_test.go 10KB
server_test.go 10KB
server.go 8KB
kernel_test.go 8KB
tensor.go 7KB
kernel.go 7KB
types.go 4KB
initializer.go 4KB
checkpoint.go 4KB
model_test.go 4KB
model.go 3KB
main.go 3KB
checkpoint_test.go 3KB
embedding_table.go 3KB
tensor_test.go 3KB
embedding_table_test.go 2KB
k8s_client.go 2KB
util.go 2KB
initializer_test.go 1KB
parameter_server.graffle 3KB
kernel_api.h 872B
cpplint_precommit.hook 1KB
clang_format.hook 15B
index.html 514B
distributed_training_sequence.jpg 271KB
customize_pai_deepctr.jpg 98KB
auc_with_different_workers.jpg 93KB
architecture.jpg 82KB
utilized_cpu_with_jobs.jpg 80KB
utilized_cpu_with_nginx.jpg 69KB
training_loop.jpg 56KB
pai_deepctr_preprocessing.jpg 52KB
data_io_pipeline.jpg 41KB
mobilenet_performance.jpg 26KB
resnet50_performance.jpg 24KB
LICENSE 1KB
Makefile 1KB
Makefile 924B
Makefile 148B
google-cn-2020-07.md 22KB
parameter_server.md 21KB
allreduce.md 20KB
data_transform.md 20KB
elasticdl-gdd-2019.md 17KB
supporting_pytorch.md 16KB
distributed_embedding_layer.md 13KB
ata_elastic_scheduling.md 13KB
keras_callback.md 13KB
parameter_server.md 12KB
custom_allreduce_training_loop_support.md 12KB
model_serving.md 10KB
elasticdl_cloud.md 10KB
elasticdl_local.md 9KB
preprocessing_tutorial.md 9KB
elasticdl_deepctr_keras.md 9KB
high_level_api.md 9KB
sqlflow_integration.md 9KB
elasticdl_torch.md 9KB
async_sgd.md 9KB
ata_deepctr_modeling.md 8KB
model_evaluation.md 8KB
ftlib_benchmark.md 8KB
dynamic_sharding_for_images.md 8KB
README.md 8KB
elasticdl_estimator.md 7KB
elasticdl_operator.md 7KB
report.md 7KB
allreduce_based_horovod.md 6KB
worker_optimization.md 6KB
elasticdl_deepctr_estimator.md 6KB
model_contribution.md 6KB
report_cn.md 6KB
checkpoint_design.md 6KB
共 430 条
- 1
- 2
- 3
- 4
- 5
资源评论
Java程序员-张凯
- 粉丝: 1w+
- 资源: 6826
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 基于C++的二叉排序树(免费提供源码)
- demo(1).zip
- 网页版扫雷image文件
- matlab+数据预处理+ARIMA预测+异常值+检测+适用于各种类型的数据集
- mubanyuanshizhan-0604-24
- Python旅游数据爬虫及可视化展示源码
- 基于MFC恶意PE,Android签名流检测系统框架 vs2022 + c/c++ + mfc + PE + APK
- 基于MFC恶意文件检测系统 框架 vs2022 + c/c++ + hook + PE + inject + 动态调试工具Imm
- 基于MFC恶意文件检测系统框架 vs2022 + c/c++ + hook + PE + inject + 动态调试工具Imm
- 八路抢答器电子工艺实习报告
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功