培训 - AmazonDeep Learning 容器
Amazon Web Services 文档中描述的 Amazon Web Services 服务或功能可能因区域而异。要查看适用于中国区域的差异,请参阅中国的 Amazon Web Services 服务入门

本文属于机器翻译版本。若本译文内容与英语原文存在差异,则一律以英文原文为准。

培训

本部分介绍如何在上运行训练。Amazon适用于 Amazon Elastic Containers 服务的 Deep Learning Containers 使用 Apache MXNet、TensorFlow 和 TensorFlow 2。

有关 Deep Learning Containers 的完整列表,请参阅Deep Learning Containers 映像.

注意

MKL 用户:读取AmazonDeep Learning Containers 英特尔数学核心库 (MKL) 建议以获得最佳训练或推理性能。

重要

如果您的账户已创建 Amazon ECS 服务相关角色,则默认情况下会为您的服务使用该角色,除非您在此处指定一个角色。如果您的任务定义使用awsvpc网络模式,或者将服务配置为使用服务发现。如果服务使用外部部署控制器、多个目标组或 Elastic Inference 加速器(在这种情况下,您不应在此处指定角色),则需要该角色。有关更多信息,请参阅 。Amazon ECS 使用服务相关角色中的Amazon ECS 开发人员指南.

TensorFlow 训练

您必须先注册任务定义才能在 ECS 集群上运行任务。任务定义是分组在一起的一系列容器。以下示例使用将训练脚本添加到 Deep Learning Containers 的示例 Docker 映像。您可以将此脚本与 TensorFlow 或 TensorFlow 2 一起使用。要将其与 TensorFlow2 一起使用,请将 Docker 映像更改为 TensorFlow 2 映像。

  1. 使用以下内容创建名为 ecs-deep-learning-container-training-taskdef.json 的文件。

    • 对于 CPU

      { "requiresCompatibilities": [ "EC2" ], "containerDefinitions": [{ "command": [ "mkdir -p /test && cd /test && git clone https://github.com/fchollet/keras.git && chmod +x -R /test/ && python keras/examples/mnist_cnn.py" ], "entryPoint": [ "sh", "-c" ], "name": "tensorflow-training-container", "image": "763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference:1.15.2-cpu-py36-ubuntu18.04", "memory": 4000, "cpu": 256, "essential": true, "portMappings": [{ "containerPort": 80, "protocol": "tcp" }], "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-group": "awslogs-tf-ecs", "awslogs-region": "us-east-1", "awslogs-stream-prefix": "tf", "awslogs-create-group": "true" } } }], "volumes": [], "networkMode": "bridge", "placementConstraints": [], "family": "TensorFlow" }
    • 对于 GPU

      { "requiresCompatibilities": [ "EC2" ], "containerDefinitions": [ { "command": [ "mkdir -p /test && cd /test && git clone https://github.com/fchollet/keras.git && chmod +x -R /test/ && python keras/examples/mnist_cnn.py" ], "entryPoint": [ "sh", "-c" ], "name": "tensorflow-training-container", "image": "763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-training:1.15.2-gpu-py37-cu100-ubuntu18.04", "memory": 6111, "cpu": 256, "resourceRequirements" : [{ "type" : "GPU", "value" : "1" }], "essential": true, "portMappings": [ { "containerPort": 80, "protocol": "tcp" } ], "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-group": "awslogs-tf-ecs", "awslogs-region": "us-east-1", "awslogs-stream-prefix": "tf", "awslogs-create-group": "true" } } } ], "volumes": [], "networkMode": "bridge", "placementConstraints": [], "family": "tensorflow-training" }
  2. 注册任务定义。记下输出中的修订号,然后在下一个步骤中使用它。

    aws ecs register-task-definition --cli-input-json file://ecs-deep-learning-container-training-taskdef.json
  3. 使用任务定义创建任务。您需要上一步的修订号以及在安装过程中创建的集群的名称

    aws ecs run-task --cluster ecs-ec2-training-inference --task-definition tf:1
  4. 打开 https://console.aws.amazon.com/ecs/ 上的 Amazon ECS 控制台。

  5. 选择 ecs-ec2-training-inference 集群。

  6. Cluster 页面上,选择 Tasks

  7. 在你的任务进入RUNNING状态中,选择任务标识符。

  8. Containers (容器) 下,展开容器详细信息。

  9. Log Configuration (日志配置) 下,选择 View logs in CloudWatch (查看 CloudWatch 中的日志 )。这会将您转到 CloudWatch 控制台以查看训练进度日志。

后续步骤

要了解在 Amazon ECS 上使用带有 Deep Learning Containers 的 TensorFlow 进行推理,请参阅TensorFlow 推理.

Apache MXNet (孵化版) 训练

您必须先注册任务定义,然后才能在 Amazon Elastic Containers Services 集群上运行任务。任务定义是分组在一起的一系列容器。以下示例使用将训练脚本添加到 Deep Learning Containers 的示例 Docker 映像。

  1. 使用以下内容创建名为 ecs-deep-learning-container-training-taskdef.json 的文件。

    • 对于 CPU

      { "requiresCompatibilities":[ "EC2" ], "containerDefinitions":[ { "command":[ "git clone -b 1.4 https://github.com/apache/incubator-mxnet.git && python /incubator-mxnet/example/image-classification/train_mnist.py" ], "entryPoint":[ "sh", "-c" ], "name":"mxnet-training", "image":"763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-training:1.6.0-cpu-py36-ubuntu16.04", "memory":4000, "cpu":256, "essential":true, "portMappings":[ { "containerPort":80, "protocol":"tcp" } ], "logConfiguration":{ "logDriver":"awslogs", "options":{ "awslogs-group":"/ecs/mxnet-training-cpu", "awslogs-region":"us-east-1", "awslogs-stream-prefix":"mnist", "awslogs-create-group":"true" } } } ], "volumes":[ ], "networkMode":"bridge", "placementConstraints":[ ], "family":"mxnet" }
    • 对于 GPU

      { "requiresCompatibilities":[ "EC2" ], "containerDefinitions":[ { "command":[ "git clone -b 1.4 https://github.com/apache/incubator-mxnet.git && python /incubator-mxnet/example/image-classification/train_mnist.py --gpus 0" ], "entryPoint":[ "sh", "-c" ], "name":"mxnet-training", "image":"763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-training:1.6.0-gpu-py36-cu101-ubuntu16.04", "memory":4000, "cpu":256, "resourceRequirements":[ { "type":"GPU", "value":"1" } ], "essential":true, "portMappings":[ { "containerPort":80, "protocol":"tcp" } ], "logConfiguration":{ "logDriver":"awslogs", "options":{ "awslogs-group":"/ecs/mxnet-training-gpu", "awslogs-region":"us-east-1", "awslogs-stream-prefix":"mnist", "awslogs-create-group":"true" } } } ], "volumes":[ ], "networkMode":"bridge", "placementConstraints":[ ], "family":"mxnet-training" }
  2. 注册任务定义。记下输出中的修订号,然后在下一个步骤中使用它。

    aws ecs register-task-definition --cli-input-json file://ecs-deep-learning-container-training-taskdef.json
  3. 使用任务定义创建任务。您需要使用上一步中的修订号。

    aws ecs run-task --cluster ecs-ec2-training-inference --task-definition mx:1
  4. 打开 https://console.aws.amazon.com/ecs/ 上的 Amazon ECS 控制台。

  5. 选择 ecs-ec2-training-inference 集群。

  6. Cluster 页面上,选择 Tasks

  7. 在你的任务进入RUNNING状态中,选择任务标识符。

  8. Containers (容器) 下,展开容器详细信息。

  9. Log Configuration (日志配置) 下,选择 View logs in CloudWatch (查看 CloudWatch 中的日志 )。这会将您转到 CloudWatch 控制台以查看训练进度日志。

后续步骤

要了解使用带有 Deep Learning Containers 的 MxNet 在 Amazon ECS 上的推理,请参阅Apache MXNet (孵化) 推理.

PyTorch 训练

您必须先注册任务定义,然后才能在 Amazon ECS 集群上运行任务。任务定义是分组在一起的一系列容器。以下示例使用将训练脚本添加到 Deep Learning Containers 的示例 Docker 映像。

  1. 使用以下内容创建名为 ecs-deep-learning-container-training-taskdef.json 的文件。

    • 对于 CPU

      { "requiresCompatibilities":[ "EC2" ], "containerDefinitions":[ { "command":[ "git clone https://github.com/pytorch/examples.git && python examples/mnist/main.py --no-cuda" ], "entryPoint":[ "sh", "-c" ], "name":"pytorch-training-container", "image":"763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.5.1-cpu-py36-ubuntu16.04", "memory":4000, "cpu":256, "essential":true, "portMappings":[ { "containerPort":80, "protocol":"tcp" } ], "logConfiguration":{ "logDriver":"awslogs", "options":{ "awslogs-group":"/ecs/pytorch-training-cpu", "awslogs-region":"us-east-1", "awslogs-stream-prefix":"mnist", "awslogs-create-group":"true" } } } ], "volumes":[ ], "networkMode":"bridge", "placementConstraints":[ ], "family":"pytorch" }
    • 对于 GPU

      { "requiresCompatibilities": [ "EC2" ], "containerDefinitions": [ { "command": [ "git clone https://github.com/pytorch/examples.git && python examples/mnist/main.py" ], "entryPoint": [ "sh", "-c" ], "name": "pytorch-training-container", "image": "763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.5.1-gpu-py36-cu101-ubuntu16.04", "memory": 6111, "cpu": 256, "resourceRequirements" : [{ "type" : "GPU", "value" : "1" }], "essential": true, "portMappings": [ { "containerPort": 80, "protocol": "tcp" } ], "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-group": "/ecs/pytorch-training-gpu", "awslogs-region": "us-east-1", "awslogs-stream-prefix": "mnist", "awslogs-create-group": "true" } } } ], "volumes": [], "networkMode": "bridge", "placementConstraints": [], "family": "pytorch-training" }
  2. 注册任务定义。记下输出中的修订号,然后在下一个步骤中使用它。

    aws ecs register-task-definition --cli-input-json file://ecs-deep-learning-container-training-taskdef.json
  3. 使用任务定义创建任务。您需要使用上一步中的修订标识符。

    aws ecs run-task --cluster ecs-ec2-training-inference --task-definition pytorch:1
  4. 打开 https://console.aws.amazon.com/ecs/ 上的 Amazon ECS 控制台。

  5. 选择 ecs-ec2-training-inference 集群。

  6. Cluster 页面上,选择 Tasks

  7. 在你的任务进入RUNNING状态中,选择任务标识符。

  8. Containers (容器) 下,展开容器详细信息。

  9. Log Configuration (日志配置) 下,选择 View logs in CloudWatch (查看 CloudWatch 中的日志 )。这会将您转到 CloudWatch 控制台以查看训练进度日志。

PyTorch 的 Amazon S3 插件

Deep Learning Containers 包括一个插件,使您能够将 Amazon S3 存储桶中的数据用于 PyTorch 培训。

  1. 要开始在 Amazon ECS 中使用 Amazon S3 插件,请设置AWS_REGION环境变量与您选择的区域。

    export AWS_REGION=us-east-1
  2. 使用以下内容创建名为 ecs-deep-learning-container-pytorch-s3-plugin-taskdef.json 的文件。

    • 对于 CPU

      { "requiresCompatibilities":[ "EC2" ], "containerDefinitions":[ { "command":[ "git clone https://github.com/aws/amazon-s3-plugin-for-pytorch.git && python amazon-s3-plugin-for-pytorch/examples/s3_imagenet_example.py" ], "entryPoint":[ "sh", "-c" ], "name":"pytorch-s3-plugin-container", "image":"763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.8.1-cpu-py36-ubuntu18.04-v1.6", "memory":4000, "cpu":256, "essential":true, "portMappings":[ { "containerPort":80, "protocol":"tcp" } ], "logConfiguration":{ "logDriver":"awslogs", "options":{ "awslogs-group":"/ecs/pytorch-s3-plugin-cpu", "awslogs-region":"us-east-1", "awslogs-stream-prefix":"imagenet", "awslogs-create-group":"true" } } } ], "volumes":[ ], "networkMode":"bridge", "placementConstraints":[ ], "family":"pytorch-s3-plugin" }
    • 对于 GPU

      { "requiresCompatibilities": [ "EC2" ], "containerDefinitions": [ { "command": [ "git clone https://github.com/aws/amazon-s3-plugin-for-pytorch.git && python amazon-s3-plugin-for-pytorch/examples/s3_imagenet_example.py" ], "entryPoint": [ "sh", "-c" ], "name": "pytorch-s3-plugin-container", "image": "763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.8.1-gpu-py36-cu111-ubuntu18.04-v1.7", "memory": 6111, "cpu": 256, "resourceRequirements" : [{ "type" : "GPU", "value" : "1" }], "essential": true, "portMappings": [ { "containerPort": 80, "protocol": "tcp" } ], "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-group": "/ecs/pytorch-s3-plugin-gpu", "awslogs-region": "us-east-1", "awslogs-stream-prefix": "imagenet", "awslogs-create-group": "true" } } } ], "volumes": [], "networkMode": "bridge", "placementConstraints": [], "family": "pytorch-s3-plugin" }
  3. 注册任务定义。记下输出中的修订号,然后在下一个步骤中使用它。

    aws ecs register-task-definition --cli-input-json file://ecs-deep-learning-container-pytorch-s3-plugin-taskdef.json
  4. 使用任务定义创建任务。您需要使用上一步中的修订标识符。

    aws ecs run-task --cluster ecs-pytorch-s3-plugin --task-definition pytorch-s3-plugin:1
  5. 打开 https://console.aws.amazon.com/ecs/ 上的 Amazon ECS 控制台。

  6. 选择 ecs-pytorch-s3-plugin 集群。

  7. Cluster 页面上,选择 Tasks

  8. 在你的任务进入RUNNING状态中,选择任务标识符。

  9. Containers (容器) 下,展开容器详细信息。

  10. Log Configuration (日志配置) 下,选择 View logs in CloudWatch (查看 CloudWatch 中的日志 )。这会将您转到 CloudWatch 控制台以查看 Amazon S3 插件示例日志。

有关更多信息和其他示例,请参阅PyTorch 的 Amazon S3 插件存储库。

后续步骤

要了解在 Amazon ECS 上使用 PyTorch 和 Deep Learning Containers 的推理,请参阅PyTorch 推理.