train¶
Train ML models on AWS SageMaker.
Synopsis¶
easy_sm [--docker-tag TAG] train --base-job-name NAME --ec2-type TYPE \
--input-s3-dir S3_PATH --output-s3-dir S3_PATH [OPTIONS]
Description¶
The train command submits a training job to AWS SageMaker using your Docker image. It creates a training job that:
- Pulls your Docker image from ECR
- Downloads training data from S3
- Runs your training code
- Uploads the trained model to S3
The command outputs the S3 location of the trained model, making it easy to pipe into deployment commands.
Options¶
| Option | Short | Type | Required | Default | Description |
|---|---|---|---|---|---|
--base-job-name | -n | string | Yes | - | Prefix for the SageMaker training job |
--ec2-type | -e | string | Yes | - | EC2 instance type (e.g., ml.m5.large) |
--input-s3-dir | -i | string | Yes | - | S3 location for input training data |
--output-s3-dir | -o | string | Yes | - | S3 location to save trained model |
--iam-role-arn | -r | string | No | From SAGEMAKER_ROLE env var | AWS IAM role ARN for SageMaker |
--app-name | -a | string | No | Auto-detected | App name for configuration |
--instance-count | -c | integer | No | 1 | Number of EC2 instances for training |
--docker-tag | -t | string | No | latest | Docker image tag (global option) |
Examples¶
Basic training job¶
export SAGEMAKER_ROLE=arn:aws:iam::123456789012:role/SageMakerRole
easy_sm train \
-n my-training-job \
-e ml.m5.large \
-i s3://my-bucket/training-data \
-o s3://my-bucket/models
Output:
Training with specific IAM role¶
easy_sm train \
-n my-training-job \
-e ml.m5.large \
-i s3://my-bucket/data \
-o s3://my-bucket/output \
-r arn:aws:iam::123456789012:role/CustomRole
Training with multiple instances¶
For distributed training:
easy_sm train \
-n distributed-job \
-e ml.m5.xlarge \
-c 4 \
-i s3://my-bucket/data \
-o s3://my-bucket/output
Training with specific Docker tag¶
easy_sm -t v1.2.0 train \
-n production-training \
-e ml.m5.large \
-i s3://my-bucket/data \
-o s3://my-bucket/output
Complete workflow: build, push, train¶
export SAGEMAKER_ROLE=arn:aws:iam::123456789012:role/SageMakerRole
# Build and push
easy_sm -t v1.0.0 build
easy_sm -t v1.0.0 push
# Train
easy_sm -t v1.0.0 train \
-n my-job-v1 \
-e ml.m5.large \
-i s3://my-bucket/data \
-o s3://my-bucket/models
Output Format¶
The command outputs the S3 path to the trained model:
This output is designed for piping into other commands:
# Save model path
MODEL=$(easy_sm train -n my-job -e ml.m5.large \
-i s3://bucket/data -o s3://bucket/output)
# Deploy the model
easy_sm deploy -n my-endpoint -e ml.m5.large -m $MODEL
Prerequisites¶
- Docker image pushed to ECR with
easy_sm push - Training data uploaded to S3 (use
upload-dataif needed) - IAM role with SageMaker permissions (either in
SAGEMAKER_ROLEenv var or via-rflag) - IAM role must have:
- Trust relationship with
sagemaker.amazonaws.com - Permissions:
s3:GetObject,s3:PutObject,ecr:GetDownloadUrlForLayer, etc.
Training Data Structure¶
Your training data in S3 should be organized as:
In the training container, this data is available at:
Training Code Requirements¶
Your training code at training/training.py should implement:
import os
import joblib
def train(input_data_path, model_save_path):
"""
Train model on SageMaker.
Args:
input_data_path: /opt/ml/input/data/training
model_save_path: /opt/ml/model
"""
# Load training data
train_data = load_data(input_data_path)
# Train model
model = train_model(train_data)
# Save model
joblib.dump(model, os.path.join(model_save_path, 'model.mdl'))
print("Training completed")
Container Paths¶
SageMaker uses these standard paths:
| Path | Purpose |
|---|---|
/opt/ml/input/data/training/ | Input training data |
/opt/ml/model/ | Save trained model here |
/opt/ml/output/ | Training metrics and logs |
Model Output¶
After training, SageMaker automatically:
- Creates a tarball of
/opt/ml/model/ - Uploads it to S3 as
model.tar.gz - The path is:
{output_s3_dir}/{job_name}/output/model.tar.gz
This model can be used directly with deploy.
Instance Types¶
Common instance types for training:
| Instance Type | vCPUs | Memory | Use Case |
|---|---|---|---|
ml.m5.large | 2 | 8 GB | Small datasets, testing |
ml.m5.xlarge | 4 | 16 GB | Medium datasets |
ml.m5.2xlarge | 8 | 32 GB | Large datasets |
ml.c5.xlarge | 4 | 8 GB | Compute-intensive |
ml.c5.2xlarge | 8 | 16 GB | Heavy compute |
ml.p3.2xlarge | 8 | 61 GB + GPU | Deep learning |
ml.p3.8xlarge | 32 | 244 GB + 4 GPUs | Large deep learning |
See AWS documentation for full list and pricing.
Distributed Training¶
For distributed training across multiple instances:
easy_sm train \
-n distributed-job \
-e ml.m5.xlarge \
-c 4 \
-i s3://bucket/data \
-o s3://bucket/output
Your training code needs to handle distribution. SageMaker provides:
SM_HOSTS: List of all hostsSM_CURRENT_HOST: Current host nameSM_NUM_GPUS: Number of GPUs available
Example:
import os
import json
def train(input_data_path, model_save_path):
# Get distributed training info
hosts = json.loads(os.environ.get('SM_HOSTS', '[]'))
current_host = os.environ.get('SM_CURRENT_HOST', '')
num_gpus = int(os.environ.get('SM_NUM_GPUS', 0))
if len(hosts) > 1:
# Distributed training logic
train_distributed(hosts, current_host)
else:
# Single-instance training
train_single()
Monitoring Training¶
After submitting a training job, monitor it in:
- AWS Console: SageMaker → Training → Training jobs
- CloudWatch Logs:
/aws/sagemaker/TrainingJobs - CLI:
Troubleshooting¶
"SAGEMAKER_ROLE environment variable not set"¶
Problem: IAM role not provided.
Solution: Set the environment variable:
Or pass it explicitly:
Training job fails immediately¶
Problem: Docker image not found in ECR.
Solution: Push the image first:
"Access Denied" errors¶
Problem: IAM role lacks S3 or ECR permissions.
Solution: Add required permissions to the SageMaker execution role:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::my-bucket/*",
"arn:aws:s3:::my-bucket"
]
},
{
"Effect": "Allow",
"Action": [
"ecr:GetDownloadUrlForLayer",
"ecr:BatchGetImage",
"ecr:BatchCheckLayerAvailability"
],
"Resource": "*"
}
]
}
Out of memory errors¶
Problem: Model too large for instance memory.
Solution: Use a larger instance type:
Training takes too long¶
Problem: Slow training on small instances.
Solution: 1. Use compute-optimized or GPU instances 2. Use distributed training with -c flag 3. Optimize your training code
upload-data¶
Upload local data to S3 for training.
Synopsis¶
Description¶
Uploads a local directory to S3 for use as training input data.
Options¶
| Option | Short | Type | Required | Default | Description |
|---|---|---|---|---|---|
--input-dir | -i | path | Yes | - | Local directory containing data |
--target-dir | -t | string | Yes | - | S3 location (e.g., s3://bucket/data) |
--iam-role-arn | -r | string | No | From SAGEMAKER_ROLE | IAM role ARN |
--app-name | -a | string | No | Auto-detected | App name |
Examples¶
# Upload training data
easy_sm upload-data \
-i ./local-data \
-t s3://my-bucket/training-data
# Output: s3://my-bucket/training-data
Output¶
The command outputs the S3 path where data was uploaded:
list-training-jobs¶
List recent SageMaker training jobs.
Synopsis¶
Description¶
Lists recent training jobs with their status and creation time. Supports pipe-friendly output for automation.
Options¶
| Option | Short | Type | Required | Default | Description |
|---|---|---|---|---|---|
--max-results | -m | integer | No | 5 | Maximum number of jobs to return |
--names-only | -n | boolean | No | false | Output only job names (one per line) |
--iam-role-arn | -r | string | No | From SAGEMAKER_ROLE | IAM role ARN |
--app-name | -a | string | No | Auto-detected | App name |
Examples¶
List recent training jobs¶
Output:
my-training-job-1 Completed 2024-01-15 10:23:45+00:00
my-training-job-2 InProgress 2024-01-15 11:30:12+00:00
my-training-job-3 Failed 2024-01-14 09:15:33+00:00
List more jobs¶
Names only (pipe-friendly)¶
Output:
Get latest job name¶
Output:
Pipe-Friendly Usage¶
# Get latest completed job name
JOB=$(easy_sm list-training-jobs -m 10 | grep Completed | head -1 | awk '{print $1}')
# Get all production job names
easy_sm list-training-jobs -n -m 20 | grep "prod-"
# Get latest model and deploy
MODEL=$(easy_sm get-model-artifacts -j $(easy_sm list-training-jobs -n -m 1))
easy_sm deploy -n my-endpoint -e ml.m5.large -m $MODEL
get-model-artifacts¶
Get S3 model path from a training job.
Synopsis¶
Description¶
Retrieves the S3 location of the model artifacts (model.tar.gz) for a completed training job. Essential for deployment workflows.
Options¶
| Option | Short | Type | Required | Default | Description |
|---|---|---|---|---|---|
--training-job-name | -j | string | Yes | - | Training job name |
--iam-role-arn | -r | string | No | From SAGEMAKER_ROLE | IAM role ARN |
--app-name | -a | string | No | Auto-detected | App name |
Examples¶
Get model path¶
Output:
Use in deployment pipeline¶
# Get latest job
JOB=$(easy_sm list-training-jobs -n -m 1)
# Get its model
MODEL=$(easy_sm get-model-artifacts -j $JOB)
# Deploy
easy_sm deploy -n my-endpoint -e ml.m5.large -m $MODEL
One-liner deployment¶
easy_sm deploy -n my-endpoint -e ml.m5.large \
-m $(easy_sm get-model-artifacts -j $(easy_sm list-training-jobs -n -m 1))
Troubleshooting¶
Problem: "Training job not found"
Solution: Verify the job name:
Problem: Job not completed yet
Solution: Wait for training to complete:
Complete Training Workflow¶
End-to-End Example¶
# 1. Set up environment
export SAGEMAKER_ROLE=arn:aws:iam::123456789012:role/SageMakerRole
# 2. Build and push Docker image
easy_sm -t v1.0.0 build
easy_sm -t v1.0.0 push
# 3. Upload training data
easy_sm upload-data \
-i ./training-data \
-t s3://my-bucket/data
# 4. Train model
easy_sm -t v1.0.0 train \
-n my-training-job \
-e ml.m5.large \
-i s3://my-bucket/data \
-o s3://my-bucket/models
# 5. List training jobs
easy_sm list-training-jobs
# 6. Get model artifacts
MODEL=$(easy_sm get-model-artifacts -j my-training-job)
# 7. Deploy
easy_sm -t v1.0.0 deploy \
-n my-endpoint \
-e ml.m5.large \
-m $MODEL
Automated Pipeline¶
#!/bin/bash
set -e
export SAGEMAKER_ROLE=arn:aws:iam::123456789012:role/SageMakerRole
VERSION="v$(date +%Y%m%d-%H%M%S)"
# Build and push
easy_sm -t $VERSION build
easy_sm -t $VERSION push
# Train
easy_sm -t $VERSION train \
-n training-$VERSION \
-e ml.m5.xlarge \
-i s3://my-bucket/data \
-o s3://my-bucket/models
# Get model and deploy
MODEL=$(easy_sm get-model-artifacts -j training-$VERSION)
easy_sm -t $VERSION deploy \
-n production-endpoint \
-e ml.m5.large \
-m $MODEL
echo "Deployed $VERSION to production-endpoint"
Related Commands¶
build- Build Docker imagepush- Push image to ECRupload-data- Upload training data to S3deploy- Deploy trained modelslocal train- Test training locally first