train¶

Train ML models on AWS SageMaker.

Synopsis¶

easy_sm [--docker-tag TAG] train --base-job-name NAME --ec2-type TYPE \
  --input-s3-dir S3_PATH --output-s3-dir S3_PATH [OPTIONS]

Description¶

The train command submits a training job to AWS SageMaker using your Docker image. It creates a training job that:

Pulls your Docker image from ECR
Downloads training data from S3
Runs your training code
Uploads the trained model to S3

The command outputs the S3 location of the trained model, making it easy to pipe into deployment commands.

Options¶

Option	Short	Type	Required	Default	Description
`--base-job-name`	`-n`	string	Yes	-	Prefix for the SageMaker training job
`--ec2-type`	`-e`	string	Yes	-	EC2 instance type (e.g., `ml.m5.large`)
`--input-s3-dir`	`-i`	string	Yes	-	S3 location for input training data
`--output-s3-dir`	`-o`	string	Yes	-	S3 location to save trained model
`--iam-role-arn`	`-r`	string	No	From `SAGEMAKER_ROLE` env var	AWS IAM role ARN for SageMaker
`--app-name`	`-a`	string	No	Auto-detected	App name for configuration
`--instance-count`	`-c`	integer	No	`1`	Number of EC2 instances for training
`--docker-tag`	`-t`	string	No	`latest`	Docker image tag (global option)

Examples¶

Basic training job¶

export SAGEMAKER_ROLE=arn:aws:iam::123456789012:role/SageMakerRole

easy_sm train \
  -n my-training-job \
  -e ml.m5.large \
  -i s3://my-bucket/training-data \
  -o s3://my-bucket/models

Output:

s3://my-bucket/models/my-training-job/output/model.tar.gz

Training with specific IAM role¶

easy_sm train \
  -n my-training-job \
  -e ml.m5.large \
  -i s3://my-bucket/data \
  -o s3://my-bucket/output \
  -r arn:aws:iam::123456789012:role/CustomRole

Training with multiple instances¶

For distributed training:

easy_sm train \
  -n distributed-job \
  -e ml.m5.xlarge \
  -c 4 \
  -i s3://my-bucket/data \
  -o s3://my-bucket/output

Training with specific Docker tag¶

easy_sm -t v1.2.0 train \
  -n production-training \
  -e ml.m5.large \
  -i s3://my-bucket/data \
  -o s3://my-bucket/output

Complete workflow: build, push, train¶

export SAGEMAKER_ROLE=arn:aws:iam::123456789012:role/SageMakerRole

# Build and push
easy_sm -t v1.0.0 build
easy_sm -t v1.0.0 push

# Train
easy_sm -t v1.0.0 train \
  -n my-job-v1 \
  -e ml.m5.large \
  -i s3://my-bucket/data \
  -o s3://my-bucket/models

Output Format¶

The command outputs the S3 path to the trained model:

s3://bucket/output-path/job-name/output/model.tar.gz

This output is designed for piping into other commands:

# Save model path
MODEL=$(easy_sm train -n my-job -e ml.m5.large \
  -i s3://bucket/data -o s3://bucket/output)

# Deploy the model
easy_sm deploy -n my-endpoint -e ml.m5.large -m $MODEL

Prerequisites¶

Docker image pushed to ECR with easy_sm push
Training data uploaded to S3 (use upload-data if needed)
IAM role with SageMaker permissions (either in SAGEMAKER_ROLE env var or via -r flag)
IAM role must have:
Trust relationship with sagemaker.amazonaws.com
Permissions: s3:GetObject, s3:PutObject, ecr:GetDownloadUrlForLayer, etc.

Training Data Structure¶

Your training data in S3 should be organized as:

s3://my-bucket/training-data/
├── train.csv
├── validation.csv
└── metadata.json

In the training container, this data is available at:

/opt/ml/input/data/training/

Training Code Requirements¶

Your training code at training/training.py should implement:

import os
import joblib

def train(input_data_path, model_save_path):
    """
    Train model on SageMaker.

    Args:
        input_data_path: /opt/ml/input/data/training
        model_save_path: /opt/ml/model
    """
    # Load training data
    train_data = load_data(input_data_path)

    # Train model
    model = train_model(train_data)

    # Save model
    joblib.dump(model, os.path.join(model_save_path, 'model.mdl'))
    print("Training completed")

Container Paths¶

SageMaker uses these standard paths:

Path	Purpose
`/opt/ml/input/data/training/`	Input training data
`/opt/ml/model/`	Save trained model here
`/opt/ml/output/`	Training metrics and logs

Model Output¶

After training, SageMaker automatically:

Creates a tarball of /opt/ml/model/
Uploads it to S3 as model.tar.gz
The path is: {output_s3_dir}/{job_name}/output/model.tar.gz

This model can be used directly with deploy.

Instance Types¶

Common instance types for training:

Instance Type	vCPUs	Memory	Use Case
`ml.m5.large`	2	8 GB	Small datasets, testing
`ml.m5.xlarge`	4	16 GB	Medium datasets
`ml.m5.2xlarge`	8	32 GB	Large datasets
`ml.c5.xlarge`	4	8 GB	Compute-intensive
`ml.c5.2xlarge`	8	16 GB	Heavy compute
`ml.p3.2xlarge`	8	61 GB + GPU	Deep learning
`ml.p3.8xlarge`	32	244 GB + 4 GPUs	Large deep learning

See AWS documentation for full list and pricing.

Distributed Training¶

For distributed training across multiple instances:

easy_sm train \
  -n distributed-job \
  -e ml.m5.xlarge \
  -c 4 \
  -i s3://bucket/data \
  -o s3://bucket/output

Your training code needs to handle distribution. SageMaker provides:

SM_HOSTS: List of all hosts
SM_CURRENT_HOST: Current host name
SM_NUM_GPUS: Number of GPUs available

Example:

import os
import json

def train(input_data_path, model_save_path):
    # Get distributed training info
    hosts = json.loads(os.environ.get('SM_HOSTS', '[]'))
    current_host = os.environ.get('SM_CURRENT_HOST', '')
    num_gpus = int(os.environ.get('SM_NUM_GPUS', 0))

    if len(hosts) > 1:
        # Distributed training logic
        train_distributed(hosts, current_host)
    else:
        # Single-instance training
        train_single()

Monitoring Training¶

After submitting a training job, monitor it in:

AWS Console: SageMaker → Training → Training jobs
CloudWatch Logs: /aws/sagemaker/TrainingJobs

CLI:

aws sagemaker describe-training-job --training-job-name my-job

Troubleshooting¶

"SAGEMAKER_ROLE environment variable not set"¶

Problem: IAM role not provided.

Solution: Set the environment variable:

export SAGEMAKER_ROLE=arn:aws:iam::123456789012:role/SageMakerRole

Or pass it explicitly:

easy_sm train -r arn:aws:iam::123456789012:role/SageMakerRole ...

Training job fails immediately¶

Problem: Docker image not found in ECR.

Solution: Push the image first:

easy_sm push

"Access Denied" errors¶

Problem: IAM role lacks S3 or ECR permissions.

Solution: Add required permissions to the SageMaker execution role:

>

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:GetObject", "s3:PutObject", "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::my-bucket/*", "arn:aws:s3:::my-bucket" ] }, { "Effect": "Allow", "Action": [ "ecr:GetDownloadUrlForLayer", "ecr:BatchGetImage", "ecr:BatchCheckLayerAvailability" ], "Resource": "*" } ] class=p>}
 Out of memory errors¶
 Problem: Model too large for instance memory.
 Solution: Use a larger instance type: 
# Upgrade from ml.m5.large to ml.m5.xlarge
easy_sm train -e ml.m5.xlarge ...
 Training takes too long¶
 Problem: Slow training on small instances.
 Solution: 1. Use compute-optimized or GPU instances 2. Use distributed training with -c flag 3. Optimize your training code
 
 upload-data¶
 Upload local data to S3 for training.
 Synopsis¶
 easy_sm upload-data --input-dir PATH --target-dir S3_PATH [OPTIONS]
 Description¶
 Uploads a local directory to S3 for use as training input data.
 Options¶
    Option  Short  Type  Required  Default  Description  
 
   --input-dir  -i  path  Yes  -  Local directory containing data  
  --target-dir  -t  string  Yes  -  S3 location (e.g., s3://bucket/data)  
  --iam-role-arn  -r  string  No  From SAGEMAKER_ROLE  IAM role ARN  
  --app-name  -a  string  No  Auto-detected  App name  
 
 
 Examples¶
 # Upload training data
easy_sm upload-data \
  -i ./local-data \
  -t s3://my-bucket/training-data

# Output: s3://my-bucket/training-data
 Output¶
 The command outputs the S3 path where data was uploaded:
 s3://my-bucket/training-data
 
 list-training-jobs¶
 List recent SageMaker training jobs.
 Synopsis¶
 easy_sm list-training-jobs [OPTIONS]
 Description¶
 Lists recent training jobs with their status and creation time. Supports pipe-friendly output for automation.
 Options¶
    Option  Short  Type  Required  Default  Description  
 
   --max-results  -m  integer  No  5  Maximum number of jobs to return  
  --names-only  -n  boolean  No  false  Output only job names (one per line)  
  --iam-role-arn  -r  string  No  From SAGEMAKER_ROLE  IAM role ARN  
  --app-name  -a  string  No  Auto-detected  App name  
 
 
 Examples¶
 List recent training jobs¶
 easy_sm list-training-jobs
 Output: 
my-training-job-1  Completed  2024-01-15 10:23:45+00:00
my-training-job-2  InProgress  2024-01-15 11:30:12+00:00
my-training-job-3  Failed  2024-01-14 09:15:33+00:00
 List more jobs¶
 easy_sm list-training-jobs -m 20
 Names only (pipe-friendly)¶
 easy_sm list-training-jobs -n
 Output: 
my-training-job-1
my-training-job-2
my-training-job-3
 Get latest job name¶
 easy_sm list-training-jobs -n -m 1
 Output: 
my-training-job-1
 Pipe-Friendly Usage¶
 # Get latest completed job name
JOB=$(easy_sm list-training-jobs -m 10 | grep Completed | head -1 | awk '{print $1}')

# Get all production job names
easy_sm list-training-jobs -n -m 20 | grep "prod-"

# Get latest model and deploy
MODEL=$(easy_sm get-model-artifacts -j $(easy_sm list-training-jobs -n -m 1))
easy_sm deploy -n my-endpoint -e ml.m5.large -m $MODEL
 
 get-model-artifacts¶
 Get S3 model path from a training job.
 Synopsis¶
 easy_sm get-model-artifacts --training-job-name JOB_NAME [OPTIONS]
 Description¶
 Retrieves the S3 location of the model artifacts (model.tar.gz) for a completed training job. Essential for deployment workflows.
 Options¶
    Option  Short  Type  Required  Default  Description  
 
   --training-job-name  -j  string  Yes  -  Training job name  
  --iam-role-arn  -r  string  No  From SAGEMAKER_ROLE  IAM role ARN  
  --app-name  -a  string  No  Auto-detected  App name  
 
 
 Examples¶
 Get model path¶
 easy_sm get-model-artifacts -j my-training-job-123
 Output: 
s3://my-bucket/models/my-training-job-123/output/model.tar.gz
 Use in deployment pipeline¶
 # Get latest job
JOB=$(easy_sm list-training-jobs -n -m 1)

# Get its model
MODEL=$(easy_sm get-model-artifacts -j $JOB)

# Deploy
easy_sm deploy -n my-endpoint -e ml.m5.large -m $MODEL
 One-liner deployment¶
 easy_sm deploy -n my-endpoint -e ml.m5.large \
  -m $(easy_sm get-model-artifacts -j $(easy_sm list-training-jobs -n -m 1))
 Troubleshooting¶
 Problem: "Training job not found"
 Solution: Verify the job name: 
easy_sm list-training-jobs -n
 Problem: Job not completed yet
 Solution: Wait for training to complete: 
aws sagemaker describe-training-job --training-job-name my-job \
  --query 'TrainingJobStatus'
 
 Complete Training Workflow¶
 End-to-End Example¶
 # 1. Set up environment
export SAGEMAKER_ROLE=arn:aws:iam::123456789012:role/SageMakerRole

# 2. Build and push Docker image
easy_sm -t v1.0.0 build
easy_sm -t v1.0.0 push

# 3. Upload training data
easy_sm upload-data \
  -i ./training-data \
  -t s3://my-bucket/data

# 4. Train model
easy_sm -t v1.0.0 train \
  -n my-training-job \
  -e ml.m5.large \
  -i s3://my-bucket/data \
  -o s3://my-bucket/models

# 5. List training jobs
easy_sm list-training-jobs

# 6. Get model artifacts
MODEL=$(easy_sm get-model-artifacts -j my-training-job)

# 7. Deploy
easy_sm -t v1.0.0 deploy \
  -n my-endpoint \
  -e ml.m5.large \
  -m $MODEL
 Automated Pipeline¶
 #!/bin/bash
set -e

export SAGEMAKER_ROLE=arn:aws:iam::123456789012:role/SageMakerRole
VERSION="v$(date +%Y%m%d-%H%M%S)"

# Build and push
easy_sm -t $VERSION build
easy_sm -t $VERSION push

# Train
easy_sm -t $VERSION train \
  -n training-$VERSION \
  -e ml.m5.xlarge \
  -i s3://my-bucket/data \
  -o s3://my-bucket/models

# Get model and deploy
MODEL=$(easy_sm get-model-artifacts -j training-$VERSION)
easy_sm -t $VERSION deploy \
  -n production-endpoint \
  -e ml.m5.large \
  -m $MODEL

echo "Deployed $VERSION to production-endpoint"
 Related Commands¶
  build - Build Docker image
 push - Push image to ECR
 upload-data - Upload training data to S3
 deploy - Deploy trained models
 local train - Test training locally first
 
 See Also¶
  Training Guide
 SageMaker Training Jobs
 Instance Types and Pricing
 
      2026-02-06      2026-02-06

Option	Short	Type	Required	Default	Description
`--input-dir`	`-i`	path	Yes	-	Local directory containing data
`--target-dir`	`-t`	string	Yes	-	S3 location (e.g., `s3://bucket/data`)
`--iam-role-arn`	`-r`	string	No	From `SAGEMAKER_ROLE`	IAM role ARN
`--app-name`	`-a`	string	No	Auto-detected	App name

Option	Short	Type	Required	Default	Description
`--max-results`	`-m`	integer	No	`5`	Maximum number of jobs to return
`--names-only`	`-n`	boolean	No	`false`	Output only job names (one per line)
`--iam-role-arn`	`-r`	string	No	From `SAGEMAKER_ROLE`	IAM role ARN
`--app-name`	`-a`	string	No	Auto-detected	App name

Option	Short	Type	Required	Default	Description
`--training-job-name`	`-j`	string	Yes	-	Training job name
`--iam-role-arn`	`-r`	string	No	From `SAGEMAKER_ROLE`	IAM role ARN
`--app-name`	`-a`	string	No	Auto-detected	App name