Skip to main content
TWYTech World by Yashrajsinh

AWS ECS Container Service

Y
Yashrajsinh
··17 min read·Intermediate

AWS ECS Container Service

Amazon Elastic Container Service is the fully managed container orchestration platform on AWS that lets you run, scale, and secure containerized applications without managing the underlying cluster infrastructure. ECS handles container placement, scheduling, health monitoring, and integration with the broader AWS ecosystem while you focus on building and deploying your application containers. Whether you choose the serverless Fargate launch type or manage your own EC2 instances, ECS provides a production-grade orchestration layer that scales from a single container to thousands of tasks across multiple Availability Zones.

Container orchestration solves the fundamental challenge of running distributed applications reliably at scale. When you move beyond a single container on a single host, you need a system that decides where to place containers, restarts them when they fail, distributes traffic across healthy instances, scales capacity based on demand, and rolls out updates without downtime. ECS provides all of these capabilities as a managed service, eliminating the operational complexity of running your own orchestration platform while integrating natively with AWS networking, security, and observability services.

This guide covers everything you need to deploy and operate containerized applications on ECS. We start with the core concepts of task definitions, tasks, and services, move through launch type selection between Fargate and EC2, explain service configuration with load balancing and auto-scaling, walk through deployment strategies, and finish with production best practices that keep your containers running reliably. If you are following the AWS services roadmap, ECS is the natural next step after mastering Docker fundamentals and pushing images to ECR.

What You Will Learn

After reading this guide, you will have a thorough understanding of AWS ECS and how to use it for production container workloads. Specifically, you will learn:

  • How ECS organizes containerized applications through clusters, task definitions, tasks, and services, and how these abstractions map to running containers
  • How task definitions declare container configurations including images, resource limits, port mappings, environment variables, logging, and health checks
  • How to choose between Fargate and EC2 launch types based on workload characteristics, cost optimization, and operational requirements
  • How ECS services maintain desired task count, integrate with Application Load Balancers for traffic distribution, and perform rolling deployments
  • How auto-scaling policies adjust task count based on CPU utilization, memory usage, custom CloudWatch metrics, or request count per target
  • How ECS networking works with awsvpc mode giving each task its own elastic network interface and security group for fine-grained network isolation
  • How deployment configurations control rolling updates, circuit breakers, and blue-green deployments through CodeDeploy integration
  • How IAM roles at the task level and execution level provide least-privilege access to AWS services without embedding credentials in containers
  • How CloudWatch Container Insights, log drivers, and health checks provide observability into container performance and application behavior

Each section builds on the previous one, giving you a coherent path from understanding ECS fundamentals to operating production container workloads confidently.

Prerequisites

Before working through this guide, make sure you have the following in place:

  • An active AWS account with permissions to create ECS clusters, task definitions, services, and associated resources including IAM roles, security groups, and load balancers
  • The AWS CLI installed and configured with credentials using aws configure, so you can provision and manage ECS resources from your terminal
  • A VPC configured with public and private subnets across multiple Availability Zones, because ECS tasks need network placement and load balancers require subnets in at least two AZs
  • Familiarity with Docker concepts including images, containers, Dockerfiles, port mappings, and environment variables, since ECS orchestrates Docker containers
  • Container images pushed to Amazon ECR or another container registry accessible from your AWS account, so ECS can pull images when launching tasks
  • Understanding of IAM roles and policies for configuring task execution roles that pull images and task roles that grant containers access to AWS services

No prior ECS experience is required. If you have used Docker Compose or Kubernetes before, you will find ECS concepts familiar but with tighter AWS integration and less operational overhead for the orchestration layer itself.

Concept Overview

ECS organizes container workloads through a hierarchy of abstractions: clusters, task definitions, tasks, and services. Understanding how these relate to each other is essential before you start deploying containers.

A cluster is the logical grouping of resources where your containers run. It serves as the isolation boundary for your workloads and the scope for IAM permissions, service discovery, and capacity providers. You might have separate clusters for production, staging, and development, or separate clusters for different teams or applications. With Fargate, a cluster is purely logical because AWS manages the underlying compute. With EC2 launch type, the cluster also contains the EC2 instances that provide compute capacity.

A task definition is the blueprint for your application. It declares one or more container definitions, each specifying the image to pull, CPU and memory limits, port mappings, environment variables, logging configuration, health check commands, and dependencies between containers. Task definitions are versioned and immutable. When you update a task definition, ECS creates a new revision rather than modifying the existing one, giving you a clear audit trail and the ability to roll back to any previous revision.

A task is a running instantiation of a task definition. It represents one or more containers running together on the same host with shared resources. Tasks are ephemeral by nature. They start, run, and eventually stop either because the application exits, a health check fails, or the service scales down. ECS handles the lifecycle of tasks, restarting them when they fail and placing them on hosts with available capacity.

A service is the long-running controller that maintains a desired number of tasks. It ensures that the specified count of healthy tasks is always running, replaces tasks that fail health checks, integrates with load balancers to register and deregister tasks as targets, and performs rolling deployments when you update the task definition. Services are the primary abstraction for running web applications, APIs, and background workers that need to stay running continuously.

The relationship between these concepts flows naturally: a cluster contains services, each service references a task definition, and the service maintains the desired count of tasks based on that definition. When you deploy a new version, you create a new task definition revision, update the service to reference it, and the service gradually replaces old tasks with new ones.

Step-by-Step Explanation

This section walks through the essential implementation steps in order. Each step builds on the previous one, providing a clear path from initial configuration to a production-ready setup that follows AWS best practices.

Creating an ECS Cluster

The first step is creating a cluster to host your container workloads. With Fargate, cluster creation is straightforward because you do not need to provision any compute instances.

# Create an ECS cluster with Container Insights enabled
aws ecs create-cluster \
  --cluster-name production-cluster \
  --capacity-providers FARGATE FARGATE_SPOT \
  --default-capacity-provider-strategy \
    capacityProvider=FARGATE,weight=1,base=2 \
    capacityProvider=FARGATE_SPOT,weight=3 \
  --settings name=containerInsights,value=enabled \
  --tags key=Environment,value=production key=Team,value=platform
 
# Verify the cluster is active
aws ecs describe-clusters \
  --clusters production-cluster \
  --query 'clusters[0].{Name:clusterName,Status:status,Providers:capacityProviders}'

This creates a cluster with two capacity providers: standard Fargate for baseline tasks and Fargate Spot for cost-optimized tasks that can tolerate interruption. The default strategy ensures at least two tasks always run on standard Fargate (the base parameter) while additional tasks prefer Spot capacity at a 3:1 ratio. Container Insights enables detailed metrics collection for CPU, memory, network, and storage at the task and service level.

Writing a Task Definition

Task definitions declare everything ECS needs to run your containers. They are JSON documents that you register with ECS and reference by family name and revision number.

{
  "family": "web-api",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
  "taskRoleArn": "arn:aws:iam::123456789012:role/web-api-task-role",
  "containerDefinitions": [
    {
      "name": "web-api",
      "image": "123456789012.dkr.ecr.ap-south-1.amazonaws.com/web-api:v1.2.0",
      "essential": true,
      "portMappings": [
        {
          "containerPort": 8080,
          "protocol": "tcp"
        }
      ],
      "environment": [
        { "name": "NODE_ENV", "value": "production" },
        { "name": "PORT", "value": "8080" }
      ],
      "secrets": [
        {
          "name": "DATABASE_URL",
          "valueFrom": "arn:aws:secretsmanager:ap-south-1:123456789012:secret:prod/db-url-AbCdEf"
        }
      ],
      "healthCheck": {
        "command": ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"],
        "interval": 30,
        "timeout": 5,
        "retries": 3,
        "startPeriod": 60
      },
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/web-api",
          "awslogs-region": "ap-south-1",
          "awslogs-stream-prefix": "ecs",
          "awslogs-create-group": "true"
        }
      },
      "linuxParameters": {
        "initProcessEnabled": true
      }
    }
  ],
  "runtimePlatform": {
    "cpuArchitecture": "ARM64",
    "operatingSystemFamily": "LINUX"
  }
}

Key decisions in this task definition include using awsvpc network mode which gives each task its own ENI and private IP address, specifying ARM64 architecture for Graviton-based Fargate which provides better price-performance, separating the execution role (used by ECS to pull images and write logs) from the task role (used by the application to access AWS services), and using Secrets Manager references instead of plaintext environment variables for sensitive configuration.

Register the task definition with ECS:

# Register the task definition
aws ecs register-task-definition \
  --cli-input-json file://task-definition.json
 
# List revisions of the task definition family
aws ecs list-task-definitions \
  --family-prefix web-api \
  --sort DESC

Creating a Service with Load Balancing

Services maintain your desired task count and integrate with load balancers to distribute traffic across healthy tasks. Before creating the service, you need an Application Load Balancer with a target group configured for the IP target type (required for awsvpc network mode).

# Create the ECS service with ALB integration
aws ecs create-service \
  --cluster production-cluster \
  --service-name web-api-service \
  --task-definition web-api:1 \
  --desired-count 3 \
  --launch-type FARGATE \
  --platform-version LATEST \
  --network-configuration "awsvpcConfiguration={subnets=[subnet-private-1a,subnet-private-1b],securityGroups=[sg-ecs-tasks],assignPublicIp=DISABLED}" \
  --load-balancers "targetGroupArn=arn:aws:elasticloadbalancing:ap-south-1:123456789012:targetgroup/web-api-tg/abc123,containerName=web-api,containerPort=8080" \
  --health-check-grace-period-seconds 120 \
  --deployment-configuration "deploymentCircuitBreaker={enable=true,rollback=true},maximumPercent=200,minimumHealthyPercent=100" \
  --enable-execute-command \
  --tags key=Service,value=web-api
 
# Verify the service is running
aws ecs describe-services \
  --cluster production-cluster \
  --services web-api-service \
  --query 'services[0].{Status:status,Running:runningCount,Desired:desiredCount,Deployments:deployments[*].{Status:status,Running:runningCount}}'

The service configuration places tasks in private subnets without public IPs, meaning traffic reaches containers only through the load balancer. The deployment circuit breaker automatically rolls back a deployment if new tasks fail to stabilize, preventing bad deployments from taking down the entire service. The health check grace period gives containers 120 seconds to start before the load balancer begins checking their health, which is important for applications with slow startup times like JVM-based services.

Setting minimumHealthyPercent=100 and maximumPercent=200 means ECS launches new tasks before stopping old ones during deployments, ensuring zero capacity reduction during rollouts. The service always has at least the desired count of healthy tasks running.

Configuring Auto-Scaling

ECS service auto-scaling adjusts the desired task count based on metrics, ensuring your application handles traffic fluctuations without manual intervention. Auto-scaling uses Application Auto Scaling with target tracking or step scaling policies.

# Register the service as a scalable target
aws application-autoscaling register-scalable-target \
  --service-namespace ecs \
  --resource-id service/production-cluster/web-api-service \
  --scalable-dimension ecs:service:DesiredCount \
  --min-capacity 2 \
  --max-capacity 20
 
# Create a target tracking policy based on CPU utilization
aws application-autoscaling put-scaling-policy \
  --service-namespace ecs \
  --resource-id service/production-cluster/web-api-service \
  --scalable-dimension ecs:service:DesiredCount \
  --policy-name web-api-cpu-scaling \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration '{
    "TargetValue": 60.0,
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "ECSServiceAverageCPUUtilization"
    },
    "ScaleInCooldown": 300,
    "ScaleOutCooldown": 60
  }'
 
# Create a second policy based on ALB request count per target
aws application-autoscaling put-scaling-policy \
  --service-namespace ecs \
  --resource-id service/production-cluster/web-api-service \
  --scalable-dimension ecs:service:DesiredCount \
  --policy-name web-api-request-scaling \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration '{
    "TargetValue": 1000.0,
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "ALBRequestCountPerTarget",
      "ResourceLabel": "app/web-api-alb/abc123/targetgroup/web-api-tg/def456"
    },
    "ScaleInCooldown": 300,
    "ScaleOutCooldown": 30
  }'
 
# View current scaling activities
aws application-autoscaling describe-scaling-activities \
  --service-namespace ecs \
  --resource-id service/production-cluster/web-api-service

Target tracking policies work by maintaining a metric at the specified target value. When CPU utilization exceeds 60 percent, ECS adds tasks to bring utilization back down. When utilization drops well below the target, ECS removes tasks after the scale-in cooldown period. The asymmetric cooldowns (60 seconds for scale-out, 300 seconds for scale-in) ensure the service responds quickly to traffic increases but avoids thrashing during brief traffic dips.

Using multiple scaling policies together provides comprehensive coverage. CPU-based scaling handles compute-intensive workloads, while request-count scaling responds to traffic volume regardless of per-request CPU cost. ECS uses the policy that results in the highest desired count at any given time, ensuring the service is never under-provisioned.

Fargate vs EC2 Launch Types

The launch type determines who manages the compute infrastructure that runs your containers. This decision affects cost, operational complexity, and the level of control you have over the underlying hosts.

Fargate is the serverless option where AWS manages all compute infrastructure. You specify CPU and memory requirements in your task definition, and Fargate provisions exactly the right amount of compute for each task. You pay per-second for the vCPU and memory your tasks consume, with no charges for idle capacity. Fargate eliminates the need to provision, patch, or scale EC2 instances, making it the right choice when you want to focus entirely on your application without managing infrastructure.

EC2 launch type runs tasks on EC2 instances that you manage within your cluster. You control the instance types, AMIs, and scaling of the underlying fleet. This gives you access to GPU instances, higher per-task resource limits, persistent local storage, and the ability to run privileged containers. EC2 launch type is cost-effective for steady-state workloads where you can right-size instances and achieve high utilization, and it is required for workloads that need capabilities Fargate does not support.

Choose Fargate when your workloads are variable or bursty, when you want zero infrastructure management, when tasks need less than 16 vCPU and 120 GB memory, and when you do not need GPU access or privileged containers. Choose EC2 when you need GPU instances for machine learning inference, when you have steady-state workloads where reserved instances provide significant savings, when tasks need more resources than Fargate supports, or when you need access to the host operating system for specialized configurations.

Fargate Spot provides up to 70 percent cost savings compared to standard Fargate by running tasks on spare capacity that AWS can reclaim with a two-minute warning. Use Spot for fault-tolerant workloads like batch processing, queue workers, and development environments where occasional task interruption is acceptable.

Networking and Security

ECS networking with awsvpc mode gives each task its own elastic network interface with a private IP address from your VPC subnet. This provides the same networking capabilities as EC2 instances: security groups, VPC flow logs, and direct addressability within the VPC. Each task can have its own security group, enabling fine-grained network policies at the container level rather than the host level.

Tasks in private subnets access the internet through a NAT gateway for pulling images from public registries or calling external APIs. For pulling images from ECR without internet access, configure VPC endpoints for ECR, S3 (where image layers are stored), and CloudWatch Logs. This keeps all traffic within the AWS network and eliminates NAT gateway data processing charges for image pulls.

Security in ECS follows the principle of least privilege through two distinct IAM roles. The task execution role grants ECS permission to pull container images from ECR, retrieve secrets from Secrets Manager or Parameter Store, and write logs to CloudWatch. The task role grants the running application permission to call AWS services like S3, DynamoDB, or SQS. Separating these roles means a compromised application cannot modify its own task definition or pull different images, limiting the blast radius of a security incident.

Service-to-service communication within ECS uses AWS Cloud Map for service discovery. When you enable service discovery on an ECS service, each task registers its IP address in a Cloud Map namespace, and other services resolve the service name through DNS to get the current set of healthy task IPs. This eliminates the need for a service mesh or manual endpoint management for internal communication between microservices.

Deployment Strategies

ECS supports multiple deployment strategies that balance speed, safety, and resource efficiency during application updates.

Rolling updates are the default strategy. When you update a service's task definition, ECS launches new tasks with the updated definition and drains connections from old tasks before stopping them. The minimumHealthyPercent and maximumPercent parameters control how aggressively ECS replaces tasks. A configuration of 100/200 means ECS doubles capacity temporarily during deployment, ensuring zero reduction in serving capacity. A configuration of 50/100 means ECS stops half the old tasks before launching new ones, using less capacity but briefly reducing serving capacity.

The deployment circuit breaker monitors new tasks during deployment and automatically rolls back if they fail to reach a healthy state. When enabled with rollback, ECS tracks the number of tasks that fail to start or fail health checks. If failures exceed a threshold, ECS stops the deployment and reverts the service to the previous stable task definition revision. This prevents bad deployments from progressing and taking down the entire service.

Blue-green deployments through AWS CodeDeploy provide the safest deployment strategy by running the new version alongside the old version and shifting traffic gradually. CodeDeploy creates a new target group with the updated tasks, shifts a percentage of traffic to the new target group, monitors health metrics during the shift, and either completes the deployment or rolls back based on CloudWatch alarms. This strategy is ideal for critical services where you need the ability to instantly roll back by shifting traffic back to the original target group.

Real-World Use Cases

ECS serves as the container orchestration platform for a wide range of production workloads, from simple web applications to complex microservice architectures.

Web applications and APIs are the most common ECS workload. A typical deployment runs multiple tasks behind an Application Load Balancer with path-based routing, auto-scaling based on request count, and rolling deployments for zero-downtime updates. The ALB performs health checks against each task and removes unhealthy tasks from the target group automatically.

Microservice architectures deploy each service as a separate ECS service within a shared cluster. Services communicate through internal load balancers or Cloud Map service discovery. Each service scales independently based on its own metrics, deploys independently with its own task definition, and can use different resource allocations appropriate to its workload characteristics.

Background workers and queue processors run as ECS services without load balancers. They pull work from SQS queues, process events from Kinesis streams, or run scheduled tasks through EventBridge. Auto-scaling for workers typically uses custom CloudWatch metrics like queue depth or processing lag rather than CPU utilization.

Batch processing workloads use ECS standalone tasks (not services) launched by Step Functions or EventBridge rules. Each task processes a unit of work and exits when complete. Fargate Spot is particularly cost-effective for batch workloads because the tasks are inherently fault-tolerant and can be retried if interrupted.

Machine learning inference endpoints run on ECS with EC2 launch type using GPU instances. The task definition requests GPU resources, and ECS places tasks on instances with available GPUs. Auto-scaling adjusts the number of inference tasks based on prediction request latency or queue depth.

Best Practices

These practices represent production-tested patterns for operating ECS workloads reliably and efficiently:

Always use awsvpc network mode for Fargate tasks and prefer it for EC2 tasks. It provides task-level network isolation through security groups, simplifies service discovery, and gives each task a routable IP address within your VPC. The only reason to use bridge or host mode is legacy compatibility with applications that cannot adapt to awsvpc networking.

Set appropriate health check parameters in your task definition. The startPeriod should be long enough for your application to complete initialization, including database migrations, cache warming, and dependency checks. A health check that fails during startup causes ECS to kill and restart the task in a loop, never allowing it to become healthy.

Use Secrets Manager or Parameter Store for all sensitive configuration rather than plaintext environment variables. Secrets referenced in the task definition are injected at task launch time and never appear in the task definition itself, CloudTrail logs, or the ECS console. Rotate secrets without redeploying by using dynamic secret references that resolve at launch time.

Enable the deployment circuit breaker with rollback on every production service. Without it, a bad deployment continues launching failing tasks until you manually intervene. The circuit breaker detects failures automatically and reverts to the last known good configuration, limiting the impact of deployment issues.

Right-size task CPU and memory allocations based on actual usage rather than guessing. Use Container Insights metrics to observe peak CPU and memory consumption over a representative period, then set limits with appropriate headroom. Over-provisioning wastes money on Fargate where you pay for allocated resources, while under-provisioning causes OOM kills and CPU throttling.

Use multiple capacity providers to optimize cost. Run baseline tasks on standard Fargate with a base count, and scale additional tasks on Fargate Spot. For EC2 launch type, mix On-Demand instances for baseline capacity with Spot instances for burst capacity. The capacity provider strategy handles placement automatically.

Implement graceful shutdown in your application containers. When ECS stops a task, it sends SIGTERM and waits for the stopTimeout period (default 30 seconds) before sending SIGKILL. Your application should catch SIGTERM, stop accepting new requests, finish processing in-flight requests, close database connections cleanly, and exit. Enable initProcessEnabled in the task definition to ensure signal delivery works correctly with PID 1.

Tag all ECS resources consistently with environment, team, service, and cost-center tags. Tags flow through to CloudWatch metrics, Cost Explorer, and resource groups, enabling you to track costs per service, filter metrics by environment, and manage resources at scale.

Common Mistakes

These mistakes appear frequently in ECS deployments and understanding them helps you avoid outages and operational problems:

Setting the health check grace period too short causes ECS to kill tasks before they finish starting. JVM-based applications, applications that run database migrations on startup, or services that warm caches need a grace period that exceeds their worst-case startup time. Monitor the HealthCheckGracePeriod against actual startup duration and add a buffer.

Not configuring the deregistration delay on the target group causes connection errors during deployments. When ECS stops a task, the load balancer needs time to drain existing connections before removing the target. Set the deregistration delay to match your longest expected request duration, typically 30 to 60 seconds for web APIs and longer for WebSocket connections.

Using the same security group for tasks and the load balancer creates a circular dependency and overly permissive network rules. Create separate security groups: one for the ALB that allows inbound traffic from the internet on ports 80 and 443, and one for ECS tasks that allows inbound traffic only from the ALB security group on the container port. This ensures tasks are never directly accessible from the internet.

Ignoring container log configuration leads to lost logs and difficult debugging. Always configure the awslogs log driver with a dedicated log group per service. Set appropriate log retention periods to control costs, and consider structured JSON logging so you can query logs effectively with CloudWatch Logs Insights.

Running containers as root without necessity violates the principle of least privilege. Configure your Dockerfile to run as a non-root user, and use the readonlyRootFilesystem parameter in the task definition to prevent containers from writing to the filesystem. If the application needs to write temporary files, mount a tmpfs volume at the specific path required.

Not setting resource limits causes a single misbehaving container to consume all available CPU or memory on the host, affecting other tasks. On Fargate, resource limits are mandatory and enforced. On EC2 launch type, always set both cpu and memory limits in container definitions to prevent resource contention between tasks sharing the same instance.

Deploying without a circuit breaker means a broken image or configuration error causes ECS to continuously launch failing tasks, consuming resources and generating noise in logs and alerts. The circuit breaker detects this pattern and stops the deployment automatically, but only if you enable it explicitly in the deployment configuration.

Summary

AWS ECS provides a production-grade container orchestration platform that handles the complexity of running distributed containerized applications while integrating deeply with AWS networking, security, and observability services. The service abstracts away cluster management through Fargate or gives you full control through EC2 launch type, letting you choose the operational model that matches your team's capabilities and workload requirements.

The key concepts to internalize are task definitions as immutable application blueprints, services as the controllers that maintain desired state and perform deployments, Fargate as the serverless compute layer that eliminates infrastructure management, and auto-scaling as the mechanism that matches capacity to demand automatically. With these fundamentals in place, you can deploy containerized applications that scale elastically, recover from failures automatically, and update without downtime.

Your next steps should include exploring service mesh patterns with App Mesh for complex microservice communication, investigating ECS Anywhere for running ECS tasks on on-premises infrastructure, and setting up CI/CD pipelines that build images, push to ECR, and trigger ECS deployments automatically. As you continue through the AWS services roadmap, you will find ECS integrating with ECR for image storage, VPC for network isolation, IAM for security, and CloudWatch for observability, making container orchestration knowledge foundational for building modern application architectures on AWS.

Intermediate13 min read

AWS API Gateway Deep Dive

Master AWS API Gateway covering REST APIs, HTTP APIs, WebSocket, Lambda integration, authorization strategies, throttling, and production deployment.

Intermediate17 min read

AWS CloudFront CDN Guide

Master AWS CloudFront CDN distributions, origins, cache behaviors, SSL certificates, edge functions, and global content delivery best practices.

Intermediate15 min read

AWS CloudWatch Monitoring

Master AWS CloudWatch metrics, logs, alarms, dashboards, anomaly detection, and insights to build comprehensive observability for your cloud infrastructure.