Hello everyone. In this tutorial, we will continue building our receipt extraction application by creating an API on Amazon Elastic Container Services (ECS). We will leverage the Amazon ECR receipt extraction image created in our previous setup. Amazon ECS is a fully managed container orchestration service that allows you to build, manage, and run containers without the overhead of complex infrastructure management.
Requirements
Before proceeding, ensure you have completed or installed the following:
- Amazon SageMaker AI prerequisite tutorial.
- An active AWS account.
- Terraform installed on your local machine to support Infrastructure as Code (IaC).
- (Optional) Streamlit for building the front-end user interface.
ECS Express Mode
ECS Express Mode allows you to deploy containerized services using Amazon ECR private or public images (as the primary container), relying only on an IAM execution role (AmazonECSTaskExecutionRolePolicy) and an IAM infrastructure role (AmazonECSInfrastructureRoleforExpressGatewayServices).
Additional configurations, such as the IAM task role, are optional. In this setup, we utilize an IAM task role to authorize our container to invoke the SageMaker endpoint. While ECS Express Mode defaults to utilizing the default VPC, we have defined a custom VPC to maintain granular control over our networking topology.
To deploy, create the following Terraform configuration files in a single directory: iam.tf, main.tf, vpc.tf, and ecs.tf. The AWS Console will be used primarily to monitor and verify the deployed resources.
Below is the ecs.tf configuration file for setting up the ECS Cluster and the ECS Express Service connected to our Gemma-based receipt extraction image:
# Create ECS Cluster
resource "aws_ecs_cluster" "fastapiecs" {
name = "fastapiecs"
}
# Create ECS Express Service that linked with receipt extraction ECR image
resource "aws_ecs_express_gateway_service" "fastapi" {
cluster = aws_ecs_cluster.fastapiecs.name
execution_role_arn = aws_iam_role.execution.arn
infrastructure_role_arn = aws_iam_role.infrastructure.arn
task_role_arn = aws_iam_role.task.arn
health_check_path = "/health"
cpu = "256"
memory = "512"
region = data.aws_region.current.region
primary_container {
image = "${local.account_id}.dkr.ecr.${local.region}.amazonaws.com/receipt-extraction-gemma-4:latest"
container_port = 8000
}
network_configuration {
subnets = aws_subnet.public[*].id
security_groups = [aws_security_group.alb_sg.id]
}
scaling_target {
auto_scaling_metric = "AVERAGE_CPU"
auto_scaling_target_value = 70
min_task_count = 1
max_task_count = 3
}
}Terraform Configuration Breakdown
Here is an explanation of the core blocks in the ecs.tf file:
execution_role_arn,infrastructure_role_arn, andtask_role_arn: Retrieve the corresponding IAM role ARNs fromiam.tfto grant proper execution and service invocation permissions.health_check_path: Defines the endpoint used by ECS to monitor the health of our FastAPI ECR container.container_port: Specifies port 8000 as the listening port for our API container.network_configuration: Sets up the subnets and security groups to handle inbound and outbound traffic.scaling_target: Configures auto-scaling based on average CPU utilization, scaling tasks dynamically between 1 and 3 instances to manage load.
Deploying specialized fine-tuned models like Gemma for receipt extraction via AWS ECS Express Mode showcases the ongoing shift towards modular, serverless-like architectures for AI Agents. Rather than relying on monolithic LLM frameworks, modern Agentic ecosystems require specialized, highly optimized microservices that can be spun up, scaled, and torn down dynamically. Compared to a full-fledged Kubernetes (EKS) setup, ECS Express Mode significantly lowers operational complexity, removing the friction of configuring complex Application Load Balancers. For AI Agent developers, wrapping cognitive or parsing capabilities (like receipt OCR and structured data extraction) into standard REST APIs managed via Terraform represents a highly repeatable and secure pattern. It enables multi-agent pipelines to seamlessly orchestrate task-specific tools with enterprise-grade resilience and minimal latency.