⚡ Labs

Deploying Gemma-based Receipt Extraction API via AWS ECS Express Mode

Deploying Gemma-based Receipt Extraction API via AWS ECS Express Mode

Hello everyone. In this tutorial, we will continue building our receipt extraction application by creating an API on Amazon Elastic Container Services (ECS). We will leverage the Amazon ECR receipt extraction image created in our previous setup. Amazon ECS is a fully managed container orchestration service that allows you to build, manage, and run containers without the overhead of complex infrastructure management.

Requirements

Before proceeding, ensure you have completed or installed the following:

  • Amazon SageMaker AI prerequisite tutorial.
  • An active AWS account.
  • Terraform installed on your local machine to support Infrastructure as Code (IaC).
  • (Optional) Streamlit for building the front-end user interface.

ECS Express Mode

ECS Express Mode allows you to deploy containerized services using Amazon ECR private or public images (as the primary container), relying only on an IAM execution role (AmazonECSTaskExecutionRolePolicy) and an IAM infrastructure role (AmazonECSInfrastructureRoleforExpressGatewayServices).

Additional configurations, such as the IAM task role, are optional. In this setup, we utilize an IAM task role to authorize our container to invoke the SageMaker endpoint. While ECS Express Mode defaults to utilizing the default VPC, we have defined a custom VPC to maintain granular control over our networking topology.

To deploy, create the following Terraform configuration files in a single directory: iam.tf, main.tf, vpc.tf, and ecs.tf. The AWS Console will be used primarily to monitor and verify the deployed resources.

Below is the ecs.tf configuration file for setting up the ECS Cluster and the ECS Express Service connected to our Gemma-based receipt extraction image:

# Create ECS Cluster
resource "aws_ecs_cluster" "fastapiecs" {
  name = "fastapiecs"
}

# Create ECS Express Service that linked with receipt extraction ECR image
resource "aws_ecs_express_gateway_service" "fastapi" {
  cluster                 = aws_ecs_cluster.fastapiecs.name
  execution_role_arn      = aws_iam_role.execution.arn
  infrastructure_role_arn = aws_iam_role.infrastructure.arn
  task_role_arn           = aws_iam_role.task.arn
  health_check_path       = "/health"
  cpu                     = "256"
  memory                  = "512"
  region                  = data.aws_region.current.region

  primary_container {
    image          = "${local.account_id}.dkr.ecr.${local.region}.amazonaws.com/receipt-extraction-gemma-4:latest"
    container_port = 8000
  }

  network_configuration {
    subnets         = aws_subnet.public[*].id
    security_groups = [aws_security_group.alb_sg.id]
  }

  scaling_target {
    auto_scaling_metric       = "AVERAGE_CPU"
    auto_scaling_target_value = 70
    min_task_count            = 1
    max_task_count            = 3
  }
}

Terraform Configuration Breakdown

Here is an explanation of the core blocks in the ecs.tf file:

  • execution_role_arn, infrastructure_role_arn, and task_role_arn: Retrieve the corresponding IAM role ARNs from iam.tf to grant proper execution and service invocation permissions.
  • health_check_path: Defines the endpoint used by ECS to monitor the health of our FastAPI ECR container.
  • container_port: Specifies port 8000 as the listening port for our API container.
  • network_configuration: Sets up the subnets and security groups to handle inbound and outbound traffic.
  • scaling_target: Configures auto-scaling based on average CPU utilization, scaling tasks dynamically between 1 and 3 instances to manage load.
[AgentUpdate Depth Analysis]

Deploying specialized fine-tuned models like Gemma for receipt extraction via AWS ECS Express Mode showcases the ongoing shift towards modular, serverless-like architectures for AI Agents. Rather than relying on monolithic LLM frameworks, modern Agentic ecosystems require specialized, highly optimized microservices that can be spun up, scaled, and torn down dynamically. Compared to a full-fledged Kubernetes (EKS) setup, ECS Express Mode significantly lowers operational complexity, removing the friction of configuring complex Application Load Balancers. For AI Agent developers, wrapping cognitive or parsing capabilities (like receipt OCR and structured data extraction) into standard REST APIs managed via Terraform represents a highly repeatable and secure pattern. It enables multi-agent pipelines to seamlessly orchestrate task-specific tools with enterprise-grade resilience and minimal latency.

↗ Read original source