Orchestrating dbt Workflows: The Duel of Apache Airflow and AWS Step Functions
Think of data pipeline orchestration as the backstage crew of a theater, ensuring every scene flows seamlessly into the next. In the data world, tools like Apache Airflow and AWS Step Functions are the unsung heroes that keep the show running smoothly, especially when you're working with dbt (data build tool) to whip your data into shape and ensure that the right data is available at the right time. Both tools are often used alongside dbt (data build tool), which has emerged as a powerful tool for transforming data in a warehouse.
In this article, we will introduce dbt, Apache Airflow, and AWS Step Functions and then delve into the pros and cons of using Apache Airflow and AWS Step Functions for data pipeline orchestration involving dbt. A note that dbt has a paid version of dbt cloud and a free open source version; we are focussing on dbt-core, the free version of dbt.
dbt (Data Build Tool)
dbt-core is an open-source command-line tool that enables data analysts and engineers to transform data in their warehouses more effectively. It allows users to write modular SQL queries, which it then runs on top of the data warehouse in the appropriate order with respect to their dependencies.
Key Features
- Version control: It integrates with Git to help track changes, collaborate, and deploy code.
- Documentation: Autogenerated documentation and a searchable data catalog are created based on the dbt project.
- Modularity: Reusable SQL models can be referenced and combined to build complex transformations.
Airflow vs. AWS Step Functions for dbt Orchestration
Apache Airflow
Apache Airflow is an open-source tool that helps to create, schedule, and monitor workflows. It is used by data engineers/ analysts to manage complex data pipelines.
Key Features
- Extensibility: Custom operators, executors, and hooks can be written to extend Airflow’s functionality.
- Scalability: Offers dynamic pipeline generation and can scale to handle multiple data pipeline workflows.
Example: DAG
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime.now() - timedelta(days=1),
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG('dbt_daily_job',
default_args=default_args,
description='A simple DAG to run dbt jobs',
schedule_interval=timedelta(days=1))
dbt_run = BashOperator(
task_id='dbt_run',
bash_command='dbt build --s sales.sql',
dag=dag,
)
slack_notify = SlackAPIPostOperator(
task_id='slack_notify',
dag=dag,
# Replace with your actual Slack notification code
)
dbt_run >> slack_notify
Pros
- Flexibility: Apache Airflow offers unparalleled flexibility with the ability to define custom operators and is not limited to AWS resources.
- Community support: A vibrant open-source community actively contributes plugins and operators that provide extended functionalities.
- Complex workflows: Better suited to complex task dependencies and can manage task orchestration across various systems.
Cons
- Operational overhead: Requires management of underlying infrastructure unless managed services like Astronomer or Google Cloud Composer are used.
- Learning curve: The rich feature set comes with a complexity that may present a steeper learning curve for some users.
AWS Step Functions
AWS Step Functions is a fully managed service provided by Amazon Web Services that makes it easier to orchestrate microservices, serverless applications, and complex workflows. It uses a state machine model to define and execute workflows, which can consist of various AWS services like Lambda, ECS, Sagemaker, and more.
Key Features
- Serverless operation: No need to manage infrastructure as AWS provides a managed service.
- Integration with AWS Services: Seamless connection to AWS services is supported for complex orchestration.
Example: State Machine Cloud Formation Template (Step Function)
AWSTemplateFormatVersion: '2010-09-09'
Description: State Machine to run a dbt job
Resources:
DbtStateMachine:
Type: 'AWS::StepFunctions::StateMachine'
Properties:
StateMachineName: DbtStateMachine
RoleArn: !Sub 'arn:aws:iam::${AWS::AccountId}:role/service-role/StepFunctions-ECSTaskRole'
DefinitionString:
!Sub |
Comment: "A Step Functions state machine that executes a dbt job using an ECS task."
StartAt: RunDbtJob
States:
RunDbtJob:
Type: Task
Resource: "arn:aws:states:::ecs:runTask.sync"
Parameters:
Cluster: "arn:aws:ecs:${AWS::Region}:${AWS::AccountId}:cluster/MyECSCluster"
TaskDefinition: "arn:aws:ecs:${AWS::Region}:${AWS::AccountId}:task-definition/MyDbtTaskDefinition"
LaunchType: FARGATE
NetworkConfiguration:
AwsvpcConfiguration:
Subnets:
- "subnet-0193156582abfef1"
- "subnet-abcjkl0890456789"
AssignPublicIp: "ENABLED"
End: true
Outputs:
StateMachineArn:
Description: The ARN of the dbt state machine
Value: !Ref DbtStateMachine
When using AWS ECS with AWS Fargate to run dbt workflows, while you can define the dbt command in DbtTaskdefinition, it's also common to create a Docker image that contains not only the dbt environment but also the specific dbt commands you wish to run.
Pros
- Fully managed service: AWS manages the scaling and operation under the hood, leading to reduced operational burden.
- AWS integration: Natural fit for AWS-centric environments, allowing easy integration of various AWS services.
- Reliability: Step Functions provide a high level of reliability and support, backed by AWS SLA.
Cons
- Cost: Pricing might be higher for high-volume workflows compared to running your self-hosted or cloud-provider-managed Airflow instance. Step functions incur costs based on the number of state transitions.
- Locked-in with AWS: Tightly coupled with AWS services, which can be a downside if you're aiming for a cloud-agnostic architecture.
- Complexity in handling large workflows: While capable, it can become difficult to manage larger, more complex workflows compared to using Airflow's DAGs. There are limitations on the number of parallel executions of a State Machine.
- Learning curve: The service also presents a learning curve with specific paradigms, such as the Amazon States Language.
- Scheduling: AWS Step functions need to rely on other AWS services like AWS Eventbridge for scheduling.
Summary
Choosing the right tool for orchestrating dbt workflows comes down to assessing specific features and how they align with a team's needs. The main attributes that inform this decision include customization, cloud alignment, infrastructure flexibility, managed services, and cost considerations.
Customization and Extensibility
Apache Airflow is highly customizable and extends well, allowing teams to create tailored operators and workflows for complex requirements.
Integration With AWS
AWS Step Functions is the clear winner for teams operating solely within AWS, offering deep integration with the broader AWS ecosystem.
Infrastructure Flexibility
Apache Airflow supports a wide array of environments, making it ideal for multi-cloud or on-premises deployments.
Managed Services
Here, it’s a tie. For managed services, teams can opt for Amazon Managed Workflows for Apache Airflow (MWAA) for an AWS-centric approach or a vendor like Astronomer for hosting Airflow in different environments. There are also platforms like Dagster that offer similar features to Airflow and can be managed as well. This category is highly competitive and will be based on the level of integration and vendor preference.
Cost at Scale
Apache Airflow may prove more cost-effective for scale, given its open-source nature and the potential for optimized cloud or on-premises deployment. AWS Step Functions may be more economical at smaller scales or for teams with existing AWS infrastructure.
Conclusion
The choice between Apache Airflow and AWS Step Functions for orchestrating dbt workflows is nuanced.
For operations deeply rooted in AWS with a preference for serverless execution and minimal maintenance, AWS Step Functions is the recommended choice.
For those requiring robust customizability, diverse infrastructure support, or cost-effective scalability, Apache Airflow—whether self-managed or via a platform like Astronomer or MWAA (AWS-managed)—emerges as the optimal solution.