Orchestrating dbt Workflows: The Duel of Apache Airflow and AWS Step Functions

Think of data pipeline orchestration as the backstage crew of a theater, ensuring every scene flows seamlessly into the next. In the data world, tools like Apache Airflow and AWS Step Functions are the unsung heroes that keep the show running smoothly, especially when you're working with dbt (data build tool) to whip your data into shape and ensure that the right data is available at the right time. Both tools are often used alongside dbt (data build tool), which has emerged as a powerful tool for transforming data in a warehouse. 

In this article, we will introduce dbt, Apache Airflow, and AWS Step Functions and then delve into the pros and cons of using Apache Airflow and AWS Step Functions for data pipeline orchestration involving dbt. A note that dbt has a paid version of dbt cloud and a free open source version; we are focussing on dbt-core, the free version of dbt.

dbt (Data Build Tool)

dbt-core is an open-source command-line tool that enables data analysts and engineers to transform data in their warehouses more effectively. It allows users to write modular SQL queries, which it then runs on top of the data warehouse in the appropriate order with respect to their dependencies. 

Key Features

Airflow vs. AWS Step Functions for dbt Orchestration

Apache Airflow

Apache Airflow is an open-source tool that helps to create, schedule, and monitor workflows. It is used by data engineers/ analysts to manage complex data pipelines.

Key Features

Example: DAG

Shell
 
from airflow import DAG

from airflow.operators.bash_operator import BashOperator

from datetime import datetime, timedelta

default_args = {

    'owner': 'airflow',

    'depends_on_past': False,

    'start_date': datetime.now() - timedelta(days=1),

    'email_on_failure': False,

    'email_on_retry': False,

    'retries': 1,

    'retry_delay': timedelta(minutes=5),

}

dag = DAG('dbt_daily_job',

          default_args=default_args,

          description='A simple DAG to run dbt jobs',

          schedule_interval=timedelta(days=1))

dbt_run = BashOperator(

    task_id='dbt_run',

    bash_command='dbt build --s sales.sql',

    dag=dag,

)

slack_notify = SlackAPIPostOperator(

    task_id='slack_notify',

    dag=dag,

    # Replace with your actual Slack notification code

)

dbt_run >> slack_notify

Pros

Cons

AWS Step Functions

AWS Step Functions is a fully managed service provided by Amazon Web Services that makes it easier to orchestrate microservices, serverless applications, and complex workflows. It uses a state machine model to define and execute workflows, which can consist of various AWS services like Lambda, ECS, Sagemaker, and more. 

Key Features

Example: State Machine Cloud Formation Template (Step Function)

Shell
 
AWSTemplateFormatVersion: '2010-09-09'

Description: State Machine to run a dbt job


Resources:

  DbtStateMachine:

    Type: 'AWS::StepFunctions::StateMachine'

    Properties:

      StateMachineName: DbtStateMachine

      RoleArn: !Sub 'arn:aws:iam::${AWS::AccountId}:role/service-role/StepFunctions-ECSTaskRole'

      DefinitionString:

        !Sub |

          Comment: "A Step Functions state machine that executes a dbt job using an ECS task."

          StartAt: RunDbtJob

          States:

            RunDbtJob:

              Type: Task

              Resource: "arn:aws:states:::ecs:runTask.sync"

              Parameters:

                Cluster: "arn:aws:ecs:${AWS::Region}:${AWS::AccountId}:cluster/MyECSCluster"

                TaskDefinition: "arn:aws:ecs:${AWS::Region}:${AWS::AccountId}:task-definition/MyDbtTaskDefinition"

                LaunchType: FARGATE

                NetworkConfiguration:

                  AwsvpcConfiguration:

                    Subnets:

                      - "subnet-0193156582abfef1"

                      - "subnet-abcjkl0890456789"

                    AssignPublicIp: "ENABLED"

              End: true


Outputs:

  StateMachineArn:

    Description: The ARN of the dbt state machine

    Value: !Ref DbtStateMachine

When using AWS ECS with AWS Fargate to run dbt workflows, while you can define the dbt command in DbtTaskdefinition, it's also common to create a Docker image that contains not only the dbt environment but also the specific dbt commands you wish to run.

Pros

Cons

Summary

Choosing the right tool for orchestrating dbt workflows comes down to assessing specific features and how they align with a team's needs. The main attributes that inform this decision include customization, cloud alignment, infrastructure flexibility, managed services, and cost considerations.

Customization and Extensibility

Apache Airflow is highly customizable and extends well, allowing teams to create tailored operators and workflows for complex requirements.

Integration With AWS

AWS Step Functions is the clear winner for teams operating solely within AWS, offering deep integration with the broader AWS ecosystem.

Infrastructure Flexibility

Apache Airflow supports a wide array of environments, making it ideal for multi-cloud or on-premises deployments.

Managed Services

Here, it’s a tie. For managed services, teams can opt for Amazon Managed Workflows for Apache Airflow (MWAA) for an AWS-centric approach or a vendor like Astronomer for hosting Airflow in different environments. There are also platforms like Dagster that offer similar features to Airflow and can be managed as well. This category is highly competitive and will be based on the level of integration and vendor preference.

Cost at Scale

Apache Airflow may prove more cost-effective for scale, given its open-source nature and the potential for optimized cloud or on-premises deployment. AWS Step Functions may be more economical at smaller scales or for teams with existing AWS infrastructure.

Conclusion

The choice between Apache Airflow and AWS Step Functions for orchestrating dbt workflows is nuanced.

For operations deeply rooted in AWS with a preference for serverless execution and minimal maintenance, AWS Step Functions is the recommended choice. 

For those requiring robust customizability, diverse infrastructure support, or cost-effective scalability, Apache Airflow—whether self-managed or via a platform like Astronomer or MWAA (AWS-managed)—emerges as the optimal solution.

 

 

 

 

Top