Published on

Airflow

https://github.com/andrem8/surf_dash

Table of Contents

what is Airflow

Airflow is a platform to programmatically author, schedule and monitor workflows. Use airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed. When workflows are defined as code, they become more maintainable, versionable, testable, and collaborative.

Airflow Architecture

Airflow is composed of three main components:

  • The metadata database
  • The scheduler
  • The workers

Metadata Database

The metadata database stores credentials, connections, history, and configuration. The metadata database, often referred to as the metadata catalog, is shared by all the components in an Airflow instance. The database, typically a SQL database, is managed by the Airflow metadata database service.

Scheduler

The scheduler is a process that uses DAG definitions to decide which tasks need to be run, according to their dependencies. The scheduler also triggers the executor to run those tasks. The scheduler chooses how to prioritize the running and execution of tasks within the system.

Workers

Workers are processes that are sent to execute the operations defined in each DAG. In Airflow, workers pull from the scheduler the tasks that have been assigned to them. When the worker completes the execution of the task, the final state of the task is recorded in the database.

Airflow Concepts

DAGs

A Directed Acyclic Graph (DAG) is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies. A DAG is defined in a Python script, which represents the DAGs structure (tasks and their dependencies) as code.

Operators

Operators define the work to be done. Airflow provides many types of operators, such as BashOperator for executing a bash script, PythonOperator for executing Python callable functions, and more. An operator describes a single task in a workflow.

Tasks

A task is a parameterized instance of an operator. A task represents an instance of an operator with specific arguments. For example, the BashOperator is a type of operator, and a BashOperator with its task_id set to print_date is an instance of a task.

Task Instances

A task instance represents a specific run of a task and is characterized as the combination of a DAG, a task, and a point in time. Task instances also have an indicative state, which could be one of queued, running, success, or failed, among others.

Scheduling

The process of allocating resources to different tasks over time is known as scheduling. The scheduler is responsible for determining when to run each task instance in the system.

Executors

The executor is the mechanism by which task instances get run. Airflow supports different executors, such as the SequentialExecutor, the LocalExecutor, and the CeleryExecutor.

Airflow Features

  • Dynamic: Airflow pipelines are configuration as code (Python), allowing for dynamic pipeline generation. This allows for writing code that instantiates pipelines dynamically.
  • Extensible: Easily define your own operators, executors, and extend the library so that it fits the level of abstraction that suits your environment.
  • Elegant: Airflow pipelines are lean and explicit. Parameterizing your scripts is built into the core of Airflow using the powerful Jinja templating engine.
  • Scalable: Airflow has a modular architecture and uses a message queue to orchestrate an arbitrary number of workers. Airflow is ready to scale to infinity.
  • Easy to use: Airflow has a very rich command line interface that allows for performing complex surgeries on the DAGs, allowing for users to perform tasks much quicker.
  • Integrations: Airflow has many integrations with external systems, such as Hive, Presto, HDFS, and more. Airflow can be extended to integrate with more systems.
  • UI: Airflow provides a beautiful user interface for visually creating, running, monitoring, and troubleshooting complex workflows.

Airflow Use Cases

  • Data Engineering: Airflow is a great tool for data engineering tasks, such as data migration, data partitioning, data transformation, and data quality validation.
  • Machine Learning: Airflow can be used to orchestrate machine learning workflows, such as model training, model evaluation, and model serving.
  • Monitoring: Airflow can be used to monitor your data pipelines, set up alerts, and integrate with third-party systems.
  • ETL: Airflow can be used to orchestrate ETL workflows, such as ingesting data from multiple sources, transforming the data, and loading it into a data warehouse.

Airflow vs. Other Tools

  • Airflow vs. Luigi: Both Airflow and Luigi are workflow management systems, but Airflow is more mature and has a larger community. Airflow has a more powerful UI and more integrations with external systems.
  • Airflow vs. Oozie: Oozie is another workflow management system, but Airflow is more modern and has a more powerful UI. Airflow has a more active community and is easier to use.
  • Airflow vs. Azkaban: Azkaban is another workflow management system, but Airflow is more modern and has a more powerful UI. Airflow has a more active community and is easier to use.

Conclusion

Airflow is a powerful and flexible workflow management system that allows you to author, schedule, and monitor workflows as code. Airflow is a great tool for data engineering, machine learning, monitoring, and ETL tasks. Airflow is easy to use, scalable, and has a rich set of features. If you are looking for a workflow management system, Airflow is a great choice.

References

Further Reading

astronomer link back to us please