Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
125 views
in Technique[技术] by (71.8m points)

Why is airflow not a data streaming solution?

Learning about Airflow and want to understand why it is not a data streaming solution.

https://airflow.apache.org/docs/apache-airflow/1.10.1/#beyond-the-horizon

Airflow is not a data streaming solution. Tasks do not move data from one to the other (though tasks can exchange metadata!)

Not sure what the it means task do not move data from one to another. So I can't have tasks like Extract Data >> Calculate A >> Calculate B using data from prev step >> another step that depends on prev task's result >> ... ?

Furthermore read the following a few times and still doesn't click with me. What's an example of a static vs dynamic workflow?

Workflows are expected to be mostly static or slowly changing. You can think of the structure of the tasks in your workflow as slightly more dynamic than a database structure would be. Airflow workflows are expected to look similar from a run to the next, this allows for clarity around unit of work and continuity.

Can someone help provide an alternative explanation or example that can walk through why Airflow is not a good data streaming solution?

question from:https://stackoverflow.com/questions/65895690/why-is-airflow-not-a-data-streaming-solution

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Not sure what the it means task do not move data from one to another. So I can't have tasks like Extract Data >> Calculate A >> Calculate B using data from prev step >> another step that depends on prev task's result >> ... ?

You can use XCom (cross-communication) to use data from previous step BUT bear in mind that XCom is stored in the metadata DB meaning that you should not use XCom for holding data that is being processed. You should use XCom for holding a reference to data (e.g.: S3/HDFS/GCS path, BigQuery table, etc).

So instead of passing data between tasks, in Airflow you pass reference to data between tasks using XCom. Airflow is an orchestrator, heavy tasks should be off-loaded to other data processing cluster (can be Spark cluster, BigQuery, etc).


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...