mumoshu/kube-airflow: A docker image and kubernetes config files to run Airflow ...

原作者: [db:作者] 来自: 网络收藏邀请

开源软件名称（OpenSource Name）：

mumoshu/kube-airflow

开源软件地址(OpenSource Url)：

https://github.com/mumoshu/kube-airflow

开源编程语言(OpenSource Language)：

Python 44.2%

开源软件介绍(OpenSource Introduction)：

kube-airflow (Celery Executor)

kube-airflow provides a set of tools to run Airflow in a Kubernetes cluster. This is useful when you'd want:

Easy high availability of the Airflow scheduler
- Running multiple schedulers for high availability isn't safe so it isn't the way to go in the first place. Someone in the internet tried to implement a wrapper to implement leader election on top of the scheduler so that only one scheduler executes the tasks at a time. It is possible but can't we just utilize a kind of cluster manager here? This is where Kubernetes comes into play.
Easy parallelism of task executions
- The common way to scale out workers in Airflow is to utilize Celery. However, managing a H/A backend database and Celery workers just for parallelising task executions sounds like a hassle. This is where Kubernetes comes into play, again. If you already had a K8S cluster, just let K8S manage them for you.
- If you have ever considered to avoid Celery for task parallelism, yes, K8S can still help you for a while. Just keep using LocalExecutor instead of CeleryExecutor and delegate actual tasks to Kubernetes by calling e.g. kubectl run --restart=Never ... from your tasks. It will work until the concurrent kubectl run executions(up to the concurrency implied by scheduler's max_threads and LocalExecutor's parallelism. See this SO question for gotchas) consumes all the resources a single airflow-scheduler pod provides, which will be after the pretty long time.

This repository contains:

Dockerfile(.template) of airflow for Docker images published to the public Docker Hub Registry.
airflow.all.yaml for manual creating Kubernetes services and deployments to run Airflow on Kubernetes
Helm Chart in ./airflow for deployments using Helm

Informations

Fork of mumoshu/kube-airflow
Highly inspired by the great work puckel/docker-airflow
Based on Debian Stretch official Image debian:stretch and uses the official Postgres as backend and RabbitMQ as queue
Following the Airflow release from Python Package Index

Manual Installation

Create all the deployments and services to run Airflow on Kubernetes:

kubectl create -f airflow.all.yaml

It will create deployments for:

postgres
rabbitmq
airflow-webserver
airflow-scheduler
airflow-flower
airflow-worker

and services for:

postgres
rabbitmq
airflow-webserver
airflow-flower

Helm Deployment (recommended)

Ensure your helm installation is done, you may need to have TILLER_NAMESPACE set as environment variable.

Deploy to Kubernetes using:

make helm-install NAMESPACE=yournamespace HELM_VALUES=/path/to/you/values.yaml

Helm ingresses

The Chart provides ingress configuration to allow customization the installation by adapting the config.yaml depending on your setup.

Prefix

This Helm chart allows using a "prefix" string that will be added to every Kubernetes names. That allows instantiating several, independent Airflow clusters in the same namespace.

Note:

Do NOT use characters such as " (double quote), ' (simple quote), / (slash) or \ (backslash)
in your passwords and prefix and keep it as small as possible.

DAGs deployment: embedded DAGs or git-sync

This chart provide basically two way of deploying DAGs in your Airflow installation:

embedded DAGs
Git-Sync

This helm chart provide support for Persistant Storage but not for sidecar git-sync pod. If you are willing to contribute, do not hesitate to do a Pull Request !

Using embedded Git-Sync

Git-sync is the easiest way to automatically update your DAGs. It simply checks periodically (by default every minute) a Git project on a given branch and check this new version out when available. Scheduler and worker see changes almost real-time. There is no need to other tool and complex rolling-update procedure.

While it is extremely cool to see its DAG appears on Airflow 60s after merge on this project, you should be aware of some limitations Airflow has with dynamic DAG updates:

If the scheduler reloads a dag in the middle of a dagrun then the dagrun will actually start
using the new version of the dag in the middle of execution.

This is a known issue with airflow and it means it's unsafe in general to use a git-sync like solution with airflow without:

using explicit locking, ie never pull down a new dag if a dagrun is in progress
make dags immutable, never modify your dag always make a new one

Also keep in mind using git-sync may not be scalable at all in production if you have lot of DAGs. The best way to deploy you DAG is to build a new docker image containing all the DAG and their dependencies. To do so, fork this project

Airflow.cfg as ConfigMap

By default, we use the configuration file airflow.cfg hardcoded in the docker image. This file uses a custom templating system to apply some environmnet variable and feed the airflow processes with (basically it is just some sed).

If you want to use your own airflow.cfg file without having to rebuild a complete docker image, for example when testing new settings, there is a way to define this file in a Kubernetes configuration map:

you need to define your own Value file you will feed to helm with helm install -f myvalue.yaml
you need to enable init the node airflow.airflow_cfg.enable: true
you need to store the content of your airflow.cfg in the node airflow.airflow_cfg.data You can see at airflow/myvalue-with-airflowcfg-configmap.yaml for an example on how to set it in your config.yaml file
note it is important to keep the custom templating in your airflow.cfg (ex: {{ POSTGRES_CREDS }}) or at least keep it aligned with the configuration applyied in your Kubernetes Cluster.

Worker Statefulset

As you can see, Celery workers uses StatefulSet instead of deployment. It is used to freeze their DNS using a Kubernetes Headless Service, and allow the webserver to requests the logs from each workers individually. This requires to expose a port (8793) and ensure the pod DNS is accessible to the web server pod, which is why StatefulSet is for.

Embedded DAGs

If you want more control on the way you deploy your DAGs, you can use embedded DAGs, where DAGs are burned inside the Docker container deployed as Scheduler and Workers.

Be aware this requirement more heavy tooling than using git-sync, especially if you use CI/CD:

your CI/CD should be able to build a new docker image each time your DAGs are updated.
your CI/CD should be able to control the deployment of this new image in your kubernetes cluster

Example of procedure:

Fork this project
Place your DAG inside the dags folder of this project, update requirements-dags.txt to install new dependencies if needed (see bellow)
Add build script connected to your CI that will build the new docker image
Deploy on your Kubernetes cluster

You can avoid forking this project by:

keep a git-project dedicated to storing only your DAGs + dedicated requirements.txt
you can gate any change to DAGs in your CI (unittest, pip install -r requirements-dags.txt,.. )

have your CI/CD makes a new docker image after each successful merge using

DAG_PATH=$PWD
cd /path/to/kube-aiflow
make ENBEDDED_DAGS_LOCATION=$DAG_PATH

trigger the deployment on this new image on your Kubernetes infrastructure

Python dependencies

If you want to add specific python dependencies to use in your DAGs, you simply declare them inside the requirements/dags.txt file. They will be automatically installed inside the container during build, so you can directly use these library in your DAGs.

To use another file, call:

make REQUIREMENTS_TXT_LOCATION=/path/to/you/dags/requirements.txt

Please note this requires you set up the same tooling environment in your CI/CD that when using Embedded DAGs.

Helm configuration customization

Helm allow to overload the configuration to adapt to your environment. You probably want to specify your own ingress configuration for instance.

Build Docker image

git clone this repository and then just run:

make build

Run with minikube

You can browse the Airflow dashboard via running:

minikube start
make browse-web

the Flower dashboard via running:

make browse-flower

If you want to use Ad hoc query, make sure you've configured connections: Go to Admin -> Connections and Edit "mysql_default" set this values (equivalent to values in config/airflow.cfg) :

Host : mysql
Schema : airflow
Login : airflow
Password : airflow

Check Airflow Documentation

Run the test "tutorial"

    kubectl exec web-<id> --namespace airflow-dev airflow backfill tutorial -s 2015-05-01 -e 2015-06-01

Scale the number of workers

For now, update the value for the replicas field of the deployment you want to scale and then:

    make apply

Wanna help?

Fork, improve and PR. ;-)

鲜花

握手

雷人

路过

鸡蛋

该文章已有0人参与评论

请发表评论

全部评论

专题导读

More+

10-27 六六分期app的软件客服如何联系？(六六分期

11-06 可心卡盟:win10系统火狐flash插件崩溃怎么

11-06 亲亲特价:怎么删除回收站图标

11-06 济南大学虚拟社区:鲁大师节能降温的具体办

11-06 xlueops.exe:无线网络安装向导

11-06 女斗合众国:win7系统cf与主机连接不稳定怎

11-06 0xc000022-[cf烟雾头]cf怎么调烟雾头

11-06 qizideyouhuo:应用程序无法正常启动0xc0000

11-06 ipz-185:win7系统vcf文件怎么打开

11-06 傻哥蹦迪:win10系统s4怎么打开usb调试

11-06 八神浩树gtaste:回收站清空了怎么恢复

11-06 妖尾之黑色守护:win10系统电脑没有1440x900

11-06 校园至尊魔王小说:win7系统浏览网页时字体

11-06 女斗合众国:win10系统访问共享文件夹提示请

11-06 tokyo hot n0654:恢复win7系统默认字体一招

11-06 雨酷仙境:设置win7系统转移临时文件夹腾出

11-06 阿穆纳伊之杖:win7系统开始菜单在右边还原

11-06 tunespotting:win10系统火狐flash插件总是

11-06 甘尔葛分析师：计谋网站seo关键词暴涨有什

11-06 蔡贵霖: 计谋网站seo关键词暴涨有什么秘密

11-06 博益网首页:ao3网页版进入不了解决方法

11-06 漏斗子专栏: 网站数据分析小白易懂精华篇

11-06 见证双虹怎么做:win7系统开启telnet命令的

11-06 颾狐蝶蜋:系统资源不足无法完成请求的服务

11-06 国光中学校歌:提交网站到alexa查询详细步骤

11-06 西安有情天:静态网页和动态网页的区别

11-06 红木雅尚斋:外部链接构造对网站的好处

11-06 前官礼遇：防止域名劫持–增强域安全性的10

11-06 密传二转答案: 中文分词算法有哪些

11-06 金泉家园邮编:百度快照劫持的表现及应对方

k0sproject/k0s: k0s - The Zero Friction Kubernetes by Team Lens发布时间：2022-07-07

ContainerSolutions/kubernetes-examples: Minimal self-contained examples of stand ...发布时间：2022-07-07

剪的笔顺,诠释剪的笔画,认识剪的部首

1 六六分期app的软件客服如何联系？(六六分期

六六分期app的软件客服如何联系？不知道吗？加qq群【895510560】即可！标题：六六分期

阅读：18344|2023-10-27

2 可心卡盟:win10系统火狐flash插件崩溃怎么

今天小编告诉大家如何处理win10系统火狐flash插件总是崩溃的问题，可能很多用户都不知

阅读：9706|2022-11-06

3 亲亲特价:怎么删除回收站图标

今天小编告诉大家如何对win10系统删除桌面回收站图标进行设置，可能很多用户都不知道

阅读：8196|2022-11-06

4 济南大学虚拟社区:鲁大师节能降温的具体办

今天小编告诉大家如何对win10系统电脑设置节能降温的设置方法，想必大家都遇到过需要

阅读：8563|2022-11-06

5 xlueops.exe:无线网络安装向导

我们在使用xp系统的过程中,经常需要对xp系统无线网络安装向导设置进行设置，可能很多

阅读：8475|2022-11-06

6 女斗合众国:win7系统cf与主机连接不稳定怎

今天小编告诉大家如何处理win7系统玩cf老是与主机连接不稳定的问题，可能很多用户都不

阅读：9418|2022-11-06

7 0xc000022-[cf烟雾头]cf怎么调烟雾头

电脑对日常生活的重要性小编就不多说了，可是一旦碰到win7系统设置cf烟雾头的问题，很

阅读：8447|2022-11-06

8 qizideyouhuo:应用程序无法正常启动0xc0000

我们在日常使用电脑的时候，有的小伙伴们可能在打开应用的时候会遇见提示应用程序无法

阅读：7878|2022-11-06

9 ipz-185:win7系统vcf文件怎么打开

今天小编告诉大家如何对win7系统打开vcf文件进行设置，可能很多用户都不知道怎么对win

阅读：8433|2022-11-06

10 傻哥蹦迪:win10系统s4怎么打开usb调试

今天小编告诉大家如何对win10系统s4开启USB调试模式进行设置，可能很多用户都不知道怎

阅读：7406|2022-11-06

客服电话

电子邮件

mumoshu/kube-airflow: A docker image and kubernetes config files to run Airflow ...

开源软件名称（OpenSource Name）：

开源软件地址(OpenSource Url)：

开源编程语言(OpenSource Language)：

开源软件介绍(OpenSource Introduction)：

kube-airflow (Celery Executor)

Informations

Manual Installation

Helm Deployment (recommended)

Helm ingresses

Prefix

DAGs deployment: embedded DAGs or git-sync

Using embedded Git-Sync

Airflow.cfg as ConfigMap

Worker Statefulset

Embedded DAGs

Python dependencies

Helm configuration customization

Build Docker image

Run with minikube

Run the test "tutorial"

Scale the number of workers

Wanna help?

请发表评论

全部评论

上一篇：

下一篇：

连衣裙、衬衫、裤子尺码对照表(买衣服更省

PacktPublishing/Python-Machine-Learning-

sussillo/hfopt-matlab: A parallel, cpu-b

鲁东大学一米网:Win7系统USB驱动器RAM的操

emersion/go-ostatus: An OStatus library

剪的笔顺,诠释剪的笔画,认识剪的部首

六六分期app的软件客服如何联系？(六六分期

florent37/ViewAnimator: A fluent Android

florent37/Shrine-MaterialDesign2: implem

CVE-2020-36276

SimpleSoftwareIO/simple-sms: Send and re

关于我们

产品与服务

解决方案

139-2527-9053