philips-labs/terraform-aws-github-runner: Terraform module for scalable GitHub a ...

原作者: [db:作者] 来自: 网络收藏邀请

开源软件名称：

philips-labs/terraform-aws-github-runner

开源软件地址：

https://github.com/philips-labs/terraform-aws-github-runner

开源编程语言：

TypeScript 50.6%

开源软件介绍：

Terraform module for scalable self hosted GitHub action runners

This Terraform module creates the required infrastructure needed to host GitHub Actions self-hosted, auto-scaling runners on AWS spot instances. It provides the required logic to handle the life cycle for scaling up and down using a set of AWS Lambda functions. Runners are scaled down to zero to avoid costs when no workflows are active.

BREAKING CHANGE: The module is upgraded to Terraform AWS provider 4.x. All new development will only support the new AWS Terraform provider. We keep a branch terraform-aws-provider-3 to witch we welcome backports to AWS Terraform 3.x provider. Besides reviewing PR's we will do not any active checking on maintance on this branch. We strongly advise to update your deployment to the new provider version. For more details about upgrading see the upgrade guide.

Motivation
Overview
- Major configuration options.
  - ARM64 support via Graviton/Graviton2 instance-types
Usages
Examples
Sub modules
- ARM64 configuration for submodules
Debugging
Requirements
Providers
Modules
Resources
Inputs
Outputs
Contribution
Philips Forest

Motivation

GitHub Actions self-hosted runners provide a flexible option to run CI workloads on the infrastructure of your choice. Currently, no option is provided to automate the creation and scaling of action runners. This module creates the AWS infrastructure to host action runners on spot instances. It provides lambda modules to orchestrate the life cycle of the action runners.

Lambda is chosen as the runtime for two major reasons. First, it allows the creation of small components with minimal access to AWS and GitHub. Secondly, it provides a scalable setup with minimal costs that works on repo level and scales to organization level. The lambdas will create Linux based EC2 instances with Docker to serve CI workloads that can run on Linux and/or Docker. The main goal is to support Docker-based workloads.

A logical question would be, why not Kubernetes? In the current approach, we stay close to how the GitHub action runners are available today. The approach is to install the runner on a host where the required software is available. With this setup, we stay quite close to the current GitHub approach. Another logical choice would be AWS Auto Scaling groups. However, this choice would typically require much more permissions on instance level to GitHub. And besides that, scaling up and down is not trivial.

Overview

The moment a GitHub action workflow requiring a self-hosted runner is triggered, GitHub will try to find a runner which can execute the workload. This module reacts to GitHub's check_run event or workflow_job event for the triggered workflow and creates a new runner if necessary.

For receiving the check_run or workflow_job event by the webhook (lambda), a webhook needs to be created in GitHub. The workflow_job is the preferred option, and the check_run option will be maintained for backward compatibility. The advantage of the workflow_job event is that the runner checks if the received event can run on the configured runners by matching the labels, which avoid instances being scaled up and never used. The following options are available:

workflow_job: (preferred option) create a webhook on enterprise, org or app level. Select this option for ephemeral runners.
check_run: create a webhook on enterprise, org, repo or app level. When using the app option, the app needs to be installed to repo's are using the self-hosted runners.
a Webhook needs to be created. The webhook hook can be defined on enterprise, org, repo, or app level.

In AWS a API gateway endpoint is created that is able to receive the GitHub webhook events via HTTP post. The gateway triggers the webhook lambda which will verify the signature of the event. This check guarantees the event is sent by the GitHub App. The lambda only handles workflow_job or check_run events with status queued and matching the runner labels (only for workflow_job). The accepted events are posted on a SQS queue. Messages on this queue will be delayed for a configurable amount of seconds (default 30 seconds) to give the available runners time to pick up this build.

The "scale up runner" lambda listens to the SQS queue and picks up events. The lambda runs various checks to decide whether a new EC2 spot instance needs to be created. For example, the instance is not created if the build is already started by an existing runner, or the maximum number of runners is reached.

The Lambda first requests a registration token from GitHub, which is needed later by the runner to register itself. This avoids that the EC2 instance, which later in the process will install the agent, needs administration permissions to register the runner. Next, the EC2 spot instance is created via the launch template. The launch template defines the specifications of the required instance and contains a user_data script. This script will install the required software and configure it. The registration token for the action runner is stored in the parameter store (SSM), from which the user data script will fetch it and delete it once it has been retrieved. Once the user data script is finished, the action runner should be online, and the workflow will start in seconds.

Scaling down the runners is at the moment brute-forced, every configurable amount of minutes a lambda will check every runner (instance) if it is busy. In case the runner is not busy it will be removed from GitHub and the instance terminated in AWS. At the moment there seems no other option to scale down more smoothly.

Downloading the GitHub Action Runner distribution can be occasionally slow (more than 10 minutes). Therefore a lambda is introduced that synchronizes the action runner binary from GitHub to an S3 bucket. The EC2 instance will fetch the distribution from the S3 bucket instead of the internet.

Secrets and private keys are stored in SSM Parameter Store. These values are encrypted using the default KMS key for SSM or passing in a custom KMS key.

Permission are managed on several places. Below the most important ones. For details check the Terraform sources.

The GitHub App requires access to actions and publish workflow_job events to the AWS webhook (API gateway).
The scale up lambda should have access to EC2 for creating and tagging instances.
The scale down lambda should have access to EC2 to terminate instances.

Besides these permissions, the lambdas also need permission to CloudWatch (for logging and scheduling), SSM and S3. For more details about the required permissions see the documentation of the IAM module which uses permission boundaries.

Major configuration options.

To be able to support a number of use-cases the module has quite a lot of configuration options. We try to choose reasonable defaults. The several examples also show for the main cases how to configure the runners.

Org vs Repo level. You can configure the module to connect the runners in GitHub on an org level and share the runners in your org. Or set the runners on repo level and the module will install the runner to the repo. There can be multiple repos but runners are not shared between repos.
Checkrun vs Workflow job event. You can configure the webhook in GitHub to send checkrun or workflow job events to the webhook. Workflow job events are introduced by GitHub in September 2021 and are designed to support scalable runners. We advise when possible using the workflow job event, you can set runner_enable_workflow_job_labels_check = true to let the webhook only accept jobs based on the labels configured. The webhook will check the custom labels provided via the variable runner_extra_labels and the GitHub managed labels, "self-hosted", OS and architecture. The OS and architecture are derived from the settings. By default the check is disabled.
Linux vs Windows. you can configure the OS types linux and win. Linux will be used by default.
Re-use vs Ephemeral. By default runners are re-used for till detected idle. Once idle they will be removed from the pool. To improve security we are introducing ephemeral runners. Those runners are only used for one job. Ephemeral runners are only working in combination with the workflow job event. We also suggest using a pre-build AMI to improve the start time of jobs.
GitHub Cloud vs GitHub Enterprise Server (GHES). The runner support GitHub Cloud as well GitHub Enterprise Server. For GHES we rely on our community to test and support. We have no possibility to test ourselves on GHES.
Spot vs on-demand. The runners use either the EC2 spot or on-demand life cycle. Runners will be created via the AWS CreateFleet API. The module (scale up lambda) will request via the CreateFleet API to create instances in one of the subnets and of the specified instance types.

ARM64 support via Graviton/Graviton2 instance-types

When using the default example or top-level module, specifying instance_types that match a Graviton/Graviton 2 (ARM64) architecture (e.g. a1, t4g or any 6th-gen g or gd type), you must also specify runner_architecture = "arm64" and the sub-modules will be automatically configured to provision with ARM64 AMIs and leverage GitHub's ARM64 action runner. See below for more details.

Usages

Examples are provided in the example directory. Please ensure you have installed the following tools.

Terraform, or tfenv.
Bash shell or compatible
Docker (optional, to build lambdas without node).
AWS cli (optional)
Node and yarn (for lambda development).

The module supports two main scenarios for creating runners. On repository level a runner will be dedicated to only one repository, no other repository can use the runner. On organization level you can use the runner(s) for all the repositories within the organization. See GitHub self-hosted runner instructions for more information. Before starting the deployment you have to choose one option.

The setup consists of running Terraform to create all AWS resources and manually configuring the GitHub App. The Terraform module requires configuration from the GitHub App and the GitHub app requires output from Terraform. Therefore you first create the GitHub App and configure the basics, then run Terraform, and afterwards finalize the configuration of the GitHub App.

Setup GitHub App (part 1)

Go to GitHub and create a new app. Beware you can create apps your organization or for a user. For now we support only organization level apps.

Create app in Github
Choose a name
Choose a website (mandatory, not required for the module).
Disable the webhook for now (we will configure this later or create an alternative webhook).
Permissions for all runners:
- Repository:
  - Actions: Read-only (check for queued jobs)
  - Checks: Read-only (receive events for new builds)
  - Metadata: Read-only (default/required)
Permissions for repo level runners only:
- Repository:
  - Administration: Read & write (to register runner)
Permissions for organization level runners only:
- Organization
  - Self-hosted runners: Read & write (to register runner)
Save the new app.
On the General page, make a note of the "App ID" and "Client ID" parameters.
Generate a new private key and save the app.private-key.pem file.

Setup terraform module

Download lambdas

To apply the terraform module, the compiled lambdas (.zip files) need to be available either locally or in an S3 bucket. They can be either downloaded from the GitHub release page or build locally.

To read the files from S3, set the lambda_s3_bucket variable and the specific object key for each lambda.

The lambdas can be downloaded manually from the release page or using the download-lambda terraform module (requires curl to be installed on your machine). In the download-lambda directory, run terraform init && terraform apply. The lambdas will be saved to the same directory.

For local development you can build all the lambdas at once using .ci/build.sh or individually using yarn dist.

Service-linked role

To create spot instances the AWSServiceRoleForEC2Spot role needs to be added to your account. You can do that manually by following the AWS docs. To use terraform for creating the role, either add the following resource or let the module manage the the service linked role by setting create_service_linked_role_spot to true. Be aware this is an account global role, so maybe you don't want to manage it via a specific deployment.

resource "aws_iam_service_linked_role" "spot" {
  aws_service_name = "spot.amazonaws.com"
}

Terraform module

Next create a second terraform workspace and initiate the module, or adapt one of the examples.

Note that github_app.key_base64 needs to be a base64-encoded string of the .pem file i.e. the output of base64 app.private-key.pem. The decoded string can either be a multiline value or a single line value with new lines represented with literal \n characters.

module "github-runner" {
  source  = "philips-labs/github-runner/aws"
  version = "REPLACE_WITH_VERSION"

  aws_region = "eu-west-1"
  vpc_id     = "vpc-123"
  subnet_ids = ["subnet-123", "subnet-456"]

  environment = "gh-ci"

  github_app = {
    key_base64     = "base64string"
    id             = "1"
    webhook_secret = "webhook_secret"
  }

  webhook_lambda_zip                = "lambdas-download/webhook.zip"
  runner_binaries_syncer_lambda_zip = "lambdas-download/runner-binaries-syncer.zip"
  runners_lambda_zip                = "lambdas-download/runners.zip"
  enable_organization_runners = true
}

Run terraform by using the following commands

terraform init
terraform apply

The terraform output displays the API gateway url (endpoint) and secret, which you need in the next step.

The lambda for syncing the GitHub distribution to S3 is triggered via CloudWatch (by default once per hour). After deployment the function is triggered via S3 to ensure the distribution is cached.

Setup the webhook / GitHub App (part 2)

At this point you have 2 options. Either create a separate webhook (enterprise, org, or repo), or create webhook in the App.

Option 1: Webhook

Create a new webhook on repo level for repo level for repo level runner, or org (or enterprise level) for an org level runner.
Provide the webhook url, should be part of the output of terraform.
Provide the webhook secret (terraform output -raw <NAME_OUTPUT_VAR>).
Ensure content type as application/json.
In the "Permissions & Events" section and then "Subscribe to Events" subsection, check either "Workflow Job" or "Check Run" (choose only 1 option!!!).
In the "Install App" section, install the App in your organization, either in all or in selected repositories.

Option 2: App

Go back to the GitHub App and update the following settings.

Enable the webhook.
Provide the webhook url, should be part of the output of terraform.
Provide the webhook secret (terraform output -raw <NAME_OUTPUT_VAR>).
In the "Permissions & Events" section and then "Subscribe to Events" subsection, check either "Workflow Job" or "Check Run" (choose only 1 option!!!).

Install app

Finally you need to ensure the app is installed to all or selected repositories.

Go back to the GitHub App and update the following settings.

In the "Install App" section, install the App in your organization, either in all or in selected repositories.

Encryption

The module support 2 scenarios to manage environment secrets and private key of the Lambda functions.

Encrypted via a module managed KMS key (default)

This is the default, no additional configuration is required.

Encrypted via a provided KMS key

You have to create an configure you KMS key. The module will use the context with key: Environment and value var.environment as encryption context.

resource "aws_kms_key" "github" {
  is_enabled = true
}

module "runners" {

  ...
  kms_key_arn = aws_kms_key.github.arn
  ...

Pool

The module basically supports two options for keeping a pool of runners. One is via a pool which only supports org-level runners, the second option is keeping runners idle.

The pool is introduced in combination with the ephemeral runners and is primary meant to ensure if any event is unexpected dropped, and no runner was created the pool can pick up the job. The pool is maintained by a lambda. Each time the lambda is triggered a check is preformed if the number of idler runners managed by the module are meeting the expected pool size. If not, the pool will be adjusted. Keep in mind that the scale down function is still active and will terminate instances that are detected as idle.

pool_runner_owner = "my-org"                  # Org to which the runners are added
pool_config = [{
  size                = 20                    # size of the pool
  schedule_expression = "cron(* * * * ? *)"   # cron expression to trigger the adjustment of the pool
}]

The pool is NOT enabled by default and can be enabled by setting at least one object of the pool config list. The ephemeral example contains configuration options (commented out).

Idle runners

The module will scale down to zero runners by default. By specifying a idle_config config, idle runners can be kept active. The scale down lambda checks if any of the cron expressions matches the current time with a margin of 5 seconds. When there is a match, the number of runners specified in the idle config will be kept active. In case multiple cron expressions matches, only the first one is taken into account. Below is an idle configuration for keeping runners active from 9 to 5 on working days.

idle_config = [{
   cron      = "* * 9-17 * * 1-5"
   timeZone  = "Europe/Amsterdam"
   idleCount = 2
}]

Note: When using Windows runners it's recommended to keep a few runners warmed up due to the minutes-long cold start time.

Supported config

Cron expressions are parsed by cron-parser. The supported syntax.

*    *    *    *    *    *
┬    ┬    ┬    ┬    ┬    ┬
│    │    │    │    │    |
│    │    │    │    │    └ day of week (0 - 7) (0 or 7 is Sun)
│    │    │    │    └───── month (1 - 12)
│    │    │    └────────── day of month (1 - 31)
│    │    └─────────────── hour (0 - 23)
│    └──────────────────── minute (0 - 59)
└───────────────────────── second (0 - 59, optional)

For time zones please check TZ database name column for the supported values.

Ephemeral runners

Currently a beta feature! You can configure runners to be ephemeral, runners will be used only for one job. The feature should be used in conjunction with listening for the workflow job event. Please consider the following:

The scale down lambda is still active, and should only remove orphan instances. But there is no strict check in place. So ensure you configure the minimum_running_time_in_minutes to a value that is high enough to got your runner booted and connected to avoid it got terminated before executing a job.
The messages sent from the webhook lambda to scale-up lambda are by default delayed delayed by SQS, to give available runners to option to start the job before the decision is made to scale more runners. For ephemeral runners there is no need to wait. Set delay_webhook_event to 0.
All events on the queue will lead to a new runner crated by the lambda. By setting enable_job_queued_check to true you can enforce only create a runner if the event has a correlated queued job. Setting this can avoid creating useless runners, for example whn jobs got cancelled before a runner is created. We suggest to use this in combination with a pool.
To ensure runners are created in the same order GitHub sends the events we use by default a FIFO queue, this is mainly relevant for repo level runners. For ephemeral runners you can set fifo_build_queue to false.
Error related to scaling should be retried via SQS. You can configure job_queue_retention_in_seconds redrive_build_queue to tune the behavior. We have no mechanism to avoid events will never processed, which means potential no runner could be created and the job in GitHub can time out in 6 hours.

The example for ephemeral runners is based on the default example. Have look on the diff to see the major configuration differences.

Prebuilt Images

This module also allows you to run agents from a prebuilt AMI to gain faster startup times. You can find more information in the image README.md

Examples

Examples are located in the examples directory. The following examples are provided:

Default: The default example of the module
ARM64: Example usage with ARM64 architecture
Ubuntu: Example usage of creating a runner using Ubuntu AMIs.
Windows: Example usage of creating a runner using Windows as the OS.
Ephemeral: Example usages of ephemeral runners based on the default example.
Prebuilt Images: Example usages of deploying runners with a custom prebuilt image.
Permissions boundary: Example usages of permissions boundaries.

Sub modules

The module contains several submodules, you can use the module via the main module or assemble your own setup by initializing the submodules yourself.

The following submodules are the core of the module and are mandatory:

runner-binaries-syncer - Syncs the action runner distribution.
runners - Scales the action runners up and down
webhook - Handles GitHub webhooks

The following sub modules are optional and are provided as example or utility:

download-lambda - Utility module to download lambda artifacts from GitHub Release
setup-iam-permissions - Example module to setup permission boundaries

ARM64 configuration for submodules

When using the top level module configure runner_architecture = "arm64" and ensure the list of instance_types matches. When not using the top-level, ensure these properties are set on the submodules.

Debugging

In case the setup does not work as intended follow the trace of events:

In the GitHub App configuration, the Advanced page displays all webhook events that were sent.
In AWS CloudWatch, every lambda has a log group. Look at the logs of the webhook and scale-up lambdas.
In AWS SQS you can see messages available or in flight.
Once an EC2 instance is running, you can connect to it in the EC2 user interface using Session Manager (use enable_ssm_on_runners = true). Check the user data script using cat /var/log/user-data.log. By default several log files of the instances are streamed to AWS CloudWatch, look for a log group named <environment>/runners. In the log group you should see at least the log streams for the user data installation and runner agent.
Registered instances should show up in the Settings - Actions page of the repository or organization (depending on the installation mode).

Requirements

Name	Version
terraform	>= 0.14.1
aws	~> 4.0

Providers

Name	Version
aws	~> 4.0
random	n/a

Modules

Name	Source	Version
runner_binaries	./modules/runner-binaries-syncer	n/a
runners	./modules/runners	n/a
ssm	./modules/ssm	n/a
webhook	./modules/webhook	n/a

Resources

Name	Type
aws_resourcegroups_group.resourcegroups_group	resource
aws_sqs_queue.queued_builds	resource
aws_sqs_queue.queued_builds_dlq	resource
aws_sqs_queue_policy.build_queue_dlq_policy	resource
aws_sqs_queue_policy.build_queue_policy	resource
random_string.random	resource
aws_iam_policy_document.deny_unsecure_transport	data source

Inputs

Name	Description	Type	Default	Required
ami_filter	List of maps used to create the AMI filter for the action runner AMI. By default amazon linux 2 is used.	`map(list(string))`	`null`	no
ami_owners	The list of owners used to select the AMI of action runner instances.	`list(string)`	[ "amazon" ]	no
aws_partition	(optiona) partition in the arn namespace to use if not 'aws'	`string`	`"aws"`	no
aws_region	AWS region.	`string`	n/a	yes
block_device_mappings	The EC2 instance block device configuration. Takes the following keys: `device_name`, `delete_on_termination`, `volume_type`, `volume_size`, `encrypted`, `iops`	list(object({ device_name = string<

鲜花

握手

雷人

路过

鸡蛋

该文章已有0人参与评论

请发表评论

全部评论

专题导读

More+

10-27 六六分期app的软件客服如何联系？(六六分期

11-06 可心卡盟:win10系统火狐flash插件崩溃怎么

11-06 亲亲特价:怎么删除回收站图标

11-06 济南大学虚拟社区:鲁大师节能降温的具体办

11-06 xlueops.exe:无线网络安装向导

11-06 女斗合众国:win7系统cf与主机连接不稳定怎

11-06 0xc000022-[cf烟雾头]cf怎么调烟雾头

11-06 qizideyouhuo:应用程序无法正常启动0xc0000

11-06 ipz-185:win7系统vcf文件怎么打开

11-06 傻哥蹦迪:win10系统s4怎么打开usb调试

11-06 八神浩树gtaste:回收站清空了怎么恢复

11-06 妖尾之黑色守护:win10系统电脑没有1440x900

11-06 校园至尊魔王小说:win7系统浏览网页时字体

11-06 女斗合众国:win10系统访问共享文件夹提示请

11-06 tokyo hot n0654:恢复win7系统默认字体一招

11-06 雨酷仙境:设置win7系统转移临时文件夹腾出

11-06 阿穆纳伊之杖:win7系统开始菜单在右边还原

11-06 tunespotting:win10系统火狐flash插件总是

11-06 甘尔葛分析师：计谋网站seo关键词暴涨有什

11-06 蔡贵霖: 计谋网站seo关键词暴涨有什么秘密

11-06 博益网首页:ao3网页版进入不了解决方法

11-06 漏斗子专栏: 网站数据分析小白易懂精华篇

11-06 见证双虹怎么做:win7系统开启telnet命令的

11-06 颾狐蝶蜋:系统资源不足无法完成请求的服务

11-06 国光中学校歌:提交网站到alexa查询详细步骤

11-06 西安有情天:静态网页和动态网页的区别

11-06 红木雅尚斋:外部链接构造对网站的好处

11-06 前官礼遇：防止域名劫持–增强域安全性的10

11-06 密传二转答案: 中文分词算法有哪些

11-06 金泉家园邮编:百度快照劫持的表现及应对方

learn-co-curriculum/git-version-control-getting-code-with-git发布时间：2022-06-11

IBM/Kubernetes-container-service-GitLab-sample: This code shows how a common mul ...发布时间：2022-06-11

剪的笔顺,诠释剪的笔画,认识剪的部首

1 六六分期app的软件客服如何联系？(六六分期

六六分期app的软件客服如何联系？不知道吗？加qq群【895510560】即可！标题：六六分期

阅读：18001|2023-10-27

2 可心卡盟:win10系统火狐flash插件崩溃怎么

今天小编告诉大家如何处理win10系统火狐flash插件总是崩溃的问题，可能很多用户都不知

阅读：9581|2022-11-06

3 亲亲特价:怎么删除回收站图标

今天小编告诉大家如何对win10系统删除桌面回收站图标进行设置，可能很多用户都不知道

阅读：8132|2022-11-06

4 济南大学虚拟社区:鲁大师节能降温的具体办

今天小编告诉大家如何对win10系统电脑设置节能降温的设置方法，想必大家都遇到过需要

阅读：8516|2022-11-06

5 xlueops.exe:无线网络安装向导

我们在使用xp系统的过程中,经常需要对xp系统无线网络安装向导设置进行设置，可能很多

阅读：8420|2022-11-06

6 女斗合众国:win7系统cf与主机连接不稳定怎

今天小编告诉大家如何处理win7系统玩cf老是与主机连接不稳定的问题，可能很多用户都不

阅读：9318|2022-11-06

7 0xc000022-[cf烟雾头]cf怎么调烟雾头

电脑对日常生活的重要性小编就不多说了，可是一旦碰到win7系统设置cf烟雾头的问题，很

阅读：8382|2022-11-06

8 qizideyouhuo:应用程序无法正常启动0xc0000

我们在日常使用电脑的时候，有的小伙伴们可能在打开应用的时候会遇见提示应用程序无法

阅读：7815|2022-11-06

9 ipz-185:win7系统vcf文件怎么打开

今天小编告诉大家如何对win7系统打开vcf文件进行设置，可能很多用户都不知道怎么对win

阅读：8370|2022-11-06

10 傻哥蹦迪:win10系统s4怎么打开usb调试

今天小编告诉大家如何对win10系统s4开启USB调试模式进行设置，可能很多用户都不知道怎

阅读：7366|2022-11-06

扫描微信二维码

查看手机版网站

随时了解更新最新资讯

139-2527-9053

在线客服（服务时间 9:00～18:00）

在线QQ客服

地址：深圳市南山区西丽大学城创智工业园

电邮：jeky_zhao#qq.com

移动电话：139-2527-9053

客服电话

电子邮件