• 设为首页
  • 点击收藏
  • 手机版
    手机扫一扫访问
    迪恩网络手机版
  • 关注官方公众号
    微信扫一扫关注
    迪恩网络公众号

philips-labs/terraform-aws-github-runner: Terraform module for scalable GitHub a ...

原作者: [db:作者] 来自: 网络 收藏 邀请

开源软件名称:

philips-labs/terraform-aws-github-runner

开源软件地址:

https://github.com/philips-labs/terraform-aws-github-runner

开源编程语言:

TypeScript 50.6%

开源软件介绍:

Terraform module for scalable self hosted GitHub action runners

awesome-runnersTerraform registry Terraform checks Lambda Webhook Lambda Runners Lambda Syncer

This Terraform module creates the required infrastructure needed to host GitHub Actions self-hosted, auto-scaling runners on AWS spot instances. It provides the required logic to handle the life cycle for scaling up and down using a set of AWS Lambda functions. Runners are scaled down to zero to avoid costs when no workflows are active.

BREAKING CHANGE: The module is upgraded to Terraform AWS provider 4.x. All new development will only support the new AWS Terraform provider. We keep a branch terraform-aws-provider-3 to witch we welcome backports to AWS Terraform 3.x provider. Besides reviewing PR's we will do not any active checking on maintance on this branch. We strongly advise to update your deployment to the new provider version. For more details about upgrading see the upgrade guide.

Motivation

GitHub Actions self-hosted runners provide a flexible option to run CI workloads on the infrastructure of your choice. Currently, no option is provided to automate the creation and scaling of action runners. This module creates the AWS infrastructure to host action runners on spot instances. It provides lambda modules to orchestrate the life cycle of the action runners.

Lambda is chosen as the runtime for two major reasons. First, it allows the creation of small components with minimal access to AWS and GitHub. Secondly, it provides a scalable setup with minimal costs that works on repo level and scales to organization level. The lambdas will create Linux based EC2 instances with Docker to serve CI workloads that can run on Linux and/or Docker. The main goal is to support Docker-based workloads.

A logical question would be, why not Kubernetes? In the current approach, we stay close to how the GitHub action runners are available today. The approach is to install the runner on a host where the required software is available. With this setup, we stay quite close to the current GitHub approach. Another logical choice would be AWS Auto Scaling groups. However, this choice would typically require much more permissions on instance level to GitHub. And besides that, scaling up and down is not trivial.

Overview

The moment a GitHub action workflow requiring a self-hosted runner is triggered, GitHub will try to find a runner which can execute the workload. This module reacts to GitHub's check_run event or workflow_job event for the triggered workflow and creates a new runner if necessary.

For receiving the check_run or workflow_job event by the webhook (lambda), a webhook needs to be created in GitHub. The workflow_job is the preferred option, and the check_run option will be maintained for backward compatibility. The advantage of the workflow_job event is that the runner checks if the received event can run on the configured runners by matching the labels, which avoid instances being scaled up and never used. The following options are available:

  • workflow_job: (preferred option) create a webhook on enterprise, org or app level. Select this option for ephemeral runners.
  • check_run: create a webhook on enterprise, org, repo or app level. When using the app option, the app needs to be installed to repo's are using the self-hosted runners.
  • a Webhook needs to be created. The webhook hook can be defined on enterprise, org, repo, or app level.

In AWS a API gateway endpoint is created that is able to receive the GitHub webhook events via HTTP post. The gateway triggers the webhook lambda which will verify the signature of the event. This check guarantees the event is sent by the GitHub App. The lambda only handles workflow_job or check_run events with status queued and matching the runner labels (only for workflow_job). The accepted events are posted on a SQS queue. Messages on this queue will be delayed for a configurable amount of seconds (default 30 seconds) to give the available runners time to pick up this build.

The "scale up runner" lambda listens to the SQS queue and picks up events. The lambda runs various checks to decide whether a new EC2 spot instance needs to be created. For example, the instance is not created if the build is already started by an existing runner, or the maximum number of runners is reached.

The Lambda first requests a registration token from GitHub, which is needed later by the runner to register itself. This avoids that the EC2 instance, which later in the process will install the agent, needs administration permissions to register the runner. Next, the EC2 spot instance is created via the launch template. The launch template defines the specifications of the required instance and contains a user_data script. This script will install the required software and configure it. The registration token for the action runner is stored in the parameter store (SSM), from which the user data script will fetch it and delete it once it has been retrieved. Once the user data script is finished, the action runner should be online, and the workflow will start in seconds.

Scaling down the runners is at the moment brute-forced, every configurable amount of minutes a lambda will check every runner (instance) if it is busy. In case the runner is not busy it will be removed from GitHub and the instance terminated in AWS. At the moment there seems no other option to scale down more smoothly.

Downloading the GitHub Action Runner distribution can be occasionally slow (more than 10 minutes). Therefore a lambda is introduced that synchronizes the action runner binary from GitHub to an S3 bucket. The EC2 instance will fetch the distribution from the S3 bucket instead of the internet.

Secrets and private keys are stored in SSM Parameter Store. These values are encrypted using the default KMS key for SSM or passing in a custom KMS key.

Architecture

Permission are managed on several places. Below the most important ones. For details check the Terraform sources.

  • The GitHub App requires access to actions and publish workflow_job events to the AWS webhook (API gateway).
  • The scale up lambda should have access to EC2 for creating and tagging instances.
  • The scale down lambda should have access to EC2 to terminate instances.

Besides these permissions, the lambdas also need permission to CloudWatch (for logging and scheduling), SSM and S3. For more details about the required permissions see the documentation of the IAM module which uses permission boundaries.

Major configuration options.

To be able to support a number of use-cases the module has quite a lot of configuration options. We try to choose reasonable defaults. The several examples also show for the main cases how to configure the runners.

  • Org vs Repo level. You can configure the module to connect the runners in GitHub on an org level and share the runners in your org. Or set the runners on repo level and the module will install the runner to the repo. There can be multiple repos but runners are not shared between repos.
  • Checkrun vs Workflow job event. You can configure the webhook in GitHub to send checkrun or workflow job events to the webhook. Workflow job events are introduced by GitHub in September 2021 and are designed to support scalable runners. We advise when possible using the workflow job event, you can set runner_enable_workflow_job_labels_check = true to let the webhook only accept jobs based on the labels configured. The webhook will check the custom labels provided via the variable runner_extra_labels and the GitHub managed labels, "self-hosted", OS and architecture. The OS and architecture are derived from the settings. By default the check is disabled.
  • Linux vs Windows. you can configure the OS types linux and win. Linux will be used by default.
  • Re-use vs Ephemeral. By default runners are re-used for till detected idle. Once idle they will be removed from the pool. To improve security we are introducing ephemeral runners. Those runners are only used for one job. Ephemeral runners are only working in combination with the workflow job event. We also suggest using a pre-build AMI to improve the start time of jobs.
  • GitHub Cloud vs GitHub Enterprise Server (GHES). The runner support GitHub Cloud as well GitHub Enterprise Server. For GHES we rely on our community to test and support. We have no possibility to test ourselves on GHES.
  • Spot vs on-demand. The runners use either the EC2 spot or on-demand life cycle. Runners will be created via the AWS CreateFleet API. The module (scale up lambda) will request via the CreateFleet API to create instances in one of the subnets and of the specified instance types.

ARM64 support via Graviton/Graviton2 instance-types

When using the default example or top-level module, specifying instance_types that match a Graviton/Graviton 2 (ARM64) architecture (e.g. a1, t4g or any 6th-gen g or gd type), you must also specify runner_architecture = "arm64" and the sub-modules will be automatically configured to provision with ARM64 AMIs and leverage GitHub's ARM64 action runner. See below for more details.

Usages

Examples are provided in the example directory. Please ensure you have installed the following tools.

  • Terraform, or tfenv.
  • Bash shell or compatible
  • Docker (optional, to build lambdas without node).
  • AWS cli (optional)
  • Node and yarn (for lambda development).

The module supports two main scenarios for creating runners. On repository level a runner will be dedicated to only one repository, no other repository can use the runner. On organization level you can use the runner(s) for all the repositories within the organization. See GitHub self-hosted runner instructions for more information. Before starting the deployment you have to choose one option.

The setup consists of running Terraform to create all AWS resources and manually configuring the GitHub App. The Terraform module requires configuration from the GitHub App and the GitHub app requires output from Terraform. Therefore you first create the GitHub App and configure the basics, then run Terraform, and afterwards finalize the configuration of the GitHub App.

Setup GitHub App (part 1)

Go to GitHub and create a new app. Beware you can create apps your organization or for a user. For now we support only organization level apps.

  1. Create app in Github
  2. Choose a name
  3. Choose a website (mandatory, not required for the module).
  4. Disable the webhook for now (we will configure this later or create an alternative webhook).
  5. Permissions for all runners:
    • Repository:
      • Actions: Read-only (check for queued jobs)
      • Checks: Read-only (receive events for new builds)
      • Metadata: Read-only (default/required)
  6. Permissions for repo level runners only:
    • Repository:
      • Administration: Read & write (to register runner)
  7. Permissions for organization level runners only:
    • Organization
      • Self-hosted runners: Read & write (to register runner)
  8. Save the new app.
  9. On the General page, make a note of the "App ID" and "Client ID" parameters.
  10. Generate a new private key and save the app.private-key.pem file.

Setup terraform module

Download lambdas

To apply the terraform module, the compiled lambdas (.zip files) need to be available either locally or in an S3 bucket. They can be either downloaded from the GitHub release page or build locally.

To read the files from S3, set the lambda_s3_bucket variable and the specific object key for each lambda.

The lambdas can be downloaded manually from the release page or using the download-lambda terraform module (requires curl to be installed on your machine). In the download-lambda directory, run terraform init && terraform apply. The lambdas will be saved to the same directory.

For local development you can build all the lambdas at once using .ci/build.sh or individually using yarn dist.

Service-linked role

To create spot instances the AWSServiceRoleForEC2Spot role needs to be added to your account. You can do that manually by following the AWS docs. To use terraform for creating the role, either add the following resource or let the module manage the the service linked role by setting create_service_linked_role_spot to true. Be aware this is an account global role, so maybe you don't want to manage it via a specific deployment.

resource "aws_iam_service_linked_role" "spot" {
  aws_service_name = "spot.amazonaws.com"
}

Terraform module

Next create a second terraform workspace and initiate the module, or adapt one of the examples.

Note that github_app.key_base64 needs to be a base64-encoded string of the .pem file i.e. the output of base64 app.private-key.pem. The decoded string can either be a multiline value or a single line value with new lines represented with literal \n characters.

module "github-runner" {
  source  = "philips-labs/github-runner/aws"
  version = "REPLACE_WITH_VERSION"

  aws_region = "eu-west-1"
  vpc_id     = "vpc-123"
  subnet_ids = ["subnet-123", "subnet-456"]

  environment = "gh-ci"

  github_app = {
    key_base64     = "base64string"
    id             = "1"
    webhook_secret = "webhook_secret"
  }

  webhook_lambda_zip                = "lambdas-download/webhook.zip"
  runner_binaries_syncer_lambda_zip = "lambdas-download/runner-binaries-syncer.zip"
  runners_lambda_zip                = "lambdas-download/runners.zip"
  enable_organization_runners = true
}

Run terraform by using the following commands

terraform init
terraform apply

The terraform output displays the API gateway url (endpoint) and secret, which you need in the next step.

The lambda for syncing the GitHub distribution to S3 is triggered via CloudWatch (by default once per hour). After deployment the function is triggered via S3 to ensure the distribution is cached.

Setup the webhook / GitHub App (part 2)

At this point you have 2 options. Either create a separate webhook (enterprise, org, or repo), or create webhook in the App.

Option 1: Webhook

  1. Create a new webhook on repo level for repo level for repo level runner, or org (or enterprise level) for an org level runner.
  2. Provide the webhook url, should be part of the output of terraform.
  3. Provide the webhook secret (terraform output -raw <NAME_OUTPUT_VAR>).
  4. Ensure content type as application/json.
  5. In the "Permissions & Events" section and then "Subscribe to Events" subsection, check either "Workflow Job" or "Check Run" (choose only 1 option!!!).
  6. In the "Install App" section, install the App in your organization, either in all or in selected repositories.

Option 2: App

Go back to the GitHub App and update the following settings.

  1. Enable the webhook.
  2. Provide the webhook url, should be part of the output of terraform.
  3. Provide the webhook secret (terraform output -raw <NAME_OUTPUT_VAR>).
  4. In the "Permissions & Events" section and then "Subscribe to Events" subsection, check either "Workflow Job" or "Check Run" (choose only 1 option!!!).

Install app

Finally you need to ensure the app is installed to all or selected repositories.

Go back to the GitHub App and update the following settings.

  1. In the "Install App" section, install the App in your organization, either in all or in selected repositories.

Encryption

The module support 2 scenarios to manage environment secrets and private key of the Lambda functions.

Encrypted via a module managed KMS key (default)

This is the default, no additional configuration is required.

Encrypted via a provided KMS key

You have to create an configure you KMS key. The module will use the context with key: Environment and value var.environment as encryption context.

resource "aws_kms_key" "github" {
  is_enabled = true
}

module "runners" {

  ...
  kms_key_arn = aws_kms_key.github.arn
  ...

Pool

The module basically supports two options for keeping a pool of runners. One is via a pool which only supports org-level runners, the second option is keeping runners idle.

The pool is introduced in combination with the ephemeral runners and is primary meant to ensure if any event is unexpected dropped, and no runner was created the pool can pick up the job. The pool is maintained by a lambda. Each time the lambda is triggered a check is preformed if the number of idler runners managed by the module are meeting the expected pool size. If not, the pool will be adjusted. Keep in mind that the scale down function is still active and will terminate instances that are detected as idle.

pool_runner_owner = "my-org"                  # Org to which the runners are added
pool_config = [{
  size                = 20                    # size of the pool
  schedule_expression = "cron(* * * * ? *)"   # cron expression to trigger the adjustment of the pool
}]

The pool is NOT enabled by default and can be enabled by setting at least one object of the pool config list. The ephemeral example contains configuration options (commented out).

Idle runners

The module will scale down to zero runners by default. By specifying a idle_config config, idle runners can be kept active. The scale down lambda checks if any of the cron expressions matches the current time with a margin of 5 seconds. When there is a match, the number of runners specified in the idle config will be kept active. In case multiple cron expressions matches, only the first one is taken into account. Below is an idle configuration for keeping runners active from 9 to 5 on working days.

idle_config = [{
   cron      = "* * 9-17 * * 1-5"
   timeZone  = "Europe/Amsterdam"
   idleCount = 2
}]

Note: When using Windows runners it's recommended to keep a few runners warmed up due to the minutes-long cold start time.

Supported config

Cron expressions are parsed by cron-parser. The supported syntax.

*    *    *    *    *    *
┬    ┬    ┬    ┬    ┬    ┬
│    │    │    │    │    |
│    │    │    │    │    └ day of week (0 - 7) (0 or 7 is Sun)
│    │    │    │    └───── month (1 - 12)
│    │    │    └────────── day of month (1 - 31)
│    │    └─────────────── hour (0 - 23)
│    └──────────────────── minute (0 - 59)
└───────────────────────── second (0 - 59, optional)

For time zones please check TZ database name column for the supported values.

Ephemeral runners

Currently a beta feature! You can configure runners to be ephemeral, runners will be used only for one job. The feature should be used in conjunction with listening for the workflow job event. Please consider the following:

  • The scale down lambda is still active, and should only remove orphan instances. But there is no strict check in place. So ensure you configure the minimum_running_time_in_minutes to a value that is high enough to got your runner booted and connected to avoid it got terminated before executing a job.
  • The messages sent from the webhook lambda to scale-up lambda are by default delayed delayed by SQS, to give available runners to option to start the job before the decision is made to scale more runners. For ephemeral runners there is no need to wait. Set delay_webhook_event to 0.
  • All events on the queue will lead to a new runner crated by the lambda. By setting enable_job_queued_check to true you can enforce only create a runner if the event has a correlated queued job. Setting this can avoid creating useless runners, for example whn jobs got cancelled before a runner is created. We suggest to use this in combination with a pool.
  • To ensure runners are created in the same order GitHub sends the events we use by default a FIFO queue, this is mainly relevant for repo level runners. For ephemeral runners you can set fifo_build_queue to false.
  • Error related to scaling should be retried via SQS. You can configure job_queue_retention_in_seconds redrive_build_queue to tune the behavior. We have no mechanism to avoid events will never processed, which means potential no runner could be created and the job in GitHub can time out in 6 hours.

The example for ephemeral runners is based on the default example. Have look on the diff to see the major configuration differences.

Prebuilt Images

This module also allows you to run agents from a prebuilt AMI to gain faster startup times. You can find more information in the image README.md

Examples

Examples are located in the examples directory. The following examples are provided:

  • Default: The default example of the module
  • ARM64: Example usage with ARM64 architecture
  • Ubuntu: Example usage of creating a runner using Ubuntu AMIs.
  • Windows: Example usage of creating a runner using Windows as the OS.
  • Ephemeral: Example usages of ephemeral runners based on the default example.
  • Prebuilt Images: Example usages of deploying runners with a custom prebuilt image.
  • Permissions boundary: Example usages of permissions boundaries.

Sub modules

The module contains several submodules, you can use the module via the main module or assemble your own setup by initializing the submodules yourself.

The following submodules are the core of the module and are mandatory:

The following sub modules are optional and are provided as example or utility:

ARM64 configuration for submodules

When using the top level module configure runner_architecture = "arm64" and ensure the list of instance_types matches. When not using the top-level, ensure these properties are set on the submodules.

Debugging

In case the setup does not work as intended follow the trace of events:

  • In the GitHub App configuration, the Advanced page displays all webhook events that were sent.
  • In AWS CloudWatch, every lambda has a log group. Look at the logs of the webhook and scale-up lambdas.
  • In AWS SQS you can see messages available or in flight.
  • Once an EC2 instance is running, you can connect to it in the EC2 user interface using Session Manager (use enable_ssm_on_runners = true). Check the user data script using cat /var/log/user-data.log. By default several log files of the instances are streamed to AWS CloudWatch, look for a log group named <environment>/runners. In the log group you should see at least the log streams for the user data installation and runner agent.
  • Registered instances should show up in the Settings - Actions page of the repository or organization (depending on the installation mode).

Requirements

Name Version
terraform >= 0.14.1
aws ~> 4.0

Providers

Name Version
aws ~> 4.0
random n/a

Modules

Name Source Version
runner_binaries ./modules/runner-binaries-syncer n/a
runners ./modules/runners n/a
ssm ./modules/ssm n/a
webhook ./modules/webhook n/a

Resources

Name Type
aws_resourcegroups_group.resourcegroups_group resource
aws_sqs_queue.queued_builds resource
aws_sqs_queue.queued_builds_dlq resource
aws_sqs_queue_policy.build_queue_dlq_policy resource
aws_sqs_queue_policy.build_queue_policy resource
random_string.random resource
aws_iam_policy_document.deny_unsecure_transport data source

Inputs

Name Description Type Default Required
ami_filter List of maps used to create the AMI filter for the action runner AMI. By default amazon linux 2 is used. map(list(string)) null no
ami_owners The list of owners used to select the AMI of action runner instances. list(string)
[
"amazon"
]
no
aws_partition (optiona) partition in the arn namespace to use if not 'aws' string "aws" no
aws_region AWS region. string n/a yes
block_device_mappings The EC2 instance block device configuration. Takes the following keys: device_name, delete_on_termination, volume_type, volume_size, encrypted, iops
list(object({
device_name = string<

鲜花

握手

雷人

路过

鸡蛋
该文章已有0人参与评论

请发表评论

全部评论

专题导读
热门推荐
阅读排行榜

扫描微信二维码

查看手机版网站

随时了解更新最新资讯

139-2527-9053

在线客服(服务时间 9:00~18:00)

在线QQ客服
地址:深圳市南山区西丽大学城创智工业园
电邮:jeky_zhao#qq.com
移动电话:139-2527-9053

Powered by 互联科技 X3.4© 2001-2213 极客世界.|Sitemap