• 设为首页
  • 点击收藏
  • 手机版
    手机扫一扫访问
    迪恩网络手机版
  • 关注官方公众号
    微信扫一扫关注
    迪恩网络公众号

aws/aws-node-termination-handler: Gracefully handle EC2 instance shutdown within ...

原作者: [db:作者] 来自: 网络 收藏 邀请

开源软件名称(OpenSource Name):

aws/aws-node-termination-handler

开源软件地址(OpenSource Url):

https://github.com/aws/aws-node-termination-handler

开源编程语言(OpenSource Language):

Go 55.1%

开源软件介绍(OpenSource Introduction):

AWS Node Termination Handler

Gracefully handle EC2 instance shutdown within Kubernetes

kubernetes go-version license build-status docker-pulls


Project Summary

This project ensures that the Kubernetes control plane responds appropriately to events that can cause your EC2 instance to become unavailable, such as EC2 maintenance events, EC2 Spot interruptions, ASG Scale-In, ASG AZ Rebalance, and EC2 Instance Termination via the API or Console. If not handled, your application code may not stop gracefully, take longer to recover full availability, or accidentally schedule work to nodes that are going down.

The aws-node-termination-handler (NTH) can operate in two different modes: Instance Metadata Service (IMDS) or the Queue Processor.

The aws-node-termination-handler Instance Metadata Service Monitor will run a small pod on each host to perform monitoring of IMDS paths like /spot or /events and react accordingly to drain and/or cordon the corresponding node.

The aws-node-termination-handler Queue Processor will monitor an SQS queue of events from Amazon EventBridge for ASG lifecycle events, EC2 status change events, Spot Interruption Termination Notice events, and Spot Rebalance Recommendation events. When NTH detects an instance is going down, we use the Kubernetes API to cordon the node to ensure no new work is scheduled there, then drain it, removing any existing work. The termination handler Queue Processor requires AWS IAM permissions to monitor and manage the SQS queue and to query the EC2 API.

You can run the termination handler on any Kubernetes cluster running on AWS, including self-managed clusters and those created with Amazon Elastic Kubernetes Service.

Major Features

Instance Metadata Service Processor

  • Monitors EC2 Metadata for Scheduled Maintenance Events
  • Monitors EC2 Metadata for Spot Instance Termination Notifications
  • Monitors EC2 Metadata for Rebalance Recommendation Notifications
  • Helm installation and event configuration support
  • Webhook feature to send shutdown or restart notification messages
  • Unit & Integration Tests

Queue Processor

  • Monitors an SQS Queue for:
    • EC2 Spot Interruption Notifications
    • EC2 Instance Rebalance Recommendation
    • EC2 Auto-Scaling Group Termination Lifecycle Hooks to take care of ASG Scale-In, AZ-Rebalance, Unhealthy Instances, and more!
    • EC2 Status Change Events
    • EC2 Scheduled Change events from AWS Health
  • Helm installation and event configuration support
  • Webhook feature to send shutdown or restart notification messages
  • Unit & Integration Tests

Which one should I use?

Feature IMDS Processor Queue Processor
K8s DaemonSet
K8s Deployment
Spot Instance Interruptions (ITN)
Scheduled Events
EC2 Instance Rebalance Recommendation
ASG Lifecycle Hooks
EC2 Status Changes
Setup Required

Installation and Configuration

The aws-node-termination-handler can operate in two different modes: IMDS Processor and Queue Processor. The enableSqsTerminationDraining helm configuration key or the ENABLE_SQS_TERMINATION_DRAINING environment variable are used to enable the Queue Processor mode of operation. If enableSqsTerminationDraining is set to true, then IMDS paths will NOT be monitored. If the enableSqsTerminationDraining is set to false, then IMDS Processor Mode will be enabled. Queue Processor Mode and IMDS Processor Mode cannot be run at the same time.

IMDS Processor Mode allows for a fine-grained configuration of IMDS paths that are monitored. There are currently 3 paths supported that can be enabled or disabled by using the following helm configuration keys:

  • enableSpotInterruptionDraining
  • enableRebalanceMonitoring
  • enableScheduledEventDraining

By default, IMDS mode will only Cordon in response to a Rebalance Recommendation event (all other events are Cordoned and Drained). Cordon is the default for a rebalance event because it's not known if an ASG is being utilized and if that ASG is configured to replace the instance on a rebalance event. If you are using an ASG w/ rebalance recommendations enabled, then you can set the enableRebalanceDraining flag to true to perform a Cordon and Drain when a rebalance event is received.

The enableSqsTerminationDraining must be set to false for these configuration values to be considered.

The Queue Processor Mode does not allow for fine-grained configuration of which events are handled through helm configuration keys. Instead, you can modify your Amazon EventBridge rules to not send certain types of events to the SQS Queue so that NTH does not process those events. All events when operating in Queue Processor mode are Cordoned and Drained unless the cordon-only flag is set to true.

The enableSqsTerminationDraining flag turns on Queue Processor Mode. When Queue Processor Mode is enabled, IMDS mode cannot be active. NTH cannot respond to queue events AND monitor IMDS paths. Queue Processor Mode still queries for node information on startup, but this information is not required for normal operation, so it is safe to disable IMDS for the NTH pod.

AWS Node Termination Handler - IMDS Processor

Installation and Configuration

The termination handler DaemonSet installs into your cluster a ServiceAccount, ClusterRole, ClusterRoleBinding, and a DaemonSet. All four of these Kubernetes constructs are required for the termination handler to run properly.

Kubectl Apply

You can use kubectl to directly add all of the above resources with the default configuration into your cluster.

kubectl apply -f https://github.com/aws/aws-node-termination-handler/releases/download/v1.16.5/all-resources.yaml

For a full list of releases and associated artifacts see our releases page.

Helm

The easiest way to configure the various options of the termination handler is via helm. The chart for this project is hosted in the eks-charts repository.

To get started you need to add the eks-charts repo to helm

helm repo add eks https://aws.github.io/eks-charts

Once that is complete you can install the termination handler. We've provided some sample setup options below.

Zero Config:

helm upgrade --install aws-node-termination-handler \
  --namespace kube-system \
  eks/aws-node-termination-handler

Enabling Features:

helm upgrade --install aws-node-termination-handler \
  --namespace kube-system \
  --set enableSpotInterruptionDraining="true" \
  --set enableRebalanceMonitoring="true" \
  --set enableScheduledEventDraining="false" \
  eks/aws-node-termination-handler

The enable* configuration flags above enable or disable IMDS monitoring paths.

Running Only On Specific Nodes:

helm upgrade --install aws-node-termination-handler \
  --namespace kube-system \
  --set nodeSelector.lifecycle=spot \
  eks/aws-node-termination-handler

Webhook Configuration:

helm upgrade --install aws-node-termination-handler \
  --namespace kube-system \
  --set webhookURL=https://hooks.slack.com/services/YOUR/SLACK/URL \
  eks/aws-node-termination-handler

Alternatively, pass Webhook URL as a Secret:

WEBHOOKURL_LITERAL="webhookurl=https://hooks.slack.com/services/YOUR/SLACK/URL"

kubectl create secret -n kube-system generic webhooksecret --from-literal=$WEBHOOKURL_LITERAL
helm upgrade --install aws-node-termination-handler \
  --namespace kube-system \
  --set webhookURLSecretName=webhooksecret \
  eks/aws-node-termination-handler

For a full list of configuration options see our Helm readme.

AWS Node Termination Handler - Queue Processor (requires AWS IAM Permissions)

Infrastructure Setup

The termination handler requires some infrastructure prepared before deploying the application. In a multi-cluster environment, you will need to repeat the following steps for each cluster.

You'll need the following AWS infrastructure components:

  1. Amazon Simple Queue Service (SQS) Queue
  2. AutoScaling Group Termination Lifecycle Hook
  3. Amazon EventBridge Rule
  4. IAM Role for the aws-node-termination-handler Queue Processing Pods

1. Create an SQS Queue:

Here is the AWS CLI command to create an SQS queue to hold termination events from ASG and EC2, although this should really be configured via your favorite infrastructure-as-code tool like CloudFormation (template here) or Terraform:

## Queue Policy
$ QUEUE_POLICY=$(cat <<EOF
{
    "Version": "2012-10-17",
    "Id": "MyQueuePolicy",
    "Statement": [{
        "Effect": "Allow",
        "Principal": {
            "Service": ["events.amazonaws.com", "sqs.amazonaws.com"]
        },
        "Action": "sqs:SendMessage",
        "Resource": [
            "arn:aws:sqs:${AWS_REGION}:${ACCOUNT_ID}:${SQS_QUEUE_NAME}"
        ]
    }]
}
EOF
)

## make sure the queue policy is valid JSON
$ echo "$QUEUE_POLICY" | jq .

## Save queue attributes to a temp file
$ cat << EOF > /tmp/queue-attributes.json
{
  "MessageRetentionPeriod": "300",
  "Policy": "$(echo $QUEUE_POLICY | sed 's/\"/\\"/g' | tr -d -s '\n' " ")"
}
EOF

$ aws sqs create-queue --queue-name "${SQS_QUEUE_NAME}" --attributes file:///tmp/queue-attributes.json

If you are sending Lifecycle termination events from ASG directly to SQS, instead of through EventBridge, then you will also need to create an IAM service role to give Amazon EC2 Auto Scaling access to your SQS queue. Please follow these linked instructions to create the IAM service role: link. Note the ARNs for the SQS queue and the associated IAM role for Step 2.

There are some caveats when using server side encryption with SQS:

2. Create an ASG Termination Lifecycle Hook:

Here is the AWS CLI command to create a termination lifecycle hook on an existing ASG when using EventBridge, although this should really be configured via your favorite infrastructure-as-code tool like CloudFormation or Terraform:

$ aws autoscaling put-lifecycle-hook \
  --lifecycle-hook-name=my-k8s-term-hook \
  --auto-scaling-group-name=my-k8s-asg \
  --lifecycle-transition=autoscaling:EC2_INSTANCE_TERMINATING \
  --default-result=CONTINUE \
  --heartbeat-timeout=300

If you want to avoid using EventBridge and instead send ASG Lifecycle events directly to SQS, instead use the following command, using the ARNs from Step 1:

$ aws autoscaling put-lifecycle-hook \
  --lifecycle-hook-name=my-k8s-term-hook \
  --auto-scaling-group-name=my-k8s-asg \
  --lifecycle-transition=autoscaling:EC2_INSTANCE_TERMINATING \
  --default-result=CONTINUE \
  --heartbeat-timeout=300 \
  --notification-target-arn <your test queue ARN here> \
  --role-arn <your SQS access role ARN here>

3. Tag the ASGs:

By default the aws-node-termination-handler will only manage terminations for ASGs tagged w/ key=aws-node-termination-handler/managed

$ aws autoscaling create-or-update-tags \
  --tags ResourceId=my-auto-scaling-group,ResourceType=auto-scaling-group,Key=aws-node-termination-handler/managed,Value=,PropagateAtLaunch=true

The value of the key does not matter.

This functionality is helpful in accounts where there are ASGs that do not run kubernetes nodes or you do not want aws-node-termination-handler to manage their termination lifecycle. However, if your account is dedicated to ASGs for your kubernetes cluster, then you can turn off the ASG tag check by setting the flag --check-asg-tag-before-draining=false or environment variable CHECK_ASG_TAG_BEFORE_DRAINING=false.

You can also control what resources NTH manages by adding the resource ARNs to your Amazon EventBridge rules.

Take a look at the docs on how to create rules that only manage certain ASGs here.

See all the different events docs here.

4. Create Amazon EventBridge Rules

You may skip this step if sending events from ASG to SQS directly.

Here are AWS CLI commands to create Amazon EventBridge rules so that ASG termination events, Spot Interruptions, Instance state changes, Rebalance Recommendations, and AWS Health Scheduled Changes are sent to the SQS queue created in the previous step. This should really be configured via your favorite infrastructure-as-code tool like CloudFormation (template here) or Terraform:

$ aws events put-rule \
  --name MyK8sASGTermRule \
  --event-pattern "{\"source\":[\"aws.autoscaling\"],\"detail-type\":[\"EC2 Instance-terminate Lifecycle Action\"]}"

$ aws events put-targets --rule MyK8sASGTermRule \
  --targets "Id"="1","Arn"="arn:aws:sqs:us-east-1:123456789012:MyK8sTermQueue"

$ aws events put-rule \
  --name MyK8sSpotTermRule \
  --event-pattern "{\"source\": [\"aws.ec2\"],\"detail-type\": [\"EC2 Spot Instance Interruption Warning\"]}"

$ aws events put-targets --rule MyK8sSpotTermRule \
  --targets "Id"="1","Arn"="arn:aws:sqs:us-east-1:123456789012:MyK8sTermQueue"

$ aws events put-rule \
  --name MyK8sRebalanceRule \
  --event-pattern "{\"source\": [\"aws.ec2\"],\"detail-type\": [\"EC2 Instance Rebalance Recommendation\"]}"

$ aws events put-targets --rule MyK8sRebalanceRule \
  --targets "Id"="1","Arn"="arn:aws:sqs:us-east-1:123456789012:MyK8sTermQueue"

$ aws events put-rule \
  --name MyK8sInstanceStateChangeRule \
  --event-pattern "{\"source\": [\"aws.ec2\"],\"detail-type\": [\"EC2 Instance State-change Notification\"]}"

$ aws events put-targets --rule MyK8sInstanceStateChangeRule \
  --targets "Id"="1","Arn"="arn:aws:sqs:us-east-1:123456789012:MyK8sTermQueue"

$ aws events put-rule \
  --name MyK8sScheduledChangeRule \
  --event-pattern "{\"source\": [\"aws.health\"],\"detail-type\": [\"AWS Health Event\"],\"detail\": {\"service\": [\"EC2\"],\"eventTypeCategory\": [\"scheduledChange\"]}}"

$ aws events put-targets --rule MyK8sScheduledChangeRule \
  --targets "Id"="1","Arn"="arn:aws:sqs:us-east-1:123456789012:MyK8sTermQueue"

5. Create an IAM Role for the Pods

There are many different ways to allow the aws-node-termination-handler pods to assume a role:

  1. Amazon EKS IAM Roles for Service Accounts
  2. IAM Instance Profiles for EC2
  3. Kiam
  4. kube2iam

IAM Policy for aws-node-termination-handler Deployment:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "autoscaling:CompleteLifecycleAction",
                "autoscaling:DescribeAutoScalingInstances",
                "autoscaling:DescribeTags",
                "ec2:DescribeInstances",
                "sqs:DeleteMessage",
                "sqs:ReceiveMessage"
            ],
            "Resource": "*"
        }
    ]
}

Installation

Helm

The easiest and most commonly used method to configure the termination handler is via helm. The chart for this project is hosted in the eks-charts repository.

To get started you need to add the eks-charts repo to helm

helm repo add eks https://aws.github.io/eks-charts

Once that is complete you can install the termination handler. We've provided some sample setup options below.

Minimal Config:

helm upgrade --install aws-node-termination-handler \
  --namespace kube-system \
  --set enableSqsTerminationDraining=true \
  --set queueURL=https://sqs.us-east-1.amazonaws.com/0123456789/my-term-queue \
  eks/aws-node-termination-handler

Webhook Configuration:

helm upgrade --install aws-node-termination-handler \
  --namespace kube-system \
  --set enableSqsTerminationDraining=true \
  --set queueURL=https://sqs.us-east-1.amazonaws.com/0123456789/my-term-queue \
  --set webhookURL=https://hooks.slack.com/services/YOUR/SLACK/URL \
  eks/aws-node-termination-handler

Alternatively, pass Webhook URL as a Secret:

WEBHOOKURL_LITERAL="webhookurl=https://hooks.slack.com/services/YOUR/SLACK/URL"

kubectl create secret -n kube-system generic webhooksecret --from-literal=$WEBHOOKURL_LITERAL
helm upgrade --install aws-node-termination-handler \
  --namespace kube-system \
  --set enableSqsTerminationDraining=true \
  --set queueURL=https://sqs.us-east-1.amazonaws.com/0123456789/my-term-queue \
  --set webhookURLSecretName=webhooksecret \
  eks/aws-node-termination-handler

For a full list of configuration options see our Helm readme.

Kubectl Apply

Queue Processor needs an sqs queue url to function; therefore, manifest changes are REQUIRED before using kubectl to directly add all of the above resources into your cluster.

Minimal Config:

curl -L https://github.com/aws/aws-node-termination-handler/releases/download/v1.16.5/all-resources-queue-processor.yaml -o all-resources-queue-processor.yaml
<open all-resources-queue-processor.yaml and update QUEUE_URL value>
kubectl apply -f ./all-resources-queue-processor.yaml

For a full list of releases and associated artifacts see our releases page.

Use with Kiam

Use with Kiam

If you are using IMDS mode which defaults to hostNetworking: true, or if you are using queue-processor mode, then this section does not apply. The configuration below only needs to be used if you are explicitly changing NTH IMDS mode to hostNetworking: false .

To use the termination handler alongside Kiam requires some extra configuration on Kiam's end. By default Kiam will block all access to the metadata address, so you need to make sure it passes through the requests the termination handler relies on.

To add a whitelist configuration, use the following fields in the Kiam Helm chart values:

agent.whiteListRouteRegexp: '^\/latest\/meta-data\/(spot\/instance-action|events\/maintenance\/scheduled|instance-(id|type)|public-(hostname|ipv4)|local-(hostname|ipv4)|placement\/availability-zone)|\/latest\/dynamic\/instance-identity\/document$'

Or just pass it as an argument to the kiam agents:

kiam agent --whitelist-route-regexp='^\/latest\/meta-data\/(spot\/instance-action|events\/maintenance\/scheduled|instance-(id|type)|public-(hostname|ipv4)|local-(hostname|ipv4)|placement\/availability-zone)|\/latest\/dynamic\/instance-identity\/document$'

Metadata endpoints

The termination handler relies on the following metadata endpoints to function properly:

/latest/dynamic/instance-identity/document
/latest/meta-data/spot/instance-action
/latest/meta-data/events/recommendations/rebalance
/latest/meta-data/events/maintenance/scheduled
/latest/meta-data/instance-id
/latest/meta-data/instance-life-cycle
/latest/meta-data/instance-type
/latest/meta-data/public-hostname
/latest/meta-data/public-ipv4
/latest/meta-data/local-hostname
/latest/meta-data/local-ipv4
/latest/meta-data/placement/availability-zone

Building

For build instructions please consult BUILD.md.

Metrics

Available Prometheus metrics:

Metric name Description
actions_node Number of actions per node
events_error Number of errors in events processing

Communication

Contributing

Contributions are welcome! Please read our guidelines and our Code of Conduct

License

This project is licensed under the Apache-2.0 License.




鲜花

握手

雷人

路过

鸡蛋
该文章已有0人参与评论

请发表评论

全部评论

专题导读
热门推荐
阅读排行榜

扫描微信二维码

查看手机版网站

随时了解更新最新资讯

139-2527-9053

在线客服(服务时间 9:00~18:00)

在线QQ客服
地址:深圳市南山区西丽大学城创智工业园
电邮:jeky_zhao#qq.com
移动电话:139-2527-9053

Powered by 互联科技 X3.4© 2001-2213 极客世界.|Sitemap