More complicated EKS scenarios: EKS managed worker nodes without internet access
Are you using EKS managed worker pools? If you don’t have any specific reasons for not using them, you should. It saves tons of time plus it boosts the “managed Kubernetes” feeling. However, this pretty new offering did not cover one specific use case: cluster with no Internet access.
By default, workers need Internet access so they can pull Docker images and register to the control plane. On the other hand, no internet access is a pretty common requirement, especially in regulated businesses.
So how to find some common ground here?
For the record, it really does not work by default. When we put everything to the private subnets with no outbound Internet access, workers can’t join the cluster.
Community for the rescue
If you know me, you know what I did. Search engine! And I was very successful cause I’ve found this GitHub issue after five minutes of searching. What’s going on there?
Mike Stefaniak from the EKS team is saying there that it can work as long as you set up the other required PrivateLink endpoints correctly.
Moreover, there’s also a link to the GitHub repository with some CloudFormation samples. Challenge accepted. I’m gonna build it 💪
Terraforming the CloudFormation stack
I’m not that much into CloudFormation. Not yet. So I had to rewrite it to Terraform first. Check the key components below, I’ll also add some explanation.
VPC settings
Note the property enable_dns_support
, this part is required by
private EKS endpoint. See more details in the documentation.
resource "aws_vpc" "main" {
cidr_block = "10.20.0.0/16"
enable_dns_support = true
tags = {
"kubernetes.io/cluster/cl01" = "shared"
}
}
Private subnets
Also, let’s add some basic private networks.
resource "aws_subnet" "private-0" {
availability_zone = "eu-central-1a"
cidr_block = "10.20.1.0/24"
vpc_id = aws_vpc.main.id
tags = {
"kubernetes.io/cluster/cl01" = "shared"
}
}
resource "aws_subnet" "private-1" {
availability_zone = "eu-central-1b"
cidr_block = "10.20.2.0/24"
vpc_id = aws_vpc.main.id
tags = {
"kubernetes.io/cluster/cl01" = "shared"
}
}
resource "aws_subnet" "private-2" {
availability_zone = "eu-central-1c"
cidr_block = "10.20.3.0/24"
vpc_id = aws_vpc.main.id
tags = {
"kubernetes.io/cluster/cl01" = "shared"
}
}
Endpoint for EC2 API
resource "aws_security_group" "endpoint_ec2" {
name = "endpoint-ec2"
vpc_id = aws_vpc.main.id
}
resource "aws_security_group_rule" "endpoint_ec2_443" {
security_group_id = aws_security_group.endpoint_ec2.id
type = "ingress"
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = [
10.20.1.0/24, // private subnet 1
10.20.2.0/24, // private subnet 2
10.20.3.0/24, // private subnet 3
]
}
resource "aws_vpc_endpoint" "ec2" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.eu-central-1.ec2"
vpc_endpoint_type = "Interface"
private_dns_enabled = true
subnet_ids = [
aws_subnet.private-0.id,
aws_subnet.private-1.id,
aws_subnet.private-2.id,
]
security_group_ids = [
aws_security_group.endpoint_ec2.id,
]
}
Endpoint for ECR/Docker APIs
resource "aws_security_group" "endpoint_ecr" {
name = "endpoint-ecr"
vpc_id = aws_vpc.main.id
}
resource "aws_security_group_rule" "endpoint_ecr_443" {
security_group_id = aws_security_group.endpoint_ecr.id
type = "ingress"
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = [
10.20.1.0/24, // private subnet 1
10.20.2.0/24, // private subnet 2
10.20.3.0/24, // private subnet 3
]
}
resource "aws_vpc_endpoint" "ecr_api" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.eu-central-1.ecr.api"
vpc_endpoint_type = "Interface"
private_dns_enabled = true
subnet_ids = [
aws_subnet.private-0.id,
aws_subnet.private-1.id,
aws_subnet.private-2.id,
]
security_group_ids = [
aws_security_group.endpoint_ecr.id,
]
}
resource "aws_vpc_endpoint" "ecr_dkr" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.eu-central-1.ecr.dkr"
vpc_endpoint_type = "Interface"
private_dns_enabled = true
subnet_ids = [
aws_subnet.private-0.id,
aws_subnet.private-1.id,
aws_subnet.private-2.id,
]
security_group_ids = [
aws_security_group.endpoint_ecr.id,
]
}
Endpoint for s3 API
In this part, we’re associating s3 private gateway with our private subnets' routing tables. See more details in the documentation. We need this API as images are stored in S3 buckets. So when it’s not enabled, nodes won’t be able to start essential services (pods).
resource "aws_vpc_endpoint" "s3" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.eu-central-1.s3"
vpc_endpoint_type = "Gateway"
route_table_ids = [
aws_route_table.private-0.id,
aws_route_table.private-1.id,
aws_route_table.private-2.id,
]
}
resource "aws_route_table" "private-0" {
vpc_id = aws_vpc.main.id
}
resource "aws_route_table" "private-1" {
vpc_id = aws_vpc.main.id
}
resource "aws_route_table" "private-2" {
vpc_id = aws_vpc.main.id
}
resource "aws_route_table_association" "private-0" {
subnet_id = aws_subnet.private-0.id
route_table_id = aws_route_table.private-0.id
}
resource "aws_route_table_association" "private-1" {
subnet_id = aws_subnet.private-1.id
route_table_id = aws_route_table.private-1.id
}
resource "aws_route_table_association" "private-2" {
subnet_id = aws_subnet.private-2.id
route_table_id = aws_route_table.private-2.id
}
EKS
Now are somehow ready to start the control plane. Please note I don’t mention there IAM resources and Security Groups as it would make the whole post even bigger. But don’t worry, Hashicorp prepared a beautiful tutorial so you can use those.
In the following example note the property vpc_config.endpoint_private_access
.
This is the private EKS endpoint I was talking about.
resource "aws_eks_cluster" "master" {
name = var.cluster_name
role_arn = aws_iam_role.default-master.arn
vpc_config {
subnet_ids = [
aws_subnet.private-0.id,
aws_subnet.private-1.id,
aws_subnet.private-2.id,
]
endpoint_private_access = true
endpoint_public_access = true
public_access_cidrs = [
"0.0.0.0/0"
]
}
depends_on = [
aws_iam_role_policy_attachment.eks_master_default_policy,
aws_iam_role_policy_attachment.eks_master_default_service_policy,
]
}
And that’s pretty much it for the control plane. Let’s proceed to managed worker nodes!
resource "aws_eks_node_group" "aza" {
cluster_name = aws_eks_cluster.master.name
node_group_name = "AZa"
node_role_arn = aws_iam_role.default-worker.arn
instance_types = ["t3.small"]
subnet_ids = [
aws_subnet.private-0.id,
]
scaling_config {
desired_size = 1
max_size = 2
min_size = 1
}
lifecycle {
ignore_changes = [
scaling_config[0].desired_size
]
}
depends_on = [
aws_iam_role_policy_attachment.default-AmazonEKSWorkerNodePolicy,
aws_iam_role_policy_attachment.default-AmazonEKS_CNI_Policy,
aws_iam_role_policy_attachment.default-AmazonEC2ContainerRegistryReadOnly,
]
}
resource "aws_eks_node_group" "azb" {
cluster_name = aws_eks_cluster.master.name
node_group_name = "AZb"
node_role_arn = aws_iam_role.default-worker.arn
instance_types = ["t3.small"]
subnet_ids = [
aws_subnet.private-1.id,
]
scaling_config {
desired_size = 1
max_size = 2
min_size = 1
}
lifecycle {
ignore_changes = [
scaling_config[0].desired_size
]
}
depends_on = [
aws_iam_role_policy_attachment.default-AmazonEKSWorkerNodePolicy,
aws_iam_role_policy_attachment.default-AmazonEKS_CNI_Policy,
aws_iam_role_policy_attachment.default-AmazonEC2ContainerRegistryReadOnly,
]
}
resource "aws_eks_node_group" "azc" {
cluster_name = aws_eks_cluster.master.name
node_group_name = "AZc"
node_role_arn = aws_iam_role.default-worker.arn
instance_types = ["t3.small"]
subnet_ids = [
aws_subnet.private-2.id,
]
scaling_config {
desired_size = 1
max_size = 2
min_size = 1
}
lifecycle {
ignore_changes = [
scaling_config[0].desired_size
]
}
depends_on = [
aws_iam_role_policy_attachment.default-AmazonEKSWorkerNodePolicy,
aws_iam_role_policy_attachment.default-AmazonEKS_CNI_Policy,
aws_iam_role_policy_attachment.default-AmazonEC2ContainerRegistryReadOnly,
]
}
Please note that we’re using 3 managed node groups. That’s because we are following the official recommendation about the stateful applications and cluster autoscaler. In any perspective, this setup can’t cause any harm.
Rise of the EKS cluster
We’re ready to go. Let’s give it a spin! The creation of such a setup takes almost 20 minutes. I usually use such time for emailing or Slack conversations but there might be a better way to utilize that 😄
terraform apply
After 20 minutes, we can view our brand new cluster in the console.
It’s truly beautiful and guess what, even all the Kubernetes components are up and running! This means that worker nodes were able to pull everything via configured private endpoints.
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system aws-node-687nk 1/1 Running 0 88s
kube-system aws-node-mppfc 1/1 Running 0 78s
kube-system aws-node-s9d4z 1/1 Running 0 82s
kube-system coredns-5b5455fd88-mdzs5 1/1 Running 0 5m19s
kube-system coredns-5b5455fd88-znqmz 1/1 Running 0 5m19s
kube-system kube-proxy-d5jbs 1/1 Running 0 82s
kube-system kube-proxy-fhsxt 1/1 Running 0 88s
kube-system kube-proxy-w2sbp 1/1 Running 0 78s
Wrap
Let’s summarize the specification of this setup:
- 3 private subnets
- VPC endpoints for ECR
- EC2 endpoint
- S3 endpoint
- EKS control plane with private endpoint
- no internet access
We’ve just witnessed that it is absolutely possible to create totally isolated clusters even for the most demanding customers. Moreover, official documentation will reflect this setup soon! Don’t forget that such setup can’t pull images from the public repositories, so there’s always an extra step needed for each application deployed: pushing its images to the ECR repository.
Seriously, I’m really glad that AWS is visible on GitHub and they actually care about the users of their products. I’d be totally lost without that advice I got there.