Exercises in code reading (I): fixing a Kubernetes AWS cloud provider bug

I consider the act of reading code as one of the most important that engineers use in their day to day work. Most of us normally work with a myriad of different projects that were not written by us, like open source libraries, internal libraries and programs that the company produced over time, even the very code that was written by you a long time ago, and thus can be considered external.

In this series of post I’m gonna present a few tips and tricks I have learned over the years to quickly get use, learn and understand an unknown code base. This will be useful, for example, to understand how a library is working under the hood, debug issues caused by external code, or simply to introduce yourself to a different code base with the purpose of learning. This can be applied to any language, but in this series we will use Golang, a language I’m interested in.

The first post in this series is gonna be about a Kubernetes bug we stumbled upon while operating our cluster at the company I work for (Scopely). This specific cluster (which is managed using kops) was down for about 2 hours (until we were able to find the issue and revert) when this bug happened, and we decided to investigate further in the library code to find out what happened.

After that we filled a bug report available here: https://github.com/kubernetes/kubernetes/issues/84926 and forgot about it. However, I always though it would be fun to actually dig down and try to fix this issue, as a learning experience.

This might seem daunting at the beginning, but in fact is not that bad, and that would be my first learning about this process: don’t think you are gonna be able to understand or contribute to unknown, big codebases; most likely you will be surprised at the fact that at the end of the day is not that different than other internal programs and libraries you manage in your job, so don’t be afraid to try!

A first overview of the code

The first thing I do in order to start is to clone the code locally and open it with my editor of choice. In terms of tooling, I would say that it’s a matter of using what is best for you. Additionally, I think it’s very important that you master your tools as soon as possible, so you can be faster navigating the code and trying things around. It’s also worth it to invest time in your tooling from time to time, since new and improved tools are being developed and you never know when a new tool might just be the thing that you need to optimize that little piece of your day to day procedures.

The first I try to do is to examine the directory structure, to get an idea about how big is the code, how many modules are there, how are they structured, how big are files in general, how many languages are embedded on it, etc. this will give a high-level overview of the complexity of the program.

For Kubernetes specifically I try to go into the parts of the project that I’m familiar with first, for example the kubectl, kubelet, apiserver, kubeproxy, etc. From there we can look for a main() entry point and start our digging process. We can follow the main method to get an idea how things are abstracted, what libraries are being pulled in, and finally onto the implementation methods we might be interested.

Quickly navigating the code back and forth is important, so you should be able to do so with ease; also, being able to enter and exit functions, modules, looking are where things are defined, examining different objects, etc. should be a smooth and fast process.

For this specific issue I mentioned, I quickly found out that the cloud provider, which is a feature that allows Kubernetes to run with a tighly integration with AWS, GCE, etc. was located at the cloudprovider but was recently extracted off-tree to legacy-cloud-providers as part of an ongoing effort to decouple k8s from cloud providers.

This meant cloning another repo (cloud-provider-aws), which is not part of the Kubernetes core repo, but most of it’s implementation is located there. This new repository contains the main aws-controller-manager code and entrypoint, and is where we are gonna focus our efforts to fix this issue.

What about documentation?

Documentation is great resource to have when going through a new code base. This will depend greatly on the project: there are projects that are super well documented and the documentation shows a track record of being accurate and trustworthy. However, not all projects are like that: somethings the documentation is faulty, not updated or just not being able to explain exactly what’s going on under the hood.

In general I always try to deep into the code to see exactly what’s going on. In my experience I’ve always being able to make sense of something once I saw the code, whereas as documentation has proven untrustworthy and sometimes right out misleading.

The bottom line is: read the documentation with care, but only trust the code!

Development Kubernetes environment

The first thing we need is the Kubernetes source code and a way of deploying this somewhere so we can test and debug. In order to do so we will be using kind, which is Kubernetes distribution made specifically for this purpose and runs in a single Docker container.

Installing kind is easy enough and instructions can we found here: https://kind.sigs.k8s.io/docs/user/quick-start/. As a dug deeper into the code I realized I needed a custom Kubernetes deployment, which is also easy enough to achieve using kind and it’s yaml-based configuration. The kind configuration I used ended up like this:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
kubeadmConfigPatches:
- |
  kind: ClusterConfiguration
  metadata:
    name: config
  apiServer:
    extraArgs:
      "cloud-provider": "external"
  controllerManager:
    extraArgs:
      "cloud-provider": "external"
- |
  kind: InitConfiguration
  metadata:
    name: config
  nodeRegistration:
    kubeletExtraArgs:
      "cloud-provider": "external"
      "provider-id": "i-0123456"

Development cycle

Once I got a good understanding of kind and the parts of the code I’m gonna have to modify, I try to get my development cycle ramped up as soon as possible. The ideas is that we are able to make changes, run and debug and troubleshoot as quickly as possible, as we are gonna have to repeat that cycle on and on.

I figured out two ways of doing this:

Pushing compiled changes to kind

kind works by deploying on a single container the whole of the kubernetes control plane. This means an Ubuntu-based Docker image with kubernetes installed and configured and several processes run using systemd. With this in mind, one way of testing our code could be compiling in our machine and then pushing the binary to the docker container:

pwd                                                                                  
~/Development/cloud-provider-aws
GOOS=linux make                                                                      
CGO_ENABLED=0 GOOS=linux go build \
		-ldflags "-w -s -X 'main.version=452ec35f-dirty'" \
		-o aws-cloud-controller-manager \
		cmd/aws-cloud-controller-manager/main.go

Then we can just push to the docker container:

docker ps
CONTAINER ID        IMAGE                  COMMAND                  CREATED             STATUS              PORTS                       NAMES
1f7cd8ce40aa        kindest/node:v1.17.0   "/usr/local/bin/entr…"   6 hours ago         Up 6 hours          127.0.0.1:32771->6443/tcp   test-control-plane

docker cp aws-cloud-controller-manager 1f7cd8ce40aa:/kind/bin/

Finally, we can bash our way into the container and invoke the program:

/kind/bin/aws-cloud-controller-manager --cloud-provider=aws --cloud-config=/kind/bin/awsconfig --kubeconfig /etc/kubernetes/controller-manager.conf --use-service-account-credentials --leader-elect=false --master="https://localhost:6443" -v 5

This makes possible to run the program as expected. However, also requires pushing a piece of config “awsconfig” that contains the region and fewer smaller things and to have a properly configured credentials in the environment. This set up is also difficult to debug, so I ended up doing something different.

Debugging locally

What I ended up doing is to just invoke the program locally and connect to the cluster from outside. This required bringing the k8s credentials to my local machine (controller-manager.conf) and configuring my personal AWS environment. I can invoke the program from my IDEA (Goland) and debug just fine.

With these two ways of running our code I have my setup ready for what’s coming next: actually making the controller work.

Challenges of debugging a cloud provider

One of the challenges of debugging this piece of code is that it’s tighly coupled with the cloud provider (in this case AWS). This meant that we are gonna have to hack around in order make the program think that it’s running on AWS. The first challenge is this little piece of code here:

func (p *awsSDKProvider) Metadata() (EC2Metadata, error) {
	sess, err := session.NewSession(&aws.Config{
		EndpointResolver: p.cfg.getResolver(),
	})
	if err != nil {
		return nil, fmt.Errorf("unable to initialize AWS session: %v", err)
	}
	client := ec2metadata.New(sess)
	p.addAPILoggingHandlers(&client.Handlers)
	return client, nil
}

As you can see, the cloud provider is gonna request the metadata (thinking that it’s running on an EC2 machine), and of course, will be unable to do so (and this early exits the program). In order to make the program believe it’s running on AWS, we can use the handy aws_fakes module to inject a custom metadata:

func (p *awsSDKProvider) Metadata() (EC2Metadata, error) {
	s := &FakeAWSServices{}
	s.region = "us-east-1"

	selfInstance := &ec2.Instance{}
	selfInstance.InstanceId = aws.String("i-0123456789")
	selfInstance.Placement = &ec2.Placement{
		AvailabilityZone: aws.String("us-east-1a"),
	}
	selfInstance.PrivateDnsName = aws.String("ip-172-20-0-100.ec2.internal")
	selfInstance.PrivateIpAddress = aws.String("192.168.0.1")
	selfInstance.PublicIpAddress = aws.String("1.2.3.4")
	s.selfInstance = selfInstance

	client := FakeMetadata{aws: s}
	return &client, nil
}

This hacks allow ourselves and to continue further down the rabbit hole; remember that our objective here is to become productive and fix an issue, so it’s fine to just hack your way around anything that is not essential (and it’s also fun!).

Of course this requires a real EC2 machine in the cloud, but guess what? As long as there is an private IP assigned (we don’t care about the public one), we don’t even have to have the node running (so it can be stopped at 0 cost!).

Remember that kind.yaml I referred before? Well, it’s exactly what is needed to initialize the kubernetes services (kubelet, etc.) to run in “cloud” mode. These, additionally, will do some checks to ensure that they are running in the specified cloud. One specific check involved this piece of code:

func ensureNodeProvidedIPExists(node *v1.Node, nodeAddresses []v1.NodeAddress) (*v1.NodeAddress, bool) {
	var nodeIP *v1.NodeAddress
	nodeIPExists := false
	if providedIP, ok := node.ObjectMeta.Annotations[kubeletapis.AnnotationProvidedIPAddr]; ok {
		nodeIPExists = true
		for i := range nodeAddresses {
			if nodeAddresses[i].Address == providedIP {
				nodeIP = &nodeAddresses[i]
				break
			}
		}
	}
	return nodeIP, nodeIPExists
}

kind sets the provided node ip to the local network of the container, and as you can expect this check won’t pass and the node won’t be recognized as a cloud-node. Although Kubernetes still works without passing this check, we need the node to think it’s part of AWS so our aws-cloud-controller-manager can acknowledge it and we can move further to reproduce our original issue.

Doing this is easy enough: we just need to edit the node annotation using the standard kubectl edit node tool. Once I did this, the controller-manager was able to recognize the node as a cloud one and consider it for internal operations.

For our specific issue, we also need this single node (which is the master as well), to be considered for load balancing. By default, master nodes are excluded, and there is a couple of ways of changing this behaviour. One is using feature gates (which kind supports), to disable the LegacyNodeRoleBehavior feature, and consider the master as part of the scheduleable nodes for workloads. The other way is just removing the master label node-role.kubernetes.io/master.

You can easily deduce what are your needs and why the code is not behaving the way you want by exaiming the source, from there you can decide what path you follow:

		if utilfeature.DefaultFeatureGate.Enabled(legacyNodeRoleBehaviorFeature) {
			// As of 1.6, we will taint the master, but not necessarily mark it unschedulable.
			// Recognize nodes labeled as master, and filter them also, as we were doing previously.
			if _, hasMasterRoleLabel := node.Labels[labelNodeRoleMaster]; hasMasterRoleLabel {
				return false
			}
		}
		if utilfeature.DefaultFeatureGate.Enabled(serviceNodeExclusionFeature) {
			// Will be removed in 1.18
			if _, hasExcludeBalancerLabel := node.Labels[labelAlphaNodeRoleExcludeBalancer]; hasExcludeBalancerLabel {
				return false
			}
			if _, hasExcludeBalancerLabel := node.Labels[labelNodeRoleExcludeBalancer]; hasExcludeBalancerLabel {
				return false
			}
		}

Fortunately, Kubernetes has a very well designed mapping between flags, options, etc. to symbols in the code, so looking for things in the documentation from the code and the other way around is piece of cake!

Fixing the issue

Once we are able to make changes and debug in a quick fashion, and as we get more and more used to the code, delivering the fix becomes more feasible. For this case, this development cycle also involved going back and forth to the AWS account to set the proper conditions for the test to reproduce.

This is the service definition we are gonna be using to reproduce the bug (take from the original bug report):

apiVersion: v1
kind: Service
metadata:
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-additional-resource-tags: Key=Value
    service.beta.kubernetes.io/aws-load-balancer-extra-security-groups: sg-00001111
  name: test-service
  namespace: test
spec:
  loadBalancerSourceRanges:
  - 10.0.0.0/8
  ports:
  - name: port-name
    port: 10000
    protocol: TCP
    targetPort: 10000
  selector:
    label: "true"
  sessionAffinity: None
  type: LoadBalancer

The security group sg-00001111 already exists and it’s assigned to this fake node (stopped EC2 machine). Additionally, a rule exists to allow all traffic from that security group to itself. As far as I know, this is something that kops does for you when creating a cluster from scratch, so I reproduced it here manually.

When we apply this manifest in our kind cluster, if we are also running the aws-cloud-controller-manager if will start the process of creating the load balancer. This will create two resources:

  • A new security group: sg-00002222 with the source ranges and ports specified in the service spec
  • A load balancer with two security groups: sg-00001111 and sg-00002222

The bug manifests in two places:

  • When ensuring the security group traffic rules to allow traffic coming from the ELB to the nodes security group is there
  • When removing the load balancer and it’s associated security groups

In the first place, it also happens in two phases:

  • When we first create the service and these couple of resources are created: if we have more than one security group (like in this example), the algorithm will pick just the one returned last in the API response. This order is random as far as I could see, so sometimes it will work, sometimes not.
  • When we remove the service, it will use the same code to determine what traffic rules it needs to remove. If the wrong security group (sg-00001111) is returned, the wrong traffic rule will be removed, which is the rule that allows traffic to happen between nodes.

Finally, when removing the load balancer, it will actually throw an exception (and timeout after 10 minutes), because it’s trying to also remove the entire security group, assuming it didn’t have any traffic rules left (because it expected the previous procedure to remove all traffic rules).

After understanding the problem and being able to reproduce successfully, we can then easily write a fix, which will be as easy as creating a couple of new methods:

// Returns a list of all the extra security groups as annotated in the service definition
func (c* Cloud) collectExtraSecurityGroupsForService(service *v1.Service) []string {
	extraSgList := []string{}

	for _, extraSG := range strings.Split(service.ObjectMeta.Annotations[ServiceAnnotationLoadBalancerExtraSecurityGroups], ",") {
		extraSG = strings.TrimSpace(extraSG)
		if len(extraSG) > 0 {
			extraSgList = append(extraSgList, extraSG)
		}
	}

	return extraSgList
}

// Checks whether a given security group is inside the extra security group annotation of a service definition
func (c* Cloud) checkSecurityGroupMatchesServiceExtraSg(service *v1.Service, securityGroup *string) bool {
	for _, extraSg := range c.collectExtraSecurityGroupsForService(service) {
		if extraSg == *securityGroup {
			return true
		}
	}

	return false
}

And the using them in the places where it’s need:

func (c *Cloud) updateInstanceSecurityGroupsForLoadBalancer(lb *elb.LoadBalancerDescription, instances map[InstanceID]*ec2.Instance, service *v1.Service) error {
	if c.cfg.Global.DisableSecurityGroupIngress {
		return nil
	}

	// Determine the load balancer security group id
	loadBalancerSecurityGroupID := ""

	for _, securityGroup := range lb.SecurityGroups {
		if aws.StringValue(securityGroup) == "" {
			continue
		}

		if c.checkSecurityGroupMatchesServiceExtraSg(service, securityGroup) {
			continue
		}
		
...
func (c *Cloud) EnsureLoadBalancerDeleted(ctx context.Context, clusterName string, service *v1.Service) error {
	loadBalancerName := c.GetLoadBalancerName(ctx, clusterName, service)
  
  ...
  
    // Collect the security groups to delete
		securityGroupIDs := map[string]struct{}{}

		for _, sg := range response {
			sgID := aws.StringValue(sg.GroupId)

			if sgID == c.cfg.Global.ElbSecurityGroup {
				//We don't want to delete a security group that was defined in the Cloud Configuration.
				continue
			}
			if sgID == "" {
				klog.Warningf("Ignoring empty security group in %s", service.Name)
				continue
			}

			if c.checkSecurityGroupMatchesServiceExtraSg(service, sg.GroupId) {
				continue
			}

...

Once again, since we already have our environment properly set up, running and checking that this code works as expected is trivial. At this point we can keep iterating in our code in case we want to improve further. As a next steps, it would be great to have a test case written for this to ensure there are not regressions in the future.

Conclussions

We have seen in this post how to start from scratch in an unknown code base, with a few techniques to quickly understand specific parts of the code, get to run the code as soon as possible and hack our way around to get to the parts we are interested as soon as possible.

This is not only useful for, in this case, contributing to an open source project or simply understanding how something works, but also gives you an idea on what the systems you operate on a daily basis are made from. If you have the chance to take a look closer at this specific problem, you will find out that the code is not very well structured or clean, for that matter, and that sometimes we believe that just because something is published as an open source project and widely used by enterprises around the world, we are not gonna able to understand how it works and contribute to it. At the end of the day these programs are written by people like you and me, and they have the same defects that comes from humans translating ideas into code in a complex reality.