A practical introduction to Kubernetes internals

jul 11, 2024

This is a walkthrough of Kubernetes internals.

As Kubernetes docs already mention, Kubernetes has in its architecture three main components:

api-server
controller-manager
etcd

Types

Versioned types

api-server is the only interface with etcd (the database), it is used by external clients but also by the controllers in controller-manager.
Amongst other things, it defines the schema for the resources both in the API and in the database (etcd).

Resources exposed in the API are grouped in a apiVersion, which we usually write at the beginning of its YAML manifest; e.g. apps/v1 and apps/v1beta1:

Unversioned types

However, usually these resources can also be found in an "unversioned" group in the code:

Unversioned types

A significant difference between versioned and unversioned types is that unversioned types don't have any json or proto struct tags.

Unversioned types are neither exposed to the API nor to the database, so they don't need (un)marshalling. Unversioned types are the ones api-server uses in its domain layer, mainly for validation and transformation:

(Notice how the import used for the types is "k8s.io/kubernetes/pkg/apis/apps", instead of "k8s.io/api/apps/v1").

Any client to the api-server will use a specific version of a resource group for a type: from apps it can use v1, v1beta1 or v1alpha1.

Imagine having a kubectl create of a Deployment. The Deployment sent to the API will be converted to the common unversioned group before reaching the api-server domain layer, where it will be transformed and validated. Once the api-server has performed the validation and modifications required to this object, it will convert it to a version of choice (set in the api-server, in this case v1), and store it in the database.

The same thing will happen for gets and lists, but the other way around:

The api-server reads from the DB a versioned type object
Converts it to an unversioned type
Performs domain logic
Converts to versioned type
Responds to client

The base conversion code is autogenerated alongside other helper functions such as DeepCopy() and OpenAPI schema, using a toolkit in this script:

Code generation script

controller-manager

This component runs the different controllers in Kubernetes.

A Controller is a process with watches in different resources exposed in the api-server.

For example, the Deployment controller has a watch on all Deployments. Every new Deployment, modification to a Deployment or Deployment delete will trigger the syncHandler function:

Deployment controller code

The logic of the syncHandler function depends on the nature of the resource. Deployments manage ReplicaSets and the policy to transition between ReplicaSets when changes are done to a Deployment. Therefore it also needs to watch ReplicaSets and incidentally Pods (source):

Informer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
		AddFunc: func(obj interface{}) {
			dc.addDeployment(logger, obj)
		},
		UpdateFunc: func(oldObj, newObj interface{}) {
			dc.updateDeployment(logger, oldObj, newObj)
		},
		DeleteFunc: func(obj interface{}) {
			dc.deleteDeployment(logger, obj)
		},
	})

The callbacks to the watches are set in an object of type Informer (e.g. DeploymentInformer).

An informer is an auto-generated type (with the script we covered before) with an in-memory cache that can manage all reads for a particular type. You would use it to perform Get and List, and to assign the handler callbacks for the different modifications an object can suffer. Since it does a watch for a resource, you need to run a thread (source):

	run := func(ctx context.Context, controllerDescriptors map[string]*ControllerDescriptor) {
		controllerContext, err := CreateControllerContext(ctx, c, rootClientBuilder, clientBuilder)
		if err != nil {
			logger.Error(err, "Error building controller context")
			klog.FlushAndExit(klog.ExitFlushTimeout, 1)
		}

		if err := StartControllers(ctx, controllerContext, controllerDescriptors, unsecuredMux, healthzHandler); err != nil {
			logger.Error(err, "Error starting controllers")
			klog.FlushAndExit(klog.ExitFlushTimeout, 1)
		}

		controllerContext.InformerFactory.Start(stopCh)
		controllerContext.ObjectOrMetadataInformerFactory.Start(stopCh)
		close(controllerContext.InformersStarted)

		<-ctx.Done()
	}

You also need to make sure the cache is in-sync with the contents of the database before doing any Get/List (source):

	dc.dListerSynced = dInformer.Informer().HasSynced
	dc.rsListerSynced = rsInformer.Informer().HasSynced
	dc.podListerSynced = podInformer.Informer().HasSynced

Note that the object used to do the Get/Lists is not the informer itself, but an object that is returned by the informer; a Lister.

A Lister is nothing more than an instance of the informer that allows you to list and get Kubernetes resources of a particular Kind

There's a Factory object to create Informers:

Informer factory code

All informers created using that informer share the same cache; hence the type is called SharedCacheInformer. This is one of the reasons why Kubernetes has all the controllers in the same component: all controllers share the same caches between them.

Storage

Storage is the component which describes how we manage the storage in etcd, usually you don’t define this when you write operators but you will work with it for sure when writing a custom Kubernetes version.

Kubernetes uses different strategies to manage the persistence of resources. For a comprehensive understanding I suggest you look at this folder which contains various strategies for the storage of kubernetes resources.

Storage strategies

The strategies include methods for preparation, validation and transformation of resources:

PrepareForCreate: this method is invoked when a resource is created, this contains all pre-conditions that should be valid for a resource creation, such as default values, contraints verification and so on

func (strategy *deploymentStrategy) PrepareForCreate(ctx context.Context, obj runtime.Object) {
    deployment := obj.(*apps.Deployment)
    if deployment.Spec.Replicas == nil {
        replicas := int32(1)
        deployment.Spec.Replicas = &replicas
    }
}

PrepareForUpdate: this method is used to define how a resource should mutate. As the PrepareForCreate it allows to verify pre conditions that need to be valid when an update of a resource pops out.
Other methods can be seen in the registry folder of the Kubernetes repository on github.com

Subresources

Subresources are component of the YAML manifest of a resource. They are secondary resources associated with a primary resources, for example the Status of a Deployment is a subresource and has eventually its own methods for store.

type StatusREST struct {
    store *registry.Store
}

func (r *StatusREST) New() runtime.Object {
    return &apps.Deployment{}
}

func (r *StatusREST) Update(ctx context.Context, name string, objInfo rest.UpdatedObjectInfo, createValidation rest.ValidateObjectFunc, updateValidation rest.ValidateObjectUpdateFunc, forceAllowCreate bool, options *metav1.UpdateOptions) (runtime.Object, bool, error) {
    // Update logic for status subresource
    return r.store.Update(ctx, name, objInfo, createValidation, updateValidation, forceAllowCreate, options)
}

In this way we allow the system to mutate the status of the resource, without modifying other aspects of the resource.

Useful links

Sample API Server: A simple example of an API Server extension with guiding comments.
Sample Controller: An example of a Kubernetes controller for learning and reference.
https://twitter.com/bryanl/status/1346125863419568129

El Substack de Fulvio