CaaS Platform: Operations Guide

1. Introduction

The CaaS platform acts as the link between FirstSpirit and the customer’s end application. The REST Interface receives information and updates it in the internal persistence layer of the CaaS platform. Data in the customer’s end application is updated by requests to the REST Interface.

The CaaS platform consists of the following components, which are provided as Docker containers:

REST Interface (caas-rest-api)

The REST Interface is used for both transferring and querying data into and from the CaaS repository. For this purpose it provides a REST endpoint that can be used by any service. It also supports authentication and authorization.

Between CaaS version 2.11 and 2.13 (inclusive), authentication and authorization functionality was provided by a separate Security Proxy.

CaaS repository (caas-mongo)

The CaaS repository is not accessible from the Internet and is only reachable within the platform from the REST Interface. It serves as the storage for all project data and internal configuration.

This document is intended for operators of the CaaS platform and contains information and instructions for operating and technically administering the platform.
A description of the functions and usage options of the REST Interface of the CaaS platform can be found in the separate documentation of the REST Interface.

2. Technical Requirements

The operation of the CaaS platform is to be implemented with Kubernetes.

If you do not feel able to operate, configure, monitor, and analyze and resolve operational problems of the cluster infrastructure, we strongly advise against on-premises operation and refer to our SaaS offering.

Since the CaaS platform is delivered as a Helm artifact, Helm must be available as a client.

It is important that Helm is installed securely. Further information can be found in the Helm installation guide.

For system requirements, please refer to the technical datasheet of the CaaS platform.

3. Installation and Configuration

Setting up the CaaS platform for operation with Kubernetes is done using Helm charts. These are included in the delivery and already contain all required components.

The following subchapters describe the necessary installation and configuration steps.

3.1. Importing the Images

The setup of the CaaS platform requires the import of the images into your central Docker registry (e.g. Artifactory) as the first step. The images are included in the delivery in the file caas-docker-images-20.12.4.zip.

The credentials for the cluster’s access to the registry must be known.

Please refer to the documentation of your registry for the necessary steps for the import.

3.2. Helm Chart Configuration

After importing the images, the configuration of the Helm chart is necessary. This is included in the delivery and can be found in the file caas-20.12.4.tgz. A default configuration of the chart is already provided in the values.yaml file. All parameters specified in this values.yaml can be overwritten with specific values in a manually created custom-values.yaml.

3.2.1. Authentication

All authentication settings for communication with or within the CaaS platform are defined in the credentials block of the custom-values.yaml.

This includes usernames, default passwords, and the CaaS Master API Key. It is strongly recommended to change the default passwords and the CaaS Master API Key.

All chosen passwords must be alphanumeric. Otherwise, issues may occur with CaaS.

The CaaS Master API Key is automatically created during the installation of the CaaS platform and thus enables direct use of the REST Interface.

3.2.2. CaaS Repository (caas-mongo)

The configuration of the repository includes two parameters:

storageClass: The ability to overwrite parameters from the values.yaml file mainly concerns the parameter mongo.persistentVolume.storageClass.

For performance reasons, we recommend that the underlying file system of MongoDB is provisioned with XFS.

clusterKey: A default configuration for the Mongo cluster authentication key is provided. The key can be set in the parameter credentials.clusterKey. It is strongly recommended to generate a new key for production use with the following command:

openssl rand -base64 756

This value should only be changed during the initial installation. Changing it later may lead to permanent database unavailability, which can only be repaired manually.

3.2.3. Docker Registry

Adjusting the parameters imageRegistry and imageCredentials is necessary to configure the used Docker registry.

Example configuration in a custom-values.yaml

imageRegistry: docker.company.com/e-spirit

imageCredentials:
   username: "username"
   password: "special_password"
   registry: docker.company.com
   enabled: true

3.2.4. Ingress Configurations

Ingress definitions control incoming traffic to each component. The definitions included in the chart are not created with the default configuration. The parameters restApi.ingress.enabled and restApi.ingressPreview.enabled allow Ingress configuration for the REST Interface.

The ingress definitions of the Helm chart require the NGINX Ingress Controller, as annotations and the class of this specific implementation are used. If you use a different implementation, you must adjust the annotations and the spec.ingressClassName attribute in your custom-values.yaml accordingly.

Ingress creation in a custom-values.yaml

restApi:
   ingress:
      enabled: true
      hosts:
         - caas.company.com
   ingressPreview:
      enabled: true
      hosts:
         - caas-preview.company.com

If the configuration options are not sufficient for your specific use case, you can create the Ingress yourself. In this case, set the corresponding parameter to enabled: false. The following code example provides guidance for the definition.

Ingress definition for the REST Interface

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
   labels:
   name: caas
spec:
   rules:
   - http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: caas-rest-api
            port:
              number: 80
   host: caas-rest-api.mydomain.com
   ingressClassName: my-ingress-caas

3.3. Helm Chart Installation

After configuring the Helm chart, it must be installed in the Kubernetes cluster. Installation is performed using the following commands, which must be executed in the directory of the Helm chart.

Chart installation

kubectl create namespace caas
helm install RELEASE_NAME . --namespace=caas --values /path/to/custom-values.yaml

The release name can be chosen freely.

If you want to use a different namespace, you must adjust the commands accordingly.

If you want to use an existing namespace, the creation step is omitted and the desired namespace is specified in the installation command.

Since the containers are first downloaded from the used image registry, installation may take a few minutes. Ideally, it should not take more than five minutes before the CaaS platform is ready for use.

The status of the individual components can be retrieved with the following command:

kubectl get pods --namespace=caas

Once all components have the status Running, installation is complete.

NAME                                 READY     STATUS        RESTARTS   AGE
caas-mongo-0                         2/2       Running       0          4m
caas-mongo-1                         2/2       Running       0          3m
caas-mongo-2                         2/2       Running       0          1m
caas-rest-api-1851714254-13cvn       1/1       Running       0          5m
caas-rest-api-1851714254-13cvn       1/1       Running       0          4m
caas-rest-api-1851714254-xs6c0       1/1       Running       0          4m

3.4. TLS

Communication from the CaaS platform to the outside is not encrypted by default. If it should be protected by TLS, there are two configuration options:

Use of an officially signed certificate: To use an officially signed certificate, a TLS secret is required, which must first be created. This must contain the keys tls.key and the certificate tls.crt.

The steps required to create the TLS secret are described in the Kubernetes ingress documentation.
Automated certificate management: Alternatively, you can automate certificate management using Cert-Manager. This must be installed in the cluster and takes care of the creation, distribution, and renewal of all required certificates. The configuration of Cert-Manager enables, for example, the use and automatic renewal of Let’s Encrypt certificates.

The necessary installation steps are described in the Cert-Manager documentation.

3.5. Scaling

To quickly process the information transferred to the CaaS, the CaaS platform must always ensure optimal load distribution. For this reason, the REST Interface and the Mongo database are scalable and, in terms of fail-safety, are already configured so that at least three instances are deployed. This minimum number of instances is especially required for the Mongo cluster.

3.5.1. REST Interface

Scaling of the REST Interface is done using Horizontal Pod Autoscalers. This allows the REST Interface to scale up or down depending on the current CPU load.
The parameter targetCPUUtilizationPercentage specifies the percentage at which scaling should occur. The parameters minReplicas and maxReplicas define the minimum and maximum number of possible REST Interface instances.

The CPU load threshold should be chosen carefully: If the percentage is too low, the REST Interface will scale up too early in case of increasing load. If it is too high, scaling may not occur quickly enough when load increases.

Incorrect configuration can therefore jeopardize system stability.

The official Kubernetes Horizontal Pod Autoscaler documentation and the examples listed in it provide further information on using an Horizontal Pod Autoscaler.

Enabling the Horizontal Pod Autoscaler

Enabling and configuring the Horizontal Pod Autoscaler should be done in the custom-values.yaml file to overwrite the default values defined in the values.yaml file.

Default configuration of the REST Interface

restApi:
  horizontalPodAutoscaler:
    enabled: false
    minReplicas: 3
    maxReplicas: 9
    targetCPUUtilizationPercentage: 50

Enabling the Horizontal Pod Autoscaler in the custom-values.yaml removes the parameter restApi.spec.replicas from the deployment. When rolling out the Helm chart, the number of REST Interface pods is temporarily reduced to the default value of 1 before the Horizontal Pod Autoscaler takes over further scaling.

If you want to avoid this behavior, remove the replica field from the deployment manifest in the Helm history secret before rolling out.

Background information on migrating to an Horizontal Pod Autoscaler can be found in this documentation.

3.5.2. Mongo Database

We distinguish between horizontal and vertical scaling. Horizontal scaling means adding additional instances to handle traffic. Vertical scaling means assigning more CPU/RAM to existing instances.

Horizontal scaling

Unlike the REST Interface, horizontal scaling of the Mongo database is only possible manually. It cannot be performed automatically using an Horizontal Pod Autoscaler.

Scaling of the Mongo database is done via the replicas parameter. This must be entered in the custom-values.yaml file to overwrite the default value defined in the values.yaml file.

At least three instances are required for the operation of the Mongo cluster, otherwise no Primary node is available and the database is not writable. If the number of available instances falls below 50% of the configured instances, no Primary node can be elected. This is essential for the functionality of the REST Interface.
The chapter Consider Fault Tolerance in the MongoDB documentation describes how many nodes can explicitly fail before it is impossible to elect a new Primary node. The information in the documentation should be considered when scaling the installation.
Further information on scaling and replicating the Mongo database can be found in the chapters Replica Set Deployment Architectures and Replica Set Elections.

Definition of the replica parameter

mongo:
  replicas: 3

Do not scale the StatefulSet directly in K8s. If you do, certain connection URLs will not be correct and the additional instances will not be used properly. Instead, use the custom Helm values.

Scaling down the Mongo database is not possible without direct intervention and requires manual reduction of the replica set of the Mongo database. The MongoDB documentation describes the necessary steps.
We also recommend deleting the corresponding Persistent Volume Claims after removing the deleted instances from the replica set configuration. Otherwise, there is a risk that the instances will not be automatically added to the replica set during future scaling up.
Such intervention increases the risk of failure and is therefore not recommended.

Vertical scaling

Vertical scaling is done using Vertical Pod Autoscalers. Vertical Pod Autoscaler are Custom Resources in Kubernetes, so you must first ensure support in your cluster.

You can then configure the following parameters in your custom-values.yaml:

Configuration of the Vertical Pod Autoscaler

mongo:
  verticalPodAutoscaler:
    enabled: false
    apiVersion: autoscaling.k8s.io/v1beta2
    updateMode: Auto
    minAllowed:
      cpu: 100m
      memory: 500Mi
    maxAllowed:
      cpu: 1
      memory: 2000Mi

Applying the configuration

The updated custom-values.yaml file must be applied after configuration changes for the REST Interface or the Mongo database using the following command.

Upgrade command

helm upgrade -i RELEASE_NAME path/to/caas-<VERSIONNUMBER>.tgz --values /path/to/custom-values.yaml

The release name can be determined with the command helm list --all-namespaces.

3.6. Monitoring

The CaaS platform is a microservice architecture and therefore consists of different components. To always monitor their status properly and react quickly in case of errors, integration into cluster-wide monitoring is essential for operation with Kubernetes.

The CaaS platform is already preconfigured for monitoring with Prometheus Operator, as this scenario is widespread in the Kubernetes environment. Prometheus ServiceMonitors for collecting metrics, Prometheus alerts for notification in case of problems, and predefined Grafana dashboards for visualizing metrics are included.

3.6.1. Prerequisites

It is essential to set up monitoring and log persistence for the Kubernetes cluster. Without these prerequisites, there are hardly any analysis options in case of errors, and Technical Support lacks important information.

Metrics: To install the Prometheus Operator, please use the official Helm chart so that cluster monitoring can be set up based on it. For further information, please refer to the relevant documentation.

If you do not operate a Prometheus Operator, you must disable the Prometheus ServiceMonitors and Prometheus alerts.
Logging: With Kubernetes, it is possible to provide various containers or services automatically and scalably. To ensure that logs persist even after an instance is terminated in such a dynamic environment, an infrastructure must be integrated that persists them beforehand.

We therefore recommend using a central logging system, such as the Elastic Stack. The Elastic or ELK Stack is a collection of open-source projects that help persist, search, and analyze log data in real time.

For installation, you can also use an existing Helm chart.

3.6.2. Prometheus ServiceMonitors

Deployment of the ServiceMonitors provided by the CaaS platform for the REST Interface and the Mongo database is controlled via the custom-values.yaml file of the Helm chart.

Access to the metrics of the REST Interface is secured by API Key, and access to the metrics of MongoDB is secured by a corresponding MongoDB user. The respective credentials are included in the credentials block of the values.yaml file of the Helm chart.

For security reasons, please adjust the credentials in your custom-values.yaml file.

Typically, Prometheus is configured to only consider ServiceMonitors with certain labels. The labels can therefore be configured in the custom-values.yaml file and apply to all ServiceMonitors of the CaaS Helm chart. In addition, the scrapeInterval parameter allows you to define how often the respective metrics are retrieved.

monitoring:
  prometheus:
    # Prometheus service monitors will be created for enabled metrics. Each Prometheus
    # instance has a configured serviceMonitorSelector property, to be able to control
    # the set of matching service monitors. To allow defining matching labels for CaaS
    # service monitors, the labels can be configured below and will be added to each
    # generated service monitor instance.
    metrics:
      serviceMonitorLabels:
        release: "prometheus-operator"
      mongo:
        enabled: true
        scrapeInterval: "30s"
      caas:
        enabled: true
        scrapeInterval: "30s"

The metrics of MongoDB are provided via a sidecar container and retrieved using a separate database user. You can configure the database user in the credentials block of the custom-values.yaml. The sidecar container is configured with the following default settings:

mongo:
  metrics:
    image: mongodb-exporter:0.11.0
    syncTimeout: 1m

3.6.3. Prometheus Alerts

Deployment of the alerts provided by the CaaS platform is controlled via the custom-values.yaml file of the Helm chart.

Typically, Prometheus is configured to only consider alerts with certain labels. The labels can therefore be configured in the custom-values.yaml file and apply to all alerts of the CaaS Helm chart.

caas-common:
  monitoring:
    prometheus:
      alerts:
        # Labels for the PrometheusRule resource
        prometheusRuleLabels:
          app: "prometheus-operator"
          release: "prometheus-operator"
        # Additional Prometheus labels to attach to alerts (or overwrite existing labels)
        additionalAlertLabels: {}
        caas:
          enabled: true
          useAlphaAlerts: false
          # Namespace(s) that should be targeted by the alerts (supports Go template and regular expressions)
          targetNamespace: "{{ .Release.Namespace }}"

3.6.4. Grafana Dashboards

Deployment of the Grafana dashboards provided by the CaaS platform is managed via the custom-values.yaml file of the Helm chart.

Typically, the Grafana sidecar container is configured to only consider ConfigMaps with specific labels and in a defined namespace. The labels of the ConfigMap and the namespace in which it is deployed can be configured in the custom-values.yaml file:

caas-common:
  monitoring:
    grafana:
      dashboards:
        enabled: true
        # Namespace that the ConfigMap resource will be created in (supports Go template and regular expressions)
        configmapNamespace: "{{ .Release.Namespace }}"
        # Additional labels to attach to the ConfigMap resource
        configMapLabels: {}
        overviewDashboardsEnabled: false

3.7. REST API Configuration

The REST Interface offers various configuration options that can be set in the custom-values.yaml file of the Helm chart.

3.7.1. Mongo Connection String with DNS Seed List

By default, the REST Interface uses a static connection string with all hostnames in the replica set (Standard Connection String) to connect to the MongoDB database. Optionally, the REST Interface can be configured to use a DNS seed list (SRV connection format).

This is done by setting restApi.mongoSrvConnectionFormat.enabled: true.
The cluster domain used for this can be overridden with the parameter restApi.mongoSrvConnectionFormat.domain.

A seed list is only usable for the MongoDB included in the chart. For connections to an externally set up MongoDB, only the standard connection string is available.

3.7.2. Metadata in Collection Queries

By default, the REST Interface does not return collection metadata when filters are used in queries.
If metadata should be returned, set restApi.additionalConfigOverrides./noPropertiesInterceptor/enabled: false.

3.7.3. Excluding MongoDB Query Operators in Filter Queries

Certain MongoDB query operators can be excluded from use in filter queries by adding them to a blacklist.

To activate this feature, set restApi.filterOperatorBlacklist.enabled: true and specify the operators in restApi.filterOperatorBlacklist.value: [].

For more information, see the documentation.

3.7.4. Resolving References in Document Queries

When querying documents that contain references to other documents, these references are automatically resolved and the referenced documents are embedded in the result.

The maximum depth of reference resolution can be configured with restApi.additionalConfigOverrides./refResolvingInterceptor/max-depth.

The maximum number of references that can be resolved in a request is set with restApi.additionalConfigOverrides./refResolvingInterceptor/limit.

Reference resolution can be disabled with restApi.additionalConfigOverrides./refResolvingInterceptor/enabled: false.

3.7.5. Enabling and Configuring GraphQL

To enable GraphQL, set restApi.graphql.enabled: true.

Default and maximum page size can be configured with restApi.graphql.pageSize.default and restApi.graphql.pageSize.max.

4. Development Environment

Kubernetes and Helm form the basis of all installations of the CaaS platform. For development environments, we recommend installing the CaaS platform in a separate namespace on your production cluster or a similarly configured cluster. We advise against using local instances of the CaaS platform, even for development.

If you need a local environment on development machines, you must create a local Kubernetes cluster. You can use one of the following projects:

This list is not exhaustive. It is intended to provide some examples that we know generally work, but we do not use these projects permanently ourselves.

Any of these projects can be used to manage Kubernetes clusters locally. However, we cannot provide support for any of these specific projects. The CaaS platform only uses standard features of Helm and Kubernetes and is therefore independent of a specific Kubernetes distribution.

Please ensure the following features are correctly configured when using a local Kubernetes cluster:

Kubernetes image pull secrets to resolve Docker images from your local or company Docker registry
Disable monitoring in the custom-values.yaml or install the required prerequisites
Adjust DNS settings of the host system to work with Kubernetes ingress resources, or use local port forwarding in the local cluster

5. Metrics

Metrics are used for monitoring and troubleshooting the CaaS components during operation and are accessible via HTTP endpoints. If metrics are available in Prometheus format, corresponding ServiceMonitors are created (see also Prometheus ServiceMonitors).

5.1. REST Interface

Healthcheck

The healthcheck endpoint provides information about the functionality of the respective component in the form of a JSON document. This status is calculated from several checks. If all checks are successful, the JSON response has HTTP status 200. If at least one check is false, the response has HTTP status 500.

The endpoint is available at: http://REST-HOST:PORT/_logic/healthcheck

The functionality of the REST Interface depends on both the reachability of the MongoDB cluster and the existence of a primary node. If the cluster does not have a primary node, write operations to MongoDB are not possible.

HTTP Metrics

Performance metrics of the REST Interface can be retrieved in Prometheus format at the following URLs:

http://REST-HOST:PORT/_logic/metrics
http://REST-HOST:PORT/_logic/metrics-caas
http://REST-HOST:PORT/_logic/metrics-jvm

5.2. MongoDB

Metrics for MongoDB are provided via a sidecar container. This container accesses the metrics of MongoDB using a separate database user and provides them via HTTP.

Metrics can be retrieved at: http://MONGODB-HOST:METRICS-PORT/metrics

Please note that MongoDB metrics are delivered via a separate port. This port is not accessible from outside the cluster and is therefore not protected by authentication.

6. Maintenance

Data transfer to the CaaS can only function if all components are working properly. If disruptions occur or updates are necessary, all CaaS components must be considered. The following subchapters describe the necessary steps for troubleshooting in case of a disruption and how to perform backups and updates.

6.1. Troubleshooting

The CaaS is a distributed system based on the interaction of different components. Each of these components can potentially cause errors. If a disruption occurs during the use of the CaaS, various causes may be responsible. The following basic analysis steps explain how to identify the causes of disruptions.

Component status: The status of each CaaS platform component can be checked using the command kubectl get pods --namespace=<namespace>. If the status of an instance deviates from Running or ready, it is recommended to start troubleshooting there and check the associated log files.

If there are problems with the MongoDB, check whether a Primary node exists. If the number of available instances falls below 50% of the configured instances, no Primary node can be elected. This is essential for the functionality of the REST Interface. The absence of a Primary node causes the REST Interface pods to lose their ready status and become unreachable.

The chapter Consider Fault Tolerance in the MongoDB documentation describes how many nodes can explicitly fail before it becomes impossible to elect a new Primary node.

Log analysis: In case of problems, log files are a good starting point for analysis. They allow you to track all operations on the systems. This way, any errors and warnings become visible.

Current log files of the CaaS components can be viewed using kubectl --namespace=<namespace> logs <pod>, but only include events that occurred during the lifetime of the current instance. To analyze log files after a crash or restart, we recommend setting up a central logging system.

Log files can only be viewed for the currently running container. Therefore, it is necessary to set up persistent storage to access log files from containers that have already stopped or restarted.

6.2. Backup

The architecture of the CaaS consists of various independent components that generate and process different information. If data backup is required, it must be performed depending on the respective component.

A backup of the information stored in the CaaS must be performed using the standard mechanisms of MongoDB. Either a copy of the underlying files can be created or mongodump can be used.

6.3. Update

Operation of the CaaS platform with Helm in Kubernetes allows updating to a new version without requiring a reinstallation.

Before updating the MongoDB database, a backup is strongly recommended.

The helm list --all-namespaces command first returns a list of all already installed Helm charts. This list contains both the version and the namespace of the corresponding release.

Sample list of installed releases

$ helm list --all-namespaces
NAME            NAMESPACE    REVISION  UPDATED             STATUS    CHART        APP VERSION
firstinstance   integration  1         2019-12-11 15:51..  DEPLOYED  caas-2.10.4  caas-2.10.4
secondinstance  staging      1         2019-12-12 09:31..  DEPLOYED  caas-2.10.4  caas-2.10.4

To update a release, the following steps must be carried out one after the other:

Transfer the settings

To avoid losing the previous settings, it is necessary to have the custom-values.yaml file with which the initial installation of the Helm chart was carried out.

Adoption of further adjustments

If there are adjustments to files (e.g. in the config directory), these must also be adopted.

Update

After performing the previous steps, the update can be started. It replaces the existing installation with the new version without any downtime. To do this, execute the following command, which starts the process:

Helm upgrade command

helm upgrade RELEASE_NAME caas-20.12.4.tgz --values /path/to/custom-values.yaml

7. Appendix

7.1. Troubleshooting: Known Issues

7.1.1. File upload with PUT request fails

The error messages

E11000 duplicate key error collection: [some-file-bucket].chunks index: files_id_1_n_1 dup key or
error updating the file, the file bucket might have orphaned chunks

indicate that orphaned file chunks exist in the MongoDB data. These orphaned data can be deleted using the following mongo shell script:

Cleaning up a file bucket by deleting orphaned file chunks.

// Name of the file bucket to clean up (e.g., my-bucket.files)
var filesBucket = "{YOUR_FILE_BUCKET_NAME}";

var chunksCollection = filesBucket.substring(0, filesBucket.lastIndexOf(".")) + ".chunks";
db[chunksCollection].aggregate([
  // avoid accumulating binary data in memory
  { $unset: "data" },
  {
      $lookup: {
        from: filesBucket,
        localField: "files_id",
        foreignField: "_id",
        as: "fileMetadata",
      }
  },
  { $match: { fileMetadata: { $size: 0 } } }
]).forEach(function (c) {
  db[chunksCollection].deleteOne({ _id: c._id });
  print("Removed orphaned GridFS chunk with id " + c._id);
});

8. Help

The Technical Support of the Crownpeak Technology GmbH provides expert technical support covering any topic related to the FirstSpirit™ product. You can get and find more help concerning relevant topics in our community.

9. Disclaimer

This document is provided for information purposes only. Crownpeak Technology GmbH may change the contents hereof without notice. This document is not warranted to be error-free, nor subject to any other warranties or conditions, whether expressed orally or implied in law, including implied warranties and conditions of merchantability or fitness for a particular purpose. Crownpeak Technology GmbH specifically disclaims any liability with respect to this document and no contractual obligations are formed either directly or indirectly by this document. The technologies, functionality, services, and processes described herein are subject to change without notice.