How to Set Up Alerts in Kasten K10 to Immediately Catch Failed Backups 

Backing up your Kubernetes applications is an essential part of the development process. When you run a backup, an import or a restore manually, the results are instantly available — you simply check your dashboard to see if the backup was successful or not. But when you run a backup operation as part of a daily policy, it’s not as transparent. Everything runs in the background while you’re completing other tasks. If anything goes wrong, you won’t know about it until later when you finally take the time to check your dashboard. While you may think everything worked the way it was supposed to, in reality, your application is not protected anymore! 

Backups fail for a number of reasons. Here are just a few common ones: 

  • Credentials change 
  • Storage failure (system full or quota exceeded) 
  • Pods that stay in errors without you knowing it  
  • Target backup moved or deleted 
  • Network failure 
  • Misconfiguration changes 
  • Timeout because storage size increased in an unexpected way 

Because backups can fail for so many different reasons, we recommend implementing alerts when you move to production. While developers use different alerting systems and there’s no one generic way to set up alerts end-to-end, with a few configuration changes, you can implement alerts that go directly to your incident system.  

In this blog post, I’ll walk you through how to set up alerts to be sent directly to a Slack channel. The same steps can be taken to send alerts via email, or to another system such as ServiceNow, Pager Duty, or JIRA. 

Implementation  

Kasten features a Prometheus instance that stores Kasten metrics. We use those metrics to produce graphs on storage usage, as well as provide information for detecting whether or not a backup failed. The image below shows how alerts appear in the Slack channel: 

In the following example, we’ll configure Alert Manager (a Prometheus component) to send alerts to a Slack channel in four easy steps:

1) Create a New Prometheus Instance and Federate it to the Kasten.io Instance

When you do this, do not modify the Prometheus configuration in the Kasten.io namespace, because this instance is managed by helm and not intended to be modified manually. Instead, use the federation URL of the Kasten Prometheus instance to spin up a Prometheus instance in another namespace:

2) Enable WebHooks in Slack

WebHooks in Slack allow you to send messages to a Slack channel, which is what you will configure Alert Manager to do.

First, create a Slack channel then from the Apps menu add the incoming WebHooks app. Be sure to take note of the WebHooks URL. You’ll need it later:

 

Once that’s done, you’ll see something like this in Slack:

3) Configure Prometheus and Alert Manager

Create the monitoring namespace: 

kubectl create ns monitoring

Then add the Prometheus community helm chart:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

Now, let’s configure a Prometheus instance with: 

  • Targets to scrap the metrics exposed by the Kasten Prometheus instance (the “federate” arrow).
  • Alert Manager with a Slack receiver (the “Receive” arrow).
  • Disablement of all the other metrics that Prometheus usually scrapes in a Kubernetes cluster.

Create the kasten_prometheus_values.yaml, replace the <slack_api_webhook_url> by the webhook URL you just obtained:

cat <<EOF > kasten_prometheus_values.yaml
defaultRules:
  create: false
alertmanager:
  config:
    global:
      resolve_timeout: 5m
    route:
      # If an alert has successfully been sent, wait 'repeat_interval' to
      # resend them.
      repeat_interval: 30m
      # A default receiver
      receiver: "slack-notification"
      routes:
      - receiver: "slack-notification"
        match:
          severity: kasten
    receivers:
      - name: "slack-notification"
        slack_configs:
#configure incoming webhooks in slack(https://slack.com/intl/en-in/help/articles/115005265063-Incoming-webhooks-for-Slack)
          - api_url: '<slack_api_webhook_url>'
            channel: '#channel' #channel which the alert needs to be sent to
            text: "{{ range .Alerts }}<!channel> {{ .Annotations.summary }}\n{{ .Annotations.description }}\n{{ end }}"
prometheus:
  prometheusSpec:
    additionalScrapeConfigs:
#federation configuration for consuming metrics from K10 prometheus-server
      - job_name: k10
        scrape_interval: 15s
        honor_labels: true
        scheme: http
        metrics_path: '/k10/prometheus/federate'
        params:
          'match[]':
            - '{__name__=~"jobs.*"}'
            - '{__name__=~"catalog.*"}'
        static_configs:
          - targets:
            - 'prometheus-server.kasten-io.svc.cluster.local'
            labels:
              app: "k10"
#below values are to disable the components which are not required. It can be changed based on the
requirement.
grafana:
  enabled: false
kubeApiServer:
  enabled: false
kubelet:
  enabled: false
kubeStateMetrics:
  enabled: false
kubeControllerManager:
  enabled: false
kubeEtcd:
  enabled: false
kubeProxy:
  enabled: false
coreDns:
  enabled: false
kubeScheduler:
  enabled: false
EOF

Finally, install Prometheus with this configuration:

helm install prometheus prometheus-community/kube-prometheus-stack -n monitoring -f kasten_prometheus_values.yaml

4) Create a Rule

The final step is to create a prometheusRule CR to configure the alerts:

cat << EOF | kubectl -n monitoring create -f -
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    app: kube-prometheus-stack
    release: prometheus
  name: prometheus-kube-prometheus-kasten.rules
spec:
  groups:
  - name: kasten_alert
    rules:
    - alert: KastenJobsFailing
      expr: |-
        increase(catalog_actions_count{status="failed"}[10m]) > 0
      for: 1m
      labels:
        severity: kasten
      annotations:
        summary: "More than 1 failed K10 jobs for policy {{ \$labels.policy }} for the last 10 min"
description: "{{ \$labels.policy }} policy run for the application {{ \$labels.app }} failed in last 10 mins"
EOF

All done! Now, it’s time to test the configuration. 

Testing the Alert 

There are three steps to testing to make sure your alerts are working.

1) Create a Fail Condition 

First, change the configuration of the AKS cluster and set a wrong client secret. You can do the same thing with the AWS secret access key. If you are using CSI, remove the Kasten annotation on  the Volumesnapshotclass. (There are other ways to obtain a failure, but this one is really simple):

helm upgrade k10 kasten/k10 -n kasten-io -f azure_val.yaml

Once that’s done, run a backup. It should fail quickly:

2) Check Alert Manager 

Port forward the Prometheus dashboard: 

kubectl port-forward service/prometheus-kube-prometheus-prometheus 9090:9090 -n monitoring

Then, navigate to http://localhost:9090/alerts. Under the Alert Manager tab, you should see your alert in pending or firing start:

 

3) Check Slack 

Shortly after that, you should receive a notification in Slack. So, go check your Slack channel! If you see the alert, it’s working properly:

Conclusion

Now that you know how to receive alerts in Slack, you can safely run backups in the background as part of your daily policy, and be confident your data and applications are protected at all times.

Try Kasten K10 for yourself, for free today. 

This article was co-authored by Jaiganesh Karthikeyan, Senior Software Engineer, InfraCloud Technologies.

Similar Blog Posts
Technical | October 30, 2024
Business | October 17, 2024
Technical | October 8, 2024
Stay up to date on the latest tips and news
By subscribing, you are agreeing to have your personal information managed in accordance with the terms of Veeam’s Privacy Policy
You're all set!
Watch your inbox for our weekly blog updates.
OK