Skip to main content

How-To: Deploying and Managing Helm Charts in Our Kubernetes Cluster

Introduction

This guide provides step-by-step instructions for deploying and managing applications in our Kubernetes ops cluster using Helm charts and Terraform. It assumes you have read the explanation documentation and understand the basic concepts of our deployment architecture. For specific scenarios covered, check out the sidebar to the right ➡️

Important Scope Note: This guide is for deploying applications to the ops cluster only. The lower environments (dev, qa, staging) and production cluster are managed exclusively by the Platform team. If you need an application deployed to lower or prod environments, submit a request to the Platform team.

Prerequisites

  • GitLab Access: Permissions to create branches and merge requests in the cluster-manager repository
  • Basic Familiarity: Understanding of YAML syntax, Git workflow (branches, commits, merge requests), and basic Terraform syntax
  • Explanation Document: Read the "Understanding Our Kubernetes Deployment Architecture" explanation document first to understand the why behind these steps

Cluster Access Scope

Developers can deploy to:

  • ops cluster - Operational tools, monitoring, development utilities, POCs

Platform team manages:

  • lower environments (dev, qa, staging) - Application deployments for development/testing
  • production cluster - Production application workloads

If you need an application in lower or prod environments, work with the Platform team who will handle the deployment following the same patterns described in this guide.

Work in Progress: The following deployment workflow sections are currently being updated:

  • How to Deploy a New Helm Chart Application - Step-by-step guide for deploying new applications
  • How to Modify an Existing Helm Release - Guide for updating configurations
  • How to Upgrade a Helm Chart Version - Instructions for version upgrades
  • How to Remove/Uninstall a Helm Release - Steps for cleaning up deployments

These sections contained deprecated applications specific examples that need to be replaced with generic guidance. They will be restored soon with updated content. For now, please use the monitoring and troubleshooting sections below, and contact the Platform team via Platform JIRA if you need assistance with deployments.


Monitoring Deployments with Grafana

After deploying an application, you'll need to verify it's running correctly and monitor its health. Our Grafana instance provides dashboards and tools to help you debug deployments and observe application behavior.

Accessing Grafana

  1. Navigate to the Grafana instance for the ops cluster https://grafana.ops.wwnorton.net
  2. Authenticate using your Microsoft Entra ID credentials
  3. Navigate to the "Dashboards" section

Key Actions for Debugging Deployments

Check Pod Status and Health

  1. Open the Kubernetes Pods Dashboard

    • Navigate to: Kubernetes / Views / Pods
    • Select your application's namespace from the dropdown (e.g., poc-backstage)
    • Select specific pods from the pod dropdown, or leave as "All" to view all pods in the namespace

Understanding Empty Panels: If you see panels showing "No data", it may be due to:

  • Information Section: Shows data only when a single pod is selected (not "All"). Select a specific pod from the dropdown to see details like "Created by", "Running on", "Pod IP", etc.
  • Resource Section: Requires at least one pod to be selected. If showing "All" and still no data, check that pods exist in the selected namespace.
  • Kubernetes Section: Tables like "Pods with Container Issues" and "Unscheduled Pods" will be empty if there are no issues in the namespace—this is a good sign! The "Container Restarts" and "OOM Events" graphs will show data only if restarts/OOM events have occurred.
  1. Review Information Section

    The dashboard shows key pod information at the top:

    • Created by: Shows what Kubernetes resource created the pod (Deployment, StatefulSet, etc.)
    • Running on: Which node the pod is running on
    • Pod IP: The pod's IP address
    • Priority Class: Priority class if assigned
    • QOS Class: Quality of Service class (Guaranteed, Burstable, or BestEffort)
    • Last Terminated Reason: If the pod was recently terminated, shows why (CrashLoopBackOff, OOMKilled, etc.)
    • Last Terminated Exit Code: Exit code from the last termination
  2. Check Resource Usage

    Review the Resource section:

    • Gauge panels: Show overall CPU and Memory usage as percentages of Requests and Limits
    • Resources by container table: Detailed breakdown showing:
      • CPU Requests and Limits per container
      • Memory Requests and Limits per container
      • Current CPU Used per container
      • Current Memory Used per container
    • Time series graphs:
      • CPU Usage / Requests & Limits by container: Shows if containers are hitting their limits
      • Memory Usage / Requests & Limits by container: Shows memory pressure over time
      • CPU Usage by container: Absolute CPU cores used
      • Memory Usage by container: Absolute memory bytes used
  3. Identify Container Issues

    Scroll to the "Kubernetes" section to check for problems:

    • Pods with Container Issues table: Shows pods with:
      • CrashLoopBackOff: Pods repeatedly crashing (check logs for errors)
      • ErrImagePull or ImagePullBackOff: Cannot pull container image (check image name/registry/credentials)
    • Container Restarts by container: Graph showing restart frequency over time
    • Unscheduled Pods table: Pods that cannot be scheduled (usually resource constraints or node affinity issues)
    • OOM Events by container: Out-of-memory events indicating memory limits are too low

View Pod Logs

  1. Access Logs via Grafana

    • Navigate to "Explore" section in Grafana
    • Select Loki as the data source
    • Use LogQL queries to filter logs:
      {namespace="your-namespace", pod="your-pod-name"}
    • Search for error messages or specific log patterns
  2. Common Log Patterns to Check

    • Application startup messages
    • Database connection errors
    • Configuration errors
    • Health check failures
    • Port binding issues

Monitor Application Metrics

  1. Find Application-Specific Dashboards

    • Many Helm charts include Grafana dashboard definitions
    • Check if your chart provides dashboards (configured via ServiceMonitor)
    • Look for dashboards matching your application name
  2. Check Key Metrics

    • Request rates (if web application)
    • Error rates
    • Response times
    • Active connections
    • Custom application metrics

Verify Service and Ingress Health

Use kubectl to verify service and ingress configuration:

  1. Service Endpoints

    # Check if service has active endpoints
    kubectl get endpoints <service-name> -n <namespace>

    # Verify service selector matches pod labels
    kubectl describe service <service-name> -n <namespace>

    # Confirm service ports match container ports
    kubectl get service <service-name> -n <namespace> -o yaml
  2. Ingress Configuration

    # Verify ingress is configured correctly
    kubectl describe ingress <ingress-name> -n <namespace>

    # Check TLS certificate status
    kubectl get ingress <ingress-name> -n <namespace> -o yaml

Track Deployment Changes Over Time

  1. Use Grafana's Time Range Selector

    • Set time range to cover your deployment window
    • Compare metrics before and after deployment
    • Identify any regressions or issues
  2. Check for Anomalies

    • Sudden drops in request rates
    • Increased error rates
    • Resource usage spikes
    • Pod restarts

When to Use kubectl vs Grafana

Use kubectl for:

  • Quick status checks (kubectl get pods -n <namespace>)
  • Executing into containers for debugging
  • Port-forwarding for local testing
  • Applying temporary fixes during incidents

Use Grafana for:

  • Historical analysis and trends
  • Aggregated metrics across multiple pods
  • Log analysis with search/filtering
  • Dashboard visualization for stakeholders
  • Alerting and monitoring

Getting Help with Monitoring

If you're unable to find the metrics or logs you need:

  1. Verify your application exposes metrics (Prometheus format)
  2. Check if ServiceMonitor is configured for your application
  3. Contact Platform team via Platform JIRA for:
    • Dashboard creation assistance
    • Metric collection configuration
    • Log retention policies
    • Alert rule setup

Common Troubleshooting Scenarios

Pods in CrashLoopBackOff

Symptoms:

  • Pods repeatedly restarting
  • Status shows CrashLoopBackOff
  • Application not accessible

Diagnosis:

  1. Check pod logs using kubectl or Grafana for error messages
  2. Look for common issues:
    • Missing environment variables
    • Cannot connect to database
    • Configuration errors
    • Missing dependencies
    • Resource limits too low

Resolution:

  1. Fix the underlying issue (config, dependencies, resources)
  2. Create MR with fix
  3. After apply, pods should start successfully
  4. If urgent, manually delete pods to force immediate restart after fix

ImagePullBackOff Errors

Symptoms:

  • Pods cannot start
  • Status shows ImagePullBackOff or ErrImagePull

Common Causes:

  • Image name or tag incorrect
  • Image repository requires authentication
  • Image doesn't exist
  • Registry unreachable

Resolution:

  1. Verify image name and tag in chart values
  2. Check chart documentation for correct image configuration
  3. For private registries, configure image pull secrets
  4. Update helm_release with corrected values

Terraform Apply Timeout

Symptoms:

  • Apply pipeline runs for very long time
  • Eventually times out with error
  • Pods may be stuck in Pending or slow to start

Common Causes:

  • wait = true and pods take too long to become ready
  • Insufficient cluster resources
  • Image pull is slow (large images)
  • Init containers taking long time

Resolution:

  1. Increase timeout value in helm_release (e.g., from 600 to 1200)
  2. Set wait = false if you don't need to wait for readiness (not recommended for prod)
  3. Investigate why pods are slow to start
  4. Check cluster capacity using kubectl or review metrics in Grafana

Configuration Values Not Taking Effect

Symptoms:

  • Changed configuration values but application behavior unchanged
  • Expected settings not applied

Diagnosis:

  1. Check Terraform plan showed the resource as modified
  2. Verify apply completed successfully
  3. Check pods were restarted using kubectl (kubectl get pods -n <namespace>)
  4. Verify values syntax in template or set blocks

Resolution:

  1. For simple config changes, Helm may not trigger pod restart
  2. Manually delete pods to force restart with new config:
    kubectl delete pods -n <namespace> -l app.kubernetes.io/name=<app-label>
  3. Or add/change an annotation to force rollout:
    set {
    name = "podAnnotations.configVersion"
    value = "v2" # Increment to force update
    }

Terraform State Drift

Symptoms:

  • Terraform plan shows changes but you didn't modify code
  • "Resource has been changed outside of Terraform" messages

Common Causes:

  • Someone modified resource manually with kubectl or helm
  • Helm upgrade run directly instead of through Terraform
  • External controller modified resources

Resolution:

  1. If changes are desired: Import the current state into Terraform code

    # Update terraform code to match reality
    # Then refresh state
    terraform refresh
  2. If changes are unwanted: Let Terraform revert them

    • Next apply will restore Terraform's desired state
    • Resources will be updated to match code
  3. Prevention: Never modify Terraform-managed resources manually except in emergencies

Permission Denied Errors

Symptoms:

  • Cannot create resources
  • "forbidden" errors in Terraform apply
  • Pods cannot access AWS services

Common Causes:

  • Insufficient Kubernetes RBAC permissions
  • Namespace doesn't exist
  • Service account missing IAM role (IRSA)
  • IAM policy too restrictive

Resolution:

  1. Verify namespace exists before deploying
  2. Check service account annotations for IAM role ARN
  3. Review IAM policies attached to role
  4. Contact Platform team to verify RBAC permissions

Ingress Not Working

Symptoms:

  • Application deployed but URL not accessible
  • ALB not created or not routing traffic
  • 502/503 errors

Diagnosis:

  1. Check Ingress resource exists and is configured:
    kubectl describe ingress <name> -n <namespace>
  2. Verify ALB was created in AWS console
  3. Check ALB target group health
  4. Verify service exists and has endpoints

Resolution:

  1. Ensure ingress annotations are correct (especially for AWS ALB):
    set {
    name = "ingress.annotations.kubernetes\\.io/ingress\\.class"
    value = "alb"
    }
    set {
    name = "ingress.annotations.alb\\.ingress\\.kubernetes\\.io/scheme"
    value = "internet-facing"
    }
  2. Verify service selector matches pod labels
  3. Check service port matches container port
  4. Verify security groups allow traffic

Best Practices and Tips

Cluster and Environment Scope

Ops cluster is for:

  • Operational and platform tools (monitoring, logging, alerting)
  • Development utilities and POCs
  • Internal tools

Ops cluster is not for:

  • Customer-facing services (these go to lower/prod via Platform team)
  • Production data processing

Version Pinning

Always pin explicit chart versions:

# Good
version = "26.21.0"

# Bad - don't use ranges or latest
version = "~> 26.0"
version = "latest"

Why: Reproducibility and preventing unexpected changes.

Namespaces

Use dedicated namespaces for:

  • POCs and experiments (easy cleanup)
  • Applications with many resources
  • Security isolation requirements
  • Team-specific applications

Share namespaces for:

  • Related microservices
  • Common utilities
  • Monitoring tools that logically belong together

Resource Requests and Limits

Set appropriate resource requests/limits:

set {
name = "resources.requests.memory"
value = "512Mi"
}
set {
name = "resources.requests.cpu"
value = "250m"
}
set {
name = "resources.limits.memory"
value = "1Gi"
}
set {
name = "resources.limits.cpu"
value = "1000m"
}

Guidelines:

  • Start conservative, adjust based on actual usage
  • Requests guarantee minimum resources
  • Limits prevent runaway resource consumption
  • Monitor actual usage with Grafana dashboards
  • Different values for dev vs production

Configuration Management

For simple deployments: Use set blocks

set {
name = "key"
value = "value"
}

For complex deployments: Use values with templatefile

  • More maintainable
  • Easier to review
  • Follows familiar YAML structure
  • Supports complex logic

Sensitive Data

Always use set_sensitive for secrets:

set_sensitive {
name = "password"
value = var.password
}

For highly sensitive data: Consider AWS Secrets Manager instead of Terraform variables

  • Secrets stored securely in AWS
  • Rotation capabilities
  • Audit logging
  • Use CSI driver to mount into pods

Testing Strategy

For ops cluster deployments:

  1. Test locally if possible (minikube, kind, or Docker Desktop with Kubernetes)
  2. Deploy to ops cluster during low-usage hours
  3. Perform thorough validation before announcing availability
  4. Monitor closely for first 24 hours
  5. Gather user feedback

For applications needing multi-environment rollout: Work with Platform team who will handle progressive deployment:

  • Lower environments (dev → qa → staging)
  • Production after validation

Validation checklist:

  • Pods running and healthy
  • Logs show no errors
  • Application UI/API accessible
  • All features working
  • Performance acceptable
  • Resource usage reasonable
  • Monitoring and alerts configured

Documentation

Document your deployment:

  • Add comments in helm_config.tf explaining non-obvious choices
  • Document any deviations from chart defaults
  • Note any chart-specific quirks or issues encountered
  • Link to relevant chart documentation
  • Explain variable purposes in variable descriptions

Monitoring and Observability

After deployment, configure:

  • Prometheus ServiceMonitors (if applicable)
  • Grafana dashboards
  • Alert rules for critical metrics
  • Log aggregation to Loki
  • Traces to Tempo (if application supports)

Edge Cases and Advanced Scenarios

Conditional Deployment

Control whether a resource is deployed using count:

resource "helm_release" "experimental_tool" {
# Deploy only in ops cluster (safety check)
count = var.eks_environment == "ops" ? 1 : 0

# ... rest of config ...
}

This pattern ensures the deployment only happens in ops cluster even if the code exists in other environment modules.

Note: Since developers only work in ops cluster, this is mainly used as a safety mechanism or when the same code is shared across environment modules.

Post-Install Jobs and Hooks

Some charts require post-install configuration:

resource "helm_release" "app" {
# ... config ...

# Wait for post-install jobs to complete
wait = true
wait_for_jobs = true
timeout = 600
}

# Run additional configuration after Helm install
resource "null_resource" "post_install" {
depends_on = [helm_release.app]

provisioner "local-exec" {
command = "kubectl apply -f ${path.module}/manifests/post-install-config.yaml"
}
}

Useful Commands for Debugging

While we don't use these for deployment, they're helpful for debugging:

Helm Commands

# List all releases in a namespace
helm list -n <namespace>

# Get release values
helm get values <release-name> -n <namespace>

# Get release manifest
helm get manifest <release-name> -n <namespace>

# Check release history
helm history <release-name> -n <namespace>

# Debug - render templates without installing
helm template <release-name> <chart> -f values.yaml

Kubectl Commands

# Get all resources in namespace
kubectl get all -n <namespace>

# Describe resource for detailed info
kubectl describe pod <pod-name> -n <namespace>

# Get pod logs
kubectl logs <pod-name> -n <namespace>

# Follow logs in real-time
kubectl logs -f <pod-name> -n <namespace>

# Get logs from previous container (if crashed)
kubectl logs <pod-name> --previous -n <namespace>

# Execute command in container
kubectl exec -it <pod-name> -n <namespace> -- /bin/bash

# Port forward for local testing
kubectl port-forward svc/<service-name> 8080:80 -n <namespace>

# Get events (useful for debugging)
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

Support and Resources

Internal Resources

External Resources

Getting Help

For ops cluster deployment issues:

  1. Check this how-to guide for similar scenarios
  2. Review Terraform plan/apply output for specific errors
  3. Check pod logs using kubectl or Grafana (see monitoring section below)
  4. Search Artifact Hub for chart documentation
  5. Contact Platform team via Platform JIRA with:
    • Application name and namespace
    • Error messages or unexpected behavior
    • What you've already tried
    • GitLab MR link (if applicable)