How-To: Deploying and Managing Helm Charts in Our Kubernetes Cluster

Introduction

This guide provides step-by-step instructions for deploying and managing applications in our Kubernetes ops cluster using Helm charts and Terraform. It assumes you have read the explanation documentation and understand the basic concepts of our deployment architecture. For specific scenarios covered, check out the sidebar to the right ➡️

Important Scope Note: This guide is for deploying applications to the ops cluster only. The lower environments (dev, qa, staging) and production cluster are managed exclusively by the Platform team. If you need an application deployed to lower or prod environments, submit a request to the Platform team.

Prerequisites

GitLab Access: Permissions to create branches and merge requests in the cluster-manager repository
Basic Familiarity: Understanding of YAML syntax, Git workflow (branches, commits, merge requests), and basic Terraform syntax
Explanation Document: Read the "Understanding Our Kubernetes Deployment Architecture" explanation document first to understand the why behind these steps

Cluster Access Scope

Developers can deploy to:

ops cluster - Operational tools, monitoring, development utilities, POCs

Platform team manages:

lower environments (dev, qa, staging) - Application deployments for development/testing
production cluster - Production application workloads

If you need an application in lower or prod environments, work with the Platform team who will handle the deployment following the same patterns described in this guide.

Work in Progress: The following deployment workflow sections are currently being updated:

How to Deploy a New Helm Chart Application - Step-by-step guide for deploying new applications
How to Modify an Existing Helm Release - Guide for updating configurations
How to Upgrade a Helm Chart Version - Instructions for version upgrades
How to Remove/Uninstall a Helm Release - Steps for cleaning up deployments

These sections contained deprecated applications specific examples that need to be replaced with generic guidance. They will be restored soon with updated content. For now, please use the monitoring and troubleshooting sections below, and contact the Platform team via Platform JIRA if you need assistance with deployments.

Monitoring Deployments with Grafana

After deploying an application, you'll need to verify it's running correctly and monitor its health. Our Grafana instance provides dashboards and tools to help you debug deployments and observe application behavior.

Accessing Grafana

Navigate to the Grafana instance for the ops cluster https://grafana.ops.wwnorton.net
Authenticate using your Microsoft Entra ID credentials
Navigate to the "Dashboards" section

Key Actions for Debugging Deployments

Check Pod Status and Health

Open the Kubernetes Pods Dashboard
- Navigate to: Kubernetes / Views / Pods
- Select your application's namespace from the dropdown (e.g., poc-backstage)
- Select specific pods from the pod dropdown, or leave as "All" to view all pods in the namespace

Understanding Empty Panels: If you see panels showing "No data", it may be due to:

Information Section: Shows data only when a single pod is selected (not "All"). Select a specific pod from the dropdown to see details like "Created by", "Running on", "Pod IP", etc.
Resource Section: Requires at least one pod to be selected. If showing "All" and still no data, check that pods exist in the selected namespace.
Kubernetes Section: Tables like "Pods with Container Issues" and "Unscheduled Pods" will be empty if there are no issues in the namespace—this is a good sign! The "Container Restarts" and "OOM Events" graphs will show data only if restarts/OOM events have occurred.

Review Information Section

The dashboard shows key pod information at the top:
- Created by: Shows what Kubernetes resource created the pod (Deployment, StatefulSet, etc.)
- Running on: Which node the pod is running on
- Pod IP: The pod's IP address
- Priority Class: Priority class if assigned
- QOS Class: Quality of Service class (Guaranteed, Burstable, or BestEffort)
- Last Terminated Reason: If the pod was recently terminated, shows why (CrashLoopBackOff, OOMKilled, etc.)
- Last Terminated Exit Code: Exit code from the last termination
Check Resource Usage

Review the Resource section:
- Gauge panels: Show overall CPU and Memory usage as percentages of Requests and Limits
- Resources by container table: Detailed breakdown showing:
  - CPU Requests and Limits per container
  - Memory Requests and Limits per container
  - Current CPU Used per container
  - Current Memory Used per container
- Time series graphs:
  - CPU Usage / Requests & Limits by container: Shows if containers are hitting their limits
  - Memory Usage / Requests & Limits by container: Shows memory pressure over time
  - CPU Usage by container: Absolute CPU cores used
  - Memory Usage by container: Absolute memory bytes used
Identify Container Issues

Scroll to the "Kubernetes" section to check for problems:
- Pods with Container Issues table: Shows pods with:
  - CrashLoopBackOff: Pods repeatedly crashing (check logs for errors)
  - ErrImagePull or ImagePullBackOff: Cannot pull container image (check image name/registry/credentials)
- Container Restarts by container: Graph showing restart frequency over time
- Unscheduled Pods table: Pods that cannot be scheduled (usually resource constraints or node affinity issues)
- OOM Events by container: Out-of-memory events indicating memory limits are too low

View Pod Logs

Access Logs via Grafana
- Navigate to "Explore" section in Grafana
- Select Loki as the data source
- Use LogQL queries to filter logs:
```
{namespace="your-namespace", pod="your-pod-name"}
```
- Search for error messages or specific log patterns
Common Log Patterns to Check
- Application startup messages
- Database connection errors
- Configuration errors
- Health check failures
- Port binding issues

Monitor Application Metrics

Find Application-Specific Dashboards
- Many Helm charts include Grafana dashboard definitions
- Check if your chart provides dashboards (configured via ServiceMonitor)
- Look for dashboards matching your application name
Check Key Metrics
- Request rates (if web application)
- Error rates
- Response times
- Active connections
- Custom application metrics

Verify Service and Ingress Health

Use kubectl to verify service and ingress configuration:

Service Endpoints

# Check if service has active endpoints
kubectl get endpoints <service-name> -n <namespace>

# Verify service selector matches pod labels
kubectl describe service <service-name> -n <namespace>

# Confirm service ports match container ports
kubectl get service <service-name> -n <namespace> -o yaml

Ingress Configuration

# Verify ingress is configured correctly
kubectl describe ingress <ingress-name> -n <namespace>

# Check TLS certificate status
kubectl get ingress <ingress-name> -n <namespace> -o yaml

Track Deployment Changes Over Time

Use Grafana's Time Range Selector
- Set time range to cover your deployment window
- Compare metrics before and after deployment
- Identify any regressions or issues
Check for Anomalies
- Sudden drops in request rates
- Increased error rates
- Resource usage spikes
- Pod restarts

When to Use kubectl vs Grafana

Use kubectl for:

Quick status checks (kubectl get pods -n <namespace>)
Executing into containers for debugging
Port-forwarding for local testing
Applying temporary fixes during incidents

Use Grafana for:

Historical analysis and trends
Aggregated metrics across multiple pods
Log analysis with search/filtering
Dashboard visualization for stakeholders
Alerting and monitoring

Getting Help with Monitoring

If you're unable to find the metrics or logs you need:

Verify your application exposes metrics (Prometheus format)
Check if ServiceMonitor is configured for your application
Contact Platform team via Platform JIRA for:
- Dashboard creation assistance
- Metric collection configuration
- Log retention policies
- Alert rule setup

Common Troubleshooting Scenarios

Pods in CrashLoopBackOff

Symptoms:

Pods repeatedly restarting
Status shows CrashLoopBackOff
Application not accessible

Diagnosis:

Check pod logs using kubectl or Grafana for error messages
Look for common issues:
- Missing environment variables
- Cannot connect to database
- Configuration errors
- Missing dependencies
- Resource limits too low

Resolution:

Fix the underlying issue (config, dependencies, resources)
Create MR with fix
After apply, pods should start successfully
If urgent, manually delete pods to force immediate restart after fix

ImagePullBackOff Errors

Symptoms:

Pods cannot start
Status shows ImagePullBackOff or ErrImagePull

Common Causes:

Image name or tag incorrect
Image repository requires authentication
Image doesn't exist
Registry unreachable

Resolution:

Verify image name and tag in chart values
Check chart documentation for correct image configuration
For private registries, configure image pull secrets
Update helm_release with corrected values

Terraform Apply Timeout

Symptoms:

Apply pipeline runs for very long time
Eventually times out with error
Pods may be stuck in Pending or slow to start

Common Causes:

wait = true and pods take too long to become ready
Insufficient cluster resources
Image pull is slow (large images)
Init containers taking long time

Resolution:

Increase timeout value in helm_release (e.g., from 600 to 1200)
Set wait = false if you don't need to wait for readiness (not recommended for prod)
Investigate why pods are slow to start
Check cluster capacity using kubectl or review metrics in Grafana

Configuration Values Not Taking Effect

Symptoms:

Changed configuration values but application behavior unchanged
Expected settings not applied

Diagnosis:

Check Terraform plan showed the resource as modified
Verify apply completed successfully
Check pods were restarted using kubectl (kubectl get pods -n <namespace>)
Verify values syntax in template or set blocks

Resolution:

For simple config changes, Helm may not trigger pod restart

Manually delete pods to force restart with new config:

kubectl delete pods -n <namespace> -l app.kubernetes.io/name=<app-label>

Or add/change an annotation to force rollout:

set {
  name  = "podAnnotations.configVersion"
  value = "v2"  # Increment to force update
}

Terraform State Drift

Symptoms:

Terraform plan shows changes but you didn't modify code
"Resource has been changed outside of Terraform" messages

Common Causes:

Someone modified resource manually with kubectl or helm
Helm upgrade run directly instead of through Terraform
External controller modified resources

Resolution:

If changes are desired: Import the current state into Terraform code

# Update terraform code to match reality
# Then refresh state
terraform refresh

If changes are unwanted: Let Terraform revert them
- Next apply will restore Terraform's desired state
- Resources will be updated to match code
Prevention: Never modify Terraform-managed resources manually except in emergencies

Permission Denied Errors

Symptoms:

Cannot create resources
"forbidden" errors in Terraform apply
Pods cannot access AWS services

Common Causes:

Insufficient Kubernetes RBAC permissions
Namespace doesn't exist
Service account missing IAM role (IRSA)
IAM policy too restrictive

Resolution:

Verify namespace exists before deploying
Check service account annotations for IAM role ARN
Review IAM policies attached to role
Contact Platform team to verify RBAC permissions

Ingress Not Working

Symptoms:

Application deployed but URL not accessible
ALB not created or not routing traffic
502/503 errors

Diagnosis:

Check Ingress resource exists and is configured:
```
kubectl describe ingress <name> -n <namespace>
```
Verify ALB was created in AWS console
Check ALB target group health
Verify service exists and has endpoints

Resolution:

Ensure ingress annotations are correct (especially for AWS ALB):

set {
  name  = "ingress.annotations.kubernetes\\.io/ingress\\.class"
  value = "alb"
}
set {
  name  = "ingress.annotations.alb\\.ingress\\.kubernetes\\.io/scheme"
  value = "internet-facing"
}

Verify service selector matches pod labels
Check service port matches container port
Verify security groups allow traffic

Best Practices and Tips

Cluster and Environment Scope

Ops cluster is for:

Operational and platform tools (monitoring, logging, alerting)
Development utilities and POCs
Internal tools

Ops cluster is not for:

Customer-facing services (these go to lower/prod via Platform team)
Production data processing

Version Pinning

Always pin explicit chart versions:

# Good
version = "26.21.0"

# Bad - don't use ranges or latest
version = "~> 26.0"
version = "latest"

Why: Reproducibility and preventing unexpected changes.

Namespaces

Use dedicated namespaces for:

POCs and experiments (easy cleanup)
Applications with many resources
Security isolation requirements
Team-specific applications

Share namespaces for:

Related microservices
Common utilities
Monitoring tools that logically belong together

Resource Requests and Limits

Set appropriate resource requests/limits:

set {
  name  = "resources.requests.memory"
  value = "512Mi"
}
set {
  name  = "resources.requests.cpu"
  value = "250m"
}
set {
  name  = "resources.limits.memory"
  value = "1Gi"
}
set {
  name  = "resources.limits.cpu"
  value = "1000m"
}

Guidelines:

Start conservative, adjust based on actual usage
Requests guarantee minimum resources
Limits prevent runaway resource consumption
Monitor actual usage with Grafana dashboards
Different values for dev vs production

Configuration Management

For simple deployments: Use set blocks

set {
  name  = "key"
  value = "value"
}

For complex deployments: Use values with templatefile

More maintainable
Easier to review
Follows familiar YAML structure
Supports complex logic

Sensitive Data

Always use set_sensitive for secrets:

set_sensitive {
  name  = "password"
  value = var.password
}

For highly sensitive data: Consider AWS Secrets Manager instead of Terraform variables

Secrets stored securely in AWS
Rotation capabilities
Audit logging
Use CSI driver to mount into pods

Testing Strategy

For ops cluster deployments:

Test locally if possible (minikube, kind, or Docker Desktop with Kubernetes)
Deploy to ops cluster during low-usage hours
Perform thorough validation before announcing availability
Monitor closely for first 24 hours
Gather user feedback

For applications needing multi-environment rollout: Work with Platform team who will handle progressive deployment:

Lower environments (dev → qa → staging)
Production after validation

Validation checklist:

Pods running and healthy
Logs show no errors
Application UI/API accessible
All features working
Performance acceptable
Resource usage reasonable
Monitoring and alerts configured

Documentation

Document your deployment:

Add comments in helm_config.tf explaining non-obvious choices
Document any deviations from chart defaults
Note any chart-specific quirks or issues encountered
Link to relevant chart documentation
Explain variable purposes in variable descriptions

Monitoring and Observability

After deployment, configure:

Prometheus ServiceMonitors (if applicable)
Grafana dashboards
Alert rules for critical metrics
Log aggregation to Loki
Traces to Tempo (if application supports)

Edge Cases and Advanced Scenarios

Conditional Deployment

Control whether a resource is deployed using count:

resource "helm_release" "experimental_tool" {
  # Deploy only in ops cluster (safety check)
  count = var.eks_environment == "ops" ? 1 : 0

  # ... rest of config ...
}

This pattern ensures the deployment only happens in ops cluster even if the code exists in other environment modules.

Note: Since developers only work in ops cluster, this is mainly used as a safety mechanism or when the same code is shared across environment modules.

Post-Install Jobs and Hooks

Some charts require post-install configuration:

resource "helm_release" "app" {
  # ... config ...

  # Wait for post-install jobs to complete
  wait = true
  wait_for_jobs = true
  timeout = 600
}

# Run additional configuration after Helm install
resource "null_resource" "post_install" {
  depends_on = [helm_release.app]

  provisioner "local-exec" {
    command = "kubectl apply -f ${path.module}/manifests/post-install-config.yaml"
  }
}

Useful Commands for Debugging

While we don't use these for deployment, they're helpful for debugging:

Helm Commands

# List all releases in a namespace
helm list -n <namespace>

# Get release values
helm get values <release-name> -n <namespace>

# Get release manifest
helm get manifest <release-name> -n <namespace>

# Check release history
helm history <release-name> -n <namespace>

# Debug - render templates without installing
helm template <release-name> <chart> -f values.yaml

Kubectl Commands

# Get all resources in namespace
kubectl get all -n <namespace>

# Describe resource for detailed info
kubectl describe pod <pod-name> -n <namespace>

# Get pod logs
kubectl logs <pod-name> -n <namespace>

# Follow logs in real-time
kubectl logs -f <pod-name> -n <namespace>

# Get logs from previous container (if crashed)
kubectl logs <pod-name> --previous -n <namespace>

# Execute command in container
kubectl exec -it <pod-name> -n <namespace> -- /bin/bash

# Port forward for local testing
kubectl port-forward svc/<service-name> 8080:80 -n <namespace>

# Get events (useful for debugging)
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

Support and Resources

Internal Resources

Platform JIRA: Primary support channel for deployment questions and issues
cluster-manager Repository: https://gitlab.com/wwnorton/ops/cluster-manager
Explanation Documentation: "Understanding Our Kubernetes Deployment Architecture"

External Resources

Helm Documentation: https://helm.sh/docs/
Artifact Hub: https://artifacthub.io/ (search for charts)
Terraform Helm Provider: https://registry.terraform.io/providers/hashicorp/helm/latest/docs
Kubernetes Documentation: https://kubernetes.io/docs/

Getting Help

For ops cluster deployment issues:

Check this how-to guide for similar scenarios
Review Terraform plan/apply output for specific errors
Check pod logs using kubectl or Grafana (see monitoring section below)
Search Artifact Hub for chart documentation
Contact Platform team via Platform JIRA with:
- Application name and namespace
- Error messages or unexpected behavior
- What you've already tried
- GitLab MR link (if applicable)

Introduction​

Prerequisites​

Cluster Access Scope​

Monitoring Deployments with Grafana​

Accessing Grafana​

Key Actions for Debugging Deployments​

Check Pod Status and Health​

View Pod Logs​

Monitor Application Metrics​

Verify Service and Ingress Health​

Track Deployment Changes Over Time​

When to Use kubectl vs Grafana​

Getting Help with Monitoring​

Common Troubleshooting Scenarios​

Pods in CrashLoopBackOff​

ImagePullBackOff Errors​

Terraform Apply Timeout​

Configuration Values Not Taking Effect​

Terraform State Drift​

Permission Denied Errors​

Ingress Not Working​

Best Practices and Tips​

Cluster and Environment Scope​

Version Pinning​

Namespaces​

Resource Requests and Limits​

Configuration Management​

Sensitive Data​

Testing Strategy​

Documentation​

Monitoring and Observability​

Edge Cases and Advanced Scenarios​

Conditional Deployment​

Post-Install Jobs and Hooks​

Useful Commands for Debugging​

Helm Commands​

Kubectl Commands​

Support and Resources​

Internal Resources​

External Resources​

Getting Help​

Introduction

Prerequisites

Cluster Access Scope

Monitoring Deployments with Grafana

Accessing Grafana

Key Actions for Debugging Deployments

Check Pod Status and Health

View Pod Logs

Monitor Application Metrics

Verify Service and Ingress Health

Track Deployment Changes Over Time

When to Use kubectl vs Grafana

Getting Help with Monitoring

Common Troubleshooting Scenarios

Pods in CrashLoopBackOff

ImagePullBackOff Errors

Terraform Apply Timeout

Configuration Values Not Taking Effect

Terraform State Drift

Permission Denied Errors

Ingress Not Working

Best Practices and Tips

Cluster and Environment Scope

Version Pinning

Namespaces

Resource Requests and Limits

Configuration Management

Sensitive Data

Testing Strategy

Documentation

Monitoring and Observability

Edge Cases and Advanced Scenarios

Conditional Deployment

Post-Install Jobs and Hooks

Useful Commands for Debugging

Helm Commands

Kubectl Commands

Support and Resources

Internal Resources

External Resources

Getting Help