How-To: Deploying and Managing Helm Charts in Our Kubernetes Cluster
Introduction
This guide provides step-by-step instructions for deploying and managing applications in our Kubernetes ops cluster using Helm charts and Terraform. It assumes you have read the explanation documentation and understand the basic concepts of our deployment architecture. For specific scenarios covered, check out the sidebar to the right ➡️
Important Scope Note: This guide is for deploying applications to the ops cluster only. The lower environments (dev, qa, staging) and production cluster are managed exclusively by the Platform team. If you need an application deployed to lower or prod environments, submit a request to the Platform team.
Prerequisites
- GitLab Access: Permissions to create branches and merge requests in the cluster-manager repository
- Basic Familiarity: Understanding of YAML syntax, Git workflow (branches, commits, merge requests), and basic Terraform syntax
- Explanation Document: Read the "Understanding Our Kubernetes Deployment Architecture" explanation document first to understand the why behind these steps
Cluster Access Scope
Developers can deploy to:
- ops cluster - Operational tools, monitoring, development utilities, POCs
Platform team manages:
- lower environments (dev, qa, staging) - Application deployments for development/testing
- production cluster - Production application workloads
If you need an application in lower or prod environments, work with the Platform team who will handle the deployment following the same patterns described in this guide.
Work in Progress: The following deployment workflow sections are currently being updated:
- How to Deploy a New Helm Chart Application - Step-by-step guide for deploying new applications
- How to Modify an Existing Helm Release - Guide for updating configurations
- How to Upgrade a Helm Chart Version - Instructions for version upgrades
- How to Remove/Uninstall a Helm Release - Steps for cleaning up deployments
These sections contained deprecated applications specific examples that need to be replaced with generic guidance. They will be restored soon with updated content. For now, please use the monitoring and troubleshooting sections below, and contact the Platform team via Platform JIRA if you need assistance with deployments.
Monitoring Deployments with Grafana
After deploying an application, you'll need to verify it's running correctly and monitor its health. Our Grafana instance provides dashboards and tools to help you debug deployments and observe application behavior.
Accessing Grafana
- Navigate to the Grafana instance for the ops cluster https://grafana.ops.wwnorton.net
- Authenticate using your Microsoft Entra ID credentials
- Navigate to the "Dashboards" section
Key Actions for Debugging Deployments
Check Pod Status and Health
-
Open the Kubernetes Pods Dashboard
- Navigate to: Kubernetes / Views / Pods
- Select your application's namespace from the dropdown (e.g.,
poc-backstage) - Select specific pods from the pod dropdown, or leave as "All" to view all pods in the namespace
Understanding Empty Panels: If you see panels showing "No data", it may be due to:
- Information Section: Shows data only when a single pod is selected (not "All"). Select a specific pod from the dropdown to see details like "Created by", "Running on", "Pod IP", etc.
- Resource Section: Requires at least one pod to be selected. If showing "All" and still no data, check that pods exist in the selected namespace.
- Kubernetes Section: Tables like "Pods with Container Issues" and "Unscheduled Pods" will be empty if there are no issues in the namespace—this is a good sign! The "Container Restarts" and "OOM Events" graphs will show data only if restarts/OOM events have occurred.
-
Review Information Section
The dashboard shows key pod information at the top:
- Created by: Shows what Kubernetes resource created the pod (Deployment, StatefulSet, etc.)
- Running on: Which node the pod is running on
- Pod IP: The pod's IP address
- Priority Class: Priority class if assigned
- QOS Class: Quality of Service class (Guaranteed, Burstable, or BestEffort)
- Last Terminated Reason: If the pod was recently terminated, shows why (CrashLoopBackOff, OOMKilled, etc.)
- Last Terminated Exit Code: Exit code from the last termination
-
Check Resource Usage
Review the Resource section:
- Gauge panels: Show overall CPU and Memory usage as percentages of Requests and Limits
- Resources by container table: Detailed breakdown showing:
- CPU Requests and Limits per container
- Memory Requests and Limits per container
- Current CPU Used per container
- Current Memory Used per container
- Time series graphs:
- CPU Usage / Requests & Limits by container: Shows if containers are hitting their limits
- Memory Usage / Requests & Limits by container: Shows memory pressure over time
- CPU Usage by container: Absolute CPU cores used
- Memory Usage by container: Absolute memory bytes used
-
Identify Container Issues
Scroll to the "Kubernetes" section to check for problems:
- Pods with Container Issues table: Shows pods with:
CrashLoopBackOff: Pods repeatedly crashing (check logs for errors)ErrImagePullorImagePullBackOff: Cannot pull container image (check image name/registry/credentials)
- Container Restarts by container: Graph showing restart frequency over time
- Unscheduled Pods table: Pods that cannot be scheduled (usually resource constraints or node affinity issues)
- OOM Events by container: Out-of-memory events indicating memory limits are too low
- Pods with Container Issues table: Shows pods with:
View Pod Logs
-
Access Logs via Grafana
- Navigate to "Explore" section in Grafana
- Select Loki as the data source
- Use LogQL queries to filter logs:
{namespace="your-namespace", pod="your-pod-name"} - Search for error messages or specific log patterns
-
Common Log Patterns to Check
- Application startup messages
- Database connection errors
- Configuration errors
- Health check failures
- Port binding issues
Monitor Application Metrics
-
Find Application-Specific Dashboards
- Many Helm charts include Grafana dashboard definitions
- Check if your chart provides dashboards (configured via ServiceMonitor)
- Look for dashboards matching your application name
-
Check Key Metrics
- Request rates (if web application)
- Error rates
- Response times
- Active connections
- Custom application metrics
Verify Service and Ingress Health
Use kubectl to verify service and ingress configuration:
-
Service Endpoints
# Check if service has active endpoints
kubectl get endpoints <service-name> -n <namespace>
# Verify service selector matches pod labels
kubectl describe service <service-name> -n <namespace>
# Confirm service ports match container ports
kubectl get service <service-name> -n <namespace> -o yaml -
Ingress Configuration
# Verify ingress is configured correctly
kubectl describe ingress <ingress-name> -n <namespace>
# Check TLS certificate status
kubectl get ingress <ingress-name> -n <namespace> -o yaml
Track Deployment Changes Over Time
-
Use Grafana's Time Range Selector
- Set time range to cover your deployment window
- Compare metrics before and after deployment
- Identify any regressions or issues
-
Check for Anomalies
- Sudden drops in request rates
- Increased error rates
- Resource usage spikes
- Pod restarts
When to Use kubectl vs Grafana
Use kubectl for:
- Quick status checks (
kubectl get pods -n <namespace>) - Executing into containers for debugging
- Port-forwarding for local testing
- Applying temporary fixes during incidents
Use Grafana for:
- Historical analysis and trends
- Aggregated metrics across multiple pods
- Log analysis with search/filtering
- Dashboard visualization for stakeholders
- Alerting and monitoring
Getting Help with Monitoring
If you're unable to find the metrics or logs you need:
- Verify your application exposes metrics (Prometheus format)
- Check if ServiceMonitor is configured for your application
- Contact Platform team via Platform JIRA for:
- Dashboard creation assistance
- Metric collection configuration
- Log retention policies
- Alert rule setup
Common Troubleshooting Scenarios
Pods in CrashLoopBackOff
Symptoms:
- Pods repeatedly restarting
- Status shows
CrashLoopBackOff - Application not accessible
Diagnosis:
- Check pod logs using kubectl or Grafana for error messages
- Look for common issues:
- Missing environment variables
- Cannot connect to database
- Configuration errors
- Missing dependencies
- Resource limits too low
Resolution:
- Fix the underlying issue (config, dependencies, resources)
- Create MR with fix
- After apply, pods should start successfully
- If urgent, manually delete pods to force immediate restart after fix
ImagePullBackOff Errors
Symptoms:
- Pods cannot start
- Status shows
ImagePullBackOfforErrImagePull
Common Causes:
- Image name or tag incorrect
- Image repository requires authentication
- Image doesn't exist
- Registry unreachable
Resolution:
- Verify image name and tag in chart values
- Check chart documentation for correct image configuration
- For private registries, configure image pull secrets
- Update helm_release with corrected values
Terraform Apply Timeout
Symptoms:
- Apply pipeline runs for very long time
- Eventually times out with error
- Pods may be stuck in
Pendingor slow to start
Common Causes:
wait = trueand pods take too long to become ready- Insufficient cluster resources
- Image pull is slow (large images)
- Init containers taking long time
Resolution:
- Increase
timeoutvalue in helm_release (e.g., from 600 to 1200) - Set
wait = falseif you don't need to wait for readiness (not recommended for prod) - Investigate why pods are slow to start
- Check cluster capacity using kubectl or review metrics in Grafana
Configuration Values Not Taking Effect
Symptoms:
- Changed configuration values but application behavior unchanged
- Expected settings not applied
Diagnosis:
- Check Terraform plan showed the resource as modified
- Verify apply completed successfully
- Check pods were restarted using kubectl (
kubectl get pods -n <namespace>) - Verify values syntax in template or set blocks
Resolution:
- For simple config changes, Helm may not trigger pod restart
- Manually delete pods to force restart with new config:
kubectl delete pods -n <namespace> -l app.kubernetes.io/name=<app-label> - Or add/change an annotation to force rollout:
set {
name = "podAnnotations.configVersion"
value = "v2" # Increment to force update
}
Terraform State Drift
Symptoms:
- Terraform plan shows changes but you didn't modify code
- "Resource has been changed outside of Terraform" messages
Common Causes:
- Someone modified resource manually with kubectl or helm
- Helm upgrade run directly instead of through Terraform
- External controller modified resources
Resolution:
-
If changes are desired: Import the current state into Terraform code
# Update terraform code to match reality
# Then refresh state
terraform refresh -
If changes are unwanted: Let Terraform revert them
- Next apply will restore Terraform's desired state
- Resources will be updated to match code
-
Prevention: Never modify Terraform-managed resources manually except in emergencies
Permission Denied Errors
Symptoms:
- Cannot create resources
- "forbidden" errors in Terraform apply
- Pods cannot access AWS services
Common Causes:
- Insufficient Kubernetes RBAC permissions
- Namespace doesn't exist
- Service account missing IAM role (IRSA)
- IAM policy too restrictive
Resolution:
- Verify namespace exists before deploying
- Check service account annotations for IAM role ARN
- Review IAM policies attached to role
- Contact Platform team to verify RBAC permissions
Ingress Not Working
Symptoms:
- Application deployed but URL not accessible
- ALB not created or not routing traffic
- 502/503 errors
Diagnosis:
- Check Ingress resource exists and is configured:
kubectl describe ingress <name> -n <namespace> - Verify ALB was created in AWS console
- Check ALB target group health
- Verify service exists and has endpoints
Resolution:
- Ensure ingress annotations are correct (especially for AWS ALB):
set {
name = "ingress.annotations.kubernetes\\.io/ingress\\.class"
value = "alb"
}
set {
name = "ingress.annotations.alb\\.ingress\\.kubernetes\\.io/scheme"
value = "internet-facing"
} - Verify service selector matches pod labels
- Check service port matches container port
- Verify security groups allow traffic
Best Practices and Tips
Cluster and Environment Scope
Ops cluster is for:
- Operational and platform tools (monitoring, logging, alerting)
- Development utilities and POCs
- Internal tools
Ops cluster is not for:
- Customer-facing services (these go to lower/prod via Platform team)
- Production data processing
Version Pinning
Always pin explicit chart versions:
# Good
version = "26.21.0"
# Bad - don't use ranges or latest
version = "~> 26.0"
version = "latest"
Why: Reproducibility and preventing unexpected changes.
Namespaces
Use dedicated namespaces for:
- POCs and experiments (easy cleanup)
- Applications with many resources
- Security isolation requirements
- Team-specific applications
Share namespaces for:
- Related microservices
- Common utilities
- Monitoring tools that logically belong together
Resource Requests and Limits
Set appropriate resource requests/limits:
set {
name = "resources.requests.memory"
value = "512Mi"
}
set {
name = "resources.requests.cpu"
value = "250m"
}
set {
name = "resources.limits.memory"
value = "1Gi"
}
set {
name = "resources.limits.cpu"
value = "1000m"
}
Guidelines:
- Start conservative, adjust based on actual usage
- Requests guarantee minimum resources
- Limits prevent runaway resource consumption
- Monitor actual usage with Grafana dashboards
- Different values for dev vs production
Configuration Management
For simple deployments: Use set blocks
set {
name = "key"
value = "value"
}
For complex deployments: Use values with templatefile
- More maintainable
- Easier to review
- Follows familiar YAML structure
- Supports complex logic
Sensitive Data
Always use set_sensitive for secrets:
set_sensitive {
name = "password"
value = var.password
}
For highly sensitive data: Consider AWS Secrets Manager instead of Terraform variables
- Secrets stored securely in AWS
- Rotation capabilities
- Audit logging
- Use CSI driver to mount into pods
Testing Strategy
For ops cluster deployments:
- Test locally if possible (minikube, kind, or Docker Desktop with Kubernetes)
- Deploy to ops cluster during low-usage hours
- Perform thorough validation before announcing availability
- Monitor closely for first 24 hours
- Gather user feedback
For applications needing multi-environment rollout: Work with Platform team who will handle progressive deployment:
- Lower environments (dev → qa → staging)
- Production after validation
Validation checklist:
- Pods running and healthy
- Logs show no errors
- Application UI/API accessible
- All features working
- Performance acceptable
- Resource usage reasonable
- Monitoring and alerts configured
Documentation
Document your deployment:
- Add comments in helm_config.tf explaining non-obvious choices
- Document any deviations from chart defaults
- Note any chart-specific quirks or issues encountered
- Link to relevant chart documentation
- Explain variable purposes in variable descriptions
Monitoring and Observability
After deployment, configure:
- Prometheus ServiceMonitors (if applicable)
- Grafana dashboards
- Alert rules for critical metrics
- Log aggregation to Loki
- Traces to Tempo (if application supports)
Edge Cases and Advanced Scenarios
Conditional Deployment
Control whether a resource is deployed using count:
resource "helm_release" "experimental_tool" {
# Deploy only in ops cluster (safety check)
count = var.eks_environment == "ops" ? 1 : 0
# ... rest of config ...
}
This pattern ensures the deployment only happens in ops cluster even if the code exists in other environment modules.
Note: Since developers only work in ops cluster, this is mainly used as a safety mechanism or when the same code is shared across environment modules.
Post-Install Jobs and Hooks
Some charts require post-install configuration:
resource "helm_release" "app" {
# ... config ...
# Wait for post-install jobs to complete
wait = true
wait_for_jobs = true
timeout = 600
}
# Run additional configuration after Helm install
resource "null_resource" "post_install" {
depends_on = [helm_release.app]
provisioner "local-exec" {
command = "kubectl apply -f ${path.module}/manifests/post-install-config.yaml"
}
}
Useful Commands for Debugging
While we don't use these for deployment, they're helpful for debugging:
Helm Commands
# List all releases in a namespace
helm list -n <namespace>
# Get release values
helm get values <release-name> -n <namespace>
# Get release manifest
helm get manifest <release-name> -n <namespace>
# Check release history
helm history <release-name> -n <namespace>
# Debug - render templates without installing
helm template <release-name> <chart> -f values.yaml
Kubectl Commands
# Get all resources in namespace
kubectl get all -n <namespace>
# Describe resource for detailed info
kubectl describe pod <pod-name> -n <namespace>
# Get pod logs
kubectl logs <pod-name> -n <namespace>
# Follow logs in real-time
kubectl logs -f <pod-name> -n <namespace>
# Get logs from previous container (if crashed)
kubectl logs <pod-name> --previous -n <namespace>
# Execute command in container
kubectl exec -it <pod-name> -n <namespace> -- /bin/bash
# Port forward for local testing
kubectl port-forward svc/<service-name> 8080:80 -n <namespace>
# Get events (useful for debugging)
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
Support and Resources
Internal Resources
- Platform JIRA: Primary support channel for deployment questions and issues
- cluster-manager Repository: https://gitlab.com/wwnorton/ops/cluster-manager
- Explanation Documentation: "Understanding Our Kubernetes Deployment Architecture"
External Resources
- Helm Documentation: https://helm.sh/docs/
- Artifact Hub: https://artifacthub.io/ (search for charts)
- Terraform Helm Provider: https://registry.terraform.io/providers/hashicorp/helm/latest/docs
- Kubernetes Documentation: https://kubernetes.io/docs/
Getting Help
For ops cluster deployment issues:
- Check this how-to guide for similar scenarios
- Review Terraform plan/apply output for specific errors
- Check pod logs using kubectl or Grafana (see monitoring section below)
- Search Artifact Hub for chart documentation
- Contact Platform team via Platform JIRA with:
- Application name and namespace
- Error messages or unexpected behavior
- What you've already tried
- GitLab MR link (if applicable)