SRE Starter Project
Intro
This is an alternate ending for the dev start project intended for SREs which will cover similar themes plus how we approach troubleshooting.
Prerequisites
It is assumed that you have completed the dev start project through the run section. If not, return back to the Starter Project documentation.
Troubleshooting Approach
The SRE approach to problem-solving is a skillset that can be learned through consistent practice. While learning these skills, it is important to approach each problem with the proper mindset:
-
Risk and failure are inevitable. Therefore, remain calm.
-
Tests and data are used to distinguish between theories. Therefore, seek understanding.
-
Communication and collaboration are required to solve complex problems. Therefore, ask for help.
With this approach, methodical problem-solving can occur.
Report
When a problem occurs, there should be data that describes the problem in detail. If not, the first step is to gather more data about the problem including scope and severity. Ideally, the description will include the intended behavior, the observed behavior, and steps/conditions required to reproduce the behavior. This data allows an appropriate response.
Mitigation
For many problems, the next step is to immediately reduce the impact. Root cause analysis can wait. Workarounds to prevent larger impacts should be implemented quickly.
Analysis
Investigate metrics and gather logs for details about the problem. Determine which data is and is not relevant. Ask questions about the system. Determine the relationship between system components and the observed symptoms. Look for anomalies in other services that are correlated. Generate hypotheses.
Dependency Management for Builds
Dependencies are managed in the kubernetes/dev-with-dependencies/kustomization.yaml file. Add the following dependencies, which are needed for the next section:
components:
- https://coderepo.mobilehealth.va.gov/scm/ckm/wiremock.git//kubernetes?ref=Release/3.9.1&timeout=60s
- https://coderepo.mobilehealth.va.gov/scm/iums/mobile-mvi-service.git//kubernetes/components/dev?ref=Release/1.33&timeout=60s
- https://coderepo.mobilehealth.va.gov/scm/iums/user-session-service.git//kubernetes/components/dev?ref=Release/1.22&timeout=60s
- https://coderepo.mobilehealth.va.gov/scm/vdms/redis.git//kubernetes?ref=Release/7.0.15&timeout=60s
Add Health Check For External CKM Service:
Just as the service health check needs to take our own service’s health into consideration, it needs to take the services it relies upon into consideration, as well.
Verify the health check test passes after adding the health checks for both mobile-mvi-service and user-session-service.
The health checks can be added to the existing ExampleComponentHealthCheckConfig by making the following changes:
-
Update AppProperties.java: add String properties for the mobile-mvi-service and user-session-service urls
-
Update application.properties: add the configuration for the new properties
-
Update the ExampleComponentHealthCheckConfig to account for our new dependencies.
-
Verify the health checks by adding an actuator test class
AppProperties.java
package gov.va.mobile.starter.v1.service;
import lombok.Getter;
import lombok.Setter;
import org.springframework.boot.context.properties.ConfigurationProperties;
import org.springframework.validation.annotation.Validated;
import jakarta.validation.constraints.NotEmpty;
/**
* Configuration properties for starter-service.
*
* @since 1.0
*/
@Getter
@Setter
@Validated
@ConfigurationProperties("mobile.starter")
public class AppProperties {
@NotEmpty
private String mobileMviSvcUrl;
@NotEmpty
private String userSessionSvcUrl;
}
application.properties
management.server.port=8081
mobile.starter.mobile-mvi-svc-url=${MOBILE_MVI_SVC_URL:http://mobile-mvi-service-v1:8080/mvi/v1}
mobile.starter.user-session-svc-url=${USER_SVC_URL:http://user-session-service-v1:8080/session/v1}
Updated ExampleComponentHealthCheckConfig.java
package gov.va.mobile.starter.v1.service.health;
import gov.va.mobile.service.client.http.rest.RestHealthIndicatorBuilder;
import gov.va.mobile.starter.v1.service.AppProperties;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.boot.actuate.health.HealthIndicator;
/**
* {@link RestHealthIndicatorBuilder} implementation for verifying the health of an external component.
*
* @see RestHealthIndicatorBuilder
* @since 1.0
*/
@Configuration
public class ExampleComponentHealthCheckConfig {
/**
* Instantiates the Health Check for the User Session Service
*
* @param properties AppProperties for the Application
* @param builder RestHealthIndicatorBuilder for configuring the HealthIndicator
* @return {@link HealthIndicator}
*/
@Bean("user-session-service")
public HealthIndicator userSessionServiceHealthIndicator(final RestHealthIndicatorBuilder builder,
final AppProperties properties) {
return builder.url(properties.getUserSessionSvcUrl()).build();
}
/**
* Instantiates the Health Check for the Mobile MVI Service
*
* @param properties AppProperties for the Application
* @param builder RestHealthIndicatorBuilder for configuring the HealthIndicator
* @return {@link HealthIndicator}
*/
@Bean("mobile-mvi-service")
public HealthIndicator mobileMviServiceHealthIndicator(final RestHealthIndicatorBuilder builder,
final AppProperties properties) {
return builder.url(properties.getMobileMviSvcUrl()).build();
}
}
After making the above changes, create a test class called ActuatorITCase.java (in the same directory as your ServiceResourceITCase.java test class) to verify the health checks:
ActuatorITCase.java
package gov.va.mobile.starter.v1.service;
import gov.va.mobile.service.test.AbstractHealthCheckITCase;
import gov.va.mobile.tools.skaffold.annotations.ServiceUrl;
import java.util.List;
/**
* Integration Test cases for Actuator Endpoints.
*
* @since 1.0
*/
class ActuatorITCase extends AbstractHealthCheckITCase {
@ServiceUrl(name = "starter-service-v1", port = 8081)
protected static String SERVICE_URL;
ActuatorITCase() {
super(SERVICE_URL);
}
@Override
protected List<String> healthCheckJsonPaths() {
return List.of("$.status", "$.components.mobile-mvi-service.status", "$.components.user-session-service.status");
}
}
Run a build of your service (mvn clean install -Pwith-skaffold) to ensure your changes compile and your test class executes successfully.
Run and Use the SBA Service:
To father operational data, logs, and metrics, there are several tools available.The simplest to run locally is Sprint Boot Admin.It provides health status at a glance, detailed metrics, and other useful data about our standardized Java Spring services.
In the kubernetes/dev-with-dependencies/kustomization.yaml file, add the configuration for Spring Boot Admin, its dependencies, and update the other services to register with SBA.
resources:
- https://coderepo.mobilehealth.va.gov/scm/ckm/security-app-config.git//dev?ref=main
components:
- https://coderepo.mobilehealth.va.gov/scm/dhsss/spring-boot-admin-service.git//kubernetes/components/dev?ref=Release/1.13&timeout=60s
- https://coderepo.mobilehealth.va.gov/scm/ckm/admin-idp.git//kubernetes/components/dev?ref=Release/2.34&timeout=60s
- https://coderepo.mobilehealth.va.gov/scm/iums/jwt-signing-service.git//kubernetes/components/dev?ref=Release/1.22&timeout=60s
- https://coderepo.mobilehealth.va.gov/scm/ckm/openldap.git//kubernetes?ref=Release/1.5&timeout=60s
Also add the following code to the kubernetes/base/application.env file.
SBA_PATH=http://spring-boot-admin-service-v1:8080/sba/v1
ADMIN_IDP_URL=http://admin-idp-v2:8080/admin/v2
Did you see any error messages when you tried to build the service?
If yes, what did the error say?
What steps did you take to figure out and fix the problem?
Please post a message to Greenfield dev with the issue, if any, and next steps on what you believe should fix it.
To access the dashboard, run the build and use kubectl to determine the node port for the spring boot admin k8s service (container port 8080).
Open a browser to http://localhost:<nodePort>/sba/v1/wallboard.
Replace <nodePort> with the port you have exposed locally on your system
|
Become very familiar with the data that SBA displays.
If you’ve already had a run skaffold dev ensure you CTRL+C
Make sure to run the service as skaffold --port-forward=true dev
If you encounter local system port conflicts, so you might be required to expose a different port for the SpringBoot Administration web console.
kubectl port-forward spring-boot-admin-service-v1-<> 8888:8080 --namespace starter-service-test
Scale Down Redis
Use kubectl to reduce the number of pod replicas of the Redis service deployment to zero. Observe the SBA dashboard for as status changes.
Scale Up Redis
Use kubectl to restore the Redis service pod. Check SBA dashboard to verify that system health is restored.
Incident Simulation
In this section we will be running a live-fire simulation of a previously observed failure in our environment. As an incoming SRE, this simulation will help with practicing end-to-end incident handling and rapid incident detection. From there you are expected to form and test your theory and document your findings.
|
Ensure you are in the root directory of the starter-project repository. |
Add simulation to kubernetes manifest
kubectl apply -f simulation/sre_simulation.yaml
Ensure scripts are executable.
chmod +x simulation/scripts/*
Add a Custom Metric
To gather more data that is critical for the operation of a service, we can define a custom metric which can then be observed through SBA or collected by Prometheus. Create a custom metric that counts requests sent to the /patients/<icn>/info endpoint and records success or failure. See the docs and this guide for example code.
You can use a single Metrics.counter().increment() expression. Note that if you don’t give a tag to your counter, the metric won’t be usable in SBA.
Send Requests
Send a few requests that should fail because they lack a valid JWT.
curl --location --request GET 'http://localhost:31349/starter/v1/patients/123/info' \ --header 'x-vamf-jwt: NOTAREALJWT'
Send a few requests that should succeed.
curl --location --request GET 'http://localhost:31349/starter/v1/patients/123/info' \ --header 'x-vamf-jwt: <<your-jwt-here>>'
Send multiples of each request and observe metrics being collected in SBA.
Review / Acceptance
Reach out to the SRE team lead to schedule a review when you have completed the project and simulation.
Useful links
Search the Internet: http://www.google.com
Kubectl cheat sheet: http://kubernetes.io/docs/reference/kubectl/cheatsheet/
Google’s SRE Book: https://sre.google/sre-book/table-of-contents/