SRE Starter Project

Intro

This is an alternate ending for the dev start project intended for SREs which will cover similar themes plus how we approach troubleshooting.

Prerequisites

It is assumed that you have completed the dev start project through the run section. If not, return back to the Starter Project documentation.

Troubleshooting Approach

The SRE approach to problem-solving is a skillset that can be learned through consistent practice. While learning these skills, it is important to approach each problem with the proper mindset:

  • Risk and failure are inevitable. Therefore, remain calm.

  • Tests and data are used to distinguish between theories. Therefore, seek understanding.

  • Communication and collaboration are required to solve complex problems. Therefore, ask for help.

With this approach, methodical problem-solving can occur.

Report

When a problem occurs, there should be data that describes the problem in detail. If not, the first step is to gather more data about the problem including scope and severity. Ideally, the description will include the intended behavior, the observed behavior, and steps/conditions required to reproduce the behavior. This data allows an appropriate response.

Mitigation

For many problems, the next step is to immediately reduce the impact. Root cause analysis can wait. Workarounds to prevent larger impacts should be implemented quickly.

Analysis

Investigate metrics and gather logs for details about the problem. Determine which data is and is not relevant. Ask questions about the system. Determine the relationship between system components and the observed symptoms. Look for anomalies in other services that are correlated. Generate hypotheses.

Test

Design tests to safely distinguish among different theories. Experiments with negative results are often more useful than experiments with positive results. Testing locally is less risky than testing in production.

Resolve

Once the correct set of root causes is identified, corrective actions are typically obvious. Communicate information about resolution widely.

Dependency Management for Builds

Dependencies are managed in the kubernetes/dev-with-dependencies/kustomization.yaml file. Add the following dependencies, which are needed for the next section:

components:
- https://coderepo.mobilehealth.va.gov/scm/ckm/wiremock.git//kubernetes?ref=Release/3.9.1&timeout=60s
- https://coderepo.mobilehealth.va.gov/scm/iums/mobile-mvi-service.git//kubernetes/components/dev?ref=Release/1.33&timeout=60s
- https://coderepo.mobilehealth.va.gov/scm/iums/user-session-service.git//kubernetes/components/dev?ref=Release/1.22&timeout=60s
- https://coderepo.mobilehealth.va.gov/scm/vdms/redis.git//kubernetes?ref=Release/7.0.15&timeout=60s

Add Health Check For External CKM Service:

Just as the service health check needs to take our own service’s health into consideration, it needs to take the services it relies upon into consideration, as well.

Verify the health check test passes after adding the health checks for both mobile-mvi-service and user-session-service.

The health checks can be added to the existing ExampleComponentHealthCheckConfig by making the following changes:

  • Update AppProperties.java: add String properties for the mobile-mvi-service and user-session-service urls

  • Update application.properties: add the configuration for the new properties

  • Update the ExampleComponentHealthCheckConfig to account for our new dependencies.

  • Verify the health checks by adding an actuator test class

AppProperties.java
package gov.va.mobile.starter.v1.service;

import lombok.Getter;
import lombok.Setter;
import org.springframework.boot.context.properties.ConfigurationProperties;
import org.springframework.validation.annotation.Validated;

import jakarta.validation.constraints.NotEmpty;

/**
 * Configuration properties for starter-service.
 *
 * @since 1.0
 */
@Getter
@Setter
@Validated
@ConfigurationProperties("mobile.starter")
public class AppProperties {

    @NotEmpty
    private String mobileMviSvcUrl;

    @NotEmpty
    private String userSessionSvcUrl;
}
application.properties
management.server.port=8081
mobile.starter.mobile-mvi-svc-url=${MOBILE_MVI_SVC_URL:http://mobile-mvi-service-v1:8080/mvi/v1}
mobile.starter.user-session-svc-url=${USER_SVC_URL:http://user-session-service-v1:8080/session/v1}
Updated ExampleComponentHealthCheckConfig.java
package gov.va.mobile.starter.v1.service.health;

import gov.va.mobile.service.client.http.rest.RestHealthIndicatorBuilder;
import gov.va.mobile.starter.v1.service.AppProperties;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.boot.actuate.health.HealthIndicator;

/**
 * {@link RestHealthIndicatorBuilder} implementation for verifying the health of an external component.
 *
 * @see RestHealthIndicatorBuilder
 * @since 1.0
 */
@Configuration
public class ExampleComponentHealthCheckConfig {

    /**
     * Instantiates the Health Check for the User Session Service
     *
     * @param properties     AppProperties for the Application
     * @param builder RestHealthIndicatorBuilder for configuring the HealthIndicator
     * @return {@link HealthIndicator}
     */
    @Bean("user-session-service")
    public HealthIndicator userSessionServiceHealthIndicator(final RestHealthIndicatorBuilder builder,
                                                             final AppProperties properties) {
        return builder.url(properties.getUserSessionSvcUrl()).build();
    }

    /**
     * Instantiates the Health Check for the Mobile MVI Service
     *
     * @param properties     AppProperties for the Application
     * @param builder RestHealthIndicatorBuilder for configuring the HealthIndicator
     * @return {@link HealthIndicator}
     */
    @Bean("mobile-mvi-service")
    public HealthIndicator mobileMviServiceHealthIndicator(final RestHealthIndicatorBuilder builder,
                                                           final AppProperties properties) {
        return builder.url(properties.getMobileMviSvcUrl()).build();
    }
}

After making the above changes, create a test class called ActuatorITCase.java (in the same directory as your ServiceResourceITCase.java test class) to verify the health checks:

ActuatorITCase.java
package gov.va.mobile.starter.v1.service;

import gov.va.mobile.service.test.AbstractHealthCheckITCase;
import gov.va.mobile.tools.skaffold.annotations.ServiceUrl;

import java.util.List;

/**
 * Integration Test cases for Actuator Endpoints.
 *
 * @since 1.0
 */
class  ActuatorITCase extends AbstractHealthCheckITCase {

    @ServiceUrl(name = "starter-service-v1", port = 8081)
    protected static String SERVICE_URL;

    ActuatorITCase() {
        super(SERVICE_URL);
    }

    @Override
    protected List<String> healthCheckJsonPaths() {
        return List.of("$.status", "$.components.mobile-mvi-service.status", "$.components.user-session-service.status");
    }
}

Run a build of your service (mvn clean install -Pwith-skaffold) to ensure your changes compile and your test class executes successfully.

Run and Use the SBA Service:

To father operational data, logs, and metrics, there are several tools available.The simplest to run locally is Sprint Boot Admin.It provides health status at a glance, detailed metrics, and other useful data about our standardized Java Spring services.

In the kubernetes/dev-with-dependencies/kustomization.yaml file, add the configuration for Spring Boot Admin, its dependencies, and update the other services to register with SBA.

resources:
- https://coderepo.mobilehealth.va.gov/scm/ckm/security-app-config.git//dev?ref=main

components:
- https://coderepo.mobilehealth.va.gov/scm/dhsss/spring-boot-admin-service.git//kubernetes/components/dev?ref=Release/1.13&timeout=60s
- https://coderepo.mobilehealth.va.gov/scm/ckm/admin-idp.git//kubernetes/components/dev?ref=Release/2.34&timeout=60s
- https://coderepo.mobilehealth.va.gov/scm/iums/jwt-signing-service.git//kubernetes/components/dev?ref=Release/1.22&timeout=60s
- https://coderepo.mobilehealth.va.gov/scm/ckm/openldap.git//kubernetes?ref=Release/1.5&timeout=60s

Also add the following code to the kubernetes/base/application.env file.

SBA_PATH=http://spring-boot-admin-service-v1:8080/sba/v1
ADMIN_IDP_URL=http://admin-idp-v2:8080/admin/v2

Did you see any error messages when you tried to build the service?

If yes, what did the error say?

What steps did you take to figure out and fix the problem?

Please post a message to Greenfield dev with the issue, if any, and next steps on what you believe should fix it.

To access the dashboard, run the build and use kubectl to determine the node port for the spring boot admin k8s service (container port 8080).

Replace <nodePort> with the port you have exposed locally on your system

Become very familiar with the data that SBA displays.

If you’ve already had a run skaffold dev ensure you CTRL+C

Make sure to run the service as skaffold --port-forward=true dev

If you encounter local system port conflicts, so you might be required to expose a different port for the SpringBoot Administration web console.

kubectl port-forward spring-boot-admin-service-v1-<> 8888:8080 --namespace starter-service-test

Scale Down Redis

Use kubectl to reduce the number of pod replicas of the Redis service deployment to zero. Observe the SBA dashboard for as status changes.

Scale Up Redis

Use kubectl to restore the Redis service pod. Check SBA dashboard to verify that system health is restored.

Incident Simulation

In this section we will be running a live-fire simulation of a previously observed failure in our environment. As an incoming SRE, this simulation will help with practicing end-to-end incident handling and rapid incident detection. From there you are expected to form and test your theory and document your findings.

Ensure you are in the root directory of the starter-project repository.

Add simulation to kubernetes manifest

kubectl apply -f simulation/sre_simulation.yaml

Ensure scripts are executable.

chmod +x simulation/scripts/*

Run simulation

Once the simulation has been added to the kubernetes cluster run the simulation start script

./simulation/scripts/simulation_start.sh

There is a simulation end script as well. It may help to start and stop the simulation a couple times while you are observing the K8S cluster.

Observation

What do you see happening? Any errors?

If any errors are present, what steps would you take to troubleshoot this issue?

If you need help, feel free to reach out to your SRE buddy and/or SRE team.

Add a Custom Metric

To gather more data that is critical for the operation of a service, we can define a custom metric which can then be observed through SBA or collected by Prometheus. Create a custom metric that counts requests sent to the /patients/<icn>/info endpoint and records success or failure. See the docs and this guide for example code. You can use a single Metrics.counter().increment() expression. Note that if you don’t give a tag to your counter, the metric won’t be usable in SBA.

Send Requests

Send a few requests that should fail because they lack a valid JWT.

curl --location --request GET 'http://localhost:31349/starter/v1/patients/123/info' \
--header 'x-vamf-jwt: NOTAREALJWT'

Send a few requests that should succeed.

curl --location --request GET 'http://localhost:31349/starter/v1/patients/123/info' \
--header 'x-vamf-jwt: <<your-jwt-here>>'

Send multiples of each request and observe metrics being collected in SBA.

Review / Acceptance

Reach out to the SRE team lead to schedule a review when you have completed the project and simulation.