Maintaining a reliable, secure, and efficient system is the most important responsibility of any software team. In order to meet that challenge, teams must be prepared to respond to any issues that arise - remediation action is required, customers and stakeholders need to be notified, and the root cause must be identified. Every team’s triaging process looks a little bit different, but any team’s process can benefit from improved standardization and organization. In the midst of the chaos caused by a serious issue, it’s all too easy for a ball to be dropped.
For example, imagine your group is responsible for maintaining the operation of a microservice. You have an on-call rotation, so every week a different member of your team handles incident response responsibilities. You have good logging and health checks to identify when there’s a problem, but depending on who is on-call during a given week, the response to a given issue will vary widely. This inconsistency leads to confusion, slow responses, and forgotten steps. Sometimes low priority issues are treated with unnecessary panic, and sometimes high priority issues are addressed with inadequate urgency. Maybe the company status page doesn’t get updated, or compliance requirements aren’t fulfilled, or the root cause investigation is dropped altogether once the engineer on-call rotates.
Using Catalytic for Developers, you can address these issues and improve the quality and speed of our incident response, while simultaneously making the process less painful for all. This powerful suite of tools empowers IT teams to improve the operations of the entire company by enabling interaction with the Catalytic ecosystem from their own systems and infrastructure, allowing for efficient, organized, and transparent business processes.
The first step is creating a Workflow for your business process. For help building Workflows, see Catalytic’s help documentation.
For instance, the Workflow may assign the on-call engineer an initial task with high-level questions related to the scope and severity of the issue: Who is impacted, and to what extent? Is this a false alarm? Is immediate remediation required? Are additional investigative or mitigation resources required?
Based on the answers to the questions, a variety of actions may be performed:
- If other technology services are impacted, the managers of those resources will be immediately notified.
- If a specific customer is impacted, the appropriate account lead will be immediately notified.
- A task will be assigned to perform immediate remediation.
- If the issue is severe enough, the outage process will start - the individual responsible for updating the company status page will be informed upon both issue identification and remediation
- A task to perform a Root Cause Analysis and identify the necessary follow up actions will be assigned
- Tasks for each follow-up action will be assigned to the appropriate engineer
Codifying your process in the Catalytic platform ensures that the proper action is taken each and every time. But the first instinct of an engineer is to dive into the problem, and by the time they remember to start the Workflow, the value of a rapid response process may have diminished. What if your Application Monitoring layer (i.e. Splunk) could start the triage Workflow automatically as soon as a server crashed, or a particular error was logged, or a health check failed?
Whew, that's a mouthful. Let's break that down - a Splunk Custom Alert Action App is an app installed on your Splunk server that can be configured as a triggered alert action that runs a custom script. For instance, you can create an alert that goes off when the number of 5XX errors in the past hour cross a certain threshold, and configure that alert to trigger a Custom Alert Action App that runs a script that starts a Catalytic process to investigate the issue.
To accomplish this, you need to install Catalytic CLI on my Splunk server and create a Catalytic Developer Key to allow the CLI to interact with your Catalytic team. Visit the Catalytic CLI installation documentation for more information.
Once Catalytic CLI is installed, you can create your Custom Alert Action App. You can use the sample “Catalytic: Start Instance” Custom Alert Action App, which has more detailed instructions, or build your own. Splunk’s Custom Alert Action documentation may also be helpful.
The crux of the Custom Alert Action is the script, which will utilize Catalytic CLI. Here is the script from the sample “Catalytic: Start Instance” Custom Alert Action App:
#!/bin/bash # Capture JSON provided via StdIn MessageJSON=$(cat -) # Extract properties from JSON WorkflowId=$(echo $MessageJSON | jq -r 'getpath(["configuration", "workflow_id"])') InstanceName=$(echo $MessageJSON | jq -r 'getpath(["configuration", "instance_name"])') LogMessage=$(echo $MessageJSON | jq -r 'getpath(["result", "_raw"])') # Catalytic CLI to start a new instance CATALYTIC_CREDENTIALS=ADD_YOUR_ACCESS_TOKEN_HERE catalytic instance start "$WorkflowId" "$InstanceName" "--field=message:$LogMessage"
Splunk passes the log metadata and trigger configuration options to the script in JSON format via the Standard Input stream. The script first captures that input, then extracts the log message and configuration parameters using
jq. Finally, it starts a new Instance via Catalytic CLI, passing the log message as a field value.
This script can be modified to fulfill compliance requirements (i.e. scrubbing PII), collect further data, or perform any additional actions prior to starting the Workflow.
And that’s it! Your Workflow will now start whenever your alert goes off.
Catalytic for Developers can be used for improving operations across the entire company, not just the IT team. Imagine instead of kicking off a triage Workflow when an error is triggered, a custom onboarding process could be started immediately on every new signup. Or when a particular customer has been inactive for a period of time, an Instance starts that assigns a task to the appropriate Account Manager to contact the customer. The possibilities are endless.
Updated over 3 years ago