Automate Application Monitoring with Slack

A tutorial to build an alerting Slack bot using Python, Apache Airflow and AWS Elasticsearch

Carlton Marshall II

August 25, 2021

Application monitoring plays a key role in maintaining a robust application and ensuring a positive, uninterrupted experience for customers. While application monitoring comes in many different shapes and sizes, some strategies are proven to be more effective than others. A Slack bot is a great tool to use for teams to monitor the health, performance and availability of their application. After all, where are the developers and other members of the team always active and logged on during the day? Slack.

This facet of workplace culture makes Slack the perfect candidate to send alerts for application monitoring. Slack has already retained the attention of the development team. This fact can be leveraged to send intelligent, meaningful alerts about the team’s application which can be acted on quickly and effectively. In this blog post, I will cover the basics of how to create a Slack bot that will monitor your application using Python, Apache Airflow and Elasticsearch.

My experience as a TDP working on Slack bots at Capital One

I’m a Senior Associate Software Engineer here at Capital One in the Technology Development Program. The Technology Development Program -- also known as TDP -- is a two-year rotational program for recent computer science graduates. As part of the program, associates are embedded into full time agile development teams to work on various projects across the enterprise. A day in the life of a TDP associate might include attending the team’s daily standup meeting, implementing a new application feature, attending an executive speaker event during lunch, and then deploying that feature into production before the day is over! At the end of year one in the program, associates rotate to a new team in order to pick up new skills, expand their professional network, and work with new technologies. Many TDPs also chose to go above and beyond their day-to-day responsibilities by getting involved with recruiting efforts or mentorship initiatives such as CODERS, which is a Capital One program designed to inspire the next generation of technologists. During my two-year experience in the TDP, I took advantage of a leadership opportunity called the TDP Council. The TDP Council is an associate-elected committee of TDP leaders tasked with organizing networking and engagement events for the entire 1,000+ member class.

My current team develops and maintains a web application which is used by leadership to view metrics about the technology that powers our associates each and every day. On this web application -- built using Python, Angular, Apache Airflow and AWS -- users can find metrics about hardware compliance, vulnerability findings, and internal software application usage data. My team also uses a Slack bot to monitor our application. When data is missing from our Elasticsearch database, our Slack bot is triggered and sends a message to our team's Slack channel. This alert message indicates which datasets are missing, and from which region or environment. I also developed an enhanced version of this Slack bot which calculates the percentage change in size of our Elasticsearch indices from day-to-day. These alerts are sent if a particular index grows or shrinks in size more than a set threshold amount compared to a 30-day average. This indicates to our team that there could be a data duplication issue or a partial data deletion issue. These alerts empower our development team to investigate and fix the data issues before customers are impacted.

Understanding the different types of application monitoring

Before getting into the technical implementation steps for an alerting Slack bot, we must understand what the different types of application monitoring are, and why they are so important.

At the end of the day, our job as engineers is to deliver a great user experience for our users. This includes an intuitive user interface, a holistic user experience, and an application that is fast and highly available - meaning very low downtime and latency. Let’s take, for example, a banking web application that allows users to view account information. A development team might be interested in monitoring various aspects of this application to ensure it’s delivering value for customers on a consistent basis. These application monitoring categories might include:

Uptime and Downtime - The development team might monitor the uptime and downtime of the application, which refers to how often the banking website is up and running.
Performance - The development team might monitor the performance of the application, which refers to how quickly users' requests about their accounts are returned.
Missing Data - The development team might monitor for missing data in the application which refers to customer account data that has been lost, deleted or corrupted.
General Logs - The development team might monitor general logs from the application which refers to the tracking of all user activity.
Unusual Activity - The development team might monitor unusual account activity on the application which might be a sign of a security breach.

All of these are examples of some of the most common aspects of an application that a development team would want to monitor. If these aspects of an application go unchecked, the results could be catastrophic. Continuing with the example of the banking web application, here are some things that would go wrong:

Uptime and Downtime - If the banking application is down for too long, customers won’t have access to their accounts and won’t be able to make transfers or view balances.
Performance - If the banking application is too slow, customers might become frustrated and the bank's customer service line could become overwhelmed. Or worse, customers might take their business elsewhere.
Missing Data - If a customer notices missing data in their bank account, this could have serious implications for the bank, and the bank’s brand could be severely damaged if this is a widespread issue.
General Logs - If logs are not properly kept about user behaviour while using the banking application, this could have legal and ethical risks. While the bank would want to keep logs for auditing purposes, they must also keep in mind consumer privacy.
Unusual Activity - If the banking application does not have mechanisms in-place to monitor for and detect nefarious activity, then this activity could be allowed to happen on the bank’s watch.

Clearly, application monitoring is an absolutely vital aspect of building and maintaining a robust application. In this tutorial, we will uncover how to set up a Slack bot that monitors an application and sends an alert message if missing data is detected.

Tutorial for building a Slack bot using Apache Airflow and Python

My team uses a tech stack for our web application that consists of Apache Airflow, Python, Angular and Elasticsearch. And, of course, all of this is deployed into Amazon Web Services -- also known as AWS - in order to be highly scalable, available and resilient. Although Capital One is a bank, we operate much like a tech company and have already exited all of our data centers and have went all-in on cloud computing and AWS,

Apache Airflow is an open-source workflow management platform which we use to define batch processes to extract, transform, and load data into an Elasticsearch cluster. Elasticsearch is a NoSQL database and uses a concept known as indices, which is the largest unit of data in Elasticsearch. Indices are logical partitions of documents (JSON objects) and can be compared to a database in the world of relational databases. Airflow presents a concept known as a DAG, or Directed Acyclic Graph, which is essentially a workflow or collection of tasks that you want to run which is defined in a Python script. Our team has numerous DAGs that run throughout the day, extracting data from different sources, transforming this data into the proper format, and then loading it into Elasticsearch so that our web application can consume it. Sometimes these DAGs, or workflows, fail for various reasons. In response to this, our team developed a DAG which checks for the existence of all our different data indices in Elasticsearch. If data is missing, an alert is triggered which sends a message to our team’s Slack channel using a Slack bot.

Let’s now dive into the steps for creating a Slack bot using Apache Airflow and Python, to send an alert when data is missing from an Elasticsearch database. This tutorial assumes that you have already set up and deployed an instance of Apache Airflow and are now ready to start developing new DAGs.

Step 1 - File Structure

For this example, here is what our directory structure will look like. We have one file called send_slack_alert.py which contains the code and business logic for sending our alert messages to Slack. The file slack_alert_dag.py contains the code for the pipeline definition. slack_alert_dag.py will ultimately be interpreted into a DAG on the Apache Airflow user interface and send_slack_alert.py will be the code that is executed during the workflow.

    .
├── Slack Bot Project
└── src
    ├── send_slack_alert.py
    └── slack_alert_dag.py

Step 2 - Setting up the DAG

    """Airflow DAG to run the missing index Slack alert process."""
from datetime import datetime
from datetime import timedelta

from airflow import DAG
from airflow.models import Variable
from airflow.operators.bash import BashOperator

script_dir = Variable.get("script_dir")

default_args = {
    "retries": 1,
    "start_date": datetime(2020, 1, 19),
    "retry_delay": timedelta(minutes=5),
}

# 4:00 PM EST
schedule = "0 20 * * *"

dag = DAG(
    "slack_alert_example",
    default_args=default_args,
    catchup=False,
    schedule_interval=schedule,
)

slack_alert = """
    python {{ params.script_dir }}/send_slack_alert.py
"""

t1 = BashOperator(
    task_id="slack_alert",
    bash_command=slack_alert,
    dag=dag,
    params={
        "script_dir": script_dir,
    },
)

Here in the slack_alert_dag.py file there is code which contains the pipeline definition for our DAG. This code defines the structure and behavior of our DAG which will ultimately send out our Slack alerts during the workflow.

First, we declare a variable script_dir that contains the directory of our script once Apache Airflow has been deployed. Airflow variables are a generic way to store and retrieve metadata within Airflow, and can be updated using the user interface, with code or with the command line interface.
Second, I have declared some arguments for the DAG, such as how many times it should retry, when the DAG should initially start, and how long the DAG should wait between retries. These arguments are assigned to the default_args variable.
Third, I declare the schedule for the DAG and assign this to the schedule variable. I want the DAG to run at 4:00pm EST every day. The time is in cron notation and it is based on Greenwich Mean Time (GMT).
Fourth, I define some characteristics of the DAG such as its name, schedule, and default arguments and assign this to the dag variable.
Fifth, t1 refers to the first task that will be executed in this workflow. When the DAG runs, it will run the bash script defined by the slack_alert variable which will run our Python code to send Slack alerts for any missing data from Elasticsearch.

Step 3 - Writing the code to send Slack alerts

    def main():
    generate_missing_index_alerts()


if __name__ == "__main__":
    main()

Now that we have written the code that defines our DAG which will run in Apache Airflow, we now must write the code that will actually send out the Slack alerts for application monitoring purposes. Here is our main function, the entry point to our program. When the DAG workflow kicks off, this Python code will be executed. The main function of the program calls generate_missing_index_alerts() which contains the business logic for the program.

    def generate_missing_index_alerts():
    """
    Generates Slack alerts for missing Elasticsearch indices.
    """
    indices_to_check = [
        "accounts_index",
        "users_index",
        "employees_index"
    ]
    elasticsearch_url = 'https://my-elasticsearch-url.us-east-1.es.amazonaws.com/_cat/indices?s=index'

    # Query Elasticsearch URL for string of all indices that exist.
    elasticsearch_indices = str(urlopen(elasticsearch_url).read())

    # Generate a list of any indices which are missing from Elasticsearch.
    missing_indices = generate_missing_indices(indices_to_check, elasticsearch_indices)

    # If there are missing indices in the Elasticsearch instance.
    if missing_indices:
        alert_message = f"Alert! The following indices are missing from Elasticsearch:\n"
        # For each missing index.
        for missing_index in missing_indices:
            # Add the index name to the alert message.
            alert_message += f"`{missing_index}`\n"

        # Send the alert message to Slack.
        send_slack_alert(alert_message)

The generate_missing_index_alerts() function is the core of our program to send out Slack alerts. First, the code queries the URL for our Elasticsearch instance to retrieve all available indices and assigns this data to the elasticsearch_indices variable. Next, the generate_missing_indices() function is called which will return a list of indices which are missing from our Elasticsearch instance.

We are passing in a list of three indices to check for in the generate_missing_indices() function:

accounts_index
users_index
employees_index

These are three sample indices that should exist in our Elasticsearch instance. Once we have gathered a list of all of the missing indices from our Elasticsearch instance, we can send out a Slack alert!

    def generate_missing_indices(indices_list: list, elasticsearch_indices: str) -> list:
    """
    Returns list of indices which are missing from Elasticsearch.

    :param indices_list: List of indices to check for existence.
    :param elasticsearch_indices: String of all existing Elasticsearch indices.
    :return: List of missing indices from Elasticsearch.
    """
    list_of_missing_indicies = []

    # For each index that should exist.
    for index in indices_list:
        # If the index does not exist in Elasticsearch.
        if index not in elasticsearch_indices:
            # Add this index to the list of missing indices.
            list_of_missing_indicies.append(index)

    return list_of_missing_indicies

Here we can see the code for the generate_missing_indices() function that generates a list of missing indices from our Elasticsearch instance. Given a list of indices that we expect to exist, and a list of indices that currently exist, this function will return a list of all missing indices. Once we have this list of missing Elasticsearch indices we can send out an alerting message for them.

    def send_slack_alert(message: str):
    """
    Sends alert messages to Slack for any missing Elasticsearch indices.

    :param message: Message to be sent to Slack channel.
    """

    slack_message = {
        "channel": "#my_teams_channel",
        "username": "My Slack Bot",
        "text": message
    }

    # Messages sent to this webhook will be sent to the Slack channel
    # which has been configured.
    slack_webhook = "https://hooks.slack.com/services/my-webhook"

    # Send a POST request to the Slack webhook, which will send the message
    # to the configured Slack channel.
    requests.post(
        slack_webhook, json.dumps(slack_message).encode("utf-8")
    )

Here we can see code that finally sends out our Slack alerts. Given a message to send out, this function uses the requests Python library to send out an HTTP POST request to the Slack webhook. The Slack webhook can be configured on your Slack dashboard. When messages are posted to this webhook URL, Slack will post the messages to the Slack channel of your choosing that you have configured.

Step 4 - Testing that the application monitoring alerts have been sent!

Example Slack bot alert saying “Alert! The following indices are missing from Elasticsearch” accounts_index”

Now that we have implemented all of our code, we can see that a Slack alert message has been sent by our Slack bot! It appears that in our example, the accounts_index index is missing from Elasticsearch. Thanks to our Slack bot we were able to identify and fix this issue before customers were impacted.

Conclusion

Application monitoring plays a critical role in maintaining a robust application, as I have described. There are many aspects of an application that an agile development team might be interested in monitoring. These include monitoring downtime, performance, missing data, general logs and unusual activity. If any of these aspects of an application go unmonitored, chaos could ensue for the end user of that application. Thus, systems must be put in place by the development team to alert on these items early and often. A great strategy to accomplish this is to create a Slack bot that sends monitoring alerts about the application to the team’s Slack channel. Team members are already logged into Slack throughout the day communicating with colleagues. By interweaving smart, actionable and informative application monitoring alerts into existing Slack channels, the development team can act them in a timely manner. In this tutorial I explained how to set up one such application monitoring Slack bot using Python, Apache Airflow and Elasticsearch. However, the principles that I have laid out can be applied to almost any tech stack or application. The responsibilities of a development team in an enterprise setting are constantly growing. It would be advantageous for the team to put measures in place to automate as much as possible. By setting up a Slack bot that monitors the application 24-hours per day, the development team is able to save time and work on the truly interesting tasks. But, once the alerting Slack bot detects an anomaly about the application, the development team can be notified immediately and jump into action to fix the issue.

Carlton Marshall II, Senior Associate Software Engineer, Cyber Engineering

Carlton Marshall II is a dedicated software engineer working on a backend software engineering team that develops and maintains an internal serverless data pipeline for vulnerability data. Carlton graduated from Northeastern University in 2019 with a cross-discipline degree in Computer Science and Business Administration. Carlton is an associate in the Cyber Engineering organization, and is a former tech intern and Technology Development Program member. Carlton is very committed to living the values at Capital One and participates in the TDP Alumni Council and the Blacks in TechBusiness Resource Group. You can connect with Carlton on LinkedIn: linkedin.com/in/carltonmarshall.