A Developer Walks into AWS SageMaker...

Building Machine Learning Models Using K-Means Algorithm and Amazon SageMaker


Unless you are living under the proverbial technical rock, you have probably heard the phrase “machine learning” at some point. In recent years machine learning has been spilling out of academia and data science into the domain of general software engineering. This has led to a scramble by engineering teams trying to understand this new domain and how to integrate it into the traditional programming model. Machine learning can be unnecessary for a lot of use cases, however there are certain scenarios where machine learning can give better value than a traditional programming model.

There are tons of blogs that discuss what machine learning is and various tools to use with it. I wrote this for classically trained software engineers interested in machine learning, and wanted to focus on one tool that could be interesting to them — Amazon SageMaker — as well as one use case for using it.

 

What is Amazon SageMaker?

Amazon SageMaker is a fully managed machine learning service by AWS that provides developers and data scientists with the tools to build, train and deploy their machine learning models. It was introduced in November of 2017 during AWS re:Invent. Although it provides end-to-end workflow for machine learning models, any of its modules can be run independently. This gives teams the flexibility to use only the parts they need. It also has a robust CLI and a rich set of SDKs for developers in various languages including Java, C# and Python.

 

A Simple Use Case

Picture if you will, a database with millions of records about a business domain. You have the most efficient SQL queries to return any data you want. Add to this scenario a request from business to forecast a trend or detect a pattern based on the millions of records in the database. It won’t be long before you realize that this is a difficult task to accomplish with just your efficient SQL queries. This is the type of use case that led me to Amazon SageMaker.

In my current role at Capital One I’m on a six-person team working on architecting middleware systems for the enterprise. We build APIs and microservices using Java and NodeJs, all within the AWS ecosystem. Part of our role entails researching new tools, vendors, and tactics with the potential to work for our projects.

About a year and a half ago, my team was assigned the task of architecting new microservices that would replace a legacy system that processed transactions for various lines of business. One of the features that was required of the new system was the ability to detect certain types of transactions. At the time, we decided to go with a rules engine because it was easier to implement with a faster turnaround time. Six months ago — after we released the microservices and successfully replaced the legacy system — we started reviewing some feature enhancements with the business teams. We discovered that there were certain transaction patterns that are harder for a rules engine to detect because the observable patterns do not follow simple arithmetic or logic.

My team started doing research into ways we could reinforce the rules engine and before long we agreed that adding machine learning was the route to take. However, we are a software engineering team and not data scientists, so we really did not know how to begin. We wanted to research, and ideally find, a new tool that can be used by data scientists and software engineers alike but also fall within the strict guidelines provided by MRO. We weren’t looking for an end all be all solution, but for a new tool for use in our ecosystem of well managed modeling systems. That was when we walked into Amazon SageMaker…

We liked SageMaker because it makes it easy for users to build machine learning models by doing most of the infrastructure heavy lifting. Although SageMaker accelerates your build, you still need to have a basic understanding of your data and modeling framework, especially when it comes to data preparation. And, if you work in a large enterprise like us, you still have to make sure your models are compliant with your legal and governance requirements. But SageMaker can work for a variety of use cases that involve lots of data that need pattern detection, forecasting, or labeling. Facial recognition and consumer buying behavior can even be accomplished with Amazon SageMaker.

Ground Rules

To build this proof of concept we laid out some ground rules to guard and guide us during the process:

  • Use Infrastructure as code (IaC) for everything we do to keep things in line with the tenets of a well-architected framework. This could be Cloud Formation, SageMaker SDK or Lambdas.
  • Encryption! Encryption!! Encryption!!! We needed to ensure data remained encrypted at rest by using the appropriate KMS keys within each SageMaker SDK.
  • Tagging any resource that gets created to ensure ease of management.
  • Ensure that this tool could adhere to our enterprise requirements.

Although using IaC as opposed to doing tasks directly in the AWS Console seemed like a lot of overhead, this approach quickly paid off given how many times we had to create and run models.

Data Preparation

This is where the data transformation (normalization, standardization etc.) takes place, including the removal of outliers and bias. A popular tool among data scientists for these tasks is Jupyter Notebooks. AWS provides a managed version of Jupyter Notebooks that integrates very well with SageMaker. However, it should be noted that you do not have to use Jupyter Notebooks to build data models in Amazon SageMaker. Any Integrated Development Environment (IDE) that supports data manipulation languages should suffice. Julia, Python and R (Jupyter get it?) are popular languages built for data manipulation and numerical analysis. I chose Python because of the extensive support it has for AWS SDK (AWS SDK does not currently offer support for Julia and R).

Here is a Python code snippet of splitting data for training and testing:

    from sklearn.model_selection import train_test_split
# get Training and testing Data
df = pd.read_csv("TrainTest_RawData.csv", sep=",")

# randomly select 20% of the data for testing and the other 80% for training
x_train, x_test, y_train, y_test = train_test_split(df.Amount, df.tranCode,test_size=0.2
  

Model Training

Data validation and understanding is the most time consuming portion of any modeling workflow, but configuration of model training requires an understanding of the model type itself and has multiple facets illustrated below.

IAM Role
The IAM role that will be used for building these models must meet the following criteria:

  • You must be able to assume the role.
  • The role must have the SageMaker Trust relationship attached to it.
    {
   "Version": "2012-10-17",
   "Statement": [
     {
       "Effect": "Allow",
       "Principal": {
         "Service": [
           "sagemaker.amazonaws.com"
         ]
       },
       "Action": "sts:AssumeRole"
     }
   ]
 }
  
  • The role must have a SageMaker/KMS policy attached to it.
    {
    "Version": "2012-10-17",
    "Statement": [
      {
        "Action": "iam:PassRole",
        "Resource": "*",
        "Effect": "Allow",
        "Condition": {
          "StringEquals": {
            "iam:PassedToService": "sagemaker.amazonaws.com"
          }
        },
        "Sid": "GrantsIamPassRole"
      },
      {
        "Action": [
          "kms:DescribeKey",
          "kms:ListAliases"
        ],
        "Resource": "*",
        "Effect": "Allow",
        "Sid": "KmsKeysForCreateForms"
      },
      {
        "Action": [
          "iam:ListRoles"
        ],
        "Resource": "*",
        "Effect": "Allow",
        "Sid": "ListAndCreateExecutionRoles"
      },
      {
        "Action": [
          "kms:Encrypt",
          "kms:Decrypt",
          "kms:ReEncrypt*",
          "kms:GenerateDataKey*",
          "kms:DescribeKey",
          "kms:GetKeyPolicy",
          "kms:CreateGrant",
          "kms:ListGrants",
          "kms:RevokeGrant"
      ],
      "Resource": "KMS_KEY_ARN",
      "Effect": "Allow",
      "Sid": "GrantKMSAccess"
    },
    {
      "Action": [
        "sagemaker:CreateEndPoint",
        "sagemaker:DescribeEndpoint"
      ],
      "Resource": "*",
      "Effect": "Allow",
      "Sid": "GrantSageMakerCreate"
    }
  ]
}
  

KMS Keys
The KMS keys that will be used for encryption must have a policy attached to them that will grant access to the IAM role that will be using SageMaker.

    {
    "Version": "2012-10-17",
    "Id": "key-consolepolicy-3",
    "Statement": [
        {
            "Sid": "Enable IAM User Permissions",
            "Effect": "Allow",
            "Principal": {
                "AWS": "ARN_OF_ROOT"
            },
            "Action": "kms:*",
            "Resource": "*"
        },
        {
            "Sid": "Allow access for Key Administrators",
            "Effect": "Allow",
            "Principal": {
                "AWS": "ARN_OF_KEY_ADMINISTRATORS_ROLE"
            },
            "Action": [
                "kms:Create*",
                "kms:Describe*",
                "kms:Enable*",
                "kms:List*",
                "kms:Put*",
                "kms:Update*",
                "kms:Revoke*",
                "kms:Disable*",
                "kms:Get*",
                "kms:Delete*",
                "kms:TagResource",
                "kms:UntagResource",
                "kms:ScheduleKeyDeletion",
                "kms:CancelKeyDeletion"
            ],
            "Resource": "*"
        },
        {
            "Sid": "Allow use of the key",
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                    "ARN_OF_SAGEMAKER_ROLE"
                ]
            },
            "Action": [
                "kms:Encrypt",
                "kms:Decrypt",
                "kms:ReEncrypt*",
                "kms:GenerateDataKey*",
                "kms:DescribeKey",
                "kms:GetKeyPolicy"
            ],
            "Resource": "*"
        },
        {
            "Sid": "Allow attachment of persistent resources",
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                    "ARN_OF_SAGEMAKER_ROLE"
                ]
            },
            "Action": [
                "kms:CreateGrant",
                "kms:ListGrants",
                "kms:RevokeGrant"
            ],
            "Resource": "*",
            "Condition": {
                "Bool": {
                    "kms:GrantIsForAWSResource": "true"
                }
            }
        }
    ]
}
  

Training Job
At this point you are ready to bring your data in SageMaker for training. Our use case lent itself well to clustering. Amazon SageMaker supports over 15 ML algorithms that satisfy a huge scenario base. K-Means is one of such algorithms that solves clustering issues.

My Capital One colleague Madison Schott wrote a blog about K-Means algorithm if you would like a more in-depth view.

There are two ways to build a training job in Amazon SageMaker for the K-Means algorithm (and for most of the algorithms that it supports):

  • Create Training Job API: This is used when you have a very good idea of the hyperparameters (arguments) that your training job needs to build an optimal model. It results in your model being built faster, however it requires that you have domain knowledge of your data to provide the exact hyperparameter values to build the optimal model. It produces one training job at the end of the process.
  • Create HyperParameter Tuning Job API: This is used when you do not know the exact hyperparameter values to yield the optimal model. You provide it a range of hyperparameter values to use and SageMaker will build a bunch of training jobs for you and label the job that was optimal based on the range provided. For K-Means algorithm, SageMaker provides a recommended range of hyperparameter values to use.

For our POC, we used the Hyperparameter Tuning Job API, this gave us the flexibility to try different training job scenarios as opposed to building one at a time.

The following Python code can be used to create a Hyperparameter tuning job:

    import boto3
import os
from sagemaker.amazon.amazon_estimator import get_image_uri
os.environ['AWS_PROFILE'] = 'YOUR_AWS_STS_PROFILE'
# gets the K-Means docker image from AWS
image = get_image_uri(boto3.Session().region_name, 'kmeans')
role = 'ARN_OF_ROLE'
bucket = 'S3_BUCKET'
data_key = 'S3_TRAINING_DATA_LOCATION'
test_key = 'S3_TESTING_DATA_LOCATION'
output_key = 'S3_MODEL_OUTPUT_LOCATION'
data_location = 's3://{}/{}'.format(bucket, data_key)
output_location = 's3://{}/{}'.format(bucket,output_key)
test_location = 's3://{}/{}'.format(bucket, test_key)
sagemaker = boto3.client('sagemaker')
response = sagemaker.create_hyper_parameter_tuning_job(
    HyperParameterTuningJobName='TRAINING_JOB_NAME',
    HyperParameterTuningJobConfig={
        'Strategy': 'Bayesian',
        'HyperParameterTuningJobObjective': {
            'Type': 'Minimize',
            'MetricName': 'test:msd'
        },
        'ResourceLimits': {
            'MaxNumberOfTrainingJobs': 50,
            'MaxParallelTrainingJobs': 2
        },
        'ParameterRanges': {
            'IntegerParameterRanges': [
                {
                    'Name': 'extra_center_factor',
                    'MinValue': '4',
                    'MaxValue': '10'
                },
                {
                    'Name': 'mini_batch_size',
                    'MinValue': '3000',
                    'MaxValue': '15000'
                },
            ],
            'CategoricalParameterRanges': [
                {
                    'Name': 'init_method',
                    'Values': [
                        'kmeans++', 'random'
                    ]
                },
            ]
        },
        'TrainingJobEarlyStoppingType' : 'Auto'
    },
    TrainingJobDefinition={
        'StaticHyperParameters': {
            'k': '10',
            'feature_dim': '2',
        },
        'AlgorithmSpecification': {
            'TrainingImage': image,
            'TrainingInputMode': 'File'
        },
        'RoleArn': role,
        'InputDataConfig': [
            {
                'ChannelName': 'train',
                'DataSource': {
                    'S3DataSource': {
                        'S3DataType': 'S3Prefix',
                        'S3Uri': data_location,
                        'S3DataDistributionType': 'FullyReplicated'
                    }
                },
                'ContentType': 'text/csv;label_size=0'
            },
            
            {
                'ChannelName': 'test',
                'DataSource': {
                    'S3DataSource': {
                        'S3DataType': 'S3Prefix',
                        'S3Uri': test_location,
                        'S3DataDistributionType': 'FullyReplicated'
                    }
                },
                'ContentType': 'text/csv;label_size=0'
            },
        ],
        'OutputDataConfig': {
            'KmsKeyId': 'KMS_KEY_ID',
            'S3OutputPath': output_location
        },
        'ResourceConfig': {
            'InstanceType': 'ml.m4.16xlarge',
            'InstanceCount': 1,
            'VolumeSizeInGB': 50,
            'VolumeKmsKeyId': 'KMS_KEY_ID'
        },
        'StoppingCondition': {
            'MaxRuntimeInSeconds': 60 * 60
        }
    }
)
print(response)
  

Model
After creating a training job that meets your criteria, you are now ready to create a model. The model takes the training job and algorithm and creates a Docker configuration, which SageMaker (or any platform) can host for you. This is a basic proof of concept to show how models can be built and glosses over the governance processes and partnerships in taking models to production.

The following Python code sample can be used to create a model:

    import boto3
import os
from sagemaker.amazon.amazon_estimator import get_image_uri
os.environ['AWS_PROFILE'] = 'AWS_STS_PROFILE'
sagemaker = boto3.client('sagemaker')
role = 'ARN_IAM_ROLE'
# gets the K-Means docker image from AWS
image = get_image_uri(boto3.Session().region_name, 'kmeans')
job_name = 'TRAINING_JOB_NAME'
model_name = 'MODEL_NAME'
info = sagemaker.describe_training_job(TrainingJobName=job_name)
print(info)
model_data = info['ModelArtifacts']['S3ModelArtifacts']
primary_container = {
    'Image': image,
    'ModelDataUrl': model_data
}
tags = [
        {'Key' : 'ResourceName', 'Value': 'KMeans_Model'}
    ]
create_model_response = sagemaker.create_model(
    ModelName = model_name,
    ExecutionRoleArn = role,
    PrimaryContainer = primary_container,
    Tags = tags)
print(create_model_response['ModelArn'])
  

Endpoint Config
The next step after creating a model is to create an endpoint config. This creates the configuration that will be used to build the API endpoint that will eventually host the model.

The following code sample can be used to create endpoint config:  

    import boto3
import os
os.environ['AWS_PROFILE'] = 'AWS_STS_PROFILE'
sagemaker = boto3.client('sagemaker')
model_name = 'MODEL_NAME'
endpoint_config_name = 'ENDPOINT_CONFIG_NAME'
print(endpoint_config_name)
tags = [
        {'Key' : 'ResourceName', 'Value': 'KMeans_EndPointConfig'}
    ]
create_endpoint_config_response = sagemaker.create_endpoint_config(
    EndpointConfigName = endpoint_config_name,
    ProductionVariants=[{
        'InstanceType':'ml.m4.xlarge',
        'InitialInstanceCount':1,
        'ModelName':model_name,
        'VariantName':'AllTraffic'}],
    Tags= tags,
    KmsKeyId='KMS_KEY_ID')
print("Endpoint Config Arn: " + create_endpoint_config_response['EndpointConfigArn'])
  

Endpoint
The endpoint is the API that will host the model from which inferences can be made. In the SDK for creating an endpoint, there is no parameter for assigning the role that will execute the SDK. Thus, you cannot execute sagemaker.create_endpoint locally.

The workaround I used involved creating a Lambda function and assigning the execution role to the IAM role used in creating the previous resources.

This is the sample code of the Lambda function:

    import json
import boto3
import os


def lambda_handler(event, context):
    sagemaker = boto3.client('sagemaker')
    
    endpoint_name = event['EndPointName']
    endpoint_config_name = event['EndPointConfigName'] 
    
    tags = [
            {'Key' : 'ResourceName', 'Value': 'Kmeans_Lambda'}
        ]
    
    print(endpoint_name)
    
    create_endpoint_response = sagemaker.create_endpoint(
        EndpointName=endpoint_name,
        EndpointConfigName=endpoint_config_name,
        Tags = tags)
    
    print(create_endpoint_response['EndpointArn'])
    
    resp = sagemaker.describe_endpoint(EndpointName=endpoint_name)
    status = resp['EndpointStatus']
    print("Status: " + status)
    
    try:
  sagemaker.get_waiter('endpoint_in_service').wait(EndpointName=endpoint_name)
    finally:
        resp = sagemaker.describe_endpoint(EndpointName=endpoint_name)
        status = resp['EndpointStatus']
        print("Arn: " + resp['EndpointArn'])
        print("Create endpoint ended with status: " + status)
    
        if status != 'InService':
            message = sagemaker.describe_endpoint(EndpointName=endpoint_name)['FailureReason']
            print('Create endpoint failed with the following error: {}'.format(message))
            raise Exception('Endpoint creation did not succeed')
  

Validation

After creating the endpoint, you are now ready to start sending test data to your endpoint and getting results. The most simple method of obtaining train/test data segments is to uniformly draw 80/20 in the data preparation phases. Validation of this method is a rigorous process we skipped for brevity. To make calls to your endpoint, you can use the SageMaker-runtime SDK in your language of choice.

Here is the Python snippet I used for my POC.

    import boto3
import os
import json
os.environ['AWS_PROFILE'] = 'AWS_STS_PROFILE'
runtime = boto3.Session().client('sagemaker-runtime',use_ssl=True)
endpoint_name = 'ENDPOINT'
payload = 'DATA_TO_SEND'
response = runtime.invoke_endpoint(EndpointName=endpoint_name, 
                                   ContentType='text/csv', 
                                   Body=payload)
result = json.loads(response['Body'].read())
print(result)
  

Results are returned in a JSON format:

    {"closest_cluster": 1.0, "distance_to_cluster": 7.6}
{"closest_cluster": 2.0, "distance_to_cluster": 3.2}
  

The meaning of cluster IDs and distance to cluster is tied to the mathematical explanation of K-Means algorithm. The model will not directly tell you if a point belongs to a cluster or not, you have to provide logic to make inferences based on the distance to the cluster. In our POC, we found out that our data points were tightly clustered (very short distance to cluster) for points that truly belonged together. However, for points that were not in the norm for a line of business we noticed a much larger than average distance to cluster. We started to build our inference logic based on this finding.

Implementation

Once you are satisfied with your validation results, then it means you have a model that is ready for validation by the governing body. When your model is approved and ready for production AWS SageMaker makes it easy to host and scale your model within AWS. Given the flexibility of its modules, you can export your model from AWS and host it on prem or any platform you want.

 

Conclusion

As a software engineer with little background in data science, Amazon SageMaker made it easy to start building machine learning models. We are still in the process of working with our model risk office to assess the model and fine-tune the accuracy of its results. Building an accurate model takes time, collaboration, and a whole ecosystem of tools but it pays off once you have a model in place that can detect complex patterns.

There are a variety of tools on the market to help engineers get started with building machine learning models. I encourage you to do your own research to figure out which one works best for your use case and governance process!


Anthony Okocha
Anthony Okocha, Senior Software Engineer, Capital One

Software development is more than a career for me, it has been my hobby and passion in one form or another for the past 15 years. From development, testing and implementation, the ability of my team to reach our set goals in each of these phases is always paramount. Waving a dead chicken while standing on one foot (figuratively speaking of course) hoping your code compiles is one way to help your team, or you can just go the traditional route and do whatever it takes as a team member to ensure a project is completed...no poultry needed. In my roles as a team member or team leader, I have used humor as a means of effective communication and motivation. I do not believe my humor will get me a spot on Saturday Night Live, (SNL....it's IT we have to abbreviate), I do hope to write code that will one day control some of their comedy skits.


DISCLOSURE STATEMENT: © 2020 Capital One. Opinions are those of the individual author. Unless noted otherwise in this post, Capital One is not affiliated with, nor endorsed by, any of the companies mentioned. All trademarks and other intellectual property used or displayed are property of their respective owners.

Related Content

5 Problems to Solve to Unlock Peak Performance in ML Models

An Introduction to 5 Must-Know Machine Learning Algorithms

Turning Machine Learning Ideas into Products