In this section we will bring the tuned model into effec with data of new customers. Firstly, we will prepare the new data loaded into mongodb and make predictions on these new data. In real-world projects, the real outcomes can come some time after the prediction, and then they can be used to validate the performance of the model. In our case, we will simulate a situation in which the real values of the Exited feature are provided time after we have made the predictions, and we will compute a score with the real data to find out how well the model predicted the behaviour of the customers.
In this section we will prepare the new data loaded into mongodb to make predictions with the tuned model. We have already started the MongoDB daemon. Now let's connect to the database, take the relevant data and store them into a data frame.
import pymongo
import pandas as pd
import numpy as np
from IPython.display import display
import csv
import pickle
#Open a connection with mongodb
client = pymongo.MongoClient('localhost', 27017)
#get the database
print(client.list_database_names())
db = client.customers
#get the collection
print(db.list_collection_names())
colle = db.deploy_data
#count the documents of the collection
print(colle.count_documents({}))
['admin', 'config', 'customers', 'local', 'test'] ['deploy_data', 'customers_data', 'legal_entity', 'natural_person'] 1020
#query some documents just to have a look
[colle.find()[i] for i in range(2)]
[{'_id': ObjectId('60cde74aaf6ab57dbe5cbd78'), 'CustomerId': 'E15683183', 'Name': 'E7090X', 'CreditScore': 684, 'Geography': 'Spain', 'Gender': '0', 'Age': 75, 'Tenure': 3, 'Balance': 999.0, 'NumOfProducts': 1, 'HasCrCard': 0, 'IsActiveMember': 1, 'EstimatedIncome': 1165675.74, 'person': 'entity'}, {'_id': ObjectId('60cde74aaf6ab57dbe5cbd79'), 'CustomerId': '15625083', 'Name': 'Abramov', 'CreditScore': 660, 'Geography': 'Germany', 'Gender': 'Female', 'Age': 38, 'Tenure': 5, 'Balance': 146337.88, 'NumOfProducts': 1, 'HasCrCard': 1, 'IsActiveMember': 1, 'EstimatedIncome': 7674.34, 'person': 'natural'}]
#query the whole collection and store it in a dataframe
datos=pd.DataFrame(list(colle.find()))
Now let's close the connection:
#close the connection
client.close()
To start with, have a look at the data:
datos.head()
_id | CustomerId | Name | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedIncome | person | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 60cde74aaf6ab57dbe5cbd78 | E15683183 | E7090X | 684 | Spain | 0 | 75 | 3 | 999.00 | 1 | 0 | 1 | 1165675.74 | entity |
1 | 60cde74aaf6ab57dbe5cbd79 | 15625083 | Abramov | 660 | Germany | Female | 38 | 5 | 146337.88 | 1 | 1 | 1 | 7674.34 | natural |
2 | 60cde74aaf6ab57dbe5cbd7a | 15712974 | Hogarth | 761 | France | Female | 33 | 4 | 137480.88 | 1 | 1 | 1 | 53765.71 | natural |
3 | 60cde74aaf6ab57dbe5cbd7b | E15607817 | E8908X | 812 | Spain | 0 | 35 | 1 | 87219.88 | 2 | 1 | 1 | 1034466.96 | entity |
4 | 60cde74aaf6ab57dbe5cbd7c | E15601011 | E6431X | 642 | France | 0 | 66 | 9 | 999.00 | 3 | 1 | 1 | 1161182.14 | entity |
Some data refer to legal entities. As we know, our model is focused on natural person, so let's filter the dataframe:
print("Unique values of 'person' before filtering: {a}".format(a=datos["person"].unique()))
datos = datos[datos["person"]=="natural"]
print("Unique values of 'person' after filtering: {a}".format(a=datos["person"].unique()))
Unique values of 'person' before filtering: ['entity' 'natural'] Unique values of 'person' after filtering: ['natural']
As we did earlier to train the model, now we will create "dummy" (i.e. binary) variables from the categorical features. However, this time we will not drop any feature in this step (except for the original categorical variables we extracted the dummies from). In fact, we will store the id and the names of the clients, because we will use this data to validate our predictions when we will get the real outcomes.
#create binary (1, 0) dummy variables
datos["Spain"]=np.where(datos["Geography"]=="Spain",1,0)
datos["Germany"]=np.where(datos["Geography"]=="Germany",1,0)
datos["Female"]=np.where(datos["Gender"]=="Female",1,0)
#drop the original categorical features
datos=datos.drop(columns="Geography")
datos=datos.drop(columns="Gender")
datos.head()
_id | CustomerId | Name | CreditScore | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedIncome | person | Spain | Germany | Female | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 60cde74aaf6ab57dbe5cbd79 | 15625083 | Abramov | 660 | 38 | 5 | 146337.88 | 1 | 1 | 1 | 7674.34 | natural | 0 | 1 | 1 |
2 | 60cde74aaf6ab57dbe5cbd7a | 15712974 | Hogarth | 761 | 33 | 4 | 137480.88 | 1 | 1 | 1 | 53765.71 | natural | 0 | 0 | 1 |
5 | 60cde74aaf6ab57dbe5cbd7d | 15744199 | Raber | 519 | 44 | 6 | 0.00 | 2 | 1 | 0 | 161254.29 | natural | 0 | 0 | 0 |
6 | 60cde74aaf6ab57dbe5cbd7e | 15679508 | Stevens | 677 | 18 | 4 | 153212.86 | 1 | 1 | 1 | 22365.03 | natural | 0 | 1 | 0 |
7 | 60cde74aaf6ab57dbe5cbd7f | 15730496 | Kharlamov | 774 | 55 | 6 | 79447.49 | 1 | 1 | 1 | 188899.10 | natural | 0 | 0 | 1 |
Now store Customerid and Name in another dataframe:
X_deploy_ids = datos[["CustomerId","Name","person"]]
display(X_deploy_ids.head())
X_deploy = datos.drop(columns=["CustomerId","Name","person","_id"])
X_deploy.head()
CustomerId | Name | person | |
---|---|---|---|
1 | 15625083 | Abramov | natural |
2 | 15712974 | Hogarth | natural |
5 | 15744199 | Raber | natural |
6 | 15679508 | Stevens | natural |
7 | 15730496 | Kharlamov | natural |
CreditScore | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedIncome | Spain | Germany | Female | |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 660 | 38 | 5 | 146337.88 | 1 | 1 | 1 | 7674.34 | 0 | 1 | 1 |
2 | 761 | 33 | 4 | 137480.88 | 1 | 1 | 1 | 53765.71 | 0 | 0 | 1 |
5 | 519 | 44 | 6 | 0.00 | 2 | 1 | 0 | 161254.29 | 0 | 0 | 0 |
6 | 677 | 18 | 4 | 153212.86 | 1 | 1 | 1 | 22365.03 | 0 | 1 | 0 |
7 | 774 | 55 | 6 | 79447.49 | 1 | 1 | 1 | 188899.10 | 0 | 0 | 1 |
Before exporting the data, we have to make sure that the features are in the same order they had in the train data set. To do that, recall the list of the feature names we stored earlier:
output_path = "/home/fabio/Documents/data_science_project_1/data/"
X_train_names=list()
with open(output_path+"X_names.csv", 'r') as csvfile:
foo = csv.reader(csvfile)
for _ in foo:
X_train_names.extend(_)
csvfile.close()
print(X_train_names)
print()
print(list(X_deploy.columns))
print()
if X_train_names == list(X_deploy.columns):
print("They are identical")
['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'EstimatedIncome', 'Spain', 'Germany', 'Female'] ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'EstimatedIncome', 'Spain', 'Germany', 'Female'] They are identical
Now we will load the tuned model and make predictions on the new data.
tuned_model_path = "/home/fabio/Documents/data_science_project_1/results/"
#Load the tuned model
loaded_model = pickle.load(open(tuned_model_path+"tuned_model.pickle", 'rb'))
#Make predictions
y_pred=loaded_model.predict(X_deploy)
#Add the predictions to X_deploy_ids as a new column
#X_deploy_ids["Predictions"] = y_pred
X_deploy_ids.insert(loc=3, column="Predictions", value=y_pred, allow_duplicates=False)
X_deploy_ids.head()
CustomerId | Name | person | Predictions | |
---|---|---|---|---|
1 | 15625083 | Abramov | natural | 0 |
2 | 15712974 | Hogarth | natural | 0 |
5 | 15744199 | Raber | natural | 0 |
6 | 15679508 | Stevens | natural | 0 |
7 | 15730496 | Kharlamov | natural | 0 |
Now let's validate the model by comparing the predictions with the real outcomes
project_data_path = "/home/fabio/Documents/data_science_project_1/data/"
#Load the real outcomes
real_outcomes = pd.read_csv(project_data_path+"0_deployment_data_real_target.csv")
real_outcomes = real_outcomes[real_outcomes["person"]=="natural"]
real_outcomes.head()
CustomerId | Name | Exited | person | |
---|---|---|---|---|
0 | 15625083 | Abramov | 1 | natural |
2 | 15712974 | Hogarth | 1 | natural |
5 | 15679508 | Stevens | 0 | natural |
6 | 15744199 | Raber | 1 | natural |
7 | 15730496 | Kharlamov | 1 | natural |
To validate the prediction of the model, we will compute a score of correct predictions. To this end, we first join the real outcomes with the predicted values, and then compute a percentage of correct predictions.
joined = X_deploy_ids.merge(real_outcomes, how='inner', on=["CustomerId","Name","person"])
joined.head()
CustomerId | Name | person | Predictions | Exited | |
---|---|---|---|---|---|
0 | 15625083 | Abramov | natural | 0 | 1 |
1 | 15712974 | Hogarth | natural | 0 | 1 |
2 | 15744199 | Raber | natural | 0 | 1 |
3 | 15679508 | Stevens | natural | 0 | 0 |
4 | 15730496 | Kharlamov | natural | 0 | 1 |
#compute the score
score = list()
for i in range(len(joined)):
if joined["Exited"][i] == joined["Predictions"][i]:
score.append(1)
else:
score.append(0)
accuracy = round( ( sum(score) / len(score) )*100, 1)
print("The accuracy of the predictions is: {a}".format(a=str(accuracy)+"%"))
The accuracy of the predictions is: 80.3%
Now that we have validated the model on new data, we can build a pipeline to run the whole prediction-validation process as it would be done in real-world projects when predictive models are deployed as "business as usual". To this end, we will first prepare some scripts to run the preparation of the data, the predictive step and the validation step separately. This will enable us to run the whole process from a terminal.