At this point, we have already started the MongoDB daemon. First, we will open a connection with MongoDB, then we will get the database and the collection. Subsequently, we will check the number of documents in the collection and have a look to some of them. Finally, we will store the collection into a dataframe.

Now let's close the connection:

To start with, have a look at the data and let's explore the types of the data and how many unique values there are in each feature:

We know that person is an identifier indicating if the client is a natutal person or a legal entity. We have already selected only the natural person, so the feature has just a unique value. Now let's see the unique value of the other features that have few unique values:

Exited is the binary (0, 1) response feature. Let's explore the relationships between the response feature and the other features listed here. We can do this with simple bar plots, using the plot method provided by pandas:

To explore the relationships between the response Exited and the other features (which have many different values) we need another kind of plot. Boxplots are convenient to compare continuous variables (or discrete variables with many different values) with categorical variables. Let's use the method provided by pandas to plot a boxplot for each feature:

The bar plots suggest that the number of products that customers have influences their behaviour, i.e. those who have 2 products are less likely to exit compared to those who have one product. Similarly, the customers less likely to leave seem to be those who have a credit card, are active clients, and are based in France. The boxplots suggest there is some effect of the age of the clients and of their balance. In the subsequent analyses we may consider removing some variables that seem less important. However, the algorithms that we will use might find some patterns in the data that we cannot see now with this simple visual analysis.

Now let's prepare the data to train the model. First, we will create "dummy" (i.e. binary) variables from the categorical features we have explored above. For each categorical feature we will create n-1 dummy variables, where n is the number of levels of the original feature. This is necessary in linear models (this is sometimes called "the dummy variable trap"). But it is sensible in any case, and we will do so even when we use non-linear algorithms (e.g. decision trees), although it may not be considered a definite assumption or requirement of this kind of algorithm.
Moreover, we will drop some features that are useless to train the model.

Finally, let's prepare the train and test data sets; for that, we will take the 25% of the original data as test set. We will also compute the target rate (i.e. the percentage of "1" in the response feature) to check whether the train and test subsets result with the same target rate.

Finally, let's export the four data sets:

Now we are ready to start training the models.