Predictive Modeling: Part 1

2nd June 2018 | Basics | Ajay Kumar N


If somebody asks you to guess how will be the weather tomorrow, you can take a guess. This guess is based on your knowledge of past weather. Predictive modelling gives this process a formal framework. It gives you tools to extract mathematical equations / rules from the past data to predict future results.

Geisser [Predictive Inference: An Introduction] defines predictive modeling as "the process by which a model is created or chosen to try to best predict the probability of an outcome."

Max Kuhn [Applied Predictive Modeling] defines predictive modeling as "the process of developing a mathematical tool or model that generates an accurate prediction."

Steps of Predictive Modelling

  1. After identifying the Business objectives, first step in any predictive model is to collate data from various sources. The sources of data can be historical data, demographic data, behavioral data, Customer data and transactions data.
  2. In the second step, we need to prepare data into right format for analysis. Here we normally clean data, impute missing values, transform and append variables.
  3. Based on the Business Objectives we have to select on or combination of Modelling Techniques like Linear regression for predicting the future Values.
  4. Final step is to check the performance of Model like error, accuracy, ROC and other measures.

Before you draw any conclusions about predictive modeling being some kind of black magic, lets list down its limitations

  1. The models [equations / rules] are dependent on the past data that you have. If data is bad, your predictive models are also going to be bad. Garbage in, garbage out.
  2. Every model will have errors associated with its predictions. Better the model, lesser the error, but it will never be an exact estimate for all practical purposes.
  3. Model will be good only until underlying factors on which it was based on, do not change behavior. For example: A predictive model which was built to predict a particular share performance in good economic conditions will perform rather poorly in recession.

Before we right away jump in to predictive modelling and start extracting those said equations, lets first figure out what really leads us there.


When we say that, given past data we can predict future results / outcomes, we are essentially relying on our hunch that we can observe some other factor which affects the outcome and we can leverage that information. For e.g., when you pick up an apple and guess it's weight, you are betting on your assumption that weight of an apple is dependent on its size [or diameter]. In other words , you are assuming that weight of that apple is correlated with its diameter.

As you might have observed by now that, weight of apple goes up as its diameter increases. On thinking more deeply you will find out that, increase in weight of apples is happening in possibly constant multiples of increase in diameter. This is called linear correlation. There can be other form of correlations as well.

What do we mean by these linear and other forms of correlations is that one variable [lets say y] can be written as a function of another [lets say x]

  1. Linear Correlation: $\large y=ax+b$

  2. Exponential Correlation: $\large y=ae^x+b $

Quantifying Correlation Coefficient

The Pearson correlation coefficient is probably the most widely used measure for linear relationships between two variables and thus often just called "correlation coefficient". The below formula formula is designed to measure strength of linear correlation.

$\large r = \frac{{}\sum_{i=1}^{n} (x_i - \overline{x})(y_i - \overline{y})} {\sqrt{\sum_{i=1}^{n} (x_i - \overline{x})^2(y_i - \overline{y})^2}}$

Ideally, value of 1 represents a perfect positive relationship, -1 a perfect negative relationship, and 0 indicates the absence of a relationship between variables. However, it takes negative values of negative correlation and positive values of positive correlation Note: negative correlation between $x$ and $y$ means, when $x$ increases $y$ decreases and vice versa.

It is also important to note that the value of $r$ doesn’t change if you linearly transform any or both of the variables. Meaning, correlation between $x$ and $y$ will be same as correlation between $(ax+b)$ and $(ay+c)$.

The downside is that it can be used to measure linear correlation only.

Correlation and Causation

Causation is when a particular factor is the reason for change in another factor. For example number of people buying sun-screen in the city and city’s temperature are going to be correlated. Also there is direct causation. Temperatures going up [Hot Sunny Weather] is driving sales of sun-screen. However if two factors are correlated, that doesn’t guarantee that there will be causation.

Assume that ice-cream sales and shark attacks are correlated. Ice-cream sales and shark attack both increase in same proportion. However that doesn’t mean that ice-cream sales are cause of shark attacks. In fact rising temperature cause more people to buy ice-cream and also it causes people to go to beaches in larger numbers [and sometimes subsequently get attacked by sharks].

This clarifies two things

  1. Correlation doesn’t necessarily means causation.
  2. Correlated factors might have a common underlying cause, though, not always necessary.

Finding Correlation in Python

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
In [2]:
df = pd.read_csv("mtcars.csv")
model mpg cyl disp hp drat wt qsec vs am gear carb
0 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
1 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
2 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
3 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
4 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
In [3]:
mpg cyl disp hp drat wt qsec vs am gear carb
mpg 1.000000 -0.852162 -0.847551 -0.776168 0.681172 -0.867659 0.418684 0.664039 0.599832 0.480285 -0.550925
cyl -0.852162 1.000000 0.902033 0.832447 -0.699938 0.782496 -0.591242 -0.810812 -0.522607 -0.492687 0.526988
disp -0.847551 0.902033 1.000000 0.790949 -0.710214 0.887980 -0.433698 -0.710416 -0.591227 -0.555569 0.394977
hp -0.776168 0.832447 0.790949 1.000000 -0.448759 0.658748 -0.708223 -0.723097 -0.243204 -0.125704 0.749812
drat 0.681172 -0.699938 -0.710214 -0.448759 1.000000 -0.712441 0.091205 0.440278 0.712711 0.699610 -0.090790
wt -0.867659 0.782496 0.887980 0.658748 -0.712441 1.000000 -0.174716 -0.554916 -0.692495 -0.583287 0.427606
qsec 0.418684 -0.591242 -0.433698 -0.708223 0.091205 -0.174716 1.000000 0.744535 -0.229861 -0.212682 -0.656249
vs 0.664039 -0.810812 -0.710416 -0.723097 0.440278 -0.554916 0.744535 1.000000 0.168345 0.206023 -0.569607
am 0.599832 -0.522607 -0.591227 -0.243204 0.712711 -0.692495 -0.229861 0.168345 1.000000 0.794059 0.057534
gear 0.480285 -0.492687 -0.555569 -0.125704 0.699610 -0.583287 -0.212682 0.206023 0.794059 1.000000 0.274073
carb -0.550925 0.526988 0.394977 0.749812 -0.090790 0.427606 -0.656249 -0.569607 0.057534 0.274073 1.000000
In [4]:
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, square=True, cmap="YlGnBu")
<matplotlib.axes._subplots.AxesSubplot at 0x212d693c978>

If we look at mpg and wt variables, the correlation is -0.87. This tells you that correlation exists and it is significant. Also the sign of correlation coefficient is negative, which tells you that as weight of a vehicle increases, mileage goes down.

We will continue our discussion on Predictive Modeling in next post.


Hi! I am Ajay. I try to contribute to society by striving to create great software products that make people's lives easier. I believe software is the most effective way to touch others' lives in our day and time. I mostly work in Python, I do not pigeonhole myself to specific languages or frameworks. A good developer is receptive and has the ability to learn new technologies. I also often contribute to open source projects and beta test startup products. I'm passionate about making people's lives better through software. Whether it's a small piece of functionality implemented in a way that is seamless to the user, or it's a large scale effort to improve the performance and usability of software, I'm there. That's what I do. I make software. Better.

Connections Are Good