“Correlation is not causation!” This phrase is emphatically shouted by many data analysts, because it’s become a bit of a sore point in the community. The statement is obviously true, but correlation is still a relatively misunderstood piece of data analysis that could use some further explanation and exploration.

A correlation is a mathematical relationship that exists between two variables. In a positive correlation, as one variable increases, so does the second variable. For instance, femur length increases as overall height increases.

In contrast, a negative correlation occurs when as one variable increases, and the other decreases. The miles per gallon decrease as the weight of the vehicle increases.

Correlation is found in different degrees as defined by the correlation coefficient. You can also discover correlations visually in a scatter plot. A strong correlation means that as one variable increases or decreases, there is a better chance of the second variable increasing or decreasing. In a visualization with a strong correlation, the points cloud is at an angle. In a strongly correlated graph, if I tell you the value of one of the variables, you should be able to get a rough idea of the value of the second variable. Wind speed is strongly correlated with wave heights in data collected from Pacific Ocean buoys for one month, for example. There do appear to be some interesting outliers around a wind speed of 10-15 meters per hour as well, where further investigation would be required.

Rushing yards are also strongly correlated with points in college football.

A weak correlation means that as one variable increases or decreases, there is a lower likelihood of there being a relationship with the second variable. In a visualization with a weak correlation, the angle of the plotted point cloud is flatter. If the cloud is very flat or vertical, there is a weak correlation. Earthquake magnitude and the depth at which it was measured is therefore weakly correlated, as you can see the scatter plot is nearly flat.

In the chart below, you can see that TV show views and TV show downloads weakly correlate. Thus if a TV show is popular on TV we cannot deduce with much certainty that it will be popular to be downloaded illegally.

Correlations can hint at tendencies, but there are no hard and fast conclusions to be drawn from correlations without further research. For example, including customer loyalty perks like free miles on an airline tends to increasing customer satisfaction but certainly doesn’t guarantee it. There are so many other factors that contribute to customer satisfaction that it would be overly simplistic to assume that all customers who have mileage perks are extremely satisfied with their flight experience.

Correlations also leave room for additions that can skew these tendencies. If customers who receive free miles because of a loyalty program are usually more satisfied, customers who also receive free checked bags as a result of the loyalty program will likely be even more satisfied. In contrast, customers who are awarded free miles but who encounter haughty customer service representatives will likely have a tendency to be less satisfied.

If correlation does not mean causation, then what does it mean? A correlation between variable A and variable B can mean a few things:

- A causes B
- B causes A
- Pure coincidence
- A and B are connected by the same cause
- Both A and B contribute to each other in a loop

Some users may shy away from correlation analysis because it requires some backstory, and because it is considered one of the less understood forms of analysis. For instance, homelessness and crime may be high in the same areas or low in the same areas (and thus are correlated). After having read this post, you know that to say that homelessness causes crime, or crime causes homelessness would be imprudent unless further research was conducted. The two variables could be caused by a third and unknown variable, such as education or drug addiction.

Now that you know what to look for in plotting for correlation, you can analyze relationships between all kinds of variables in DataHero. Plot order revenue by number of items per order, or customer satisfaction by response time, the possibilities are endless. Correlation points you in the right direction so you know where to further investigate the backstory in your data, or can lead you to better decisions in where to focus your efforts. If sign ups correlate positively with your Facebook ad spending, but not with your Twitter ad spending, then you know to probe into why that is, and consider reallocating more of your marketing budget to Facebook ads.

Get the fastest, easiest way to understand your data today.

Sign up for free