There are some mistakes in data analysis that pop up more often than others. In an effort to make data analysis accessible for everyone, we want to provide a refresher course in best practices. The following examples are based on survey data Reddit collected of its users and an e-commerce dataset.
Top ten graphs and pie charts are a popular way to show categorical information, it lets you get an idea of the major contributors in a given field quickly. In some instances, it makes sense to use an Other category to combine sources that add up to a relatively small percentage of the whole as in the example below.
However, the problem lies in using the Other category when that category is larger than many of the named categories. For example, look at the two graphs below. The first graph that has the other category hidden leads you to believe that those top ten categories account for the majority of the popular subreddits. The second graph tells a different story, and brings to mind the long tail theory.
Bar charts are used to visualize relative sizes. If you see one bar that is twice as long as the other bar, you expect the underlying quantity to be twice as big. This relative sizing fails if you do not start your bar chart axis at 0, however. As a rule of thumb, if you want to illustrate a small change, use a line chart if you’re starting your y-axis anywhere other than 0. Here is a recent example of this topic in Twitter’s IPO filing with the SEC.
It’s easier to represent this idea with examples. The chart on the left compares Redditors who like dogs vs. redditors who prefer cats. With the y-axis starting at 9,000, it looks like dog lovers outnumber cat lovers by three times. However, the graph on the right is a more accurate representation of the data. There are not quite twice as many redditors who prefer dogs to cats.
A limitation of bar charts that start at 0 is they do not show small percent differences. If you need to change the start of your axis in order to highlight small changes, switch to a line chart.
Confusing correlation and causation is perhaps one of the most tempting data pitfalls. In the Cheese and Employment Status percentage graph, it is clear that retired Redditors prefer cheddar cheese and freelance Redditors prefer brie. This does not mean that once an individual retires he or she will develop a sudden affinity for cheddar. Nor does becoming a freelancer cause one to prefer brie.
Averages can be a great way to get a quick overview of how a business is doing in some key areas, but use it wisely. For example, average order value is a useful metric. If we were to look at only the average order value by month it’s enlightening because it shows an increase over time, which indicates that the company is moving in the right direction. However, it’s more useful to look at average order value by department by month, because this shows us where the increase in average order value is coming from; the women’s shoes department. If we looked at only average order value by month, we may focus our marketing in all departments, which may not be the most efficient allocation of resources.
It may be tempting in the above situation to make a pie chart of the average order amount by department but resist that urge. Pie charts are meant only to represent a slice of the whole. How can you have a slice of the average? Instead, use sum or number of records for pie charts.
What are your data analysis pet peeves? Leave a comment below and join in the conversation.
DataHero helps you unmask the answers in your data. There’s nothing to download or install. Simply create a free account and connect to the data services you use every day (like Salesforce, Stripe, MailChimp, Dropbox and Box). DataHero automatically decodes your data and shows you the answers you need through dynamic visualizations.
Get the fastest, easiest way to understand your data today.Sign up for free