Box Plots Explained
Dylan | Dec 01, 2019
Today we'll be exploring Box plots, also referred to as Box and Whisker Plots. Box plots are a fantastic tool for quickly generating an overview of your data's distribution by conveniently summarizing five numbers for your viewer; the "minimum" value, first quartile, second quartile, third quartile, and the "maximum" value. Let's explore these five basic statistics below!
QuartilesThe first quartile is the value 25% through our data after sorting it in ascending order. It's calculated by finding the median of our data, then taking the lower half of our data and calculating the median once more. This value is the first quartile, also known as Q1.
The second quartile is simply another name for the median. It is the center value after sorting all of the data into ascending order. If your data set happens to be even, then the median is the mean (average) of the two middle values. In addition to being the median, the second quartile is also referred to as Q2.
The third quartile is the value 75% through our data after sorting it in ascending order. We can calculate the third quartile similar to how we calculated the first quartile. First, we divide our data on the median and calculate the median a second time on the half with larger values. Unsurprisingly, the third quartile is known as Q3.
These three numbers form the box of our box plot!
"Minimum" and "Maximum" ValuesYou may be wondering why "minimum" and "maximum" have always been placed inside quotation marks. What exactly do they have to hide? The truth is, they do not always represent the true smallest and largest values in a data set. More accurately, they represent the minimum and maximum values in a data set after removing the outliers.
How do we determine which values are outliers? There's a formula for that but don't worry, we'll break it all down!
The equation above represents an inclusive range of all inlier values. If any value in our data set falls below the first number in our range or exceeds the second number, it can be deemed an outlier. Outliers are simply any data points that significantly differ from the other observations.
You may also be wondering what IQR refers to in the equation above. IQR is short for interquartile range, which is simply the difference between the third quartile (Q3) and the first quartile (Q1). Let's put all of these pieces together on a real data set.
In PracticeLet's imagine we have the following data set: 6, 8, 9, 11, 16, 19, 23, 28, 35, 41, 43, 58, and 61.
First, we should find the median. Conveniently, our data is already in ascending order. We can easily find the median of a data set with an odd number of values by dividing the length of the data in half, adding one, and counting that many starting from the left. In our case, 13/2+1=7. The seventh value in our set is 23.
Next, let's find the first and third quartiles. We can accomplish this by splitting the set on the median and finding the medians for each of the halves. The center values for the first half of the split set are 9 and 11. The first quartile will be their average, 10. The center values for the second half of the split set are 41 and 43, resulting in a third quartile of 42.
We can now draw the box of our plot!
The leftmost edge of the box represents the first quartile, while the center line reflects the median, and the rightmost edge shows the third quartile. Because the left side of the divided box is much shorter than the right side, we know the data between the first quartile and the median is more densely distributed than the data between the median and the third quartile.
Next, our box needs some whiskers! Let’s begin by identifying the minimum and maximum values of inliers. Returning to the formula above, we need to calculate the interquartile range (IQR). This is as simple as the third quartile minus the first; in our case, IQR=42-10=32. Now multiply the IQR by 1.5 to get 48. Now subtract 48 from 10 (Q1) for the minimum of our inlier range and add 48 to 42 (Q3) for the maximum range. This gives us an inlier range of [-38, 90] and anything outside this range will be considered an outlier.
Because the maximum (61) and minimum (6) values in our data set do not fall below or exceed our inlier range, they will mark the end of our box plot whiskers. This produces the final following plot.
Final Notes: In the event a data set contained some outlier values, they would be represented in the graph by single dots outside of the box plot. Box plots are also often represented vertically to compare different data sets side by side.
As always, thank you so much for reading! If you have any further questions about box plots, quartiles, outliers, or anything else mentioned in this post, please leave a comment down below. I look forward to furthering this discussion with you. Until next time, happy coding!