Visuals are always easy to review and summarize the content. This is precisely the reason a graphical summary is done rather than reading data in multiple rows or columns.
What is a Box Plot?
The most commonly used method for graphical summary is a frequency distribution plot like a histogram (for continuous data). The same data can also be plotted using a box plot which is just another way of looking at the histogram. Box Plot is a top view of the histogram. I took the annual rainfall data (from the GoI website for Andaman and Nicobar) and below is the graphical summary from Minitab.
If you notice, the same data is represented in a histogram and a box plot.
Even though both graphs represent the same data, the two are actually different. I have tried to summarize the differences below
In addition to the insights or usefulness of the Box Plot as captured in the above table, Box Plot can be used in the following scenarios as well
1. Compare data sets for the same metric (I have provided an example below) even when a project is not being done
2. Used to identify the problem in Define phase (too much spread or process shifted to one side)
3. Used to baseline the process performance in Measure phase
4. Used to graphically compare performance of two or more sub-groups (units, departments, centers, shits etc.) in Analyze phase
5. Used to confirm the improvement in the Improve phase (spread will reduce or process is more centered)
6. Check for presence of outliers in data to ensure process control in Control phase
In the below example, I considered the annual rainfall data for 6 regions (from the GoI website).
Observations from the box plot
1. Clearly identifies the regions which get higher rainfall as compared to the others. A&N receive the maximum annual rainfall while Rajasthan West receives the lowest
2. Rainfall in Rajasthan, Delhi, Orissa and UP West (if I ignore the slightly elongated whisker) is almost equally distributed across the range, while it is skewed in A&N (left skewed) and Nagaland (right skewed)
3. The variation in rainfall is the least in Delhi and Rajasthan West while it the max in A&N and Nagaland (given that the length of the box is highest for them)
4. There are no outliers in the data set
Just for illustration, I added another year's data (hypothetically a drought year). Below is how the box plot changes.
Now the box plot, adds an outlier (star mark) for all states except Rajasthan West (i had entered a value of 0, but still it did not consider it as an outlier). These star marks indicate the presence of a value which is different from the other values for the data set or in other words is an outlier. Box Plot identifies it and gives us a chance to investigate and do RCA to find out the reason (remember I had entered data for a hypothetical drought year where rainfall will be very less).
I guess the limitations of Gauss' Normal Distribution Plot and Karl Pearson's Histogram led John Tukey to identify and start using a Box Plot :)