This post carries my response to the question posed last week in the first part of a story which is here – https://benchmarksixsigma.com/blog/how-to-find-if-a-data-set-is-genuine/. Kindly go through part 1 of the story before reading this section.
As visible in responses to part 1, some of us have mentioned an assumption that the process might have improved or changed and therefore the number of complaints has probably come down. This should have happened in a process ideally, but in our story, this was not the case. You might notice that it is mentioned that complaints were consistent over time at another location of the same company and process in the location being audited was pretty much the same. So, let us see how the situation unfolded later when it was known to me that there was no process improvement or change in customers.
The data shown to me during that audit was the data-set B. This is shown here again. To recollect the categorization logic of this data from our story, it shows the number of complaints marked against their first digit of serial number.
Me: Mr. Dhanush, I can see that complaints show a peculiar pattern in your data-set and I want to know why this has happened. To be more specific, the number of complaint numbers starting with digit 1 is higher than those starting with 2 and so on.
Dhanush (Auditee): Yes, you are right, we have also noticed that. It is a natural trend in numbers as normally expected.
Me: I am interested in understanding why is it that you consider it as a natural pattern.
Dhanush (Confidently, with a smile as if he expected and was ready for this question): I have discussed this with my boss and he is also joining us here. There is something called Benford’s Law. And this law shows that distribution of first digits follows a declining pattern. Number of values with first digit being 1 is always the biggest. Here are two examples (shows me something on his laptop)
1. The height of world’s tallest structures – Digit 1 is the most common leading number – http://en.wikipedia.org/wiki/List_of_tallest_buildings_and_structures_in_the_world
2. Population of 237 countries follows similar pattern – http://en.wikipedia.org/wiki/List_of_countries_by_population
Meanwhile, Manish joins us. He is Dhanush’s boss.
Me: Good that you joined us, Manish. We have been discussing the reason why the complaints show a peculiar trend and Dhanush has been showing me Benford’s law. It is a very useful law and does apply to numbers at many places. Are you aware of this pattern?
Manish (Laughs): Yes, we came to know about this law last week. And interesting to see that this applies to our data as well.
Me: I am afraid I do not see a reason why this law should apply to your data. I hope you agree that after deleting irrelevant complaints (which are likely to be randomly distributed), your data shall also have a somewhat random pattern, in terms of serial number of complaints. Is that right?
Manish: Yes, logically that should be it.
Me: Your complaint numbers start with 1 and go up to 998. Let us try this – Input this formula in excel =RANDBETWEEN(1,998) and now copy the formula in 110 cells to get a set of numbers. Here is what I got and you are likely to get a similar (but not same) pattern.
As you can see in this example, the number of serial numbers starting with 1 and number of serial numbers starting with 9 are likely to be similar in a random pattern.
Manish: Oh, I see. I do understand this. We now know that Benford’s Law does not apply to our data set. But probably our data did not know this and followed it. (Suddenly bursts into laughter and pats Dhanush on his back)
I see no option but to join the fun and we all have a hearty laugh. Manish and Dhanush agreed that they had deleted many complaints and while deleting, wanted to use a method that will ensure that data will not look fudged. They came across Benford’s Law and thought this knowledge is handy. They carefully retained complaint serial numbers that followed Benford’s Law. The complete data was actually downloaded earlier and was available. We had a look at the whole data set. As things settle down, the conversation continued.
Manish: Why does the Law not apply to our data?
Me: It does not apply because your values are not distributed across multiple orders of magnitude. The law does apply to large number of different data sets like electricity bills, stock prices, tax calculations, death rates etc. It tends to be more and more accurate when you have multiple orders. Here your numbers are between 1 and 998. So, the chances of each digit coming first is almost same. If you wish to simulate in excel more precisely so that we avoid repeated numbers, please search “random numbers in excel without repeats” on google and you shall find a more suitable method than what I demonstrated.
Dhanush (with a big smile): I understand what you said. However, I do not want to get into finding another so called “perfect” method of fudging for our next audit but we will address the complaints so that we are happy to show you a real declining trend.
Me: Well, that is what everyone wants. Cheers!
NOTE – In the comments to my story-part I, I appreciate all who participated and especially those who shared some in-depth analysis. Special and deeper appreciation for Saurabh who read through other comments and accurately mentioned that Benford’s Law does not apply here and gave absolutely sound reasoning for it.