Skip to content
View in the app

A better way to browse. Learn more.

Benchmark Six Sigma Forum

A full-screen app on your home screen with push notifications, badges and more.

To install this app on iOS and iPadOS
  1. Tap the Share icon in Safari
  2. Scroll the menu and tap Add to Home Screen.
  3. Tap Add in the top-right corner.
To install this app on Android
  1. Tap the 3-dot menu (⋮) in the top-right corner of the browser.
  2. Tap Add to Home screen or Install app.
  3. Confirm by tapping Install.
Message added by Mayank Gupta,

Data Dredging (also known as significance chasing or p-hacking or data fishing) is the mal-practice of using data analysis to find statistically significant results where none actually exists to support the intuition or instincts of the researcher. Data dredging leads one to believe that false positives are ok.

 

An application-oriented question on the topic along with responses can be seen below. The best answer was provided by Sanchita Roy on 04th Feb 2022.

 

Applause for all the respondents - Hirak Raval, Johanan Collins, Vijay Krishnan, Sanchita Roy.

Featured Replies

Q 442. In a Lean Six Sigma project, in order to prove their intuitions it is common for a project leader to go for Data Dredging. What is Data Dredging and its harmful effects on data analysis? Mention some of the ways in which it can be avoided.

 

Note for website visitors - Two questions are asked every week on this platform. One on Tuesday and the other on Friday.

Solved by sanchitar17

Data Dredging is used for selective interference of intuition also known as data fishing or data paralysis. It is highly not recommended practice used by project leaders to prove their already known solution through randomly generating data through different tools in excel and Minitab. It increases the risk of false and misleading conclusion of problem and sometimes do not help in identifying sustainable solutions. There are high chances of incorporating Bias, Predetermined decision making if project leader do data dredging. Generally project leaders use this for proving co-relations between 2 or more variables , Proving a alternative hypothesis as correct one for ease of execution. Below are the ways to avoid data dredging :

 

1. Look for patterns in data, Compare it with standard distributions as per phenomena i.e Normal, triangle , Log normal, Uniform etc.

2. Devide data in random sub groups and check for distribution patters , it should follow the same distribution patterns as per parent data distributions
3. Do multiple hypothesis test through different statistical tests like Bonferroni corelation

Data Dredging

Data dredging is the abuse of data analysis to detect patterns in data that can be used to show statistical significance when no significance exists. This searching of patterns can lead to an increase in false positives. It could be done by conducting various statistical tests, randomly changing the model parameters, and cherry-picking the findings that lead to statistically significant results.  It is also called data butchery, snooping, fishing, significance questing or chasing, selective inference, or p-hacking.

simplicable.com describes data dredging as being similar to data mining with the key difference being that data dredging is an automatic statistical search that does not have a hypothesis but just looks for patterns in data whereas data mining starts off with a hypothesis (what you are expecting to find in the data). With large data sets and computing power, the probability of finding correlational patterns in data would increase. It, therefore, looks for a pattern that fits the data rather than confirming if a pattern exists in the data. Data dredging makes it easier to write an academic paper that is auto-generated but is not valid.

Harmful Effects

It is unethical and misleads other researchers and has a negative effect on the Body of Knowledge.

It increases the number of false positives leading to wrong decisions being made based on the results

It could lead to the retraction of previously published articles.

It increases the bias in the study and decreases the range of probability

It could lead to sub-optimal utilization/wastage of resources such as funding and researchers.

How to prevent it?

Dispel the belief among academicians that not statistically significant results are not important.

Publish or perish. The practice of publish or perish which puts researchers under pressure to regularly publish research articles in order to continue in their position needs to be reviewed by the academic body.

Take strict disciplinary action against researchers who engage in data dredging and other such practices

Institutional Review Board (Approval). Making it mandatory for IRB approval, prior to the researcher engaging in research.

Making it mandatory for researchers to do courses such as “Responsible conduct of research” prior to engaging in research

Peer Review Journals Vs Predatory Journals. Discouraging researchers to publish in predatory journals.

References

https://simplicable.com/new/data-dredging

https://s4be.cochrane.org/blog/2021/06/25/what-is-data-dredging/

https://en.wikipedia.org/wiki/Data_dredging

What is Data Dredging?

 

Data Dredging, P-Hacking or Data Fishing is a data mining practice that is unethical where large amounts of data are analysed seeking any possible relationships between data. It could be termed as cherry picking of promising findings leading to spurious excess of statistically significant results in published or unpublished literature.

 

Unlike the scientific method where we begin with a hypothesis and follow with an examination of data, data dredging often is used for leading to premature conclusion to support the intuition or mal-intent of the analyst.

 

Data dredging sometimes results in correlation between variables concluded as significant though the data requires further study before such association can legitimately be determined.

 

For example, during the Coronavirus pandemic of the 3,00,000 odd published articles more than 50% of them lack authenticity “cut corners, many of them were very hastily done, many of them were very unreliable with hugely exaggerated results, hugely wrong results sometimes” as put forth by Prof. John Loannidis, Professor of Medicine at Stanford Prevention Center.

 

1551101472_DataDredging.JPG.bda0a6ccffd4598aa67686347ce6c853.JPG

 

Over reliance on p-value

A p-value of less than or equal to 0.05 is generally considered as statistically significant. This points to a strong evidence against the null hypothesis. In order to gather significant data, investigators often pick the data suitable to their desired conclusion.

The harmful effects of data dredging are:

  • Increases the number of false positives

  • Increased bias of the study

  • Mislead other investigators

  • Decreased range of probability

  • Wastage of resources

  • May lead to retraction of publications and stopping of funding

  • Decrease the sanctity and belief among common people on scientific studies

How to prevent data dredging?

- Following strict guidelines for conducting studies

-    Relying on other methods such as confidence interval, decision theoretic modelling, likelihood ratios and Bayesian factors than just p-value

-   Use AI and ML methods to detect possible use of data dredging technique

 

References

https://www.bihealth.org/en/notices/how-did-the-coronavirus-influence-research

https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.0020124

https://s4be.cochrane.org/blog/2021/06/25/what-is-data-dredging/

https://imgs.xkcd.com/comics/significant.png under Creative Commons License 2.5 Randall Munroe. xkcd.com (Image Courtesy)

https://s4be.cochrane.org/blog/2015/07/28/data-mining-data-dredging/

  • Solution

As LSS professional often we find ourselves in a tight situation and complete deliverables under pressure. There is a chance that due to several external factors around us we end up Data dredging, also referred as "data fishing" which means analyzing data in such a manner so that possible relationships between data are somehow demonstrated. The effects are harmful because it defeats the purpose of true hypothesis testing. Some of the other terms of data dredging are “p-hacking”, “data snooping”, “fishing trip” and so on. For instance, we want to prove a hypothesis during a pre-project and post project improvement analysis, however the data doesn’t reveal so and we use a “cherry picked” sample which helped prove the point of improvement statistically. This would result in data dredging!

In LSS world, unless factors are statistically significant, it doesn’t have the “value” and to prove the hypothesis using a statistical test quickly may end up in data dredging. Sometimes, unintentionally; more often a move made to close the case with some bias. It is easy to access large data set and perform analysis to come up with various relationships at random. Sometimes, data dredging may result in accidental correlation which otherwise may not have been identified. However, in our endeavor to research/analyze, it is important to recognize a valid relationship and focus on unbiased data set to arrive at accurate conclusions.

The end results can be harmful in many ways:

·         Proved a hypothesis as statistically significant which may be later be proved as ‘not significant’

·         Solutions are framed around a “so-called” significant cause whereas it may not help resolve the issue thereby becoming a questionable move later

·         Time/Effort spent would be a waste and end up being anti-LSS (Lean says reduce waste!)

·         Credibility of the professional may go down if practiced frequently and may put the entire organization in the wrong spot

 

We can avoid data dredging by adopting practices like:

·         Ensuring data set is sufficient, relevant, representative, and not just a mere “subset”

·         Negotiate for adequate time, effort required for analysis and not perform under pressure, if we must turn around something quickly, we do so with a caution statement and not conclude too soon

·         Make data capture process accurate, robust, and exhaustive

·         Question the extreme values

·         Go with a balance of “data door” and “process door” approach in the project so that all possibilities are explored, and data/information are better presented for operational consumption without getting stuck in hypothesis testing

·         Keep it simple, use business sense as well to justify causation once we see correlation

A scenario: The project lead shared the following data towards the end of project end for a review with the mentor:

Pro-Project (AHT in mins)

20

Post Project (AHT in mins)

13

 

A better view for the project mentor would be the below table to mitigate data dredging as assess sustained performance:

Pre Project

(AHT in mins)

20

16

23

22

18

24

17

Post Project

(AHT in mins)

12

16

13

12

14

13

11

Response is drafted basis relevance of Data Dredging typically in business process outsourcing.

Interesting answers to an interesting question. The best answer has been provided by Sanchita Roy.

Create an account or sign in to comment

Account

Navigation

Search

Search

Configure browser push notifications

Chrome (Android)
  1. Tap the lock icon next to the address bar.
  2. Tap Permissions → Notifications.
  3. Adjust your preference.
Chrome (Desktop)
  1. Click the padlock icon in the address bar.
  2. Select Site settings.
  3. Find Notifications and adjust your preference.