Everything posted by RANSINGH SATYAJIT RAY
-
Class Imbalance
RANSINGH SATYAJIT RAY replied to Vishwadeep Khatri's topic in We ask and you answer! The best answer wins!Overview: Class Imbalance is one of the challenges faced in Classification Model prediction wherein the output from a dominating class is reflected in the output of the model there by leading to result which is just like a generalization i.e. say if 990 customer are not churning but 10 people are churning then, when we develop the classification model to determine whether some one is churning or not model will be running at higher accuracy when we are considering not churning as determining level of the classifier even when our model predicted churning people as not churning. This kind of problem often decieve the model developer into false hope that model is great based on accuracy and error rate as performance determining criteria. Before understaning the problem of class imbalance lets understand what Classfication Model is, What is Confusion Matrix, What is Class Imbalance? Classification Model: Classification Modelling refers to developing predictive model wherein we want to do prediction on what would be the output in terms of categorical/discrete feature depending on independent features (categorical or/and continuous) which is determined based on historical data of records of independent feature and their output in terms of categorical feature. eg. 1. Product pass or fail based on various independent feature like product dimention, material property, shift information, workforce profile. 2. Employee churn or not based on employee education profile, manager profile, employee personal profile. 3. Loan approval accepted or rejected based on applicant professional profile etc. Performance measurement with Confusion Matrix: We can think Confusion Matrix for a classification model is synonymous to hypothesis testing result. Hypothesis Testing -> As per Hypothesis Ho Ha Not Guilty Guilty Actual Not Guilty Right Decision Type 1 Error Guilty Type 2 Error Right Decision Confusion Matrix -> As per Model Not Guilty P Guilty N Actual Result Not Guilty T TP FN (Truely Determined as Not Guilty) ( Falsely Determined as Guilty) Guilty F FP TN (Falsely Determined as Not Guity) (Truely Determined as Guilty) The confusion matrix is called so because wrong interpretation of it woud lead to confusion in model output wherein the actual output doesnt match with predicted output. This confusion matrix leads us to have multiple number of model performance determining indicators like 1. Sensitivity or Recall 2. Specificity 3. F-1 Score 4. Accuracy 5. Error Rate 6. AUC Classification model performance are determined using any of these performance measuring indicators depending on the use case we are working on. Class Imbalance: As the name suggest it is the disproportion or variance between the classes in a classification model. This is often a challenge when the majority class number of records are significantly higher that the minority class, thus leading to false hope that model performance with respect to accuracy and error rate is high even when the model is predicting the minority class wrongly ( say majority class is people not churning and minority class is people churning, here the accuracy is righly detecting the people not churning is high even when or classification model wrongly predict minority class that is people churning as not churning). This we can think of as sort of generalization wherein model memorizes the outputs from majority class. This is othen a challenge in 1. Fraud Detection 2.Anomaly detection 3.Security threat detection 4.Medical Diagnosis How to know when we are falling in the trap of class imbalance? Firstly there are certain kinds of complex models(algorithm) that are prone to overfitting or underfitting. *Overfitting: Model performing well on training data(the data on which we build the model) and performing terribly on testing data (the data on which we test our model built on training data) *Underfitting: Model performing poorly both on training and test data We usually do experiment with model to check that like 1. experimenting with various model perforance measure 2.conducting SME interviews How to overcome this? Any one of the following or combination of them can be followed for mitigating class imbalance problem 1. Collecting more data 2.Choosing accuracy measure apart from Accuracy and error rate like Presision , Sensitivity/Recall, Specificity, ROC Curves, AUC, F-1 3. Creating samples synthetically (artificially creating data based on data distribution pattern of the minority class) 4. Applying resampling techniques where in we take many sub samples from all the records and try to develop a sort of concensus between them. 5. Choosing differerent mathematical model altogather.
-
Power of a Test
In statistical test of Hypothesis, we usually encounter p value and alpha value but why is power of the test or beta value or any term related to it not refleted? Isn't beta value equally important as alpha value? Why does Type 1 error have the edge over Type 2 error?
-
Process FMEA and DMAIC
RANSINGH SATYAJIT RAY replied to Vishwadeep Khatri's topic in We ask and you answer! The best answer wins!The Process FMEA is used to not only minimise the the risk associated with a process but also define them whenever possible. It helps in identifying the known or potential failure modes and provide follow-up and corrective action before the first production run.The Process FMEA is related to Lean Six Sigma DMAIC in the following ways to reduce variation in the process and generation of waste. 1. Reduces product development time and costs. 2. Help reduces the redundancies in the process. 3. Help identify the significant characteristics 4. It helps in identifying the sequence of tasks that come into play in a process. 5. Identifying errors and facilitate. 6. Helps in defining the corrective action. 7. Helps in knowing the magnitude of failures and their effect.
-
Sigma Level, Z score
RANSINGH SATYAJIT RAY replied to Vishwadeep Khatri's topic in We ask and you answer! The best answer wins!While managing the quality driven processes and parameters, there is always a question that arise of whether variables affecting the product or service meet the requirement. For this several parameters like Cp, Cpk (When data is fairly normal) and Ppk indices when the data is non normal (after suitable transformation for stable data without out of control data points), Z Scores, DPMO, etc are used. Many of the indices (Cp, Cpk and Ppk) used require a good amount of data (in terms of number of data points per sample) for providing valid result on the other hand with respect to many organisation faced with short production runs for being responsive to customer needs and specialising the product for future demand require a Capability measuring parameter to meet their quality requirements. The Z Score comes handy delivering status of process performance in both the cases. Z Score| LSL= (Individual value (Or Mean) -LSL)/ Process Standard Deviation Z Score| USL= (USL-Individual value (Or Mean))/ Process Standard Deviation By comparing the Z Score with the critical value (at a given alpha level) we check the status of performance.
-
Root Cause
RANSINGH SATYAJIT RAY replied to Vishwadeep Khatri's topic in We ask and you answer! The best answer wins!Root causes are the causes beyond which we cant go. Root cause would be terminated only when we get change in system, process, technology, training, policy. Hence in that case Root cause would refer to cause in those cases like a. technology is missing, ' b. policy doesn't say anything at present, c. training process has not been initiated for something, d. changes/tweak in system are made. Few examples where causes refer to root causes: 1. (changes/tweak in system are made) We could consider causes as root causes when we are performing controlled experiments where we have experimental group and control group with only experimental group being treated with factor of interest to observe its impact. If any changes seen in the experimental group then it could be directly inferred that causes of the changes is the root cause that is the change in the factor of interest. Some of the fields it is used is chemical processing, pharmacology,aeronautics.
-
Calculation of OEE based on MTTR, MTTF, MTBF
Is there any method to calculate OEE (Overall Equipment Effectiveness) based on the data from MTTR (Mean time to repair), MTTF (Mean time to Failure) , MTBF (Mean Time between Failure) in equipment used in non manufacturing utility industries.
-
Is reliability and maintainability linked with Six Sigma.
Thanks VK, One question does all Equipment Life data should in generally follow Weibull distribution or it is confined to specific product. Any example of its application?
-
Is reliability and maintainability linked with Six Sigma.
Is reliability and maintainability linked with Six Sigma. Is there any significant application of Weibull distribution.
-
CENTRAL LIMIT THEOREM
In the central limit theorem (CLT) establishes that, in some situations, when independent random variables are added, their properly normalised sum tends toward a normal distribution even if the original variables themselves are not normally distributed. The Question is how could this be proven as it looks very intimidating at times. Why is 30 considered the minimum sample size in some forms of statistical analysis? Is there any rationale for this.
-
Skewness and Kurtosis
RANSINGH SATYAJIT RAY replied to Vishwadeep Khatri's topic in We ask and you answer! The best answer wins!Given the skewness and Kurtosis we could predict the shape of a probability distribution. Skewness: The Lack of Symmetry in the probability distribution is called Skewness, A distribution is positive skewed when it has a long tail to the right (Right tail are + skewed) and a distribution is negative skewed if it has a long tail towards left. Further it is also interesting to know that when we check the data points using the Box plot if the mean of the dataset is greater that the median then its negative skewed and when the mean is less than median then its positive skewed. Kurtosis: The sharpness in the probability distribution is referred to as Kurtosis. Flatter curves are PlatyKurtic (-ve Kurtosis) and Sharper curves are (+ve Kurtosis)