Jump to content
  • 0
Sign in to follow this  
Rupinder N

Number of samples for Regression Analysis

Sample Size

 

Sample Size is the number of observations or data points or objects in a sample. Sufficiency of sample size is a key element in hypothesis testing to be able to make inferences about the population. The right sample size is primarily dependent on the cost & time involved in data collection and the need for statistical significance. Statistically, sample size is affected by the following parameters
a. Significance Level (σ) or the maximum allowed probability of committing Type I error
b. Power of the test (1-β), where β is the maximum allowed probability of committing Type II error
c. Minimum difference (in the test statistic) to be detected

 

Regression Analysis

 

Regression Analysis is a statistical tool that defines the relationship between two continuous variables. It uses data on relevant variables to develop a prediction equation, or model. It generates an equation to describe the statistical relationship between one or more predictors and the response variable and to predict new observations

 

An application-oriented question on the topic along with responses can be seen below. The best answer was provided by     
Natwar Lal on 21st June  2019.

 

Applause for all the respondents- Natwar Lal, Sandra Thomas, Chris Marince, Kevin Naya, Sridhar Narayanam. 

 

Also review the answer provided by Mr Venugopal R, Benchmark Six Sigma's in-house expert.

Question

Q. 169  How can you check if you have taken enough samples for carrying out a Regression Analysis?

 

Note for website visitors - Two questions are asked every week on this platform. One on Tuesday and the other on Friday.

 

 

Share this post


Link to post
Share on other sites

7 answers to this question

Recommended Posts

  • 0

Sample size for Regression Analysis. 

 

What is Sample Size?

Since we cannot work with population data (due to constraints of time and money), we always prefer to work with sample data. Therefore, it becomes important to know how many data points (or sample size) are required in the sample. Usually the sample size determination is dependent on the following parameters

1. Significance Level or alpha

2. Power of the test or (1-Beta)

3. Effect or the difference to be detected

 

Smaller the alpha, Higher the Power of test, smaller the effect that needs to be detected --> Higher is the sample size required.


Sample size for Regression Analysis depends on the following (in addition to the parameters already listed for sample size selection above and hence starting with number 4 below)

 

4. Type of Regression being done (Linear, Multiple, Ordinal etc.)

5. Purpose of Regression - 

a. Determine the effectiveness of the model (looking at R-square value)

b. Determine the statistically important predictors (or determining the Beta values for each predictor)

6. Level of correlation between the predictors

 

For point 4, generally, simpler the regression lesser the sample size required. Hence, a lower sample size if I'm carrying out a linear regression vs a multiple regression.

For point 5, if the purpose if only to check the fit of the model a smaller sample size would suffice as compared to determining the significant factors from all the potential ones

For point 6, higher the correlation higher the sample size (applicable only if there are multiple predictors)

 

Now that we know the factors affecting sample size for regression, how should be check if we have the required number of samples for doing regression. The best way is to follow the theory behind sampling - higher the size, better it is :) But this arises another question, what sample is sufficiently high? There are a few empirical formulae that can be of help here. I am listing a few of them below

 

1. One common rule of thumb and the most famous one is that sample size should be 10 times the number of predictors. So if you have 4 predictors, you should have a minimum of 40 samples for running regression

2. As suggested by Green (1991)

a. Sample size = 50+8*k, k --> number of predictors; applicable if we are doing regression for point 5a

b. Sample size = 104 + k, k--> number of predictors; applicable if we are doing regression for point 5b

 

There are some more depending on the kind of regression (ordinal, log etc.) that you plan to run.

 

Sometimes, it is difficult to have answers to the 6 parameters before one decides the sample size. A more practical approach is to work backwards i.e. since we know the number of samples or we know how many we could collect, we could always do the Power Analysis (given the other factors are kept constant or pre-decided). 

 

Share this post


Link to post
Share on other sites
  • 0

How can you check if you have taken enough samples for carrying out a Regression Analysis?

using the adjusted r-sq calculator (https://www.danielsoper.com/statcalc/calculator.aspx?id=25) you can assess if the increase in the sample size will bring the observed r-sq to the r-sq (adj) values closer together (leaving predictors alone)

Testing larger sample sizes to see if the values get closer is one way to see if you have taken enough (r-sq values s/b within 20% of each other). If the observed r-sq to the r-sq (adj) are very close within 20%  then the sample size is sufficient

 

Share this post


Link to post
Share on other sites
  • 0

You can perform a Power Analysis to determine if your have a suitable sample size. Knowing the number of predictors, alpha, power, and effect size, the sample size can be determined. 

Share this post


Link to post
Share on other sites
  • 0

Regression analysis sample size sufficiency will depend on the need.  For example, if a prediction is needed then a high Rsq and predictive Rsq implies a good predictive regression and assuming that there is not a over fit situation.  If the residuals are small, random, and normal, this also indicates sufficient sample size.

Share this post


Link to post
Share on other sites
  • 0

Appropriate sample size is the parameter to determine whether the samples collected are enough for regression. 

Sample size can be determined such that all the data collected should be in between confidence intervals as we should capture population parameter. 

Sample size can be calculated by difference in the interval divided by standard error. 

Share this post


Link to post
Share on other sites
  • 0

Benchmark Six Sigma Expert View by Venugopal R

 

Most of us carry out regression analysis using software applications such a Minitab. We will get a result, whatever be the number of samples that we use for the regression exercise. Certain applications do indicate if we have taken the minimum sample size or not. Many follow the rule of thumb sample size of 10 or 30. This number may go up if the we have more independent variables. The discussion regarding scientific determination of required sample size for regression analysis can drag us into deeper statistical discussion. I will try to give my views and understanding briefly.

 

A statistical derivation for the sample size that takes into account the statistical power; (i.e. the probability of rejecting null hypothesis when false) and where there are multiple independent variables,  the minimum sample size has been derived as 50 + K, where K is the number of independent variables. If we need to evaluate the weightage of each variable, then the minimum sample size becomes 104 + K.

 

The above derivations indicate that a sample over 100 will have statistical justification.

 

If one needs to go deeper into this topic, the criteria for sample derivation is further extended to include the correlation amongst the independent variables, and the correlations between the independent and dependent variable are also taken into consideration. The sample size in such case would be higher and is represented as a table for various correlation values as mentioned above.

 

Sometimes, practical constraints deprive us of obtaining the scientific sample sizes and we may resort to lower sample sizes. While, this would certainly compromise the power of the test, we may look at the R-square value. Higher R-square value gives assurance the most variation of the dependent variable is explained by the considered independent variables.

Share this post


Link to post
Share on other sites
  • 0

The chosen best answer is that of Natwar Lal. For a practical approach, read through Sandra's answer. For Benchmark expert view, refer to Venugopal's answer.

Share this post


Link to post
Share on other sites
Guest
This topic is now closed to further replies.
Sign in to follow this  

×
×
  • Create New...