Jump to content

Curse of Dimensionality (a phenomenon in Data Analytics) refers to the problems that arise in organizing and analysing the data set if we have too many variables or dimensions or inputs. Biggest problems being the exponential rise in the sample size requirement and low accuracy in determining the relationship between output and multiple dimensions.

 

An application-oriented question on the topic along with responses can be seen below. The best answer was provided by Joyal on 29th Sep 2020.

 

Applause for all the respondents - Aritra Das Gupta, Sudheer Chauhan, Sherin Sebastian, Joyal.

 

Also review the answer provided by Mr Venugopal R, Benchmark Six Sigma's in-house expert.

Question

Q 300. Every action that we take, generates data. However, having too much data leads to a paradoxical situation known as 'Curse of Dimensionality'. Explain the curse and its effects on data analysis with suitable examples.

 

 

Note for website visitors - Two questions are asked every week on this platform. One on Tuesday and the other on Friday.

Link to post
Share on other sites

6 answers to this question

Recommended Posts

  • 1
  • Solution

The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in High dimensional spaces that donot occur in lower dimensions like 3D physical space of every day life.

Dimensionally cursed phenomena occurs in sampling, numerical analysis, data mining , machine learning and databases. The common theme of these problem is when dimensionality increases, volume of space increases and available data are sparse.This becomes problem for any method that requires statistical significance.

In order to obtain a statistically sound and reliable results the amount of data needed to support the result often grows exponentially with the dimensionality. Also, organizing and searching data often relies on detecting areas where objects form a group with similar properties,  in high dimensional data, however all objects appear to be sparse and dissimilar in many ways, which prevent common data organization strategies from being efficient.

There are two things to consider regarding curse of dimensionality,  on one hand machine learning excels at analyzing data with many dimensions. Humans are not good at finding patterns that may spread across so many dimensions,  especially if those dimensions are interrelated in counter intuitive ways. On the other hand , as we add more dimensions we also increase the processing power we need to analyze the data,  and we also increase the amount of training data required to make meaningful models.

Hughes Phenomena 

Hughes phenomena shows that as number of features increases, classifiers performance increases as well until we reach the optimal number of features.  Adding more features based on the same size as the training set will then degrade the classifiers performance. 

Curse of dimensionality in distance function

An increase in the number of dimensions of a datasets means there are more entries in the vector of features that represent each observation in the corresponding Euclidean space. In other words as the number of features grows for a given number of observations,  the feature space becomes increasingly sparse; that is less dense or emptier. On the flip side, the lower data density requires more observations to keep average distance between the data points the same.

When the distance between observation grows supervised machine learning becomes more difficult because predictions for new samples are less likely to be based on learning from similar training features.

Over fitting and Under fitting

In curse of dimensionality even the closest neighbor can appear to being far away in a high dimensional space to give a good estimate. Regularization is one way to avoid over fitting. We can sometimes use feature selection and dimensionality reduction techniques to help us avoid the curse of dimensionality. Over fitting occurs when a model starts to memorize the aspects of the training set and in turn loses the ability to generalize.

Ex:-As our training data is not good enough we risk producing a model that could be very good at predicting the target class on the training datasets that may fail miserably when faced with new data. That is our model, doesn't have the generalization power.

To avoid overfitting is to preference  simple methods, hypothesis with fewest assumptions must be selected.

If we keep our model simple we must avoid overfitting but if we keep it simpler we may risk of suffering from undercutting. It arises when our model has such low representation power that it cannot model the data even if we had all the training data we want. A model undercuts when it fails to capture the pattern in the data. It suffers high bias.

Hence in order to avoid curse of dimensionality more data is needed.

Link to post
Share on other sites
  • 1

Benchmark Six Sigma Expert View by Venugopal R

When we want to study the relationship of an outcome based on one factor, like a simple linear regression, we would obtain a relationship model with a certain level of accuracy. If we enhance the model by adding another relevant factor (dimension), we can expect the accuracy of the prediction to increase. However, if we keep on increasing the number of dimensions, from a certain threshold onwards, we will see the accuracy would actually start decreasing, unless we keep increasing the quantum of data substantially.

 

The term “Curse of Dimensionality” has been coined by Richard Bellman, an American Mathematician, while dealing with problems on Dynamic programming. While studying models of relationship for certain outcomes based on factors (referred to as dimension), the ability to establish statistical relationship becomes very difficult with the increase in the number of dimensions, unless we exponentially increase the amount of data. This phenomenon is of particular interest in the field of Machine Learning related Data Analysis.

 

To illustrate this in simple terms let’s consider an example where the variation in Quality of a certain food is studied for varying temperature. The Quality is determined by applying a score for various levels of temperature. We obtain a scatter diagram as in figure-1 below:

image.png.71b1e7e922a7ab2dfa03eaa198c3c87b.png

Now, we try to enhance the model by adding one more factor viz. Time, while the total number of samples remain unchanged. Since we have added one more dimension, we have to use a 3D scatter plot as in figure-2, to represent the relationship.

 

In figure-1, when it was a two dimensional model, we could observe that the points were quite dense and a regression line is fitted with apparently low residuals.

 

Figure-2 represents the 3D regression with an addition factor ‘Time’ was included, all other data remaining the same. The space of the scatter diagram becomes a cube, and we can observe that the data points have changed from a ‘dense’ pattern to a more ‘sparse’ pattern.  If we continue to include more dimensions for the same sample size, the representation is more complex, but the ‘sparseness’ of the data will increase making it difficult to obtain an accurate prediction from the model.

 

Understanding the ‘Curse of Dimensionality’ is crucial while planning the number of dimensions and the volumes for an effective machine learning exercise.

Link to post
Share on other sites
  • 0

As a Black belt we have often come across situations where there are some intricate analysis which involves statistically concept been shown to the management and post all the explanation its asked “What is the point”?

But if we try to understand from the standpoint of the audience we will soon understand that they have limited amount of time and concentration so they want specific answers to the questions in a simplified manner.

In today’s world there is so much of data that there are potentially 2 risks that can happen :- 

1.    If there is surplus data than we have excess data where as the features are limited .This result in out of sample performance
2.    The other scenario is if we have too many features in the data then it becomes extremely difficult to cluster. This result in a scenario where there are too many dimensions and every observation in the dataset appears to be equidistant from each other.

This term was coined by Richard Bellman to showcase the difficulty to optimize a function which has a lot of variables.

 

Below is a detailed example for this concept :- 
 

image.png.f7714258182ba8882296d1fbb3458f16.png

In the above example there are 8 candies. There are 2 distinct categories the first is the set of candies which is sweet whereas the other is spicy.
The problem is to find a solution that if a customer asks for a sweet candy he never gets a spicy one . The same should be when a customer asks for a spicy one he should not get a sweet candy.

 

image.png.90285ef8e13526c88a3b521ed38320ae.png

 

In the above clustering the candies can be clustered in the 2 distinct categories which will fulfil the customers demand. The 2 distinct demand of customer is either a sweet or a spicy candy. This can be easy differentiated by a human as he is visually able to differentiate the colour and hence fulfil the customers requirement.

However if the same needs to be identified as a part of a AI project where this logic needs to be converted to an algorithm. The machine will not be able to understand the difference and there is a high probability that the customer will be given an incorrect candy.
image.png.fdd578612e5d0f33fe4358f43c51fa96.png

 

In the above example there are not 2 categories but 8 categories and clustering is extremely difficult in this. Every candy is an individual colour. There is no relationship which has been drawn so it difficult to create a algorithm to predict the taste of the candy.

Though there are 8 different colours and clusters which are equidistant from each other however there are 4 candy which are sweet and 4 which are spicy.

Dimensionality Reduction is the only solution to this problem. 

 

image.png.3221bb24efc36eb6b91ef584e81169ff.png

 

In the above example each colour features is exposure to the latent features. Post this if the same can be plotted in a graph this will help in clustering easily.

 

image.png.9ff0644a871bdf34361b5ef8543c30e2.png

 

In the above graph there are 2 distinct colours which is Red & Blue .Red represents Spicy candy whereas Blue is sweet.

So whenever there is a new candy the colour is recorded and then it is transformed to its exposure to Red & Blue.  Using the latent feature it can easily be found out whether it is close to Red or Blue. So Once this is done it is easily ascertained if it is Spicy or Sweet.

This is a very important concept when there Is too many features which can create a confusion for machine learning algorithm. This concept helps us to reduce to dimensions which create difficulty and helps in getting the solution.

 

Link to post
Share on other sites
  • 0
 
   

Curse of Dimensionality: -

 

The term “Curse of dimensionality” used by famous mathematician R. Bellmen in his book Dynamic Programming in 1957.

Curse of dimensionality is a problem occur due to adding extra dimensions or attribute in a model. Meaning of Curse of dimensionality is that error increases with increasing the number of features or attributes.

 Generally, accuracy of any algorithm increases with increase in the number of features or attributes however it is true till threshold limit, after threshold limit algorithm accuracy decreases with increasing the extra number of features or attributes. Fact behind that algorithms are difficult to design in high dimensions and have a running time exponential in the dimensional .A high number of dimensions or features allow more information to be store but practically it does not help and give more noise and redundancy in real.

Collection of huge data may give dimensionality issue such as high noise. Some features do not give a significant information only create the confusion in the system and increases data.

Blow difficulties found during analysing of high dimension data results

·         High -dimensional space have a geometrical property which are different the properties of 2-3-dimensional space.

·         Data analysis tool design as per mind strength and these tools are best work in 2- 3 dimensional space

Domains of curse of dimensionality

 

§  Anomaly Detection

 

Anomaly is a high dimensional data which have significant attributes irrelevant to nature given large search space.

§  Combinatorics

Curse of dimensionality occurs when the complexity increases fast. It is caused to increase the combinations of inputs.

§  Machine Learnings

 

Enough data is important to success development of learning algorithms. As increase the dimensionality in the data result in the more sparser of the data. This effect is called the COD.

Prevent the COD

 

§  Dimensionality reduction use for changing the high dimensional variable to lower dimensional variable without changing the information. It is often use to reduce the COD effect.

§  Regulation: - the problem comes throughout the estimate of unstable parameters hence we can regulate these estimates will protect COD.

§  PCA (Principal Component Analysis) is traditional tool used for converting high dimension to low dimension reduction. It changes the date into additional informative space so that use of low dimension and prevent the Curse of dimensionality.

 

Example: -

 

If 5 algorithm generated as M1, M5, M10, M50 & M100 having the attribute of 1,5,10,50 & 100 and if the accuracy threshold limit is up to10 attribute then accuracy increases when increase the attributes from 1 to 10 however when attribute increases from 11 or more then accuracy will decrease because more attributes increase more confusion and lead to inaccuracy this phenomenon called as Curse of Dimensionality.

 

 

Untitled 1.png

 

 

 

 

 

Link to post
Share on other sites
  • 0

Joyal has clearly explained the problems arising due to high dimensions and hence it has been selected as the best answer. Congratulations!

 

Also review the answer provided by Mr Venugopal R, Benchmark Six Sigma's in-house expert.

Link to post
Share on other sites
Guest
This topic is now closed to further replies.
  • Who's Online (See full list)

    There are no registered users currently online

  • Forum Statistics

    • Total Topics
      2,876
    • Total Posts
      14,575
  • Member Statistics

    • Total Members
      55,085
    • Most Online
      888

    Newest Member
    Lavanya Ahuja
    Joined
×
×
  • Create New...