Insanely simple explanation of the challenges in Machine Learning in higher dimensions.
Here’s my best attempt at a simple.wikipedia.org article. In accordance with both simple.wikipedia.org rules and the oft-repeated Einstein/Feynman/Michael Scott quotes about explaining things to 5-year olds being the best sign of understanding something yourself, I’ve kept the vocabulary simple: Feel free to remix and rework the text to improve it.
“Machine learning” is often the problem of taking a bunch of data and coming up with a simple summary. For example, you might want a machine to look at pictures (which are made up of millions of data points called “pixels”) and come up with a single word definition of the subject (say “dog”, or “cat”, or “human”, or “spoon”, or “chair”). Now we will pick an even simpler problem to use as an example. We have collected data on two objects. We don’t know what the objects are, but we do know that there are two-distinct objects in our data. We want our machine to _cluster_ (this is a specific important machine learning problem called _clustering_) the given input and tell us which ones are object 1 and which ones are object 2. So, for example, we could give it lots of data on dogs and cats, and we want it to separate the dog-data from the cat-data.
However sometimes when we have a lot of data, our algorithms stop working well. This is called the “curse of dimensionality”. Consider a more abstract version of the problem than the one above. You see a group of points in 2-dimensions and you want to draw a single line through it that separates the points. For example, see this picture:
I will refer to this picture many times, so keep it open on a tab to the side. The algorithm (the one used in the picture is called “K-Means” but we can ignore the specifics. The point here is that _any_ clustering algorithm will face the same issues, so just focus on the problem itself) has separated the points into two groups (red and blue) and drawn a line between them. The idea is we now have detected that the blues are different from the reds in some pattern of the “features” (which is what we call the 2 axes —- we have 2 “features”).
Now we could try and “generalize” this algorithm —- that is, come up with some modification that works when we have lots of features. So lets consider three dimensions. Now first you will notice that a line doesn’t “split” three dimensions. You need a “plane”. A plane is a generalization of a line into 3-space. Visualize a room. A single pipe doesn’t split the room in two. But a very large piece of stiff cardboard will. Similarly, when we are in higher dimensions we will talk of “hyperplanes”. Notice that a plane sort of “exists” in a 2-d world of it’s own (like the stiff cardboard), even though the cardboard “lives” in 3-d. This is not a coincidence. We can, in principle, use a (n-1) dimensional structure to separate cleanly a n-dimensional structure into two parts. While you cannot visualize this in higher dimensions, the n-1 separation principle is important. [2]
So since we are looking to do just that —- “cluster” our data into two separate parts separated by a single “hyperplane” —- we will try and draw one in 3-dimensions. So maybe it is easy. From the picture above, imagine that the points are in a 3-dimensional room. So we have the two dimensions as in the picture, but now there is a third one, say that is just completely useless —- so the points are randomly distributed on that dimension. To visualize, imagine you laid that figure on the floor in your room and the points were all at different random heights off the floor, suspended in the room. The important thing is that if you viewed the room from directly above, (and far enough away so that you couldn’t see the heights at all) you would see the original 2-d picture in the k-means example picture). But from any other angle, you would see a bunch of extra noise clouding the clean separation that is visible in that one specific from-the-ceiling angle.
So already you can see when you have a single useless dimension it makes life difficult. You have to somehow figure out the right angle to see things so that it doesn’t cloud your judgment and you can “see” the separation. This intuition is important, because that is what will happen to the algorithm.
Now we will talk about k-means, the specific algorithm used in that example to do clustering. K-means is a powerful algorithm. How k-means works is it compares the distance between points and some “centers” that it thinks “cluster” the data. So it guesses two centers, and then draws a dividing line between the two (at the mid point). You can see this in the example picture. There is a blue center and a red center and a line that divides.
K-means then computers a “average” (it uses a little more fancy metric, but “average” will suffice for our intuition here) of the distances of all the points on it’s side of the line to the center. Basically, it is trying to see how far all the points are from the center. If the points are all far away, then it is not really the center. In fact, the center will _minimize_ the distance from all the points —- that’s why we like centers. So it repeats this a few times until it finds the right center. Importantly, you can also use this calculation to “point” you in the right direction. I found a nice picture that shows this step by step…
…as it _figures out_ where the centers should be, at each step redrawing the lines.
Now, the important thing here is that we use the distances from the points to the center to calculate how “good” the center is. So I will unfortunately have to introduce a little bit of math here to show that.
When we have two points x and y in n dimensions, we can compute (x_1 - y_1)^2 + (x_2 - y_2)^2 + (x_3 - y_3)^2 + … + (x_n - y_n)^2 as the distance between the two points.
Now suppose in our example, we have only 2 dimensions that “matter”. So n = 2. So we have our distance metric as (x_1 - y_1)^2 + (x_2 - y_2)^2 Now let’s assume that x_1 and x_2 tell us important things about the data. Maybe, for example, x_1 and x_2 are key properties of the objects we want to cluster. (weight and number of teeth). Suppose we are clustering between dogs and cats. Clearly weight and number of teeth will tell us important things that help us separate the dogs from the cats. So we call these “features” _informative_. They are valuable! Running k-means will give us a good separation.
Now suppose, in addition to that, we have some extra dimensions that are useless. Say “color”, “date of birth”, first name of owner”, “number of kids in the house where it lives”. Maybe these “features” really are just random noise. They’re just getting in the way. So we will assume that the data is “randomly distributed with respect to these features”.
Now our distance metric is: (x_1 - y_1)^2 + (x_2 - y_2)^2 + (x_3 - y_3)^2 + … + (x_n - y_n)^2 Since we assumed for k>2 that x_k is random, we have our distance as (x_1 - y_1)^2 + (x_2 - y_2)^2 + (random-random)^2 + … + (random-random)^2
Now we will avoid some math here, but the important point is that the “random - random” starts to shout out over the first two dimensions as we increase the number of randomly distributed features.[3]
Intuitively, unless you know to “see” from the exact right angle (like the top of the room in the example) you will be lost in the sea of points in the space. But how do you decide which is the right angle? That is itself a hard problem. And oftentimes, it is _the_ hard problem. Deciding that “first name of the owner” wasn’t as important as “weight” _is_ the hard problem. So k-means doesn’t work as well.
This problem is often called the “curse” of dimensionality, because in a very high-dimensional space, “all points are equally far”.
[1] A proper simple.wikipedia article should use a different word here. Suggestions?
[2] Side distraction: We can actually, with a little effort visualize a 4-d space. Say the 4 dimensions are our usual 3 + the extra time-dimension. We can have a 3-d object separate us into two separate 4-d worlds. So we’re waiting waiting waiting as time passes, and then suddenly, the entire world is made of concrete. It obliterates everything in the universe. Then, in a flash, just as soon as it arrived, the concrete disappears. That concrete barrier is the 3-d hyperplane in the 4-d space.
Now to make it convenient to visualize, we have made our 3-d object “axis-aligned”. But it may not be axis aligned. But the problem will be that the answer we want won’t always be conveniently axis-aligned. Imagine a 2-d concrete barrier that starts at one end of the universe and “moves” through the universe as we move time. It crushes everything in its path. You can’t get past it. It will eventually get to you because it is moving perpendicular to it’s surface. Eventually it will destroy everything. So you still have a 3-d (this barrier exists in 2 spacial dimensions and 1 time dimension) concrete barrier separating the 4-d space without being axis aligned.
[3] I’m trying to avoid having to say that while the expected value of (random-random) is 0, the variance is nonzero and positive. And as we have more random features, the expected variance of the sum of the random features grows. I.E. Var(N(mu, sigma) - N(mu, sigma))
— Arjun Narayanan
That was very, very helpful.


