## Sunday, May 24, 2015

### Things that are the same in high dimensions

I have a set of high-dimensional vectors. These vectors are distributed as a gaussian surrounding zero in all the dimensions. I choose two of those vectors, perform an operation on them, and search for the nearest neighbors among the set using Euclidean distance. The following operations all tend to return the same neighbors:

The sum

The average

n * the sum, where n is large enough (something like 0.25 to infinity)

The vector product (I think that's what it's called)

The element-wise maximum

The element-wise minimum

The sum of the element-wise sign function (e > 0 : +1, e < 0 : -1)

All of these basically give me something that is approximately a 50/50 mixture of the elements of the two vectors. Each of the elements of the result vector will resemble an element of one of the two input vectors. The multiplier n can be as large as you like because after a certain point, it's just saying "whatever is furthest out along that direction."

If you think of a simpler version of the same situation, it is easier to see what is happening. Let all the coefficients be +1 or 0, still distributed as a half-gaussian. Then you can think of the values that happen to be +1 as "words" which "define" the vector. The result of any of the above operations will give you a definition that includes words from the definitions of both the inputs, and no others.

I think I could prove this mathematically, but maybe someone knows of a proof already I could use as a reference?

The sum

The average

n * the sum, where n is large enough (something like 0.25 to infinity)

The vector product (I think that's what it's called)

The element-wise maximum

The element-wise minimum

The sum of the element-wise sign function (e > 0 : +1, e < 0 : -1)

All of these basically give me something that is approximately a 50/50 mixture of the elements of the two vectors. Each of the elements of the result vector will resemble an element of one of the two input vectors. The multiplier n can be as large as you like because after a certain point, it's just saying "whatever is furthest out along that direction."

If you think of a simpler version of the same situation, it is easier to see what is happening. Let all the coefficients be +1 or 0, still distributed as a half-gaussian. Then you can think of the values that happen to be +1 as "words" which "define" the vector. The result of any of the above operations will give you a definition that includes words from the definitions of both the inputs, and no others.

I think I could prove this mathematically, but maybe someone knows of a proof already I could use as a reference?

Subscribe to:
Posts (Atom)