Visual Filters

I'm working on ways to teach computers to assign labels to different parts of an image or video.  The main tool I've been using is something we've dubbed a "visual filter."
Visual filters were originally thought up by Justin Domke, who called them a "Graphical Combination of Classifiers." Either name is an apt description, but some of this terminology is specific to the image and signal processing field, so I'll give a little background before I explain how this algorithm works.

If you remember back to high school algebra, a "function" can take several variables and gives as a result the unique value of a single variable. An image processing "filter" is a function in this sense. The variables it takes in are the colors of the pixels in a small window of an image, and the output is the color of a single pixel of the output. The filter scans the image, sliding across row after row, each time considering a region of the image that mostly overlaps with the regions it has already looked at. (If you want to know how a color can be an input variable, take a look at what I wrote here.)
Traditionally, filters have used fairly simple formulas to take these input colors and transform them into the output colors. (The "box blur," for example, simply takes the average of all the colors in the window and assigns it to the output.) That's why they are a fundamental tool in the image processing toolbox.  Adobe Photoshop has a main menu entry called "Filters." If you want to blur an image, find the edges, or many other simple tasks, you would use a filter.
These filters used simple formulas because early computers had very limited memory and it took a long time to filter an image, because the same formula needed to be applied as many times as there are pixels in the image.  So for a 1000 x 1000 (megapixel) image, that's a million times the variables need to be gathered from the sliding window and the formula need to be applied.

Visual Filters:
We wanted to build a filter that would take in a patch of an image, and color the output depending on what the window was centered on.  For example, one color for "human," and another color for "anything else." No one knows what formula would be able to do this; it must be terribly complicated.  But such a function must exist out there, out in Platonic space somewhere.
When scientists want to know what a function is that comes from the natural world, they gather data samples. For example, if you measure the height of a bouncing ball at various times, you can get a pretty good idea of the function that is describing how the ball bounces.  In the same way, if we take a million sample windows and label them appropriately, it helps us to estimate the shape of that complicated hidden function. A shortcut to doing this is to take a bunch of images, create an appropriate label image, randomly take samples from the image, and assign the samples the label from that spot in the label image.
A naive solution to our problem, then, would be the following: slide the window along the image, and at each step, compare what is in the window to every sample you have that is labeled. Find the best match or matches, and assign that label to your result. 
This would work, but the trouble is that the number of patches you would need is so absurdly high that you could never get enough. People can vary in lighting, in pose, in camera position, in appearance, in clothing, in age, in facial expression, and so forth. And even worse than that, the exact same sample from an image might be part of a person's shirt, or it might be part of a blanket on a bed.  Without looking at the surrounding context, you would never know.  So the algorithm needs a way to include context.

Including Context:
This is the really clever bit, in my opinion. For the first step, the program does the naive method, sliding the window and creating that output image by comparing what's in the window at each location to all the labeled samples. But then we go back, and define a new filter, a new function.  This one takes in a window on the image, but also takes in a window from the same location on the previously generated output map. This map is far from perfect, but if even some of the pixels are correctly labeled, that context information can help each patch be better labeled the next time. (Since we have the labeled training images, we know what this map is supposed to look like for some images.) So we do this over and over on the training images, building up maybe five filters to be run in succesion, each time improving the output.
Finally, once we have these five succesive filters, we can run them on new images.  As long as the new images are sufficiently similar to the training images, we can do a pretty good job of classifying every single pixel in that new image.

Instead of just taking the raw pixel values, we instead compress the information in such a way that it preserves similarity between patches.  This can be done by turning the sample into a SIFT feature, or a multi-scale patch compressed using PCA, or creating features similar to those used by the human visual system (What Serre et. al. call "standard model features.")  This serves two purposes: it lets us get away with fewer samples, and the individual samples don't take up as much memory.

Instead of searching for similar patches directly, we can train a classifier, using a neural net, for example. 

The direct method could be considered a "non-parametric classifier." Of course we aren't just taking the best match, but an average of all the nearby matches weighted by similarity. We can interpolate between them in a more intelligent way, by using anisotropic kernels.
The nice thing about using the data directly is that it enables the program to handle as many classes as we want and to keep throwing in more and more data (until we run out of memory.) This is generally known as "the Google approach."
We could train it using motion data as well, to improve the results on video.

Two papers about this:

Some images (more to come soon-- check the papers for other examples):
Detecting head, arms, and legs.  This was trained exclusively on sports images.

Detecting salient contours.  This was trained on hand-labeled salient edge images.  The point was mostly to show we could label edge-like features as well as regions.

Detecting trees.  This was trained using labeled images from the LabelMe database.


Popular Posts