Monday, November 22, 2010

Visual Filters

I'm working on ways to teach computers to assign labels to different parts of an image or video.  The main tool I've been using is something we've dubbed a "visual filter."
Visual filters were originally thought up by Justin Domke, who called them a "Graphical Combination of Classifiers." Either name is an apt description, but some of this terminology is specific to the image and signal processing field, so I'll give a little background before I explain how this algorithm works.

If you remember back to high school algebra, a "function" can take several variables and gives as a result the unique value of a single variable. An image processing "filter" is a function in this sense. The variables it takes in are the colors of the pixels in a small window of an image, and the output is the color of a single pixel of the output. The filter scans the image, sliding across row after row, each time considering a region of the image that mostly overlaps with the regions it has already looked at. (If you want to know how a color can be an input variable, take a look at what I wrote here.)
Traditionally, filters have used fairly simple formulas to take these input colors and transform them into the output colors. (The "box blur," for example, simply takes the average of all the colors in the window and assigns it to the output.) That's why they are a fundamental tool in the image processing toolbox.  Adobe Photoshop has a main menu entry called "Filters." If you want to blur an image, find the edges, or many other simple tasks, you would use a filter.
These filters used simple formulas because early computers had very limited memory and it took a long time to filter an image, because the same formula needed to be applied as many times as there are pixels in the image.  So for a 1000 x 1000 (megapixel) image, that's a million times the variables need to be gathered from the sliding window and the formula need to be applied.

Visual Filters:
We wanted to build a filter that would take in a patch of an image, and color the output depending on what the window was centered on.  For example, one color for "human," and another color for "anything else." No one knows what formula would be able to do this; it must be terribly complicated.  But such a function must exist out there, out in Platonic space somewhere.
When scientists want to know what a function is that comes from the natural world, they gather data samples. For example, if you measure the height of a bouncing ball at various times, you can get a pretty good idea of the function that is describing how the ball bounces.  In the same way, if we take a million sample windows and label them appropriately, it helps us to estimate the shape of that complicated hidden function. A shortcut to doing this is to take a bunch of images, create an appropriate label image, randomly take samples from the image, and assign the samples the label from that spot in the label image.
A naive solution to our problem, then, would be the following: slide the window along the image, and at each step, compare what is in the window to every sample you have that is labeled. Find the best match or matches, and assign that label to your result. 
This would work, but the trouble is that the number of patches you would need is so absurdly high that you could never get enough. People can vary in lighting, in pose, in camera position, in appearance, in clothing, in age, in facial expression, and so forth. And even worse than that, the exact same sample from an image might be part of a person's shirt, or it might be part of a blanket on a bed.  Without looking at the surrounding context, you would never know.  So the algorithm needs a way to include context.

Including Context:
This is the really clever bit, in my opinion. For the first step, the program does the naive method, sliding the window and creating that output image by comparing what's in the window at each location to all the labeled samples. But then we go back, and define a new filter, a new function.  This one takes in a window on the image, but also takes in a window from the same location on the previously generated output map. This map is far from perfect, but if even some of the pixels are correctly labeled, that context information can help each patch be better labeled the next time. (Since we have the labeled training images, we know what this map is supposed to look like for some images.) So we do this over and over on the training images, building up maybe five filters to be run in succesion, each time improving the output.
Finally, once we have these five succesive filters, we can run them on new images.  As long as the new images are sufficiently similar to the training images, we can do a pretty good job of classifying every single pixel in that new image.

Instead of just taking the raw pixel values, we instead compress the information in such a way that it preserves similarity between patches.  This can be done by turning the sample into a SIFT feature, or a multi-scale patch compressed using PCA, or creating features similar to those used by the human visual system (What Serre et. al. call "standard model features.")  This serves two purposes: it lets us get away with fewer samples, and the individual samples don't take up as much memory.

Instead of searching for similar patches directly, we can train a classifier, using a neural net, for example. 

The direct method could be considered a "non-parametric classifier." Of course we aren't just taking the best match, but an average of all the nearby matches weighted by similarity. We can interpolate between them in a more intelligent way, by using anisotropic kernels.
The nice thing about using the data directly is that it enables the program to handle as many classes as we want and to keep throwing in more and more data (until we run out of memory.) This is generally known as "the Google approach."
We could train it using motion data as well, to improve the results on video.

Two papers about this:

Some images (more to come soon-- check the papers for other examples):
Detecting head, arms, and legs.  This was trained exclusively on sports images.

Detecting salient contours.  This was trained on hand-labeled salient edge images.  The point was mostly to show we could label edge-like features as well as regions.

Detecting trees.  This was trained using labeled images from the LabelMe database.

Friday, November 5, 2010

Summers Stay Hotel

The Summers Stay Hotel, also known as "Booked" is a hotel that is arranged like a library. Room numbers are according to the Dewey Decimal system, or by author's last name and genre, for the fiction section. Every wall of the hotel, both inside the rooms and out, is covered with bookshelves from ceiling to floor.  The front desk looks like an information desk.  There is no pool, but there is a children's room that has sliding ladders on the walls that can be brought up to high speed and crashed into each other (padded, of course.) When you "check out" from the hotel, they allow you to borrow a book for one year, hoping that you will return it the next year for another vacation stay.
The hotel has been so successful the owners are expanding to a cruise zepplin.  "Booked Passage" has rooms of much the same sort, but will slowly float between major university libraries around the world.

It turns out that someone else has had much the same idea, and such a hotel actually exists in NYC:

Tuesday, November 2, 2010


One thing about Dungeons and Dragons that really appealed to me was the way it imposed a game structure-- a simplification, an ordering-- on life as a whole.  It allowed me to think about philosophy, character, and story by making it simple enough to approach one piece at a time.  Of course any simplification is a distortion; compression inevitably introduces artifacts.  But the model is at worst merely wrong; without any model our ideas are "not even wrong." The D&D version of philosophy makes falsifiable predictions, so is useful as a starting point for thinking about these kinds of issues.
You see a lot of D&D type simplifying of the world in 1600-1900 European modernist philosophy. It's all binary oppositions and the imposition of an oversimplified order. Since their invention as a kind of small-scale wargame, tabletop roleplaying games have since moved in a more postmodern direction, emphasizing narrative and creativity, areas where they still hold the edge over digital RPGs.

(I am going to use all male examples below, but looking at the difference between how male and female characters in fiction realize these archetypes differently would itself be interesting.)

Lawful Good versus Chaotic Good: I think this is essentially a religious question.  A rational atheist may believe that laws are useful for society as a whole, but at any point his decision to follow a particular law depends on the law's utility. He may recognize that following the law as a default behavior avoids having to make difficult moral computations that can be paralyzing, but if he thinks breaking a law will serve a higher good (even after weighing its effect on society), he will do so. A true anarchist can be chaotic good since he only wants everyone to be equal (an-archy). This is the Superman vs. Batman conflict.

Chaotic Neutral: This is Captain Jack Sparrow, or Coyote the trickster god. His goals are often selfish and he carelessly hurts others; but he is not actively seeking power.  He doesn't actively oppress others. I think a lawful good character would judge that a chaotic neutral character, by not being lawful good, is in fact evil.  Personally, I thought that the point of Pirates of the Caribbean was to make a comedy about alignment. You have lawful good characters who gradually become neutral or even chaotic good.  There are also lawful evil characters, who are fairly rare in fiction, and a lot of the interpersonal conflict is driven by these differences in alignment.

Lawful evil versus chaotic evil: The lawful evil character is an authoritarian.  He wants there to be an order, but he wants it to be his order. A lawful evil character is without mercy, but not without justice.  I think of Javere the policeman from Les Miserables. The chaotic evil character is a monster, a predator. The Joker from The Dark Knight and Grendel fall into this category.

Lawful neutral:  I think a realistically portrayed robot would necessarily be lawful neutral.  For example, the terminator in the first film (despite its glowing red eyes and mean eyebrows) is actually lawful neutral. It appears evil to the characters in the film because the law it is following (its programming) pays no attention to morality, which for a person would in itself be evil (because part of our morality is innate and must be actively rejected.) But for the robot there is nothing but the law.

Here is a pretty good website, if you like this sort of thing:

All this also brings to mind one of my favorite comics: and