Object categorization from image search

In computer vision, the problem of object categorization from image search is the problem of training a classifier to recognize categories of objects, using only the images retrieved automatically with an Internet search engine. Ideally, automatic image collection would allow classifiers to be trained with nothing but the category names as input. This problem is closely related to that of content-based image retrieval (CBIR), where the goal is to return better image search results rather than training a classifier for image recognition.

Traditionally, classifiers are trained using sets of images that are labeled by hand. Collecting such a set of images is often a very time-consuming and laborious process. The use of Internet search engines to automate the process of acquiring large sets of labeled images has been described as a potential way of greatly facilitating computer vision research.^[1]

Challenges

Unrelated images

One problem with using Internet image search results as a training set for a classifier is the high percentage of unrelated images within the results. It has been estimated that, when a search engine such as Google images is queried with the name of an object category (such as airplane?, up to 85% of the returned images are unrelated to the category.^[1]

Intra-class variability

Another challenge posed by using Internet image search results as training sets for classifiers is that there is a high amount of variability within object categories, when compared with categories found in hand-labeled datasets such as Caltech 101 and Pascal. Images of objects can vary widely in a number of important factors, such as scale, pose, lighting, number of objects, and amount of occlusion.

pLSA approach

In a 2005 paper by Fergus et al.,^[1] pLSA (probabilistic latent semantic analysis) and extensions of this model were applied to the problem of object categorization from image search. pLSA was originally developed for document classification, but has since been applied to computer vision. It makes the assumption that images are documents that fit the bag of words model.

Model

Just as text documents are made up of words, each of which may be repeated within the document and across documents, images can be modeled as combinations of visual words. Just as the entire set of text words are defined by a dictionary, the entire set of visual words is defined in a codeword dictionary.

pLSA divides documents into topics as well. Just as knowing the topic(s) of an article allows you to make good guesses about the kinds of words that will appear in it, the distribution of words in an image is dependent on the underlying topics. The pLSA model tells us the probability of seeing each word $w$ given the category $\displaystyle d$ in terms of topics $\displaystyle z$ :

$\displaystyle P(w|d)=\sum _{z=1}^{Z}P(w|z)P(z|d)$

An important assumption made in this model is that $\displaystyle w$ and $\displaystyle d$ are conditionally independent given $\displaystyle z$ . Given a topic, the probability of a certain word appearing as part of that topic is independent of the rest of the image.^[2]

Training this model involves finding $\displaystyle P(w|z)$ and $\displaystyle P(z|d)$ that maximizes the likelihood of the observed words in each document. To do this, the expectation maximization algorithm is used, with the following objective function:

$\displaystyle L=\prod _{d=1}^{D}\prod _{w=1}^{W}P(w|d)^{n(w|d)}$

Application

ABS-pLSA

Absolute position pLSA (ABS-pLSA) attaches location information to each visual word by localizing it to one of X 揵ins?in the image. Here, $\displaystyle x$ represents which of the bins the visual word falls into. The new equation is:

$\displaystyle P(w|d)=\sum _{z=1}^{Z}P(w,x|z)P(z|d)$

$\displaystyle P(w,x|z)$ and $\displaystyle P(d)$ can be solved for in a manner similar to the original pLSA problem, using the EM algorithm

A problem with this model is that it is not translation or scale invariant. Since the positions of the visual words are absolute, changing the size of the object in the image or moving it would have a significant impact on the spatial distribution of the visual words into different bins.

TSI-pLSA

Translation and scale invariant pLSA (TSI-pLSA). This model extends pLSA by adding another latent variable, which describes the spatial location of the target object in an image. Now, the position $\displaystyle x$ of a visual word is given relative to this object location, rather than as an absolute position in the image. The new equation is:

$\displaystyle P(w,x|d)=\sum _{z=1}^{Z}\sum _{c=1}^{C}P(w,x|c,z)P(c)P(z|d)$

Again, the parameters $\displaystyle P(w,x|c,z)$ and $\displaystyle P(d)$ can be solved using the EM algorithm. $\displaystyle P(c)$ can be assumed to be a uniform distribution.

Implementation

Selecting words

Words in an image were selected using 4 different feature detectors:^[1]

Kadir–Brady saliency detector
Multi-scale Harris detector
Difference of Gaussians
Edge based operator, described in the study

Using these 4 detectors, approximately 700 features were detected per image. These features were then encoded as Scale-invariant feature transform descriptors, and vector quantized to match one of 350 words contained in a codebook. The codebook was precomputed from features extracted from a large number of images spanning numerous object categories.

Possible object locations

One important question in the TSI-pLSA model is how to determine the values that the random variable $\displaystyle C$ can take on. It is a 4-vector, whose components describe the object抯 centroid as well as x and y scales that define a bounding box around the object, so the space of possible values it can take on is enormous. To limit the number of possible object locations to a reasonable number, normal pLSA is first carried out on the set of images, and for each topic a Gaussian mixture model is fit over the visual words, weighted by $\displaystyle P(w|z)$ . Up to $\displaystyle K$ Gaussians are tried (allowing for multiple instances of an object in a single image), where $\displaystyle K$ is a constant.

Performance

The authors of the Fergus et al. paper compared performance of the three pLSA algorithms (pLSA, ABS-pLSA, and TSI-pLSA) on handpicked datasets and images returned from Google searches. Performance was measured as the error rate when classifying images in a test set as either containing the image or containing only background.

As expected, training directly on Google data gives higher error rates than training on prepared data.?^[1] In about half of the object categories tested do ABS-pLSA and TSI-pLSA perform significantly better than regular pLSA, and in only 2 categories out of 7 does TSI-pLSA perform better than the other two models.

OPTIMOL

OPTIMOL (automatic Online Picture collection via Incremental MOdel Learning) approaches the problem of learning object categories from online image searches by addressing model learning and searching simultaneously. OPTIMOL is an iterative model that updates its model of the target object category while concurrently retrieving more relevant images.^[3]

General framework

OPTIMOL was presented as a general iterative framework that is independent of the specific model used for category learning. The algorithm is as follows:

Download a large set of images from the Internet by searching for a keyword
Initialize the dataset with seed images
While more images needed in the dataset:
- Learn the model with most recently added dataset images
- Classify downloaded images using the updated model
- Add accepted images to the dataset

Note that only the most recently added images are used in each round of learning. This allows the algorithm to run on an arbitrarily large number of input images.