Visual Turing Test

[[File:Sample questions.png|thumb|Selected sample questions generated by the query generator for a Visual Turing TestCite error: A <ref> tag is missing the closing </ref> (see the help page). it is “an operator-assisted device that produces a stochastic sequence of binary questions from a given test image”.^[1] The query engine produces a sequence of questions that have unpredictable answers given the history of questions. The test is only about vision and does not require any natural language processing. The job of the human operator is to provide the correct answer to the question or reject it as ambiguous. The query generator produces questions such that they follow a “natural story line”, similar to what humans do when they look at a picture.

History

Research in Computer Vision dates back to the 1960s when Seymour Papert first attempted to solve the problem. This unsuccessful attempt was referred to as the Summer Vision Project. The reason why it was not successful was because computer vision is more complicated than what people think. The complexity is in alignment with the human visual system. Roughly 50% of the human brain is devoted in processing vision, which clearly indicates that it is a difficult problem.

Later there were attempts to solve the problems with models inspired by the human brain. Perceptrons by Frank Rosenblatt, which is a form of the neural networks, was one of the first such approaches. These simple neural networks could not live up to their expectations and had certain limitations due to which they were not considered in future research.

Later with the availability of the hardware and some processing power the research shifted to Image processing which involves pixel-level operations, like finding edges, de-noising images or applying filters to name a few. There was some great progress in this field but the problem of vision which was to make the machines understand the images was still not being addressed. During this time the neural networks also resurfaced as it was shown that the limitations of the perceptrons can be overcome by Multi-layer perceptrons. Also in the early 90’s convolutional neural networks were born which showed great results on digit recognition but did not scale up well on harder problems.

Late 1990’s and early 2000’s saw the birth of modern computer vision. One of the reasons this happened was due to the availability of key, feature extraction and representation algorithms. Features along with the already present machine learning algorithms were used to detect, localise and segment objects in Images.

While all these advancements were being made, the community felt the need to have standardised datasets and evaluation metrics so the performances can be compared. This led to the emergence of challenges like the Pascal VOC challenge and the ImageNet challenge. The availability of standard evaluation metrics and the open challenges gave directions to the research. Better algorithms were introduced for specific tasks like object detection and classification.

Visual Turing Test aims to give a new direction to the computer vision research which would lead to the introduction of systems that will be one step closer to understanding images the way humans do.

Current Evaluation Practices

A large number of datasets have been annotated and generalised to benchmark performances of difference classes of algorithms to assess different vision tasks (e.g., object detection/recognition) on some image domain (e.g., scene images).

One of the most famous datasets in computer vision is ImageNet which is used to assess the problem of object level Image classification. ImageNet is one of the largest annotated datasets available and has over one million images. The other important vision task is object detection and localisation which refers to detecting the object instance in the image and providing the bounding box coordinates around the object instance or segmenting the object. The most popular dataset for this task is the Pascal dataset. Similarly there are other datasets for specific tasks like the H3D^[2] dataset for human pose detection, Core dataset to evaluate the quality of detected object attributes such as colour, orientation, and activity.

Having these standard datasets has helped the vision community to come up with extremely well performing algorithms for all these tasks. The next logical step is to create a larger task encompassing of these smaller subtasks. Having such a task would lead to building systems that would understand images, as understanding images would inherently involve detecting objects, localising them and segmenting them.

Details

The Visual Turing Test (VTT) unlike the Turing Test has a query engine system which interrogates a computer vision system in the presence of a human co-ordinator.

It is a system that generates a random sequence of binary questions specific to the test image, such that the answer to any question k is unpredictable given the true answers to the previous k-1 questions (also known as history of questions).

The test happens in the presence of a human operator who serves two main purposes: removing the ambiguous questions and providing the correct answers to the unambiguous questions. Given an Image infinite possible binary questions can be asked and a lot of them are bound to be ambiguous. These questions if generated by the query engine are removed by the human moderator and instead the query engine generates another question such that the answer to it is unpredictable given the history of the questions.

The aim of the Visual Turing Test is to evaluate the Image understanding of a computer system, and an important part of image understanding is the story line of the image. When humans look at an image, they do not think that there is a car at ‘x’ pixels from the left and ‘y’ pixels from the top, but instead they look at it as a story, for e.g. they might think that there is a car parked on the road, a person is exiting the car and heading towards a building. The most important elements of the story line are the objects and so to extract any story line from an image the first and the most important task is to instantiate the objects in it, and that is what the query engine does.

Query Engine

The query engine is the core of the Visual Turing Test and it comprises two main parts : Vocabulary and Questions

Vocabulary

Vocabulary is a set of words that represent the elements of the images. This vocabulary when used with appropriate grammar leads to a set of questions. The grammar is defined in the next section in a way that it leads to a space of binary questions.

The vocabulary ${\mathcal {V}}$ consist of three components:

Types of Objects ${\mathcal {T}}$
Type-dependent attributes of objects ${\mathcal {A}}(t)$
Type-dependent relationships between two objects ${\mathcal {R}}(t,t')$

For Images of urban street scenes the types of objects include people, vehicle and buildings. Attributes refer to the properties of these objects, for e.g. female, child, wearing a hat or carrying something, for people and moving, parked, stopped, one tire visible or two tires visible for vehicles. Relationships between each pair of object classes can be either “ordered” or “unordered”. The unordered relationships may include talking, walking together and the ordered relationships include taller, closer to the camera, occluding, being occluded etc.

[[File:Wregions.png|thumb|Sample regions used as context in a Visual Turing Test. The one on the left shows regions with 1/8th the size of the image and the one on the right show regions with 1/4th size of the imageCite error: A <ref> tag is missing the closing </ref> (see the help page).

Dataset

The Images considered for the Geman et al.^[1] work are that of ‘Urban street scenes’ dataset,^[1] which has scenes of streets from different cities across the world. This why the types of objects are constrained to people and vehicles for this experiment.

[[File:DatasetSample.png|thumb|Images of the Urban Street scenes from the training data. The training data is a collection of such images with scenes from different cities across the worldCite error: A <ref> tag is missing the closing </ref> (see the help page).^[3] dataset which has real world images of indoor scenes. But they^[4] propose a different version of the visual turing test which takes on a holistic approach and expects the participating system to exhibit human like common sense. [[File:Annotated image.png|thumb|Example annotations of training image provided by the human workersCite error: A <ref> tag is missing the closing </ref> (see the help page). Such systems might be able to perform well on the VTT.

References

^ ^a ^b ^c Cite error: The named reference :0 was invoked but never defined (see the help page).
^ "H3D". www.eecs.berkeley.edu. Retrieved 2015-11-19.
^ Malinowski, Mateusz; Fritz, Mario (2014-10-29). "Towards a Visual Turing Challenge". arXiv:1410.8027 [cs].
^ Cite error: The named reference :1 was invoked but never defined (see the help page).

[:0-1] Cite error: The named reference :0 was invoked but never defined (see the help page).

[2] "H3D". www.eecs.berkeley.edu. Retrieved 2015-11-19.

[3] Malinowski, Mateusz; Fritz, Mario (2014-10-29). "Towards a Visual Turing Challenge". arXiv:1410.8027 [cs].

[:1-4] Cite error: The named reference :1 was invoked but never defined (see the help page).

[1]

[2]

[3]

[4]