Visual Turing Test

Computer Vision research is driven by standard evaluation practices. The current systems are tested by their accuracy for tasks like object detection, segmentation and localisation. Methods like the convolutional neural networks seem to be doing pretty well in these tasks, but the current systems are still not close to solving the ultimate problem of understanding images the way humans do. So motivated by the ability of humans to understand an image and even tell a story about it, Geman et al. have introduced the Visual Turing Test for computer vision systems.

As described in ^[1], it is “an operator-assisted device that produces a stochastic sequence of binary questions from a given test image”^[1]. The query engine produces a sequence of questions that have unpredictable answers given the history of questions. The test is only about vision and does not require any natural language processing. The job of the human operator is to provide the correct answer to the question or reject it as ambiguous. The query generator produces questions such that they follow a “natural story line”, similar to what humans do when they look at a picture.

History

Research in Computer Vision dates back to the 1960's when Seymour Papert first attempted to solve the problem. This unsuccessful attempt was referred to as the Summer Vision Project. The reason why it was not successful was because computer vision is more complicated than what people think. The complexity is in alignment with the human visual system. Roughly 50% of the human brain is devoted in processing vision, which clearly indicates that it is a difficult problem.

Later there were attempts to solve the problems with models inspired by the human brain. Perceptrons by Frank Rosenblatt, which is a form of the neural networks, was one of the first such approaches. These simple neural networks could not live up to their expectations and had certain limitations due to which they were not considered in future research.

Later with the availability of the hardware and some processing power the research shifted to Image processing which involves pixel-level operations, like finding edges, de-noising images or applying filters to name a few. There was some great progress in this field but the problem of vision which was to make the machines understand the images was still not being addressed. During this time the neural networks also resurfaced as it was shown that the limitations of the perceptrons can be overcome by Multi-layer perceptrons. Also in the early 90’s convolutional neural networks were born which showed great results on digit recognition but did not scale up well on harder problems.

Late 1990’s and early 2000’s saw the birth of modern computer vision. One of the reasons this happened was due to the availability of key, feature extraction and representation algorithms. Features along with the already present machine learning algorithms were used to detect, localise and segment objects in Images.

While all these advancements were being made, the community felt the need to have standardised datasets and evaluation metrics so the performances can be compared. This lead to the emergence of challenges like the Pascal VOC challenge and the ImageNet challenge. The availability of standard evaluation metrics and the open challenges gave directions to the research. Better algorithms were introduced for specific tasks like object detection and classification.

Visual Turing Test aims to give a new direction to the computer vision research which would lead to the introduction of systems that will be one step closer to understanding images the way humans do.

Current Evaluation Practices

A large number of datasets have been annotated and generalised to benchmark performances of difference classes of algorithms to assess different vision tasks (e.g., object detection/recognition) on some image domain (e.g., scene images).

One of the most famous datasets in computer vision is ImageNet which is used to assess the problem of object level Image classification. ImageNet is one of the largest annotated datasets available and has over one million images. The other important vision task is object detection and localisation which refers to detecting the object instance in the image and providing the bounding box coordinates around the object instance or segmenting the object. The most popular dataset for this task is the Pascal dataset. Similarly there are other datasets for specific tasks like the H3D dataset for human pose detection, Core dataset to evaluate the quality of detected object attributes such as colour, orientation, and activity.

Having these standard datasets has helped the vision community to come up with extremely well performing algorithms for all these tasks. The next logical step is to create a larger task encompassing of these smaller subtasks. Having such a task would lead to building systems that would understand images, as understanding images would inherently involve detection objects, localising them and segmenting them.

^ ^a ^b Geman, Donald; Geman, Stuart; Hallonquist, Neil; Younes, Laurent (2015-03-24). "Visual Turing test for computer vision systems". Proceedings of the National Academy of Sciences. 112 (12): 3618–3623. doi:10.1073/pnas.1422953112. ISSN 0027-8424. PMC 4378453. PMID 25755262.

[:0-1] Geman, Donald; Geman, Stuart; Hallonquist, Neil; Younes, Laurent (2015-03-24). "Visual Turing test for computer vision systems". Proceedings of the National Academy of Sciences. 112 (12): 3618–3623. doi:10.1073/pnas.1422953112. ISSN 0027-8424. PMC 4378453. PMID 25755262.

[1]