Irregular Webcomic! #3309

ICCV 2013, Sydney.

During the past week (as this is published), I have been attending the 2013 International Conference on Computer Vision, known in the trade as ICCV. This is the major conference in the field and is held biennially in a different city around the world, and this year it happened to be held in Sydney. Since this meant my work could send me to the conference without any travel or accommodation costs, I got the chance to attend, even though my work is only tangentially related to computer vision.

What is computer vision? It is basically using computation and image processing to extract information from images, usually images collected by digital cameras. Most of us don't think about this much, but it is more common than you may realise.

A major application is surveillance. Surveillance cameras are used keep track of what is happening at places like airports, train stations, banks, busy traffic intersections, shopping malls, or even small individual installations like shops or home security. All of these cameras produce video streams that are, for the most part, incredibly boring. Nothing of interest may happen for days or even months at a time. And a place like an airport may have hundreds of cameras. You really don't want to have to sit and watch all of the resulting video, and neither do you want to pay someone to do it. Because it's so boring that when something interesting does happen, your watcher might have fallen asleep.

Instead, you can have a computer do the watching. This is computer vision.

What sort of things does a computer need to look for in a surveillance video? One important job is to keep an eye out for left luggage. If a person carries a bag into an airport, then walks off and leaves it sitting somewhere, then that's something that airport security should know about. So you want your computer to have an algorithm that detects objects moving through a scene, and detects when they stop moving. But it has to do more than that. Humans move through an airport and then often sit for a long time in one position. And they carry luggage and leave it sitting still next to them. You only want your computer vision program to call security if the person leaves the seat and leaves the luggage behind. So you need to be able to distinguish between people and luggage and make some decisions about whether a piece of luggage is being attended or has been abandoned.

Another application: Face recognition demonstration.

All of this is not easy. In fact, it's so hard that there are conferences on computer vision every year where researchers swap details of the latest algorithms and methods for doing this sort of stuff. And this is only one aspect of computer vision. Other applications include:

Robotics. Mobile machines need to be able to sense their environment and objects of interest. This is not as esoteric as you might think. Industrial robots now assemble our cars and other machines for us, and they need to be able to recognise car parts, manipulate their arms to pick them up, orient them in the right configuration, and attach them to other parts which they also need to sense. Robots are also commonly used in warehouses to store and retrieve inventory items. We're also on the cusp of domestic robots, with things like the Roomba vacuum cleaner autonomously navigating rooms as it cleans (though Roombas do it with simple sensors rather than imaging cameras).

Manufacturing. Things produced in a factory need to be inspected for safety and quality control. Humans can do this job, but it's tedious and can be prone to error if people get bored. Instead, you can take images of your components and end products and have a computer scan them for any defects. This could be just a visible light scan, or it may also incorporate other modalities, such as x-rays, infra-red, or ultrasound. All of these methods of scanning produce images, which need computer vision techniques to extract relevant information from.

Navigation. Knowing where you are can be done in gross terms with GPS, but it can't tell you about obstacles in your path such as trees, cliffs, humans, or traffic. Some vehicles already have camera systems to collect more information than a human driver/pilot can observe on their own, and can generate warning or advice information for the vehicle operator. Autonomous vehicles are starting to become possible as well thanks to computer vision techniques used to navigate dynamic environments such as public roads.

Another application: 3D scanning of objects. Perhaps to make 3D printing plans.

Health care. Some aged or infirm people can benefit from monitoring, in the form of cameras and software that raises an alert if it detects a fall or other potentially dangerous situation.

Human interaction. This can be for fun like video games such as those enabled by Microsoft's Kinect system, or for more serious applications like controlling the home environment, computer interface, or enabling communication with disabled people.

Animal or landscape tracking. Biological research often needs to track animals. Doing it automatically with a camera and computer can save a lot of gruntwork. Similarly you can map landscape regions in three dimensions and identify any changed objects.

Medical imaging. Medical images need to be interpreted by a skilled pathologist. This is a bottleneck in diagnosing many conditions. Computers can help by performing preliminary evaluation and identification of suspicious regions of medical images for human scrutiny.

Archaeology and arts. Scanning objects in the field or in a museum can provide important data for analysis and understanding. You can build an accurate three-dimensional model from photos or video.

Visual matching. Given an image, figure out something about it, or find other images that are similar somehow. Part of this is what Google Image Search does, but there are much wider aspects, such as identifying where or when a photo was taken, identifying animal or plant species from images, or sorting your photo collection and identifying common people or places to group them into events.

Another application: Species identification.

Entertainment and sports. Motion capture for computer generated special effects is now an important part of movie making, and is a classic use of computer vision methods. Sport has gotten in on the action, with automated tracking and analysis of player movements which can assist coaches in statistical analysis, tactical planning, and individual player technique coaching. Computer vision can also provide near real time assistance for sport officials, to help them make in-game decisions.

There are many other applications as well (I can't think of them all, because more are being thought of all the time!). Getting computers to analyse image data for all of these purposes is highly non trivial. Vision is one of those things that human beings can do much better than computers. By "vision", I mean interpreting image data to gain understanding of what is going on in a scene. A human can look at a scene and pretty much instantly determine several things about it: whether it is indoors or outdoors, if any people are present, roughly how many people there are, the positions and sizes of various objects in the scene, the identifications of those objects (a chair, a table, a dog, etc.), and so on.

Humans are incredibly good at recognising objects they can see. Infants can do it. What's more, humans are incredibly good at classifying objects they can see into meaningful semantic classes. An infant can see a dog and declare that it is a dog. They can see another dog, of a totally different breed to any dog they have ever seen before, and still recognise it as a dog. They can see a stuffed toy dog, and they still recognise that it is, in some sense, a "dog", while at the same time recognising that it is in another sense not the same thing as a living dog.

Identifying objects and classifying them in ways which allow us to understand what is happening around us is such a basic human skill that we are mostly unaware of just how amazing this ability is. People have been working on making computer algorithms capable of doing the same thing for decades and the problem is not yet solved. The difference (well, one of the differences - a computer vision researcher will be able to rattle of dozens of differences) is the amount of contextual information that humans and computers have available.

Tracking moving objects in real time, demo.

Naively, one might imagine that computers have an advantage in volume of knowledge. After all, Google can find almost anything. But that operates on a vast network of computers with enormous stores of data, and it's still pretty dumb when it comes to making the sorts of connections that a human makes between things. It is actually humans who have the overwhelming advantage in contextual knowledge. We can tell the difference between a cat and a dog at a glance, without having to think consciously about it at all. We can differentiate a table, a desk, a chair, a stool, a bedside table, a chest of drawers, and so on with ease. We can recognise a person as a specific individual, we can recognise females and males, we can estimate people's ages, and we can make very good judgements about people's emotional states just by looking at them. We can recognise a car of any model as a mode of transport, know roughly how many people it will carry, and how big it is. Just by looking at it. We can look at a busy scene in a city and instantaneously parse it into streets, cars, buses, taxis, footpaths, people, buildings, trees, dogs, rubbish bins, newspaper stands, hot dog carts, bus stops, road signs, benches, doors, windows, fire hydrants, traffic lights, street lights, advertising signs, bicycles, pigeons, motorcycles, and on and on and on. Just by looking at it.

For a computer to do even a small part of this, it has to process the data contained in the pixels of an image it captures (with a camera). One approach is to search through the image pixels for what are known as "features". These are simple things that can be detected in small regions of just a few pixels, such as edges and corners, where colours or brightness change suddenly. You can then look for groupings of nearby features that might fit the shape of a known object, such as a chair, or a person. But to do this, you need a large database of what chairs or people look like when you examine them in terms of "feature space". And in this limited feature space, it's possible, in fact very likely in many cases, that sometimes the features of a chair will match better to the database of people, or vice versa.

Computer vision is a big and active research field.

I'm simplifying a lot here, because computer vision has been an active research field for many years and there are many techniques of varying complexity for matching up images to contextual information about what objects are in the image. The main problem is that a chair, for example, looks very different when viewed from different angles, in different illumination, when partly obscured by another object, and for different models of chairs. It is impractical to store all of these various possibilities in a database, so you need to take shortcuts.

You might imagine that you could make a three dimensional model of a chair and just store that, then match it up using various angles of view and potential obscuring objects. This sort of thing can help, but only for a single model of chair. Humans have a much more contextual model of what a chair is. It is defined by a function, supporting a sitting human, rather than a specific shape. So our brains have a model of a chair that is not constrained in ways that a computer model of a chair is.

Given these sorts of recognition problems, it is amazing that computer vision works as well as it does, although there is also the sense that it works far less well than human vision. Attending a conference like ICCV, one realises that computer algorithms that contribute to general artificial intelligence (as opposed to performance in a restricted field, such a searching web pages) have a very long way to go before they can approach what humans can do. And this is why things like computer vision are such an active area of research.