Irregular Webcomic! #3322

Same photo, processed differently. Which do you prefer?

One of the interesting trends I noticed at the Electronic Imaging conference I attended a few weeks ago was a change in the way that testing of user preferences is done in imaging research. There's a good nugget of scientific method behind this, and a cool story of how technological change can drive adoption of new research techniques.

User preference testing^[1] is an age-old method of gathering quantitative data on how various factors affect human perception and response. It's a realm in which "hard" science in the form of mathematical analysis of physical properties such as colour, contrast, and noise level in images necessarily meets the "softer" science of psychological reactions to those properties. You can get a mathematically determined answer to a question like "what is the contrast level in this image?", but it's not so clear how to do it, or even if you can, for a question like "how good is this image?", or even "what is the image quality of this image?" This intersection of psychology and physics is known as psychophysics.

Which do you prefer?

As an example, there is a well known effect in imaging science research in which you artificially degrade an image in various ways. In general terms, an image is degraded if the colour and brightness values of the pixels don't match those of the original image, and the greater the difference, the more the image is considered degraded. When the change varies randomly from pixel to pixel, it's a type of degradation known as noise, and appears as a speckly fuzz over the image. Now consider an image of a landscape, with grass and trees and blue sky. If you add a little bit of noise to the sky, it is highly noticeable and disturbing, and people will generally call that a "low quality" image. But you can add quite a lot of noise to the grass and trees, much more than you do to the sky, and people won't even notice it at all. They'll happily call it a "high quality" image, even though mathematically and statistically speaking it has a significantly lower "image quality" (by a simplistic definition of those words) compared to the original.

The complication here is that human visual perception is highly dependent on image context and features. As another example, if you geometrically distort a photo of a forest or a beach, you might not even realise anything has been done to the photo. But if you distort a photo of a person's face by the same amount, or even less, it is very noticeable, and often disturbing. So people can very easily prefer badly reproduced images over ones that are much more faithfully reproduced, depending on what they are images of.

Which do you prefer?

So a purely statistical description of an image can only give you some abstract measure of "image quality", which does not necessarily accord with how a human would judge the quality of the image. Mathematical objectivity does not necessarily agree with human subjectivity. So how can we do better?

There are scientific methods of approaching these sorts of subjective questions. The questions involve human judgement, so the methods are experiments which also involve human judgement. The basic method of deciding which image looks "better" is known as pairwise comparison. In this method, a person is shown two versions of the same image, which have had different image processing operations performed on them (one of the operations may be "nothing"), and asked to decide which one they prefer. This can either be done by showing them both of the images at once, or just one at a time and allowing them to flip back and forth. (Other factors such as size of display or time constraints may dictate which of these methods is used.)

Asking one person which image they like better doesn't necessarily get you the "right" answer. Individuals have different preferences. One person might prefer versions of an image with bright, vivid colours in general, while someone else might prefer more muted, realistic colours, for example.^[2] But if you repeat the experiment with a large number of people, you can build up a model of the statistical distribution - the proportions of people who prefer image A over image B, or vice versa, or who have no preference.

Which do you prefer?

To further improve the reliability of the statistics, you repeat the pairwise comparison with each observer several times, using the same modifications applied to different pairs of images, or the same pair of images which have been modified in different ways, or both. This allows you to build up a model of the preference distribution, which then allows you to predict, for example, if X amount of noise is added to an image, or some other modification made, what percentage of people will prefer one image over the other, or not show any preference.

A practical application of all this is image and video compression technology. Images contain lots of data, and so take up lots of memory space in computer hardware - more than almost any other type of commonly stored file. If you can reduce the amount of data needed to store an image, you can squeeze more into a given amount of data storage. So there's a high incentive to compress image data, which is to recode the data so that you can reconstruct the image from fewer bits of data. You can't get something for nothing, so something's gotta give. Rather than reconstruct the image perfectly, the most efficient compression algorithms actually throw away some of the data necessary for a perfect reconstruction, and concentrate on coding the data needed to reconstruct the image so that to a person it looks "good enough".^[3] How you decide what bits of the image are necessary and what bits can be discarded is done by reference to user testing models to determine which particular features in an image are least noticed by humans when they are slightly degraded. This is the principle behind the popular JPEG image file format (though I've left out much of the technical detail - maybe another day). Lossy compression is also used for the popular MP3 audio coding file format.

Which do you prefer?

Anyway, that's all kind of background. The interesting trend I mentioned in the opening sentence above is in how the observer testing is done. Obviously you want as large a sample of different people as you can get for a study like this. However recruiting people and testing them can be expensive and time consuming. Furthermore, image comparison testing has traditionally been done in specialised observing rooms, with precisely controlled lighting conditions, an absence of distractions, and with the images either printed or displayed on a screen, both forms of reproduction done with precisely calibrated conditions. In the case of displaying images on a screen, testing would be done on expensive, carefully calibrated monitors, to ensure accurate colour reproduction and high contrast levels. All this equipment means you can only test one person at a time, unless you have budget for multiple such set-ups, which then must all be precisely cross calibrated to ensure matching viewing conditions.

The result of this is that these sorts of user tests for image preference testing have tended to be done with samples of just a few tens up to maybe a hundred observers. The statistics you get out of this can be useful, but they are prone to sampling errors due to the small sample size. Furthermore, many image preference scenarios are culture dependent. People of different ages, sexes, or ethnic or cultural backgrounds can have highly different responses to images. It is known, for example, that American observers have different preferences for the photographic reproduction of human skin tones than Japanese observers. So you need to be careful how you choose your observers, split them apart based on demographic data about them, and interpret your results keeping this in mind. This further serves to make the statistics less reliable by reducing the sample sizes of consistent populations.

The new trend in observer testing is to go in very much the opposite direction. Sample as many people as you can, but without worrying too much about identically calibrated monitors and distraction-free environments. The way to do this is to take advantage of crowdsourcing, using the Internet to reach hundreds or thousands of people. You can either do it by setting up a website with a voluntary survey, or you can use a system to recruit thousands of paid observers such as Amazon's Mechanical Turk. The idea is that you can get meaningful statistics about images - even if you have no control whatsoever over the viewing conditions of the observers - by sampling a large enough number of people to average out any variation that causes.

Which do you prefer?

I first saw this idea proposed by a researcher at the Electronic Imaging conference in 2011. The proposal was greeted with scepticism and almost shouted down as heresy. People in the audience said that the lack of controlled viewing conditions and monitor calibration meant that the results could never be meaningful or useful. You can't compare the image viewing preferences of a randomly selected IT professional in a comfortable office with a good monitor against those of a distracted housewife juggling two kids while surfing on a crappy 10-year-old monitor, they said, you just can't! It was an interesting conference session, with some quite raised voices and arguments, let me tell you.

Three years later, at the same conference, what did I see? I saw multiple research groups presenting multiple different image preference studies, all of them using Mechanical Turk to recruit thousands of observers from across the Internet (and hence around the world). They had developed techniques to analyse the statistics they gathered in a meaningful way. Part of this is to build control questions into their experimental samples. Unknown to the users, these control questions determine things like how well they can see subtle colour differences on their (uncontrolled) monitors, how repeatable their responses are over multiple trials of the same data, some basic demographic information, and how much they are paying attention to the task rather than just clicking randomly to earn a few bucks.

Which do you prefer? (Okay, this one is cheating slightly...)

This filtering discards maybe 10 to 20 percent of the data, but what's left is a large enough sample to average out the remaining noise and produce useful results. In fact some researchers even argue that the results are more meaningful than results gathered under highly controlled test conditions, because what you really want to find out about is how people react to images in the real world, not in a sterile grey testing chamber.

And the reaction to this methodology in 2014? Nobody batted an eyelid or raised any objection. In just three years the state of the art in this field of science has shifted dramatically, because somebody went out on a limb and tried something new, and got it to work. This last part is important. You can't propose wacky new ideas in science and expect to get them listened to if you can't demonstrate that they work. But if you can, then others will follow. Sometimes it takes much longer than three years, but sometimes a revolution in scientific thinking can happen very fast indeed.

And it's happening all the time, across all fields of science. Cool, huh?

[1] I would normally link to something like a Wikipedia article describing user preference testing, but the closest thing I could find was this article on preference testing, which is all about testing behaviour of non-human animals. The concept for testing human preferences is actually quite similar, though.

[2] This is a well known example of how many people prefer images with some property a little different to reality. Most (but not all) people actually like photos better when the colours are slightly more saturated (vivid or vibrant) than they are in real life. This can be a conundrum for camera manufacturers. Do they make cameras and software which reproduces colours faithfully, or boost the saturation a bit to make it so that more people like the resulting photos?

Typically, they go for the slight boost in saturation that most people prefer. High end cameras often offer user settings which let you pick between "faithful" colour reproduction and some other settings with labels such as "vibrant" or even "normal". (There may also be a "black and white" or "monochrome" setting.) Usually the default is "normal", which is actually slightly boosted in saturation. You need to switch to the slightly duller "faithful" if you want accurate colour reproduction.

This is related to a psychophysical phenomenon called memory colours. These are the colours you associate with objects in your memory. Imagine you are given a set of paint chips of different shades of blue and asked to pick the one that matches a blue sky, but you need to do this in a closed room where you can't actually see the sky. Or given a set of greens and asked to pick the one that is the colour of grass. Or a set of oranges and asked to pick the one that matches the colour of a real orange. When people are asked to do this, they almost invariably pick a colour that is more saturated and vivid than the real object. In your memory, skies are bluer than they are in reality, grasses are greener, and oranges are oranger. Researchers studying this phenomenon propose that this is the reason most people prefer images that are slightly more saturated than reality - they better match our memory of reality - though we don't yet know why our brains remember colours as being more vivid than they really are.

[3] These are so-called lossy compression algorithms, because they lose some of the original image quality. There is also lossless compression, which takes advantage of typical statistical properties of the type of file being compressed to make most typical files smaller, but at the expense of requiring more data to encode atypical files. Lossless compression gives superior reproduction quality, but lossy compression produces significantly smaller file sizes, so both methods have their uses.