As the accuracy and range of image detection increases, so will it’s uses. How good are computers at detecting faces, facial attributes and emotions…?
Late last year, Microsoft released a new set of APIs specialising in computer vision, providing the opportunity for anyone to experiment with them. The APIs launched under ‘Project Oxford’ include:
- Computer Vision – Understand images and generate thumbnails
- Face – Face and facial attributed detection
- Emotion – Recognise emotions
Computer vision is one of the hottest areas in artificial intelligence right now, with a range of start-ups and acquisitions appearing in the news and as apps on social networks over the past six months.
It seems many of the machine learning algorithms are based on the Facial Action Coding System (FACS) developed in the 1970s by Paul Ekman and Wallace V Friesen, updated in 2002 by Ekman, Friesen and Joseph C Hager.
FACS encodes the movements of more than 5,000 individual facial muscles and then deconstructs them into specific Action Units (AUs) used to produce an expression. A large manual is available to purchase if you want to delve into the details. The following information is taken from the summary available on Wikipedia. A subset of the FACS is the Emotional Facial Action Coding System (EMFACS) that considers only emotion-related facial actions. Examples include:
|Emotion||Action Units (AU)||AU Names|
|Happiness||6 + 12||Cheek Raiser, Lip Corner Puller|
|Sadness||1 + 4 + 15||Inner Brow Raiser, Brow Lowerer, Lip Corner Depressor|
|Surprise||1 + 2 + 5B + 26||Inner Brow Raiser, Outer Brow Raiser, Upper Lid Raiser (Slight), Jaw Drop|
|Fear||1 + 2 + 4 + 5 + 7 + 20 + 26||Inner Brow Raiser, Outer Brow Raiser, Brow Lowerer, Upper Lid Raiser, Lid Tightener, Lip Stretcher, Jaw Drop|
|Anger||4 + 5 + 7 + 23||Brow Lowerer, Upper Lid Raiser, Lid Tightener, Lip Tightener|
|Disgust||9 + 15 + 16||Nose Wrinkler, Lip Corner Depressor, Lower Lip Depressor|
|Contempt||R12A + R14A||Lip Corner Puller (Slight), Dimpler (Slight)|
The above emotions match seven of the eight emotions detected by the Project Oxford Emotion API. The eighth is ‘Neutral’ and I would guess that is simply a catch-all given the combined scores for all 8 emotions add up to 1. For example, a ‘half-smile, half-grimace’ expression might be given a score of 0.4 for Happiness, 0.3 for Surprise, 0.2 for Contempt and 0.1 for Neutral.
What’s interesting is that, of the seven emotions (ignoring the 8th – neutral), five have perceived negative connotations – sadness, anger, fear, disgust and contempt. Only one would be considered positive – happiness. Surprise could go either way and needs to be measured along with the score for happiness versus fear or disgust. The emphasis on negativity is a factor that should be taken into consideration when using the emotion algorithm.
Testing Project Oxford APIs
With that all in mind, I ran three experiments using the Project Oxford APIs to explore their accuracy:
- One person in one location, pulling different faces to express emotions
- Various people in different locations, each expressing ‘no emotion’
- One person in one location, expressing no emotion but altering facial ‘decorations’
I promised the individuals involved that their images would not be used and there would be no identifying details shared publicly. Participants ranged in age from roughly 20 to 60 and were an even split between male and female. Here’s a summary of the results
Test 1. Same person with different expressions
Twelve photos of the same person, taken from the same angle, within ten minutes, asking them to pull a different facial expression for each picture ranging from happy to sad, angry to calm, excited to bored, thinking to blank.
Interestingly, the first observation is that the age estimate varies between the two APIs that detect faces and facial attributes – Computer Vision and Face. Computer Vision is quicker and provides ages in years. Face is much more compute-intensive, but dares to include decimal places, e.g. 34.7. The image below shows the accuracy of the age result for each API across the 12 photos. 0 means the actual age, negative value means the result was younger, positive value means the result returned was older than the actual age:
Overall, Computer Vision performed worse – as would be expected given it makes a much quicker calculation. But the range was quite shocking, spanning 27 years below the actual age and 25 years above it! The Face API was closer except for one image that was out of focus.
The Emotion API pretty much settled on either Happy or Neutral for all the pictures. Admittedly, some wine had been consumed so perhaps not the best set-up for a rigorous test of emotions. But it gives an idea of the limitations of the range of emotions available. Laughter, Joking, Thoughtful, Tired, Intensity, Confused… none of these are considered. At best, they get merged into Happiness, at worst they split between Neutral, Surprise and Contempt.
Test 2. Different people with same expression
For the second test, the request was for people to take an ‘unposed’ picture using just the web cam on their computer. Each person gave their age to test accuracy of the API. The rough hypothesis being that pictures shared in the media are posed or staged to some degree and whether or not the machine learning algorithm would be biased against rough ‘mug shots’ with poor or noisy backgrounds.
Well… no. There was no evidence of any bias in the results – some ages came in high, others low. But there was another angle that exposed limitations with the Computer Vision API. Out of 8 images, 2 had the wrong gender associated with them. And in both cases, no human being would have made the mistake.
Gender detection was re-run using the Face API and it correctly labelled all images.
Age detection was a mixed bag. Apart from one, if you were over the age of 35 the computer added to your age, if you were under then it subtracted… but, disappointingly for my various conspiracy theories, there was no correlation between the accuracy of age detection and gender, facial decoration (glasses or beards) or image quality.
In this test, the Face API barely performed better than the Computer Vision API. It was better for some and worse for others.
Test 3: Same person with same expression
For the final test, I used me – same person, using a webcam to take an ‘unposed’ picture but changing facial accessories. i.e. hair loose, hair tied, with and without glasses and sunglasses . To my utter embarrassment – the APIs were not kind to me – the six images are displayed below, having had their resolution reduced significantly to ease my humiliation 🙂
So, how did I fair? Well the APIs added between 3 and 15 years to my age 😦 I need to smile if I want a computer-based guess to come in under my actual age:
But like the first test, the age guesses were improved using the Face API compared to the Computer Vision API. And also, like the first test, there were consistent inaccuracies. This was not the case in test 2. It suggests that, for fixed cameras at least, some calibration could be done to improve results.
On to gender… oh gender…. Well the Computer Vision API correctly guessed I was a female in five of the images. But in the 6th (displayed on the far-right above), the result came back Male. And worse still, re-running with the supposedly more accurate Face API didn’t correct the error. Even the Face API thinks I’m a man in the sixth image. You can decide for yourself if you agree or not. Please don’t feel the need to share your opinion in the comments 🙂 I know I’m no oil painting but what scant ego I had is dwindling.
Finally, emotion. What’s interesting is that the Emotion API, despite my best efforts at pulling a neutral pose, detected traces of contempt in my expression. If you look back to the table above, the AUs for contempt focus on the right lip corner and dimple. I have a damaged muscle (it’s a genetic flaw, my relatives have it too) that affects the movement of the right-side of my lip… When I smile, it always looks wonky (and yes, I’m self-conscious enough about it to hate having my picture taken). Any psychologist, or generally-aware human being, would spot it in conversation or in video footage. But it’s much harder to know of its existence just from viewing a single static image…
So from a brief play with Microsoft’s Project Oxford APIs for computer vision, face and emotion detection, using a very limited sample, what have we/I discovered…
- Age detection is pretty ropey. But consistent (whether the outcome is good or bad) when using the same background. So for a controllable environment like a photo booth, it would likely be possible to calibrate to improve results.
- The Computer Vision API is very fast but clearly is doing the least amount of work to understand specifics about faces. If getting the gender right matters, switch to the more compute-intensive Face API. But even then, the computer is still more easily fooled than people
- The Computer Vision API is more sensitive to detecting the presence of faces, if not the accuracy of the content. It picked up a background face in one image that the Face and Emotion APIs missed
- The emotions being detected emphasise the negative and a static image conveys less information than a motion picture or live subject. This should be a consideration when looking to use computers to detect emotions in real-life situations
The optimal solution would be to use the Computer Vision API to detect the presence of faces, but use the Face and Emotion APIs to clarify the details. But be careful using age and gender in sensitive situations given the computer is still more easily fooled than people when handling more ambiguous images. And be careful assuming the emotion detected is enough to draw a conclusion about the subject’s current feelings…
- Microsoft Project Oxford APIs
- Apple buys Artificial Intelligence Start-up Emotient – Wall Street Journal, 7 January 2016
- Facial Action Coding System (FACS) – Wikipedia, as of 17 January 2016
Closing note: For the data scientists and coders, I plan to get some sample code up on Github. Will update here and tweet when it’s online