Carnival masks

Algorithms based on training data will incorporate any biases in that training data, including not accounting for the missing data


Research has been published about an algorithm that can determine if a person is homo- or heterosexual based purely on a photo of his/her face, claiming up to 91% accuracy and able to outperform humans.

At first blush, I think that claim is a little flawed.

The algorithm uses machine learning, specifically a deep neural network. And there’s the problem. Machine learning requires sample data in order to learn how to classify the data. The resulting classifier is only ever as good as the sample it has been built upon and samples often contain biases, particularly when the source data is created by human beings. The argument in favour of machine learning in the era of ‘big data’ is that, given a large enough sample, it will overcome and smooth out any biases. That argument has a weakness.

For this study, the sample used images that men and women had posted publicly on a US dating site. That means images of people willing to publicly state and visually display their sexual orientation with the aim of attracting a mate of either the same or different sex.

The sample consisted of approximately 8,000 men and 7,000 women, evenly divided between gay/lesbian and heterosexual profiles for each gender, between the ages of 18 and 40, all caucasian and living in the US.

It found that gay men and women tended to have more gender-atypical features. In short, homosexual men were more likely to have feminine facial appearances than heterosexual men, and homosexual women were more likely to have more masculine appearances than heterosexual women. So far, so not that surprising. From the paper’s results:

…gay men had narrower jaws and longer noses, while lesbians had larger jaws…. gay men had larger foreheads than heterosexual men, while lesbians had smaller foreheads than heterosexual women.

Gay men had less facial hair, suggesting differences in androgenic hair growth, grooming style or both. They also had lighter skin, suggesting potential differences in grooming, sun exposure and/or testosterone levels. Lesbians tended to use less eye makeup, had darker hair and wore less revealing clothes, indicating less gender-typical grooming and style… lesbians smiled less …

When given a single image, the resulting classifier could correctly determine sexual orientation in 81% of cases for men, and 74% of cases for women. Human judges managed a lower rate of 61% for men and 54% for women. Hence the claim that the machines are out-performing the humans. When reviewing five images per person, the algorithm’s success rate went up to 91% and 83% respectively.

There are two concerns with that success rate. First, is sexuality a simple case of ‘gay’ or ‘straight’? What about those who consider themselves to be bi-sexual or a-sexual, or change their sexual preference over time?

Second, what about people who hide their sexual preference because they live within deeply religious or paternalistic families who do not tolerate anything other than heterosexuality? What about people who look traditionally ‘straight’ or ‘gay’ but aren’t and get frustrated by inappropriate advances. What about people who aren’t sure about their sexuality? Do you think any of these people will have put their images up on that US dating site? What about different races with varying facial bone structures and skin colours? What about testing using passport photos or police mug shots, that typically lack the kind of grooming and expression carefully crafted for photos to be used on dating web sites. I wonder what the accuracy rate would be for those sorts of photos. From my own research, and that of others, face-detection algorithms struggle with darker skin colours and are easily fooled.

In short, the machine classifier appears to have been trained on a constrained and well-defined binary dataset (only two possible outcomes) and then tested on similar data. The humans will have judged the images based on their real-life experiences, which will likely have included interacting with people who are less openly visible about their sexuality, less rigid about their sexuality or hide behind a persona that conforms to social stereotypes. That uncertainty and variation beyond a simple binary choice in life will be reflected in their results.

This reminds me of a popular story used to explain the danger in making decisions based on biased samples. The story of Abraham Wald.

Abraham Wald was a mathematician working in the Statistical Research Group (SRG) during World War II. The military wanted ideas for how to make fighter planes more resilient and efficient so that they could fly longer with less likelihood of being shot down. Adding more armour to protect against being shot makes planes heavier which affects their manoeuvrability and thus easier to shoot at as well as reducing how long they can fly before needing to refuel.

Here were the statistics provided by the military:

Section of plane Bullet holes per square foot
Engine 1.11
Fuselage 1.73
Fuel system 1.55
Rest of the plane 1.80

Source: Ellenberg, 2015

The expectation was to get a mathematical recommendation for how better to place armour to protect the most vulnerable parts of the plane. Based on the data provided, that was assumed to be around the fuselage.

Abraham Wald pointed out that more armour was needed to protect the engine.

Why? The data only represented the planes that were coming back.

Wald’s argument was that the data showed that the planes could tolerate a number of hits to the fuselage and still remain airborne. Planes that got hit in the engine weren’t surviving. Or, as Ellenberg bluntly clarifies in his book ‘How not to be wrong’, you’ll see more people in hospital beds with bullet holes in their legs than in their chest.The latter will be in the morgue.

So back to AI* having a better ‘gay-dar’ than people. It would be interesting to see how well it copes with the missing data: those who choose not to promote their sexuality in public or deliberately hide it behind a different persona, people of a different race and colour, let alone handling a hodgepodge of everyday photos of the same person. Because the other algorithms aren’t holding up too well. Face recognition struggles with facial furniture (beards, glasses…) as well as skin colour. Emotion APIs are easily tripped up by fake smiles and insult women who don’t conform to stock images. Computers are being fed biased data and cannot see behind the masks we wear to hide our true feelings.

…and the final observation. When evaluating the performance of an algorithm you should always test against the population base rate – how many people identify as gay or straight? The claim is that the algorithm performs with up to 91% accuracy. A quick search online for statistics suggests that somewhere between 1 and 10 percent of the population identifies as homo- or bisexual or transgender. So the base line would be to simply classify all the photos as ‘straight’. That would yield a performance of between 90 and 99% accuracy without requiring any intelligence at all. Still sucks for the people incorrectly classified, but no worse than the trained algorithm.

Related posts

References

* I am not even going to start on why this is not an example of AI. It is a journalistic addition not used by the authors of the research.

Featured image: iStockphoto, not licensed for reuse.

Category:
Behaviour, Blog, Data Science
Tags:
,

Join the conversation! 4 Comments

  1. I remember that story from a class on problem solving. One of the long-term problems with machine learning is that we humans may come to tolerate or accept errors as part of life. My wife notices far more grammatical and word-choice errors in news stories today. I pointed out that they might not be written by humans. I worry about the machines that diagnose better (95% vs 84%) of the time, taking over the process. I think I’d almost rather have the human doctor who might not get it right on the first pass, but is willing to listen and revise. Otherwise, I fear being in the 5%.

    Thanks for another thought-provoking post.

  2. Loving the fact you’re citing Wald. Changed my view on many things…

  3. And thanks Dan for another great comment!

    I’m with you. The worry with ceding too much power in decisions that affect people to machine learning, automation and formal models is that, whilst they may outperform humans at scale, they encourage humanity to become less diverse and I do not see that as progress. To be at the edges in a world judged by algorithms is to be mis-labeled and denied opportunities.

  4. Thanks Seb! And ditto… I always think back to that example when working with new data and problems, and these days when reading most of the news… What’s missing? It’s up there with ‘Who benefits’ and ‘Follow the money…’ when evaluating available/curated evidence 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: