
The biggest risk with relying on data is not its accuracy but treating something to be certain when it may be anything but…
The previous post was somewhat mis-titled. There are often accuracy issues with data. A mistyped name here, or social security number there, accidentally registering a perfectly healthy person as dead on government systems. Mistakes are made but hopefully most are (eventually) found and fixed.
The bigger issue is that data are often presented as certain facts when they can be anything but, introducing risk within automated decision systems. In human decision-making emotions can override the data. Through gut instinct that something just isn’t as it appears, through awareness of the situation or simple empathy for other people. It’s not just ‘measure twice, cut once’, it’s ‘measure and think about whether this measurement makes sense before taking an action that will have consequences that cannot be undone’.
As soon as you start analysing data, you introduce some form of uncertainty. You make choices about what groups data belongs to in order to make comparisons. This is true whether the ‘you’ is a person or a machine.
To give a simple example. I am currently conducting a piece of research into mobile digital interactions in urban spaces. I have a data set for London generated from updates captured via the built-in sensors of mobile devices during March this year. The image below shows a partial plot (10%) of the individual records with coordinates that lie within the Greater London boundary. The grey lines represent individual neighbourhood boundaries – lower super output areas (LSOAs) – of which there are nearly 5,000 across London.
Image 1: Mobile data points across London
In order to coduct my analysis, I need to tag each record with the LSOA that it lies within. Fortunately, I can use a computer to do this, specifically the statistics software R. But my data set is too large for R to munch through in one go. (So it told me when it ran out of memory). No problem. The source data has already been tagged by borough so I can retrieve the data in slices to tag by LSOA within each borough. (London has 33 boroughs).
Here’s what happened.
Image 2: Lambeth before and after tagging with LSOA
The image above on the left shows data retrieved that has already been tagged as being located within the borough of Lambeth. Black dots are the data points. Grey lines are the boundaries between LSOAs within Lambeth. If you look closely, you will see the problem. The top left and right of the Lambeth borough is empty whilst the points go beyond the borough boundaries to the right. Whatever method was used to tag the source data was incorrectly aligned.
Using R, I have recoded all the data and the image on the right shows the end result for Lambeth. It has picked up points that were previously coded for its north and western neighbours Westminster and Wandsworth, and has handed back points it had incorrectly claimed from eastern and southern neighbours Southwark and Croydon.
If I were just doing a summary analysis by borough, I may have been tempted to trust the source given it was already tagged and recoding is a time-consuming pain no matter how much you delegate to the computer. And my analysis would have been wrong. However that is a simple inaccuracy issue.
The real challenge is uncertainty. For my analysis, I am using the LSOA boundaries as defined by the Ordnance Survey in 2011. As opposed to the previous 2001 boundaries and the more recent 2014 boundaries. I am specifically using the 2011 boundaries because I’ll be analysing against other data sets that have been produced based on those boundaries. But if I include any data sets older than 2011 or newer than 2014 I may not know what boundaries were used to define their location. The LSOAs are designed to provide an overview of population statistics and the size is kept to between 1,500 and 3,000 residents per LSOA. Building development and migration trends mean the boundaries are regularly redrawn and who decides just how those boundaries get redrawn?
The mobile data itself contains uncertainty. Each data point has latitude and longitude coordinates. But also has a precision accuracy measure. I have removed all data points with a precision accuracy wider than a 200 metre radius. That eliminated 16% of my dataset. If I narrow the precision to a 50 metre radius, my dataset shrinks to 74%. But what does that precision accuracy measure mean? Checking the small print on the data source, it means there is a 66% confidence level that the coordinates fall within that radius. I have to factor that uncertainty into my analysis.
This is the reality of dealing with data of all sizes. At what point do you stop measuring? What level of uncertainty is acceptable? If boundaries have been drawn, how confident are you that they are meaningful? And those boundaries aren’t just spatial. They involve any classification that is used to separate raw data into groups. What would be the impact on your results if they changed? Every taxonomy we invent has its flaws and biases.
And then there are plain old mistakes still which is why you should always keep sampling your data to check. I thought I had successfully recoded all my data. Just to check, I plotted each borough again. 31 boroughs looked just like Lambeth. Patches of density, lines suggesting roads and bus routes, some empty spaces indicating parks and buildings where mobile signals go to die. One borough just looked plain wrong. What happened to Hackney?
Image: Islington and Hackney after tagging with LSOA
After initially envisioning some wild scenario involving a real-world Tony Stark or some secret government agency disrupting all mobile signals in the area… am guessing I missed an error that caused R to burp and not export the Hackney results back out to the database. Hackney has picked up points from its neighbours, most coming from Islington to the west. But has lost all of its own. Oops. Back to R we go…
Related post
Featured image: Illusion kindly shared on Flickr by Tinou Bao
People or pillars? An illusion can be two things but you can’t view them simultaneously. Your eyes switch from one to the other…