Over the years, the desire for taxonomies within organisations seems to gain and lose popularity. At the moment, it seems to be on the rise again…
Whilst I don’t doubt there are scenarios where developing and rigidly applying a detailed taxonomy structure against your content is a sensible approach that delivers lots of valuable benefits for your organisation… I haven’t seen many successful implementations. I have seen plenty of failed attempts. The most common cause of failure is creating an overly complex taxonomy that becomes impossible to implement easily and/or consistently and/or results in making content harder to find (when the opposite is usually desired).
It is not difficult to end up with a complex taxonomy. As soon as you create an initial set of categories and start classifying content, you discover clashes that require subsets of categories to accurately define content, and so the number of terms in your taxonomy begins to increase.
And once you have a complex taxonomy, you have a problem. For starters, applying it to content is a major headache – authors have to navigate nested lists of words to get to the ones that need to be applied to their content. Consistency is impossible to achieve organically (i.e. when people are involved) – words can have different meanings in different contexts and no two people will apply exactly the same classifications to an identical set of content. Who is right and who is wrong? If you’ve got a budget burning a hole in your pocket, you can try skipping this problem by automating the process instead.
But even if you’ve crossed these hurdles and have successfully applied a taxonomy to your content, you may still fail to achieve the desired results. For many organisations, at best, content is still hard to find. At worst, it becomes even harder to find
The reason content is still hard to find is simple – the people looking for the content are using a different vocabulary to the one used to classify content. They have different requirements and different view points. This issue was demonstrated in a previous blog post: When taxonomy fails…
But how can content become harder to find as a result of using a taxonomy structure? It will depend on what indexing engine you use, but most face the same challenge. Indexing engines usually apply a variety of methods to calculate the relevance rank for each and every item returned in a search results list. Different algorithms work better for different types of content. For example, Google has demonstrated the benefits of PageRank for analysing links between web pages to determine relevance. Text-analysis (probabilistic ranking algorithms, often based on Bayesian inference) usually works better for documents by mapping related terms within the body of the document. But one method often overrides these algorithms – matching search query terms to metadata, i.e. property values. The reasoning is simple: if someone has taken the effort to classify a document, than the values in the properties will have significant meaning, more so than analysing the content of the document.
To clarify this point: the more you use metadata properties to classify content, the less influence an indexing engine’s ranking algorithms will have on calculating relevancy. There’s a reason why Google pretty much ignores metatags when indexing web sites…
To demonstrate, let’s create a metadata property called ‘Document Type’ (at this point, in case anyone reading this thinks I might be referring to them, I assure you I am not… at least, not just you 🙂 ‘Document Type’ is the most common example given for a desired metadata property when designing a SharePoint deployment.)
The benefits of classifying content based on its document type is to quickly narrow search results and improve relevance for specific roles. For example, a project manager might be more interested in documents classified as ‘project’ than documents classified as ‘marketing’. If you do a basic search query (i.e. enter a phrase inside a box and click Go) for “customer X project Y”, you’ll get results listing all content containing the words: customer, X, project, and Y. Because the word ‘project’ matches a property value, all content classified with that property value should appear high in the results list. If you do an advanced search query and select property=Document Type, value=Project, you will get a reduced results set, listing only content classified as project documents containing the words: customer, X, and Y. This is not an unreasonable requirement for organisations with large volumes of content.
However, assuming query terms that match property values have higher significance is a double-edged sword. It improves relevance for searches using the word in the same context but reduces relevance for searches using the word in a different context. For example, another common search might be to find all documents about project Y. Doing a basic search query will return all content containing the words: project and Y. Again, the word ‘project’ will be an exact match against all content classified as project documents (Document Type=Project) meaning all project documents will be treated as more relevant than other content types. But this time, you are not interested in the type of document, you are interested in information about the actual project. The most relevant document might be an internal memo. But unless the word ‘project’ appears in one of the metadata properties for that document, it could be ranked as having lower relevance than all project documents, depending on what other terms are used in the search query.
That’s the problem with taxonomies. Not only are they difficult to implement in the first place, they can actually make it harder to find relevant content because they are not great at handling different points of view…
Technorati tags: Taxonomy, Tagging, Information Architecture
Then, what is the solution?
Ah, well, I tried to address that in the post but it was in danger of turning into a book 🙂 I don't think the question can ever be 'what is the solution?', instead I think we need to ask 'what are the options?'. Perhaps the challenge with taxonomy has always been the same – the attempt to define one single solution. I'm working on this issue in a post coming up…