Metacrap

I can’t help it, the word ‘metacrap’ just makes me want to chuckle.

Anyways, there’s a great blog post over on Wired, by David Weinberger – Metacrap and Flickr Tags: An Interview with Cory Doctorow. (transcript and audio available). Lots of useful information and opinion about search challenges, tagging and information rights. Like the following snippet:

No one really worries that when you hit control-r to reply to an email that the first thing it does is make a holis bolis copy of that first email, which is, itself, a copyrighted work, and then allow you to insert your own commentary as an interleaf between the paragraphs. That’s how you know the Internet was designed by engineers, not lawyers.

The interview includes a great discussion about the differences between implicit and explicit metadata, explaining why the former is better for indexing and the latter is better for tagging. (That’s a very simplistic summary, I’d recommend reading the article to get the full perspective.)

When IT doesn’t matter

I’m currently reading ‘The Labyrinths of Information: Challenging the Wisdom of Systems‘ by Claudio Ciborra. I haven’t gotten very far through the book yet, it is written in an academic tone which always slows me down. But early on, I stumbled across a very interesting point of view.

IT architecture is popular topic right now. You can get enterprise architects, software architects, infrastructure architects, information architects… the list goes on. One of the focus areas for architecture is the adoption of standards and consist methods for the design, development and deployment of IT systems. All sounds very sensible and measurable.

But Claudio makes a simple observation that suggests such architecture doesn’t matter, in that it does not help an organisation to become successful. Instead, architecture is a simple necessity of doing business digitally. This argument concurs with Nicholas Carr’s controversial article (and subsequent book) ‘IT doesn’t matter

A sample from the book: (note – SIS refers to strategic information systems)

“…market analysis of and the identification of SIS applications are research and consultancy services that can be purchased. They are carried out according to common frameworks, use standard data sources, and, if performed professionally, will reach similar results and recommend similar applications to similar firms.”

So what do you need to do to become an innovative company? Claudio suggests:

“…To avoid easy imitation, the quest for a strategic application must be based on such intangible, and even opaque, areas as organisational culture. The investigation and enactment of unique sources of practice, know-how, and culture at firm and industry level can be a source of sustained advantage…

See, I have been telling those techie-oriented IT folk for years, collaboration and knowledge sharing are far more important than your boring transaction-based systems 🙂

…Developing an SIS is much closer to prototyping and the deployment of end-user’s ingenuity than has so far been appreciated: most strategic applications have merged out of plain hacking. The capacity to integrate unique ideas and practical design solutions at the end-user level turns out to be more important than the adoption of structured approaches to systems development…”

Sounds like an argument in favour of mash-ups and wikis to me. See also: Let’s make SharePoint dirty

Why taxonomy fails…

Over the years, the desire for taxonomies within organisations seems to gain and lose popularity. At the moment, it seems to be on the rise again…

Whilst I don’t doubt there are scenarios where developing and rigidly applying a detailed taxonomy structure against your content is a sensible approach that delivers lots of valuable benefits for your organisation… I haven’t seen many successful implementations. I have seen plenty of failed attempts. The most common cause of failure is creating an overly complex taxonomy that becomes impossible to implement easily and/or consistently and/or results in making content harder to find (when the opposite is usually desired).

It is not difficult to end up with a complex taxonomy. As soon as you create an initial set of categories and start classifying content, you discover clashes that require subsets of categories to accurately define content, and so the number of terms in your taxonomy begins to increase.

And once you have a complex taxonomy, you have a problem. For starters, applying it to content is a major headache – authors have to navigate nested lists of words to get to the ones that need to be applied to their content. Consistency is impossible to achieve organically (i.e. when people are involved) – words can have different meanings in different contexts and no two people will apply exactly the same classifications to an identical set of content. Who is right and who is wrong? If you’ve got a budget burning a hole in your pocket, you can try skipping this problem by automating the process instead.

But even if you’ve crossed these hurdles and have successfully applied a taxonomy to your content, you may still fail to achieve the desired results. For many organisations, at best, content is still hard to find. At worst, it becomes even harder to find

The reason content is still hard to find is simple – the people looking for the content are using a different vocabulary to the one used to classify content. They have different requirements and different view points. This issue was demonstrated in a previous blog post: When taxonomy fails…

But how can content become harder to find as a result of using a taxonomy structure? It will depend on what indexing engine you use, but most face the same challenge. Indexing engines usually apply a variety of methods to calculate the relevance rank for each and every item returned in a search results list. Different algorithms work better for different types of content. For example, Google has demonstrated the benefits of PageRank for analysing links between web pages to determine relevance. Text-analysis (probabilistic ranking algorithms, often based on Bayesian inference) usually works better for documents by mapping related terms within the body of the document. But one method often overrides these algorithms – matching search query terms to metadata, i.e. property values. The reasoning is simple: if someone has taken the effort to classify a document, than the values in the properties will have significant meaning, more so than analysing the content of the document.

To clarify this point: the more you use metadata properties to classify content, the less influence an indexing engine’s ranking algorithms will have on calculating relevancy. There’s a reason why Google pretty much ignores metatags when indexing web sites…

To demonstrate, let’s create a metadata property called ‘Document Type’ (at this point, in case anyone reading this thinks I might be referring to them, I assure you I am not… at least, not just you 🙂 ‘Document Type’ is the most common example given for a desired metadata property when designing a SharePoint deployment.)

The benefits of classifying content based on its document type is to quickly narrow search results and improve relevance for specific roles. For example, a project manager might be more interested in documents classified as ‘project’ than documents classified as ‘marketing’. If you do a basic search query (i.e. enter a phrase inside a box and click Go) for “customer X project Y”, you’ll get results listing all content containing the words: customer, X, project, and Y. Because the word ‘project’ matches a property value, all content classified with that property value should appear high in the results list. If you do an advanced search query and select property=Document Type, value=Project, you will get a reduced results set, listing only content classified as project documents containing the words: customer, X, and Y. This is not an unreasonable requirement for organisations with large volumes of content.

However, assuming query terms that match property values have higher significance is a double-edged sword. It improves relevance for searches using the word in the same context but reduces relevance for searches using the word in a different context. For example, another common search might be to find all documents about project Y. Doing a basic search query will return all content containing the words: project and Y. Again, the word ‘project’ will be an exact match against all content classified as project documents (Document Type=Project) meaning all project documents will be treated as more relevant than other content types. But this time, you are not interested in the type of document, you are interested in information about the actual project. The most relevant document might be an internal memo. But unless the word ‘project’ appears in one of the metadata properties for that document, it could be ranked as having lower relevance than all project documents, depending on what other terms are used in the search query.

That’s the problem with taxonomies. Not only are they difficult to implement in the first place, they can actually make it harder to find relevant content because they are not great at handling different points of view…

Technorati tags: Taxonomy, Tagging, Information Architecture

When taxonomy fails…

I’m doing a lot of work around information findability at the moment. And at some point, will hopefully find some time to post some of the research and experiences here.

One element of information findability is content classification. There are various methods to classify content but two words crop up in the boxing ring more than any others – taxonomy and folksonomy. Do a search on taxonomy versus folksonomy and you’ll see what I mean…

There’s an interesting article in the New York Times – One Picture, 1,000 Tags – that highlights why so many people champion folksonomy over taxonomy. The article discusses the challenges faced by museums in making their online collections accessible – how do you find content when you don’t know how to describe what you are looking for? The Metropolitan Museum of Art ran a test and got non-specialist volunteers to apply tags to a selection of art. The results raised eyebrows. More than 80% of the terms used were not listed in the museum’s documentation. The experts classifiying content are talking a completely different language to the one used by their contents’ audience.

This problem all too often applies to corporate scenarios too. Those who are chosen to design the corporate taxonomy are often selected because they are experts. But those who will be using the taxonomy, either applying it or using it to search for content, may have a different point of view. And that’s when the taxonomy fails…

Related posts:

Technorati tags: Taxonomy, Folksonomy,

Final note: If you are interested in the use (and pros/cons) of taxonomy, ontology and folksonomy to improve the findability of your content, there’s a great book worth reading: Ambient Findability by Peter Morville

Documents versus Records

During 2002, I was involved in a project to implement electronic records management (not SOX, that act was still in development…) The subject has cropped up again and brought back some memories…

The biggest challenge the project faced was that electronic records had to be managed in an identical way to paper-based files. That experience resulted in point #4 in ‘Why is KM so difficult‘. I remember, in frustration, muttering ‘you can’t shred electronic records’ when trying to explain why paper files and folders had fundamentally different properties to electronic ones. (Enron was headline news at the time, as was Microsoft‘s DoJ case)

The second challenge was that the project also wanted to improve document management (including collaboration) and, unsurprisingly, some thought it would be a great idea to roll out a single system to cover both. Now, in theory, it can be done but it needs to be recognised that managing the archiving of a record and managing the creation of a document are very different activities:

Records Management:

  • Management of the record is more important than the content of the record
  • The record never changes (although its properties might)
  • Records require access controls, lots of them

Document Management/Collaboration:

  • Without content there is no document
  • The document changes a lot, that’s the whole point of collaboration
  • Access controls restrict and impede collaboration, the fewer there are the better

The ability to have one store manage the entire document lifecycle process depends on what’s involved. If you require very granular records management capabilities, the database design will be different to what you need for most document management and collaboration activities. Mashing the two together could lead to performance and scalability problems. It’s no different to any other situation where you can have dedicated solutions or combine them together. A sofa-bed is not as comfortable as a sofa or a bed, but if you can’t have both, a sofa-bed may be better than only having a sofa or a bed… You have to decide if the trade-off is worth it.

If you insist on forcing an electronic system to replicate a paper-based system, you risk stretching the technology beyond reasonable design limits. Either the system will fail, or people will fail to use the system. As the LATCC learned, a change in tool sometimes requires a change to the process. Failure to do so can lead to escalating costs for a system that will never perform optimally – just because something is possible does not make it plausible.

My advice for companies who need to provide records management and who also want to improve document collaboration:

  • Collaboration and records management have different goals and objectives
  • All records are documents, not all documents are records
  • The ideal records management solution should be an extension to the collaborative environment, not the other way round
  • Collaborate to create the document, apply locking controls when the final document is declared a record
  • If the required controls are complicated, the records may need a separate database designed specifically to provide those controls
  • If you are in charge of designing the required controls, please give some thought to the economic implications of your decisions

And never forget, people can always circumvent the system. Put too many controls in place and they simply will not create the document – no document, no record…