How relevant is your content?

Last year I wrote a blog post that generated a bit of offline feedback – Search Lessons. Or rather, it was explaining The Google™ Effect that sparked a debate.

¨Why can’t our enterprise search just be like Google?¨

People are used to Google when searching the Internet – simple (and single) user interface, fast results. Relevance varies depending on the search but is usually good enough for most queries. When it comes to searching for information within the boundaries of your organisation, the answers are not so easy to find…

There’s one simple answer: Intranet ≠ Internet.

They might look similar, thanks to web standards, but looks can deceive. A mushroom looks like a toadstool but one is edible and the other isn’t.

On the Internet, pages want to be found. The same is rarely true within the enterprise.

On the Internet, if your site doesn’t appear in search results, it doesn’t exist. That’s a huge incentive to ensure anyone publishing content uses tags and any other tricks to attract the attention of a search engine. Enough of an incentive to create a whole new industry – Search Engine Optimisation (SEO). The same incentives rarely apply within the enterprise and few organisations budget for in-house SEO. Added to that, the search vendors have to continuously optimise the index to balance any negative effects caused by SEO. In-house? You need to do both! Enterprise search is a continuous improvement process, not an isolated project.

On the Internet, there is no relationship between the search engine and the end-user. If you don’t find what you are looking for, you try a different query or look somewhere else. There is no contract stating that the Internet must provide you with answers. There are no guarantees that what you find will be accurate or useful. When an enterprise search engine fails to perform, the IT department usually hears about it.

On the Internet, all users are equal (and anonymous). You can’t login to Google with a different ID and suddenly access previously hidden search results. In the enterprise, security is rarely considered optional.

The Internet doesn’t care what you do with what you find. An enterprise search engine should.

When an enterprise search engine fails to deliver, the technology usually gets the blame:

¨Our search engine is rubbish, the results it finds are never relevant.¨

But is it really the search engine that is to blame? When it comes to calculating relevance, it all depends on the content. Jenny Everett, over on the SharePoint Blogs, published the results of a survey that clearly demonstrate the real challenge facing enterprise search solutions.

Source: SharePoint Search is not Psychic by Jenny Everett

85% of search issues had nothing to do with relevance ranking. The numbers may vary by organisation, but I would wager that the top two issues are consistent. Imagine if you performed a search on the Internet and the first page of results consisted of items with the title ‘Blank’.

Will we see the Semantic Web?

The Semantic Web, in theory, is a brilliant idea. For a starting point to read all about it, check out its entry on Wikipedia. In a nutshell, the Semantic Web is about bringing meaning to data and web pages. The ultimate goal appears to be make data work without (and on behalf of) humans. At the moment, if my car starts feeling ill, the computer displays a warning light on the dashboard. I ignore the warning light until the car all but stops working. Then I phone the garage, get it towed in and they repair it. In a Semantic Web world, my car wouldn’t let me be so flippant. It might still display that warning light to humour me, but it will also send a message to the garage and book itself an appointment to repair the failing component. If fully automated driving ever takes off, it will take itself to the garage and come back home when its ready

OK, so there are probably more realistic examples to demonstrate the benefits of a Semantic Web, but the principles (in simple terms) are there:

  • a universal method to describe data (ontology)
  • a standard language to store data (XML)
  • a common mechanism to move data (web services)

In my optimistic example, cars will use a universal ontology to describe their ‘bits’ and associated information to trigger alerts when something goes wrong. That data will be stored using an agreed XML schema that all garages can read and willl be transmitted to the garage (subject to wireless network being available) using a published web service. Similarly, all garages will use a universal ontology, XML schema and web service to make their diaries available for cars to locate free slots and book in appointments… and so it goes on

Two of those principles are achievable and already being adopted in many business and consumer systems. One isn’t. There will never be a universal method todescribe all data and if there were, it would go the way of Google’s PageRank algorithm – successful for as long as it takes for the system to be gamed. Search Engine Optimisation would be replaced with a new black art: Semantic Web Optimisation

As long as human nature is involved in designing (and using) an ontology, there will never be agreement. I have just finished reading a wonderful book – Glut: Mastering Information Through The Ages. It documents the attempts to classify and organise information literally since records began, and highlights the common cause of failure – different perspectives reach different conclusions. Many people reference the library as a shining example of how taxonomy can work, citing the Dewey Decimal system. The Dewey Decimal system was first created in 1876 by Melvil Dewey. Dewey gave higher status to Christian religions (sections 200 through 280) than all others (Judaism and Islam are given just one section each – 296 and 297 respectively). I wonder what religion he belonged to? His design was not without context and therefore not universal…

And history appears to have taught us nothing about our lack of ability to work on anything universal. In the 18 August 2007 issue of the New Scientist magazine, there were two articles covering the same piece of the planet. The first article questioned whether or not a tipping point has already been reached with regard to climate change. Concern focused on the melting Greenland ice sheet and how the consequences will affect us all. In the news section, a small article covered ‘Arctic monkeying’, the race to claim ownership of a seabed structure running from Greenland to Russia. Apparently, a Russian submarine planted a flag on the seabed beneath the North Pole. Canada is building a military base and deep-water port to govern the emerging North-West passage (about that melting ice sheet…) Denmark has sent researchers to assess its claim on the untapped oil and gas reserves that just happen to lie in the region under dispute… amply demonstrating why we are doomed to suffer the consequences of climate change. For goodness sake, we can’t agree on a single currency, spoken language or even agree to all drive on the same side of the road. Diversity is a core part of what it is to be human. Universality requires human nature to exit through the door to the right (about that talk of climate change…)

So what will become of the Semantic Web?

XML and web services continue to spread through systems and applications. Even Microsoft, never the first to sign up to an open standard 🙂 has adopted XML formats in its latest Office applications, and web services are used to communicate between desktop applications and services/online services.

Ontology is the difficult member of the semantic family – unfortunate because ontology is the ‘semantic’ in The Semantic Web.. Hierarchical ‘top-down’ taxonomies continue to battle with networked ‘bottom-up’ folksonomies and history suggests such conflict is a necessary feature (not a bug) of evolving systems. User tagging has proven unexpectedly successful, thanks to Flickr and blogging, and demonstrated the value of folksonomies. But inconsistencies remain rife. If I want to Technorati tag a blog post about Web 2.0, do I use web 2.0, web-20, web2.0, web20… Perhaps that’s the trade-off we will have to accept – forget striving for a universal method and instead learn to work with local variants and their different contexts. Accept that the results won’t be perfect. Just like the real world 🙂

Without doubt, the use of metadata to enhance search results and create new systems has the potential to disrupt and is attracting huge attention. You can see some of Google’s work on multi-faceted search (pivoting results based on classification category) over at http://www.google.com/experimental/index.html. Microsoft Research published research on ‘object-level vertical‘ search algorithms although a more exciting demo can be found at TED – Ideas worth spreading: using Photosynth to combine photos (found through their tags) into a 3D virtual world.

Nicholas Carr, yes he who said I.T. doesn’t matter but can’t seem to stop writing about it anyway :-), perhaps identified where the semantic web will come to life – the real Web 2.0: the datacenters competing to host our information in the ‘cloud’. Will the providers of online services become the gate keepers to different semantic webs?

References

Technorati tags: semantic web; web 2.0

Just Enough Taxonomy

On Microsoft’s Channel 9 network, there is an interesting podcast called ‘Just Enough Architecture‘, where the interviewee provides some good recommendations about the balance between how much architecture you need versus just getting on and writing software that does something useful.

The same debate could be applied to taxonomy, specifically the use of metadata properties to classify content.

For some reason, most companies who decide they want to improve how content is classified seem to want extreme taxonomy, swinging from not-enough taxonomy to too-much. The mantra may sound somewhat familiar:

One taxonomy to rule them all, one taxonomy to find them, one taxonomy to bring them all and, in the records management store, define them

Often starting with none at all (i.e. content is organised informally and inconsistently using folders), the desire is to create a single corporate taxonomy to classify everything (using a hierarchical structure of metadata terms). An inordinate amount of time is then spent defining and agreeing the perfect taxonomy (for some reason, many seem to settle on about 10,000 terms). Several months later, heads are being scratched as people try to figure out just how they are going to implement the taxonomy. Do they classify existing content or only apply it to new stuff? Do they have specific roles dedicated to classifying the content, rely on the content owners to do it, or look at automated classification tools. Do they put rules in place to force people to classify content and store it in specific locations that are ‘taxonomy-aware’. How do they prevent people bypassing the system, those who figure they can still get their work done by switching to a wiki or a Groove workspace or a MySpace site or a Twitter conversation? How do they validate the taxonomy and check that people are classifying correctly? What do they do if people aren’t classifying correctly, who don’t understand the hierarchy or have different meanings for the terms in use? What started out as a simple idea to improve the findability of information becomes a huge burden to maintain with questionable benefits, given there are so many opportunities for classification to go wrong.

This dilemma reveals two flaws that make implementing a taxonomy so difficult. The first is the desire to treat taxonomy as a discrete project rather than an organic one. Collaboration and knowledge management projects often share this fate. Making taxonomy a discrete project usually means tackling it all in one go from a technology perspective and then handing it over to the business to run ‘as is’ for ever more (i.e. until the next technology upgrade). Such projects end up looking like that old cliché – attempting to eat an elephant whole. The project team tries to create a perfect design that will deliver all identified requirements (and the business, knowing this could be their one chance for improved tools, delivers a loooooong list of requirements), implements a solution and then moves on to the next project. As the solution is used, the business finds flaws in their requirements or discover new ways of working enabled by the technology, but it is too late to get the solution changed. The project is closed, the budget spent.

An alternative approach is to treat taxonomy as an organic project or, for those who prefer corporate-speak, a continuous-improvement programme. Instead of planning to create and deploy the perfect taxonomy, concentrate on ‘just enough taxonomy’. A good starting point is to find out why taxonomy is needed in the first place. If it is to make it easier for people to find information, first document the specific problems being experienced. Solve those problems as simply as possible, test them and gather feedback. If successful, people will raise the bar on what they consider good findability, generating new demands waiting for IT to solve, and so the cycle continues.

The following is a simple example using a fictitious company.

Current situation: Most information is stored in folders on file shares and shared via email. There is an intranet that is primarily static content published by a few authors. The IT department has been authorised to deploy Microsoft Office SharePoint Server 2007 (MOSS)

General problem: Nobody can find what they are looking for (resist temptation to sing U2 song at this moment…)

Specific problems: Difficult to find information from recently completed projects that could be re-used in future projects; Difficult to differentiate between high quality re-usable project information versus low quality or irrelevant project information; Difficult to find all available documents for a specific customer (contracts, internal notes, project files)

Possible solution: Deploy a search engine to index all file folders and the intranet. Move all project information to a central location. Within the search engine, create a scope (or collection) for the project information location. Users will then be able to perform search queries that will return only project information within the results. Using ‘date modified’ as the sorting order will locate information from the most recent projects. Create a central location for storing top-rated ‘best practice’ project information. Set-up a team of subject matter experts to work with project teams and promote documents as ‘best practice’. The Best Practices store can be given high visibility throughout the intranet and promoted as high relevance for search queries.

Now that is a very brief answer outlining one possible solution. But the solution is relatively simple to implement and should offer immediate (and measurable) improvements based on feedback regarding the problems people are experiencing. There were two red herrings in the requirements that could have resulted in a very different, more complex, solution: 1. That MOSS was going to be the technology; and 2. The need to find documents for a specific customer. When you have chosen a technology, there is always the temptation to widen the project scope. MOSS has all sorts of features that can help improve information management and the starting point is often to replace an old crusty static intranet. But the highlighted problems did not mention any concerns about the intranet. That’s not to say those concerns do not exist, but they are a different problem and not the priority for this project. The second red herring is a classic. When people want to be able to find information based on certain parameters, such as all documents connected to a specific customer, there is the temptation to implement a corporate-wide taxonomy and start classifying all content, starting with the metadata property ‘customer name’. But documents about a specific customer will likely contain the customer’s name. In this scenario, the simplest solution is to create a central index and provide the ability for users to search for documents containing a given customer’s name. If that fails to improve the situation then you may need to consider more drastic measures.

Rejecting the large-scale information management project in favour of small chunks of continuous ‘just enough’ improvement is not an easy approach to take. The idea of having a centralised, classified and managed store of content, where you can look up information based on any parameter and receive perfect results, continues to be an attractive one with lots of benefits to the business – both value-oriented (i.e. helping people discover information to do their job) and cost-oriented (i.e. managing what people do with information – compliance checks and the like). But a perfectly classified store of content is a utopia. Trying to achieve it can result in creating systems that are harder to use and difficult to maintain when the goal is supposed to be to make them easier.

I mentioned that the common approach to implementing taxonomy has two flaws. The first has been discussed here – how to create just enough taxonomy. The second flaw is the desire to create a single universal taxonomy that can be applied to everything. I’ll tackle that challenge in a separate post (a.k.a this post is already too long…)

Reference: Just Enough Architecture (MSDN Channel 9). Highly recommended. There are plenty of similarities between software architecture and information architecture (of which taxonomy is subset). Don’t be put off by the techie speak, it debates the pro’s and con’s of formal processes and informal uses, and includes some great non-technical examples for how to find a balance.

Recent related posts:

Technorati tags: Taxonomy, Tagging, Information Architecture

Metacrap

I can’t help it, the word ‘metacrap’ just makes me want to chuckle.

Anyways, there’s a great blog post over on Wired, by David Weinberger – Metacrap and Flickr Tags: An Interview with Cory Doctorow. (transcript and audio available). Lots of useful information and opinion about search challenges, tagging and information rights. Like the following snippet:

No one really worries that when you hit control-r to reply to an email that the first thing it does is make a holis bolis copy of that first email, which is, itself, a copyrighted work, and then allow you to insert your own commentary as an interleaf between the paragraphs. That’s how you know the Internet was designed by engineers, not lawyers.

The interview includes a great discussion about the differences between implicit and explicit metadata, explaining why the former is better for indexing and the latter is better for tagging. (That’s a very simplistic summary, I’d recommend reading the article to get the full perspective.)

When IT doesn’t matter

I’m currently reading ‘The Labyrinths of Information: Challenging the Wisdom of Systems‘ by Claudio Ciborra. I haven’t gotten very far through the book yet, it is written in an academic tone which always slows me down. But early on, I stumbled across a very interesting point of view.

IT architecture is popular topic right now. You can get enterprise architects, software architects, infrastructure architects, information architects… the list goes on. One of the focus areas for architecture is the adoption of standards and consist methods for the design, development and deployment of IT systems. All sounds very sensible and measurable.

But Claudio makes a simple observation that suggests such architecture doesn’t matter, in that it does not help an organisation to become successful. Instead, architecture is a simple necessity of doing business digitally. This argument concurs with Nicholas Carr’s controversial article (and subsequent book) ‘IT doesn’t matter

A sample from the book: (note – SIS refers to strategic information systems)

“…market analysis of and the identification of SIS applications are research and consultancy services that can be purchased. They are carried out according to common frameworks, use standard data sources, and, if performed professionally, will reach similar results and recommend similar applications to similar firms.”

So what do you need to do to become an innovative company? Claudio suggests:

“…To avoid easy imitation, the quest for a strategic application must be based on such intangible, and even opaque, areas as organisational culture. The investigation and enactment of unique sources of practice, know-how, and culture at firm and industry level can be a source of sustained advantage…

See, I have been telling those techie-oriented IT folk for years, collaboration and knowledge sharing are far more important than your boring transaction-based systems 🙂

…Developing an SIS is much closer to prototyping and the deployment of end-user’s ingenuity than has so far been appreciated: most strategic applications have merged out of plain hacking. The capacity to integrate unique ideas and practical design solutions at the end-user level turns out to be more important than the adoption of structured approaches to systems development…”

Sounds like an argument in favour of mash-ups and wikis to me. See also: Let’s make SharePoint dirty

Why taxonomy fails…

Over the years, the desire for taxonomies within organisations seems to gain and lose popularity. At the moment, it seems to be on the rise again…

Whilst I don’t doubt there are scenarios where developing and rigidly applying a detailed taxonomy structure against your content is a sensible approach that delivers lots of valuable benefits for your organisation… I haven’t seen many successful implementations. I have seen plenty of failed attempts. The most common cause of failure is creating an overly complex taxonomy that becomes impossible to implement easily and/or consistently and/or results in making content harder to find (when the opposite is usually desired).

It is not difficult to end up with a complex taxonomy. As soon as you create an initial set of categories and start classifying content, you discover clashes that require subsets of categories to accurately define content, and so the number of terms in your taxonomy begins to increase.

And once you have a complex taxonomy, you have a problem. For starters, applying it to content is a major headache – authors have to navigate nested lists of words to get to the ones that need to be applied to their content. Consistency is impossible to achieve organically (i.e. when people are involved) – words can have different meanings in different contexts and no two people will apply exactly the same classifications to an identical set of content. Who is right and who is wrong? If you’ve got a budget burning a hole in your pocket, you can try skipping this problem by automating the process instead.

But even if you’ve crossed these hurdles and have successfully applied a taxonomy to your content, you may still fail to achieve the desired results. For many organisations, at best, content is still hard to find. At worst, it becomes even harder to find

The reason content is still hard to find is simple – the people looking for the content are using a different vocabulary to the one used to classify content. They have different requirements and different view points. This issue was demonstrated in a previous blog post: When taxonomy fails…

But how can content become harder to find as a result of using a taxonomy structure? It will depend on what indexing engine you use, but most face the same challenge. Indexing engines usually apply a variety of methods to calculate the relevance rank for each and every item returned in a search results list. Different algorithms work better for different types of content. For example, Google has demonstrated the benefits of PageRank for analysing links between web pages to determine relevance. Text-analysis (probabilistic ranking algorithms, often based on Bayesian inference) usually works better for documents by mapping related terms within the body of the document. But one method often overrides these algorithms – matching search query terms to metadata, i.e. property values. The reasoning is simple: if someone has taken the effort to classify a document, than the values in the properties will have significant meaning, more so than analysing the content of the document.

To clarify this point: the more you use metadata properties to classify content, the less influence an indexing engine’s ranking algorithms will have on calculating relevancy. There’s a reason why Google pretty much ignores metatags when indexing web sites…

To demonstrate, let’s create a metadata property called ‘Document Type’ (at this point, in case anyone reading this thinks I might be referring to them, I assure you I am not… at least, not just you 🙂 ‘Document Type’ is the most common example given for a desired metadata property when designing a SharePoint deployment.)

The benefits of classifying content based on its document type is to quickly narrow search results and improve relevance for specific roles. For example, a project manager might be more interested in documents classified as ‘project’ than documents classified as ‘marketing’. If you do a basic search query (i.e. enter a phrase inside a box and click Go) for “customer X project Y”, you’ll get results listing all content containing the words: customer, X, project, and Y. Because the word ‘project’ matches a property value, all content classified with that property value should appear high in the results list. If you do an advanced search query and select property=Document Type, value=Project, you will get a reduced results set, listing only content classified as project documents containing the words: customer, X, and Y. This is not an unreasonable requirement for organisations with large volumes of content.

However, assuming query terms that match property values have higher significance is a double-edged sword. It improves relevance for searches using the word in the same context but reduces relevance for searches using the word in a different context. For example, another common search might be to find all documents about project Y. Doing a basic search query will return all content containing the words: project and Y. Again, the word ‘project’ will be an exact match against all content classified as project documents (Document Type=Project) meaning all project documents will be treated as more relevant than other content types. But this time, you are not interested in the type of document, you are interested in information about the actual project. The most relevant document might be an internal memo. But unless the word ‘project’ appears in one of the metadata properties for that document, it could be ranked as having lower relevance than all project documents, depending on what other terms are used in the search query.

That’s the problem with taxonomies. Not only are they difficult to implement in the first place, they can actually make it harder to find relevant content because they are not great at handling different points of view…

Technorati tags: Taxonomy, Tagging, Information Architecture

When taxonomy fails…

I’m doing a lot of work around information findability at the moment. And at some point, will hopefully find some time to post some of the research and experiences here.

One element of information findability is content classification. There are various methods to classify content but two words crop up in the boxing ring more than any others – taxonomy and folksonomy. Do a search on taxonomy versus folksonomy and you’ll see what I mean…

There’s an interesting article in the New York Times – One Picture, 1,000 Tags – that highlights why so many people champion folksonomy over taxonomy. The article discusses the challenges faced by museums in making their online collections accessible – how do you find content when you don’t know how to describe what you are looking for? The Metropolitan Museum of Art ran a test and got non-specialist volunteers to apply tags to a selection of art. The results raised eyebrows. More than 80% of the terms used were not listed in the museum’s documentation. The experts classifiying content are talking a completely different language to the one used by their contents’ audience.

This problem all too often applies to corporate scenarios too. Those who are chosen to design the corporate taxonomy are often selected because they are experts. But those who will be using the taxonomy, either applying it or using it to search for content, may have a different point of view. And that’s when the taxonomy fails…

Related posts:

Technorati tags: Taxonomy, Folksonomy,

Final note: If you are interested in the use (and pros/cons) of taxonomy, ontology and folksonomy to improve the findability of your content, there’s a great book worth reading: Ambient Findability by Peter Morville