Web 3.0 and the Semantic Web

I’ve got mixed feelings about the viability of the semantic web but this video is a great compilation of the challenges facing information discovery and possible options. It’s become way easier to create information than to manage it…

Tagging over time

In a report by NetworkWorld: 25 leading-edge IT research projects, there is an interesting feature that could significantly advance the world of information architecture, specifically classification of content. (If you click the link and read the article, it is item 15. There are some other gems in there too.)

Researchers at Penn State University have developed software that not only automatically tags photos when they are uploaded, but also automatically updates the tags based on how people interact with the photos over time. In its current state, the researchers claim the system can grow from 40% to 60% accuracy as it learns from user behaviour.

Whilst the research is focused on images, the technology could have major benefits for traditional information sources – documents and web pages. One of the biggest failings of traditional taxonomies and classification systems has been their inability to cope with change. Language continues to evolve: old words are given new meanings, new words are applied to old meanings. Adaptive tagging could introduce a whole new method of relevance ranking to improve search results.

Filed in library under: Information Architecture

Technorati tags: taxonomy; information architecture; metadata

Zillionics change perspective


Interesting article – ‘Zillionics‘ by Kevin Kelly. Well worth a read if you are interested in long tails, social networks and wondering where digital information technology is leading us to. Here’s a soundbite:

More is different.

When you reach the giga, peta, and exa orders of quantities, strange new powers emerge. You can do things at these scales that would have been impossible before… At the same time, the skills needed to manage zillionics are daunting…

Zillionics is a new realm, and our new home. The scale of so many moving parts require new tools, new mathematics, new mind shifts.

It’s a short and thought provoking article.

And on the same subject, a longer article from Wired: ‘The End of Theory: The Data Deluge Makes the Scientific Method Obsolete‘ by Chris Anderson. Hmmm…. we’ll see about that 😉

Header image: World in dots (iStockphoto, not for re-use)

Rethinking the fileplan

Perhaps one of the loudest unspoken messages from the SharePoint conference held recently in Seattle was the need for information architects and managers to work more closely with their user interface (UI) and technology-focused counterparts. Thanks to the Internet, we are unlikely to see a downturn in the market for digital information in the foreseeable future. But the methods used to classify, manage and access information are still dominated by techniques taken from the physical world of information – paper and its storage methods: micro (books) and macro (libraries).

Let’s pick on ‘The Fileplan’

A common scenario I see in organisations, especially government ones, is the use of a fileplan to store and access content. Here’s the definition of a fileplan, courtesy of ‘Developing a Fileplan for Local Government‘ (UK) (My comments in brackets):

¨The fileplan will be a hierarchical structure of classes starting with a number of broad functional categories. These categories will be sub-divided and perhaps divided again until folders are created at the lowest level. These folders, confusingly called files in paper record management systems (hence the term ‘fileplan’), are the containers in which either paper records or electronic documents are stored.¨

And why do we need fileplans

¨An important purpose of the fileplan is to link the documents and records to an appropriate retention schedule.¨

Really? Just how many different retention schedules does an organisation need to have? One per lowest-level folder? I doubt that. Let’s create a (very) simple fileplan: Geography – Business Unit – Activity

Taking 3 geographies, 3 business units and 3 activities. These are the folders you end up with:

  • UK/finance/budget/
  • UK/finance/managementaccounts/
  • UK/finance/projects/
  • UK/IT/operations/
  • UK/IT/procedures/
  • UK/IT/projects/
  • UK/Sales/campaigns
  • UK/Sales/products
  • UK/Sales/projects
  • France/finance/budget/
  • France/finance/managementaccounts/
  • France/finance/projects/
  • France/IT/operations/
  • France/IT/procedures/
  • France/IT/projects/
  • France/Sales/campaigns/
  • France/Sales/products/
  • France/Sales/projects/
  • Germany/finance/budget/
  • Germany/finance/managementaccounts/
  • Germany/finance/projects/
  • Germany/IT/operations/
  • Germany/IT/procedures/
  • Germany/IT/projects/
  • Germany/Sales/campaigns
  • Germany/Sales/products
  • Germany/Sales/projects

So we have 27 different locations to cover 3 geographies with 3 departments and 3 activities. Now scale this up for your organisation. How many different folders do you end up with?

The ultimate killer with this scenario? There isn’t any content in the first 2 levels of the hierarchy. You’ve got to navigate through 3 levels before you can even start to find what you are looking for. This is because a librarian approach is used for storing and locating information:

Go upstairs, ‘Technology’ section is on the left, you’ll find ‘Computing’ about halfway along. Third shelf up is ‘Programming Languages’, books organised alphabetically by author…

In the physical world, we can’t do a ‘Beam me up, Scotty!‘ and magically arrive at the shelf containing the book containing the page(s) we want. But in the digital world, we can. If fans of the fileplan designed Google’s navigation, it might look something like this:

And they probably wouldn’t include the search box on the first two pages. Fortunately for everyone who uses the Internet to search for information, Google took the ‘Beam me up, Scotty!’ approach.

The fileplan approach causes problems for everyone. Authors have to find ‘the right’ location to put their stuff. If they are working on anything remotely ambiguous, it is unlikely there will be one clear option. That’s why everyone ends up defaulting to the ‘projects’ folder (‘miscellaneous’ is another popular destination). Search engines that use URL depth algorithms (such as PageRank) struggle to identify relevant content – is the folder ‘Finance’ more important than a document called ‘Finance’ that is two levels deeper in the hierarchy buried under Projects/Miscellaneous? If someone is searching for documents about France, are documents located in the France folder hierarchy more important than documents containing references to France that have been stored in the UK hierarchy? Authors don’t know where to put their stuff, and searchers can’t find it. What about those all important retention schedules? They might be different for different geographies (governments don’t seem to agree or standardise on anything much, globally) but then what? Do all of Finance docs have a different retention schedule to all of IT? Within Finance, do different teams have different retention schedules? (Quite possibly – certain financial documents need storing for specific periods of time). Current solution? Sub-divide and conquer, i.e. create yet another level of abstraction in the fileplan… I have seen solutions where users have to navigate through 6 levels before reaching a folder that contains any content.

So what’s the alternative?

Perhaps a better question would be ‘what’s an alternative?’ The desire to find one optimal solution is what trips up most information system designs. Here are some of my emerging thoughts. If you’ve got an opinion, please contribute in the comments because I certainly don’t have all the answers.

Step 1: Stop thinking physically and start thinking digitally

There are two fundamental problems with the fileplan. First, it originates from the constraints enforced by physical technologies. A paper document must exist somewhere and you don’t want to have to create 100 copies to cover all retrieval possibilities – it’s expensive and time-consuming. Instead, all roads lead to one location… and it’s upstairs, third cabinet on the right, second drawer down, filed by case title. This approach creates the second problem – because content is managed in one place, that one place – the fileplan – must cover all purposes, i.e. storage, updates, retention schedule, findability and access. Physical limits required you to think this way. But those limits are removed when you switch to digital methods. What we need are multiple file plans, each suited to a specific purpose.

Information specialists can help identify the different purposes and different ‘file plans’ required. Technologists need to help create solutions that make it as easy as possible (i.e. minimal effort required) for authors and searchers to work with information and ‘fileplans’. And user interface specialists need to remind everyone about what happens when you create mandatory metadata fields and put the search box in the wrong place on the page…

Digital storage of content should be logical to the creators, because authors ultimately decide where they save their documents. Trying to force them into a rigid navigation hierarchy designed by somebody else just means everything gets saved in ‘miscellaneous’. Don’t aim for a perfect solution. Instead, provide guidance about where ‘stuff’ should go. Areas for personal ‘stuff’, team ‘stuff’, community sites, collaborative work spaces, ‘best practices’ sites. Ideally, you still want to stick to one location. Not because of any resource constraints but rather to avoid unnecessary duplication that can cause confusion. If an item of content needs to appear ‘somewhere else’ then it
should be a link rather than a copy, unless a copy is required to fit a different scenario (e.g. publishing a copy of a case study up onto a public web site, but keeping the original held in a location that can only be edited by authors)

To improve relevance of search results, thesauri and controlled vocabularies can help bridge the language barriers between authors and readers. A new starter might be looking for the ’employee manual’. What they don’t know is what they are actually looking for is the ‘corporate handbook’ or ‘human remains guide’ that may contain the words ’employee’ and ‘manual’ but never together in the same sentence. The majority of search frustrations come from information seekers using a different language to the one used by the authors of the information they seek. Creating relationships between different terms can dramatically improve relevance of search results. Creating tailored results pages (a mix of organic search results and manufactured links) can overcome differences in terminology and improve future search behaviour.

And the elephant in the file system – retention schedules. First identify what retention schedules you require to comply with industry regulations and to manage legal risk. And do they apply to all content or only certain content? (I doubt many government organisations have kept junk paper mail for 30 years.) And at what point do they need to be applied? From the minute somebody opens a word processor tool and starts typing, or at the point when a document becomes finalised? This is the area that needs most coordination between information specialists and technologists. As we start to move to XML file formats, life could potentially become so much easier for everyone. For example, running scripts to automatically track documents for certain words that give a high probability that the document should be treated as a record and moved from a community discussion forum to the archive. Automatically inserting codes that enable rapid retrieval of content to comply with a legal request but that have no effect on relevance for regular searches.

On the Internet, Google introduced a tag ‘nofollow’ that could be applied to links to prevent the link improving a page’s relevance rank. (PageRank works by determining relevance based on the number of incoming links to a page. If you want to link to a page so that people can look at it but you don’t want the page to benefit from the link in search results, you can insert ‘nofollow’). Maybe Enterprise Search solutions need a similar method. Different indicators for metadata that helps describe content for searches versus metadata that organises content for retention schedules versus metadata that helps authors remember where they left their stuff. And again, XML formats ought to make it possible to automatically insert the appropriate values without requiring the author to figure out what’s needed. The ultimate goal would be to automatically insert sufficient information within individual content items so that requirements are met regardless of where the content is stored or moved to. I email an image to someone and its embedded metadata includes its fileplan(s).

There are lots of ways that technology could be used to improve information management and findability, to meet all the different scenarios demanded by different requirements. But to achieve them requires closer interaction between people making the policies regarding how information is managed, people creating the so-called ‘technology-agnostic’ (in reality it is ‘technology-vendor-agnostic’) file plans to satisfy those policies and the technology vendors creating solutions used to create, store and access the content being created that have to cope with the fileplans and the policies.

The information industry has to move on from the library view of there being only one fileplan. Lessons can be learned from the food industry. There was a time when there was only one type of spaghetti sauce. In the TED talk below, Malcolm Gladwell explains how the food industry discovered the benefits from offering many different types of spaghetti sauce (and why you can’t rely on focus groups to tell you what they want – another dilemma when designing information systems):

Direct link to TED talk (in case video doesn’t load here)

There is a great quote within the above talk:

¨When we pursue universal principles in food, we aren’t just making an error, we are actually doing ourselves a massive disservice¨

You could replace the word ‘food’ with ‘information’. It’s not just the fileplan that needs rethinking…

Don’t count on the data

A follow up to uncontrolled vocabulary.

Many of the more advanced search engines try to improve relevance by extracting meaning out of the content of documents, i.e. without any input required from users. This is commonly known as auto-classification (or auto-categorisation). It then enables you to perform faceted search queries (also known as conceptual search). For example, if you search on ‘politics’, the results might include facets for different political parties (Democrats, Republicans), politicians (Bush, Clinton, McCain, Obama), government activities (budget, new policies) and so on. It’s an effective way of quickly narrowing down your search criteria and hence refining results.

The challenge is can you trust the data? How accurate is the content within your documents? For example, let’s look at the number 35.

Why 35?

That’s the average number of people wrongly declared dead every day in the US as a result of data input errors by social security staff. Doh! 🙂 (Source: MSNBC. Published in NewScientist magazine, 8 March 2008)

Uncontrolled vocabulary

In case you are wondering, this is not going to be an expletive-filled post or a discussion about what happens if you suffer damage to the frontal lobe of your brain.

One of the big challenges for enterprise search solutions is deciphering meaning from words. There are five aspects to the challenge. (Possibly more, given I originally started this post with two!)

1. The same word can have multiple meanings

We only have so many words, and we have a habit of using the same words to mean very different things. For example:

  • Beetle = insect or motor vehicle?
  • Make-up = be nice to someone or apply paint to your face
  • Sandwich = literal (food) or lateral (something squashed between something else)
  • Resume = to continue or required for a job application?
  • SME = subject matter expert or small to medium enterprise?

Double meanings can be partially solved by studying the words in context of words around them. If resume means to continue, it is usually preceded by the word ‘to’. To determine if the beetle in question has legs or wheels, the phrase will hopefully include a little more information – not too many people (intentionally) put fuel in the insect version or find the motor version in their shower.

A popular method used by search engines to decipher meaning from documents (and search queries) is Bayesian Inference. For the SharePoint fans in the audience, the answer is no. SharePoint does not use Bayesian Inference in any shape or form at the moment. It used to, back in the first version (SPS 2001) but currently doesn’t. Yes, other search engines do.

Abbreviations can be trickier because they are often assumed/applied in context of a human perspective. For example, is a search for ‘Health care SME’ looking for an internal expert on health care to ask them a question or looking for small businesses in the health care industry to include in a targeted sales campaign?

2. The same meaning is applied to different words

This one is easiest to explain with an example. A typical trans-atlantic flight offers you the choice of up to 4 ‘classes’ – Economy, Economy Plus, Business, First. But it seems no two airlines use these precise words any more. For example:

Controlled Vocabulary Actual: Virgin Atlantic Actual: British Airways
Economy Economy World Traveller
Economy Plus Premium Economy World Traveller Plus
Business Upper Class Club World
First n/a Club First

A common method to identify different words that share the same meaning is a thesaurus. In enterprise search solutions, a thesaurus is typically used to replace or expand a query phrase. For example, the user searches for ‘flying in economy plus’, the search engine checks the thesaurus and also searches for ‘premium economy’ and ‘world traveller plus’.

3. Different words have the same meaning in theory but are different in practice

Back to our airlines. Having travelled in the Economy Plus cabin of both Virgin Atlantic and British Airways recently, I can tell you from first-hand experience that the experience varies. On a Virgin Atlantic flight, Economy Plus is half-way between Economy and Business (Upper Class in their terminology). On a British Airways flight, Economy Plus is one step away from Economy and a world way from Business (Club World in their terminology). Added to the challenge, experience changes over time. It’s been a few years since I last travelled Economy Plus on British Airways and I would say that it is not as good as it used to be. But then I’ve had the luxury of travelling Business Class with them in between. The Virgin Atlantic flight was on a plane that had just had its Economy Plus cabin refurbished and was all shiny and new. And I have never travelled Business Class on Virgin Atlantic.

This is a tricky issue for a search engine to solve. It crosses into the world of knowledge and experience, a pair of moving and ambiguous goal posts at the best of times.

4. Different words are spelt near-identically

A cute feature available in most search engines is the ‘Did you mean’ function. If you spell something slightly incorrectly, for example: ‘movy’, the search engine will prompt you with ‘did you mean movie?’

The trickier challenge for search engines is when a user types in the wrong word (often accidentally) but the word typed exists in its own right (and hence doesn’t get spotted or picked up by spell-checkers):

  • Stationary (not moving) vs Stationery (materials for writing)
  • There (somewhere) vs Their (people)
  • Here (somewhere) vs Hear (what your ears do)
  • Popular (people like you) vs Poplar (a type of tree)

As in the first challenge, advanced algorithms can be used to try and decipher meaning by also looking at words used in association. ‘Can someone please move that stationery vehicle?’ probably isn’t referring to a cardboard cut-out. But this scenario is much harder to sport. (Spot the deliberate typo.) For example: ‘information about popular trees’ is a different query to ‘information about poplar trees’ but both look just fine to the search engine. Worse still, the content being indexed may itself contain a lot of typos. An author might have been using ‘stationery’ when they should have been using ‘stationary’ for years. (I’m supposed to be good at spelling and I had to check the dictionary to be sure.) In this scenario, Bayesian Inference can start to fail because there is now a range of words in documents containing ‘stationery’ that have nothing to do with paper or writing materials but occur often enough across a set of documents to suggest a relationship. Search queries correctly using ‘stationary’ won’t find highly relevant results containing mis-spelt ‘stationery’ unless the thesaurus is configured to spot the errors.

5. Different words are used by different people to mean the same thing

This is the ultimate gotcha for enterprise search engines, and the reason why most taxonomies fail to deliver on their promises. Search engines that rely on word-breaking to match search queries with index results can fall down completely when the search query uses words that don’t actually exist in the document that the searcher is looking for. For example, a search for ’employee manual’ could be seeking the ‘human resources handbook’ that doesn’t actually contain the words employee and manual together in the same sentence, let alone in any metadata. Because there is no relation, the relevance will be deemed low to the query.

The thesaurus and friends (such as Search Keywords/Best Bets) can again come to the rescue – for example, tying searches for ’employee’ with ‘human remains’. The challenge is who decides what goes in the thesaurus? Referring to a previous blog post – When taxonomy fails – the Metropolitan Museum of Art in New York ran a test of their taxonomy using a non-specialist audience. More than 80% of the terms used by the test audience were not even listed in the museum’s documentation. The experts who decided on the taxonomy were talking a completely different language to the target audience. When the experts are allowed to ‘tweak’ the search engine, don’t assume that search results will improve…

Summing Up (i.e. closing comments, not a mathematical outcome)

I’ve over-simplified how search engines work and apply controlled vocabularies through tools such as thesauri. But hopefully this post helps show why so many people become frustrated with the results returned from enterprise search solutions, even the advanced ones. Quite often, t
he words simply don’t match.

[Update: 20 March] Using the ‘Stationary vs Stationery’ example in the most recent Enterprise Search workshop, an attendee came up with a great little snippet of advice to remember the difference. Remember that envelopes begin with the letter ‘e‘ and they are stationery. And hilariously, I can now look at the Google ads served up alongside this post and spot the spelling mistakes 🙂


Technorati tags: Enterprise Search; Taxonomy; Information Architecture; Search Server

How relevant is your content?

Last year I wrote a blog post that generated a bit of offline feedback – Search Lessons. Or rather, it was explaining The Google™ Effect that sparked a debate.

¨Why can’t our enterprise search just be like Google?¨

People are used to Google when searching the Internet – simple (and single) user interface, fast results. Relevance varies depending on the search but is usually good enough for most queries. When it comes to searching for information within the boundaries of your organisation, the answers are not so easy to find…

There’s one simple answer: Intranet ≠ Internet.

They might look similar, thanks to web standards, but looks can deceive. A mushroom looks like a toadstool but one is edible and the other isn’t.

On the Internet, pages want to be found. The same is rarely true within the enterprise.

On the Internet, if your site doesn’t appear in search results, it doesn’t exist. That’s a huge incentive to ensure anyone publishing content uses tags and any other tricks to attract the attention of a search engine. Enough of an incentive to create a whole new industry – Search Engine Optimisation (SEO). The same incentives rarely apply within the enterprise and few organisations budget for in-house SEO. Added to that, the search vendors have to continuously optimise the index to balance any negative effects caused by SEO. In-house? You need to do both! Enterprise search is a continuous improvement process, not an isolated project.

On the Internet, there is no relationship between the search engine and the end-user. If you don’t find what you are looking for, you try a different query or look somewhere else. There is no contract stating that the Internet must provide you with answers. There are no guarantees that what you find will be accurate or useful. When an enterprise search engine fails to perform, the IT department usually hears about it.

On the Internet, all users are equal (and anonymous). You can’t login to Google with a different ID and suddenly access previously hidden search results. In the enterprise, security is rarely considered optional.

The Internet doesn’t care what you do with what you find. An enterprise search engine should.

When an enterprise search engine fails to deliver, the technology usually gets the blame:

¨Our search engine is rubbish, the results it finds are never relevant.¨

But is it really the search engine that is to blame? When it comes to calculating relevance, it all depends on the content. Jenny Everett, over on the SharePoint Blogs, published the results of a survey that clearly demonstrate the real challenge facing enterprise search solutions.

Source: SharePoint Search is not Psychic by Jenny Everett

85% of search issues had nothing to do with relevance ranking. The numbers may vary by organisation, but I would wager that the top two issues are consistent. Imagine if you performed a search on the Internet and the first page of results consisted of items with the title ‘Blank’.

Will we see the Semantic Web?

The Semantic Web, in theory, is a brilliant idea. For a starting point to read all about it, check out its entry on Wikipedia. In a nutshell, the Semantic Web is about bringing meaning to data and web pages. The ultimate goal appears to be make data work without (and on behalf of) humans. At the moment, if my car starts feeling ill, the computer displays a warning light on the dashboard. I ignore the warning light until the car all but stops working. Then I phone the garage, get it towed in and they repair it. In a Semantic Web world, my car wouldn’t let me be so flippant. It might still display that warning light to humour me, but it will also send a message to the garage and book itself an appointment to repair the failing component. If fully automated driving ever takes off, it will take itself to the garage and come back home when its ready

OK, so there are probably more realistic examples to demonstrate the benefits of a Semantic Web, but the principles (in simple terms) are there:

  • a universal method to describe data (ontology)
  • a standard language to store data (XML)
  • a common mechanism to move data (web services)

In my optimistic example, cars will use a universal ontology to describe their ‘bits’ and associated information to trigger alerts when something goes wrong. That data will be stored using an agreed XML schema that all garages can read and willl be transmitted to the garage (subject to wireless network being available) using a published web service. Similarly, all garages will use a universal ontology, XML schema and web service to make their diaries available for cars to locate free slots and book in appointments… and so it goes on

Two of those principles are achievable and already being adopted in many business and consumer systems. One isn’t. There will never be a universal method todescribe all data and if there were, it would go the way of Google’s PageRank algorithm – successful for as long as it takes for the system to be gamed. Search Engine Optimisation would be replaced with a new black art: Semantic Web Optimisation

As long as human nature is involved in designing (and using) an ontology, there will never be agreement. I have just finished reading a wonderful book – Glut: Mastering Information Through The Ages. It documents the attempts to classify and organise information literally since records began, and highlights the common cause of failure – different perspectives reach different conclusions. Many people reference the library as a shining example of how taxonomy can work, citing the Dewey Decimal system. The Dewey Decimal system was first created in 1876 by Melvil Dewey. Dewey gave higher status to Christian religions (sections 200 through 280) than all others (Judaism and Islam are given just one section each – 296 and 297 respectively). I wonder what religion he belonged to? His design was not without context and therefore not universal…

And history appears to have taught us nothing about our lack of ability to work on anything universal. In the 18 August 2007 issue of the New Scientist magazine, there were two articles covering the same piece of the planet. The first article questioned whether or not a tipping point has already been reached with regard to climate change. Concern focused on the melting Greenland ice sheet and how the consequences will affect us all. In the news section, a small article covered ‘Arctic monkeying’, the race to claim ownership of a seabed structure running from Greenland to Russia. Apparently, a Russian submarine planted a flag on the seabed beneath the North Pole. Canada is building a military base and deep-water port to govern the emerging North-West passage (about that melting ice sheet…) Denmark has sent researchers to assess its claim on the untapped oil and gas reserves that just happen to lie in the region under dispute… amply demonstrating why we are doomed to suffer the consequences of climate change. For goodness sake, we can’t agree on a single currency, spoken language or even agree to all drive on the same side of the road. Diversity is a core part of what it is to be human. Universality requires human nature to exit through the door to the right (about that talk of climate change…)

So what will become of the Semantic Web?

XML and web services continue to spread through systems and applications. Even Microsoft, never the first to sign up to an open standard 🙂 has adopted XML formats in its latest Office applications, and web services are used to communicate between desktop applications and services/online services.

Ontology is the difficult member of the semantic family – unfortunate because ontology is the ‘semantic’ in The Semantic Web.. Hierarchical ‘top-down’ taxonomies continue to battle with networked ‘bottom-up’ folksonomies and history suggests such conflict is a necessary feature (not a bug) of evolving systems. User tagging has proven unexpectedly successful, thanks to Flickr and blogging, and demonstrated the value of folksonomies. But inconsistencies remain rife. If I want to Technorati tag a blog post about Web 2.0, do I use web 2.0, web-20, web2.0, web20… Perhaps that’s the trade-off we will have to accept – forget striving for a universal method and instead learn to work with local variants and their different contexts. Accept that the results won’t be perfect. Just like the real world 🙂

Without doubt, the use of metadata to enhance search results and create new systems has the potential to disrupt and is attracting huge attention. You can see some of Google’s work on multi-faceted search (pivoting results based on classification category) over at http://www.google.com/experimental/index.html. Microsoft Research published research on ‘object-level vertical‘ search algorithms although a more exciting demo can be found at TED – Ideas worth spreading: using Photosynth to combine photos (found through their tags) into a 3D virtual world.

Nicholas Carr, yes he who said I.T. doesn’t matter but can’t seem to stop writing about it anyway :-), perhaps identified where the semantic web will come to life – the real Web 2.0: the datacenters competing to host our information in the ‘cloud’. Will the providers of online services become the gate keepers to different semantic webs?


Technorati tags: semantic web; web 2.0

Just Enough Taxonomy

On Microsoft’s Channel 9 network, there is an interesting podcast called ‘Just Enough Architecture‘, where the interviewee provides some good recommendations about the balance between how much architecture you need versus just getting on and writing software that does something useful.

The same debate could be applied to taxonomy, specifically the use of metadata properties to classify content.

For some reason, most companies who decide they want to improve how content is classified seem to want extreme taxonomy, swinging from not-enough taxonomy to too-much. The mantra may sound somewhat familiar:

One taxonomy to rule them all, one taxonomy to find them, one taxonomy to bring them all and, in the records management store, define them

Often starting with none at all (i.e. content is organised informally and inconsistently using folders), the desire is to create a single corporate taxonomy to classify everything (using a hierarchical structure of metadata terms). An inordinate amount of time is then spent defining and agreeing the perfect taxonomy (for some reason, many seem to settle on about 10,000 terms). Several months later, heads are being scratched as people try to figure out just how they are going to implement the taxonomy. Do they classify existing content or only apply it to new stuff? Do they have specific roles dedicated to classifying the content, rely on the content owners to do it, or look at automated classification tools. Do they put rules in place to force people to classify content and store it in specific locations that are ‘taxonomy-aware’. How do they prevent people bypassing the system, those who figure they can still get their work done by switching to a wiki or a Groove workspace or a MySpace site or a Twitter conversation? How do they validate the taxonomy and check that people are classifying correctly? What do they do if people aren’t classifying correctly, who don’t understand the hierarchy or have different meanings for the terms in use? What started out as a simple idea to improve the findability of information becomes a huge burden to maintain with questionable benefits, given there are so many opportunities for classification to go wrong.

This dilemma reveals two flaws that make implementing a taxonomy so difficult. The first is the desire to treat taxonomy as a discrete project rather than an organic one. Collaboration and knowledge management projects often share this fate. Making taxonomy a discrete project usually means tackling it all in one go from a technology perspective and then handing it over to the business to run ‘as is’ for ever more (i.e. until the next technology upgrade). Such projects end up looking like that old cliché – attempting to eat an elephant whole. The project team tries to create a perfect design that will deliver all identified requirements (and the business, knowing this could be their one chance for improved tools, delivers a loooooong list of requirements), implements a solution and then moves on to the next project. As the solution is used, the business finds flaws in their requirements or discover new ways of working enabled by the technology, but it is too late to get the solution changed. The project is closed, the budget spent.

An alternative approach is to treat taxonomy as an organic project or, for those who prefer corporate-speak, a continuous-improvement programme. Instead of planning to create and deploy the perfect taxonomy, concentrate on ‘just enough taxonomy’. A good starting point is to find out why taxonomy is needed in the first place. If it is to make it easier for people to find information, first document the specific problems being experienced. Solve those problems as simply as possible, test them and gather feedback. If successful, people will raise the bar on what they consider good findability, generating new demands waiting for IT to solve, and so the cycle continues.

The following is a simple example using a fictitious company.

Current situation: Most information is stored in folders on file shares and shared via email. There is an intranet that is primarily static content published by a few authors. The IT department has been authorised to deploy Microsoft Office SharePoint Server 2007 (MOSS)

General problem: Nobody can find what they are looking for (resist temptation to sing U2 song at this moment…)

Specific problems: Difficult to find information from recently completed projects that could be re-used in future projects; Difficult to differentiate between high quality re-usable project information versus low quality or irrelevant project information; Difficult to find all available documents for a specific customer (contracts, internal notes, project files)

Possible solution: Deploy a search engine to index all file folders and the intranet. Move all project information to a central location. Within the search engine, create a scope (or collection) for the project information location. Users will then be able to perform search queries that will return only project information within the results. Using ‘date modified’ as the sorting order will locate information from the most recent projects. Create a central location for storing top-rated ‘best practice’ project information. Set-up a team of subject matter experts to work with project teams and promote documents as ‘best practice’. The Best Practices store can be given high visibility throughout the intranet and promoted as high relevance for search queries.

Now that is a very brief answer outlining one possible solution. But the solution is relatively simple to implement and should offer immediate (and measurable) improvements based on feedback regarding the problems people are experiencing. There were two red herrings in the requirements that could have resulted in a very different, more complex, solution: 1. That MOSS was going to be the technology; and 2. The need to find documents for a specific customer. When you have chosen a technology, there is always the temptation to widen the project scope. MOSS has all sorts of features that can help improve information management and the starting point is often to replace an old crusty static intranet. But the highlighted problems did not mention any concerns about the intranet. That’s not to say those concerns do not exist, but they are a different problem and not the priority for this project. The second red herring is a classic. When people want to be able to find information based on certain parameters, such as all documents connected to a specific customer, there is the temptation to implement a corporate-wide taxonomy and start classifying all content, starting with the metadata property ‘customer name’. But documents about a specific customer will likely contain the customer’s name. In this scenario, the simplest solution is to create a central index and provide the ability for users to search for documents containing a given customer’s name. If that fails to improve the situation then you may need to consider more drastic measures.

Rejecting the large-scale information management project in favour of small chunks of continuous ‘just enough’ improvement is not an easy approach to take. The idea of having a centralised, classified and managed store of content, where you can look up information based on any parameter and receive perfect results, continues to be an attractive one with lots of benefits to the business – both value-oriented (i.e. helping people discover information to do their job) and cost-oriented (i.e. managing what people do with information – compliance checks and the like). But a perfectly classified store of content is a utopia. Trying to achieve it can result in creating systems that are harder to use and difficult to maintain when the goal is supposed to be to make them easier.

I mentioned that the common approach to implementing taxonomy has two flaws. The first has been discussed here – how to create just enough taxonomy. The second flaw is the desire to create a single universal taxonomy that can be applied to everything. I’ll tackle that challenge in a separate post (a.k.a this post is already too long…)

Reference: Just Enough Architecture (MSDN Channel 9). Highly recommended. There are plenty of similarities between software architecture and information architecture (of which taxonomy is subset). Don’t be put off by the techie speak, it debates the pro’s and con’s of formal processes and informal uses, and includes some great non-technical examples for how to find a balance.

Recent related posts:

Technorati tags: Taxonomy, Tagging, Information Architecture