Web 3.0 and the Semantic Web

I’ve got mixed feelings about the viability of the semantic web but this video is a great compilation of the challenges facing information discovery and possible options. It’s become way easier to create information than to manage it…

Tagging over time

In a report by NetworkWorld: 25 leading-edge IT research projects, there is an interesting feature that could significantly advance the world of information architecture, specifically classification of content. (If you click the link and read the article, it is item 15. There are some other gems in there too.)

Researchers at Penn State University have developed software that not only automatically tags photos when they are uploaded, but also automatically updates the tags based on how people interact with the photos over time. In its current state, the researchers claim the system can grow from 40% to 60% accuracy as it learns from user behaviour.

Whilst the research is focused on images, the technology could have major benefits for traditional information sources – documents and web pages. One of the biggest failings of traditional taxonomies and classification systems has been their inability to cope with change. Language continues to evolve: old words are given new meanings, new words are applied to old meanings. Adaptive tagging could introduce a whole new method of relevance ranking to improve search results.

Filed in library under: Information Architecture

Technorati tags: taxonomy; information architecture; metadata

Zillionics change perspective


Interesting article – Zillionics – by Kevin Kelly. Well worth a read if you are interested in long tails, social networks and wondering where digital information technology is leading us to. Here’s a soundbite:

More is different.

When you reach the giga, peta, and exa orders of quantities, strange new powers emerge. You can do things at these scales that would have been impossible before… At the same time, the skills needed to manage zillionics are daunting…

Zillionics is a new realm, and our new home. The scale of so many moving parts require new tools, new mathematics, new mind shifts.

It’s a short and thought provoking article.

Rethinking the fileplan

Perhaps one of the loudest unspoken messages from the SharePoint conference held recently in Seattle was the need for information architects and managers to work more closely with their user interface (UI) and technology-focused counterparts. Thanks to the Internet, we are unlikely to see a downturn in the market for digital information in the foreseeable future. But the methods used to classify, manage and access information are still dominated by techniques taken from the physical world of information – paper and its storage methods: micro (books) and macro (libraries).

Let’s pick on ‘The Fileplan’

A common scenario I see in organisations, especially government ones, is the use of a fileplan to store and access content. Here’s the definition of a fileplan, courtesy of ‘Developing a Fileplan for Local Government‘ (UK) (My comments in brackets):

¨The fileplan will be a hierarchical structure of classes starting with a number of broad functional categories. These categories will be sub-divided and perhaps divided again until folders are created at the lowest level. These folders, confusingly called files in paper record management systems (hence the term ‘fileplan’), are the containers in which either paper records or electronic documents are stored.¨

And why do we need fileplans

¨An important purpose of the fileplan is to link the documents and records to an appropriate retention schedule.¨

Really? Just how many different retention schedules does an organisation need to have? One per lowest-level folder? I doubt that. Let’s create a (very) simple fileplan: Geography – Business Unit – Activity

Taking 3 geographies, 3 business units and 3 activities. These are the folders you end up with:

  • UK/finance/budget/
  • UK/finance/managementaccounts/
  • UK/finance/projects/
  • UK/IT/operations/
  • UK/IT/procedures/
  • UK/IT/projects/
  • UK/Sales/campaigns
  • UK/Sales/products
  • UK/Sales/projects
  • France/finance/budget/
  • France/finance/managementaccounts/
  • France/finance/projects/
  • France/IT/operations/
  • France/IT/procedures/
  • France/IT/projects/
  • France/Sales/campaigns/
  • France/Sales/products/
  • France/Sales/projects/
  • Germany/finance/budget/
  • Germany/finance/managementaccounts/
  • Germany/finance/projects/
  • Germany/IT/operations/
  • Germany/IT/procedures/
  • Germany/IT/projects/
  • Germany/Sales/campaigns
  • Germany/Sales/products
  • Germany/Sales/projects

So we have 27 different locations to cover 3 geographies with 3 departments and 3 activities. Now scale this up for your organisation. How many different folders do you end up with?

The ultimate killer with this scenario? There isn’t any content in the first 2 levels of the hierarchy. You’ve got to navigate through 3 levels before you can even start to find what you are looking for. This is because a librarian approach is used for storing and locating information:

Go upstairs, ‘Technology’ section is on the left, you’ll find ‘Computing’ about halfway along. Third shelf up is ‘Programming Languages’, books organised alphabetically by author…

In the physical world, we can’t do a ‘Beam me up, Scotty!‘ and magically arrive at the shelf containing the book containing the page(s) we want. But in the digital world, we can. If fans of the fileplan designed Google’s navigation, it might look something like this:

And they probably wouldn’t include the search box on the first two pages. Fortunately for everyone who uses the Internet to search for information, Google took the ‘Beam me up, Scotty!’ approach.

The fileplan approach causes problems for everyone. Authors have to find ‘the right’ location to put their stuff. If they are working on anything remotely ambiguous, it is unlikely there will be one clear option. That’s why everyone ends up defaulting to the ‘projects’ folder (‘miscellaneous’ is another popular destination). Search engines that use URL depth algorithms (such as PageRank) struggle to identify relevant content – is the folder ‘Finance’ more important than a document called ‘Finance’ that is two levels deeper in the hierarchy buried under Projects/Miscellaneous? If someone is searching for documents about France, are documents located in the France folder hierarchy more important than documents containing references to France that have been stored in the UK hierarchy? Authors don’t know where to put their stuff, and searchers can’t find it. What about those all important retention schedules? They might be different for different geographies (governments don’t seem to agree or standardise on anything much, globally) but then what? Do all of Finance docs have a different retention schedule to all of IT? Within Finance, do different teams have different retention schedules? (Quite possibly – certain financial documents need storing for specific periods of time). Current solution? Sub-divide and conquer, i.e. create yet another level of abstraction in the fileplan… I have seen solutions where users have to navigate through 6 levels before reaching a folder that contains any content.

So what’s the alternative?

Perhaps a better question would be ‘what’s an alternative?’ The desire to find one optimal solution is what trips up most information system designs. Here are some of my emerging thoughts. If you’ve got an opinion, please contribute in the comments because I certainly don’t have all the answers.

Step 1: Stop thinking physically and start thinking digitally

There are two fundamental problems with the fileplan. First, it originates from the constraints enforced by physical technologies. A paper document must exist somewhere and you don’t want to have to create 100 copies to cover all retrieval possibilities – it’s expensive and time-consuming. Instead, all roads lead to one location… and it’s upstairs, third cabinet on the right, second drawer down, filed by case title. This approach creates the second problem – because content is managed in one place, that one place – the fileplan – must cover all purposes, i.e. storage, updates, retention schedule, findability and access. Physical limits required you to think this way. But those limits are removed when you switch to digital methods. What we need are multiple file plans, each suited to a specific purpose.

Information specialists can help identify the different purposes and different ‘file plans’ required. Technologists need to help create solutions that make it as easy as possible (i.e. minimal effort required) for authors and searchers to work with information and ‘fileplans’. And user interface specialists need to remind everyone about what happens when you create mandatory metadata fields and put the search box in the wrong place on the page…

Digital storage of content should be logical to the creators, because authors ultimately decide where they save their documents. Trying to force them into a rigid navigation hierarchy designed by somebody else just means everything gets saved in ‘miscellaneous’. Don’t aim for a perfect solution. Instead, provide guidance about where ‘stuff’ should go. Areas for personal ‘stuff’, team ‘stuff’, community sites, collaborative work spaces, ‘best practices’ sites. Ideally, you still want to stick to one location. Not because of any resource constraints but rather to avoid unnecessary duplication that can cause confusion. If an item of content needs to appear ‘somewhere else’ then it
should be a link rather than a copy, unless a copy is required to fit a different scenario (e.g. publishing a copy of a case study up onto a public web site, but keeping the original held in a location that can only be edited by authors)

To improve relevance of search results, thesauri and controlled vocabularies can help bridge the language barriers between authors and readers. A new starter might be looking for the ’employee manual’. What they don’t know is what they are actually looking for is the ‘corporate handbook’ or ‘human remains guide’ that may contain the words ’employee’ and ‘manual’ but never together in the same sentence. The majority of search frustrations come from information seekers using a different language to the one used by the authors of the information they seek. Creating relationships between different terms can dramatically improve relevance of search results. Creating tailored results pages (a mix of organic search results and manufactured links) can overcome differences in terminology and improve future search behaviour.

And the elephant in the file system – retention schedules. First identify what retention schedules you require to comply with industry regulations and to manage legal risk. And do they apply to all content or only certain content? (I doubt many government organisations have kept junk paper mail for 30 years.) And at what point do they need to be applied? From the minute somebody opens a word processor tool and starts typing, or at the point when a document becomes finalised? This is the area that needs most coordination between information specialists and technologists. As we start to move to XML file formats, life could potentially become so much easier for everyone. For example, running scripts to automatically track documents for certain words that give a high probability that the document should be treated as a record and moved from a community discussion forum to the archive. Automatically inserting codes that enable rapid retrieval of content to comply with a legal request but that have no effect on relevance for regular searches.

On the Internet, Google introduced a tag ‘nofollow’ that could be applied to links to prevent the link improving a page’s relevance rank. (PageRank works by determining relevance based on the number of incoming links to a page. If you want to link to a page so that people can look at it but you don’t want the page to benefit from the link in search results, you can insert ‘nofollow’). Maybe Enterprise Search solutions need a similar method. Different indicators for metadata that helps describe content for searches versus metadata that organises content for retention schedules versus metadata that helps authors remember where they left their stuff. And again, XML formats ought to make it possible to automatically insert the appropriate values without requiring the author to figure out what’s needed. The ultimate goal would be to automatically insert sufficient information within individual content items so that requirements are met regardless of where the content is stored or moved to. I email an image to someone and its embedded metadata includes its fileplan(s).

There are lots of ways that technology could be used to improve information management and findability, to meet all the different scenarios demanded by different requirements. But to achieve them requires closer interaction between people making the policies regarding how information is managed, people creating the so-called ‘technology-agnostic’ (in reality it is ‘technology-vendor-agnostic’) file plans to satisfy those policies and the technology vendors creating solutions used to create, store and access the content being created that have to cope with the fileplans and the policies.

The information industry has to move on from the library view of there being only one fileplan. Lessons can be learned from the food industry. There was a time when there was only one type of spaghetti sauce. In the TED talk below, Malcolm Gladwell explains how the food industry discovered the benefits from offering many different types of spaghetti sauce (and why you can’t rely on focus groups to tell you what they want – another dilemma when designing information systems):

Direct link to TED talk (in case video doesn’t load here)

There is a great quote within the above talk:

¨When we pursue universal principles in food, we aren’t just making an error, we are actually doing ourselves a massive disservice¨

You could replace the word ‘food’ with ‘information’. It’s not just the fileplan that needs rethinking…

Don’t count on the data

A follow up to uncontrolled vocabulary.

Many of the more advanced search engines try to improve relevance by extracting meaning out of the content of documents, i.e. without any input required from users. This is commonly known as auto-classification (or auto-categorisation). It then enables you to perform faceted search queries (also known as conceptual search). For example, if you search on ‘politics’, the results might include facets for different political parties (Democrats, Republicans), politicians (Bush, Clinton, McCain, Obama), government activities (budget, new policies) and so on. It’s an effective way of quickly narrowing down your search criteria and hence refining results.

The challenge is can you trust the data? How accurate is the content within your documents? For example, let’s look at the number 35.

Why 35?

That’s the average number of people wrongly declared dead every day in the US as a result of data input errors by social security staff. Doh! 🙂 (Source: MSNBC. Published in NewScientist magazine, 8 March 2008)