Google vs Bing for search results

Last weekend, I wanted to find out what time the Wimbledon men’s finals match began.  TV coverage started around about 1pm but I was pretty sure that wasn’t the start time. So off I went to Internet search.  And whilst I was at it, decided to compare Google and Bing.

Search Wimbledon on Google

Google search results for Wimbledon – Click image to view larger

Search for Wimbledon on Bing

Bing search results for Wimbledon – Click image to view larger

What’s interesting is that each is taking a very different approach to displaying the results.

Google seems to be trying to save you visiting a web site if a quick answer is what you are looking for. You get to see recent match results and the date/time for the next matches to take place. Yey – found what I was looking for.  You also see a variety of different sources – news articles, location map as well as the official web site.

Bing displays no information about the current tournament in its results summaries. And appears to assume you might not find what you’re looking for at the first attempt, offering a list of related searches in a prominent position over on the right of the results. Bing manages to display more results than Google in a smaller space, but it seems to be at the expense of helping decide which result is most likely to be useful.

From a personal perspective, I find the ‘Related searches’ list distracting on the Bing results page. It pulls my eyes over to it instead of reading the main results area. Google puts a list of related search links at the end of the first page. This feels more logical – if you haven’t clicked anything on the first page, maybe the results need refining.

It’s a similar story when searching for other facts, such as weather:

Search weather comparison

Yes, the UK weather this summer really is that bad…

I find I still favour Google for searches.  Quick facts can usually be found without needing to click further. Whether web sites like that outcome is another matter. But when it comes to applying these lessons for enterprise search designs, saving clicks can be a big productivity boost.

I haven’t found an example yet where Bing delivers demonstrably better search results, despite what Steve Ballmer says. Has anybody else? And of course the missing element to both is the conversation taking place in real-time. Google is starting to push it’s Google+ social network, if you’re signed in. But no Twitter, no Facebook, no chattering updates. They’re all taking place in the digital walled gardens.

Does Search Matter?

There’s been a host of news this week as the deal between Microsoft and Yahoo was finally inked. The net result: Yahoo drops its own search engine and adopts Microsoft’s Bing, increasing the market share for Bing to 28% against Google’s 65% and leaving 7% for everyone else (stats in the US)

Other chatter recently about search has been how real-time snippets of information, like Twitter updates, change the dynamics of search results. If you search for information on Google it’s unlikely you will see any Twitter results there. How can 140 characters rival an entire web page for relevance? But can it change?

Travel back in time to around 2003, and Bill Gates made a comment that, at the time, I thought was wrong. From memory it went along the lines “Search is done, there isn’t much more you can do to improve on what’s out there”. Of course, this was before Google IPO’d and people realised just how much money was being made on the back of those little text ads. All of sudden, improving search/winning market share became a much more interesting prospect.

But whilst I disagreed at the time, recently I’ve changed my mind. I think the comment was spot on for search as we know it today.

Relevance on Google is far from perfect. If you do a search involving words that have any kind of commercial value, chances are the result you really want is buried somewhere on page 5 or beyond. Hilariously, sometimes the ads help. If you are looking for a particular hotel chain to book a room, if they’ve paid enough for the ad space, they’ve got a better chance of appearing on your first page of results than with just the web site alone. SEO companies continuously learn how to exploit Google’s algorithms to promote their customers web sites in your results regardless of whether their customer is the result you want to find. But I still doubt there’s a better way than PageRank and friends to determine relevance of general web content.

Social media offers the opportunity of a new form of algorithm – instead of PageRank, evaluating a page based on the incoming links to it (a link from a high profile site carries more value than a link from an unknown site), how about introducing SocialRank? Evaluate the links shared by people that are then shared by other people. Sites like Facebook, Twitter, Friendfeed, Technorati, MySpace, etc. etc. contain vast social networks. I share a link, people like it and/or share it with their connections and so the link spreads.

Can we apply a rank to a page based on how the links spread? We probably can. But it won’t work.

Take FriendFeed for example. A high profile account like Robert Scoble’s means that everything he shares has a high probability of being noticed and re-shared. Does that mean anything? The same item shared by someone else might be ignored. A computer wouldn’t show such bias. If you want an idea of how bad search results would be if based on what people share, take a look at Twitter’s trending topics at any point in time – apparently that’s what the majority of people are twittering about. Would any of it help find what you’re looking for?

Search is useful and still a key part of the Internet. But increasingly we use other means to find information. On social sites such as Twitter and Facebook, we follow people we either trust or are interested in and discover information without ever looking for it. If I want a book, I often use Amazon and check the reviews. Random stuff to buy? Try eBay first. Need information about a topic, off I go to Wikipedia. If I’m looking for something I’ve read before and liked? I use the custom Google search on my web site – which includes everything on the web site and my FriendFeed account. FriendFeed aggregates everything I share through Google Reader, Delicious, Twitter, blog, web site, Slideshare… and any other service I upload stuff to.

Google is relegated to answering one-off questions (I couldn’t remember the Mac commands for a screen capture just now – easy result on Google) and desperate searches (old news, travel information) that struggle to be found anywhere.

How can Internet search be improved?

The easiest method to improve relevance across a broad range of content is to separate the results into buckets. Within enterprise search solutions, we call this federated results. Present Twitter and other real-time snippets in a separate list to standard web page results but on the same page. Use SocialRank for time-specific results such as news and travel, see how quickly a link spreads through social networks. Nuances could be included to make it difficult to game by one individual or organisation.

To give a simple example, the image below shows federated search results from a site I host on the Internet for clients. (No prizes for guessing the software being used.) I use it to show them what Google doesn’t tell them. In this example, I just entered the name of a company – Lloyds TSB. Results have come back from Twitter, Bing, Technorati and FriendFeed. (Flickr, YouTube and others are also included but snipped from the screenshot)

The Twitter comments aren’t pleasant. But better to know what your customers are saying and deal with it than not know until they’ve all left. I did the same test for a client recently and they found out that somebody had posted a Tweet asking if anyone from said client was on Twitter. Nobody was, they are now.

In short, search does matter but no more so than yesteryear whilst the format stays the same – a single page of mushed up results served with a side and topping of ads.

Whether you use Google or Bing will mostly come down to preference. (And some of that preference is more political than technical.) The relevance is ranked slightly differently – I find Bing seems to prefer domain name matches. But the differences are incremental and barely noticeable. As demonstrated in the two images below.


Same search as before – Lloyds TSB. Both display one non-Lloyds TSB domain result and both put it in 4th place. Interesting how they both display the top result the same but with different sub links to pick from. Decide for yourself which format is the best. Looking at them side-by-side here, I actually prefer the user interface for Bing. But people don’t switch browsers or search engines to discover what’s different. They change when they hear there is an alternative that is much better and easy to use. Microsoft’s challenge is that many people don’t even know there is a difference between a browser and a search engine, and some think Google is both (and they haven’t heard of Chrome):

Final note: this is one of the most rambling posts I’ve written in a while. If you got far enough through it to be reading this, thank you! 🙂 This was one of the posts that’s been floating around my head for months and it just needed to spew out but is far from perfect… I’ll endeavour to make the next one more to the point.

Microsoft vs Google in the Search Wars

Stop the clocks, blogging has recommenced 🙂

Couple of cheat posts coming up, starting with this one, which are really reproductions of comments I’ve left on other posts but with added juice.

Henry Blodget posted the following article on Silicon Valley Insider: It’s time for Microsoft to face the reality about Search and the Internet

It’s a great article and worth a read. Here’s the comment I left there:

I think people make a good point in highlighting that competition in search is a good thing for us as consumers. Just not sure it’s a good thing for Microsoft

Comparing with the likes of SQL and Exchange is comparing apples with pairs*, I always tell people to never underestimate just how hard MS will work to develop the winning product. But past successes have always been about bringing in a lower cost product with good enough features to compete against an expensive market leader (the business intelligence and systems management markets being two of the latest focus areas gaining ground in the enterprise software market).

Competing against ‘free’, a product used by one audience and paid for by another, is a completely different challenge and one that MS has yet to succeed in. Time will tell. But I doubt it will come from competing like for like. Google didn’t knock Alta Vista off the top by copying their business model. To take over a market means to do something different that weakens the incumbent players. Adding to the challenge is that ‘free’ or ‘freemium’ models have yet themselves to stand the test of time. Somebody somewhere always has to pay, one way or another. Making money from sales of a product or service still have far more long term potential than making money from people paying for the attention you’ve managed to capture.

And that all said, I still wouldn’t underestimate MS, Google isn’t the only one who can create ‘waves’** under cover

I suspect the ol’ Google vs Microsoft debate will rumble on for a few years yet. Steve Ballmer and quite a few influentials within Microsoft would like a big slice of the advertising market that is a fair bit bigger than the software market. But I’m still not confident that’s the right goal.

The argument goes that people are becoming used to not paying for online services. Yet Flickr has done quite well getting people on to premium accounts. Virtual worlds and multi-player online games like World of Warcraft also seem to find plenty of paying customers (I’m one of them, and a girl too – take note, Xbox team. I’m in that market that Nintendo noticed whilst considered a has-been at no.3 in the console market a few years ago). I’m also in that apparent small minority who pays for their music online. Amazon has done quite well just selling stuff that you go looking for rather than have thrown at you in a glitzy banner, eBay isn’t doing so bad either. None of them dependent on advertising revenue. The last two examples have both made money from closing the distance from customer to seller without advertising interrupting the process.

And who wants to be an advertising company anyway? Google succeeded ‘cos they managed to make the ads as unobtrusive as possible and came up with a great revenue concept (auction the search words) to get companies competing for what little ad space there is. Most people seem to dislike ads unless it is for the exact company they are looking for (I do a search for Hilton Hotels, I want the damn official web site, not a million travel web sites) or something completely original (that ends up in the ‘top 100 ads’ TV chart). Even worse are the fake web sites that get into search results only to present you with a page of even more ads than the search engine dared to show you.

Advertising is not a loved market. Microsoft is not a loved company. I don’t know if that’s the synergy they see but it’s not a great start. Without doing the research, I’m guessing the margins in advertising are not as healthy as software. And Google may be raking the cash in from ads but is pouring a fair chunk back out again, not least running the hardware to support YouTube.

What is right is the desire to create a new market. Monopolies (natural or otherwise) and market domination rarely last for very long… unless funded by government but let’s not go there today. Microsoft needs to keep testing new waters and best do it whilst there’s oodles of cash in the bank than start when it’s running out. And call me biased because I used to work there, but I still hope all this Google chatter is smoke and mirrors whilst they work on something worth paying for.

* Oh dear, didn’t notice I’d used pair instead of pear when posting the original comment, and you can’t go back and edit them. Oops.

** Google cleverly making waves (Techcrunch)

Delicious tags: Microsoft Google Business

Microsoft Search Workshop Part 2a

Originally, this series was supposed to be 4 parts. But started to realise that part 2 in one shot would be a very looooooong blog post. So here is part 2a. (Click here for part 1):

Key messages from the presentation:

Slide 3. Whilst most documents about indexing in SharePoint include a complicated diagram to explain the indexing process, here’s the simple version of how it works:

  • Connectors (aka Protocol Handlers) connect to a store and suck out its content. This should include security permissions and metadata managed by the store. Hence the connector needs to adhere to an agreed protocol to keep the store happy. Different stores manage files, security and metadata differently and require different connectors.
  • Once the indexing server has its hands on the content, filters are used to strip out the unnecessary gumpf (technical term) from within each item. If you open a MS Word document in Notepad, you’ll see a ton of square boxes before you get to any text. That’s because Notepad doesn’t understand MS Word formatting. Filters don’t care about formatting and chuck it all away to get down to the raw text and any metadata stored within the document. Different file formats need different filters.
  • So, for each item retrieved, the indexing server has a pile of raw text, metadata (some found within the document, some held by the store alongside the document), a link to the original item and security permissions (who can access it). All the metadata becomes ‘crawled properties’ and are dumped into a property store. All the individual words – the raw text – are dumped into the content index.
  • There is one additional element included – a static rank. We’ll cover that in part 3.

Slide 4. Now we have an index, people can start to query it and receive search results. When you type a query in a basic search box, here is what happens: (e.g. ‘SharePoint and enterprise search’)

  • The search query will be word broken and noise words removed – language match determines what dictionary and noise list are used (our example now becomes ‘SharePoint’; ‘Enterprise’, ‘Search’. Say bye bye to ‘and’ – it’s a noise word)
  • If you have stemming enabled (it is off by default), then the search will probably also include ‘searches’, ‘searching’ and ‘enterprises’. If the thesaurus has been configured, it may include additional acronyms for SharePoint, such as ‘SPS’. Not sure about ‘MOSS’ – technically, that’s a dictionary word). See related post – SharePoint and Stemming – for more information about stemming, noise words and the thesaurus
  • We now have a list of words that form the search query. A list of results are returned from the index that match any and all of the words in the query (when performing a basic search – advanced search enables you to only return docs that only match all words in the query, and other options)
  • The results are security-trimmed – anything that the user doesn’t have permission to see is removed. They can also be scope-trimmed, if a scope has been selected (e.g. only return docs that are less than one year old)
  • The remaining set of results are relevance ranked – a dynamic rank (calculated based on the search query terms) is added to the static rank held in the index – and returned as an ordered list

Slide 5. One of the most popular areas to customise in SharePoint is property management. An index can contain lots and lots of crawled properties. You can leverage those properties in search queries. For example, looking for all documents classified as ‘finance’ – see related post: Classifying content in SharePoint. To do this, you create managed properties – a managed property can be mapped to one or more crawled properties. The example in the slide – you might want people to be able to look up content classified by ‘customer name’, ‘customer name’ may be used across multiple different content stores with different titles. Managed properties can be added to the Advanced Search page and used in scopes. We’ll revisit this in part 2b (or 2c, depending on how long 2b ends up)

Slide 6: In a typical SharePoint deployment, you will have one single central indexing and search server. This server will index all your different content sources. If you’ve got the licences, you will probably separate search from indexing, using query servers. This means the indexing server can focus on doing indexing and propagates index changes up to the query servers. If the indexing server decides to take a break (literally), users can still search for content because copies of the index reside on each query server. It simply means there will be no updates to the index until the indexing server returns from vacation. The new feature introduced this year is federation. An indexing server can include federated connectors to other indexes. Great for accessing content not indexed natively by SharePoint and also great for spreading the indexing load. When a user submits a search query, results are returned from the central index and any other indexes with a federated connector. If you want to see this in action, try performing a search at http://www.infomash.co.uk/. You will see results returned from Flickr, Technorati, Twitter (Summize) and FriendFeed. If you want to have some fun, try http://www.infomash.co.uk/googmsft.aspx – you’ll see results returned from Google and Live side-by-side. Great for comparing how they determine relevance. (Note: it’s a prototype server, no guarantees regarding availability or performance)

Slide 7: Some capacity planning tips:

  • Already mentioned – the first scale issue you will hit is indexing server performance. Indexing will fight with search queries to win RAM and CPU attention. Put them in separate playgrounds (or, if you don’t have timezone problems, schedule indexing to only take place out of hours – but means your index will always be a day old).
  • The most popular indexing question – how big will the index be? The numbers can be quite frightening – up to 50% of the size of content you are indexing. Average is usually nearer to 20% but it all depends on your range of vocabulary. Lots of colourful language and each different word has to go into that index. Lots of metadata and it all gets stored…
  • The required disk space for the index is important – you need to allow 2.5 x the index size. This is due to the fact that you can, for a temporary period of time, have 2 copies of the index stored (to do with how changes are managed and propagated). But, it is 2.5 x the index size, not the size of the corpus being indexed (I’ve seen the latter stated at MS conferences and in some documents on TechNet – wrong.)
  • Federated connectors give you lots more flexibility in your architecture. Whilst you will mostly hear about how they let you include other indexes in your search results, there is a hidden benefit. You can split up your SharePoint indexes and then federate all the results on a single page. It won’t be one results list – each federated connector displays its results in a separate web part. Required, because each results set will have a different rank calculation. Great potential for branch office scenarios to save bandwidth and keep indexes fresher, and for dropping an indexing server onto niche applications and content stores (if they are running Windows Server 2003 or later, you can use Search Server Express and not have to pay for any extra licences)

Note: Federated connectors are currently only available in Search Server 2008. They are due to be released for SharePoint Server 2007, hopefully quite soon.

To download a copy of the presentation (3Mb) – MS-Search-Pt2a.pdf

Related blog posts:

Filed in Library: Microsoft SharePoint Microsoft Search

Technorati tags: SharePointMOSS 2007Search ServerEnterprise Search

Microsoft Search Workshop Part 1

Earlier this year, Joining Dots ran a series of Enterprise Search workshops for Microsoft UK. Its purpose was to help organisations explore what enterprise search means and what Microsoft technologies can do to help implement an effective search solution.

The workshop consisted of four sessions, containing a mix of presentations, hands-on demonstrations and plenty of discussion. Here is part 1:

Part 1 was all about setting the scene. First, exploring ‘what is enterprise search?’ Second, an introduction to the current products in Microsoft’s search portfolio. Note: at the time, the FAST acquisition had not completed.

Key messages from the presentation:

  • The most common question asked is ‘Why can’t our search be just like searching on Google?’ To begin answering that question, we need to define enterprise search – fundamentally different to Internet search. One of the challenges within many organisations has been that there is no dedicated focus on improving search. Instead it is often a feature of a larger project, such as an intranet replacement or new document management system. Before Google came along, that’s how the major Internet portals treated search…
  • Enterprise search technologies usually fit in one of three layers:
    • The simplest solutions help find what you know exists. Products are either free or low-cost and focus on ‘unstructured’ content, i.e. documents, email and web pages. Desktop search is available from the likes of Microsoft, Google and Yahoo. Network search tools include Microsoft Search Server, Google Mini Appliance and IBM/Yahoo Ominfind
    • The mid-tier typically provides a base platform for enterprise search, relatively inexpensive and focused on common requirements. Solutions should include security trimming (filtering results based on who you are and what you have permission to see) and indexing multiple sources of content. Some solutions start to move beyond unstructured content to also include people search (directories and social networks) and structured data (integrating business applications). This is the hunting ground for SharePoint Server 2007 and the Google Search Appliance.
    • The top-tier provides advanced indexing and search capabilities, such as automatic classification of content, concept-driven search interfaces and integration with business intelligence tools. Leaders in this space include Autonomy, Endeca and FAST.
  • Whilst advanced search is often the goal, many organisations would benefit from first identifying what content needs to be found. Is it just about documents? How accessible are those documents? And should enterprise search also include business applications and the ability to find people? We often prefer to seek answers from each other in the workplace… These are all questions that need to be answered if you want to implement an effective enterprise search solution.
  • Microsoft products and services span three areas of search: the web (Live.com), the desktop (Windows Desktop Search) and the intranet/company web sites (SharePoint Server 2007 and Search Server 2008)
    • Intranet search includes the ability to find documents, business data and people. Federated connectors enable results to be returned not only from multiple different content sources but also from multiple different indexes. The table in the presentation shows what features are available per product.
    • Desktop search enables individuals to query their own content, such as private email and locally-stored documents – content that is often difficult to access by intranet search tools.
    • Web search trends are worth following to see what’s likely to be coming down the line for enterprise search. On Live.com, concept-driven results enable you to refine your search query. If you do a search for videos, hovering over the video will start it playing inside your web browser…

To download a copy of the presentation (3.3Mb): MS-Search-Pt1.pdf

Filed in Library: Microsoft SharePoint | Microsoft Search

Technorati tags: SharePoint | MOSS 2007 | Search Server | Enterprise Search

Search Ad value

A light-hearted post to start what seems to be the true (in terms of weather) first day of Spring here in Warwickshire, UK.

First, the context. Improving information findability (aka search) has been one of my key interests for the past 8 years. I still remember my induction day at Microsoft back in March 2000. During introductions, we all had to say what our browser home-page is set-to. The swats answered msn.com, the graduates – yahoo.com. Then I stood up and said ‘google.com’. Tumbleweed blowing in the wind. Few had even heard of it. Imagine doing the same today? Quickest way to secure a P45… 12 months later, whilst others were nursing hangovers from the party the night before, I multi-tasked and sat enthralled as Jonathan Kauffman explained the algorithms being introduced with SharePoint Portal Server 2001. The point of all this…

Working through year-end paperwork with my finance manager, a.k.a. my mom, at the weekend. I flourished a Google voucher before her, £30 to spend on adwords (received it in a magazine).

¨Pff, that won’t buy you much¨

Whaaaaatttttt? My blinkin’ mother is a Google expert. (She does the books for my brother’s business and it appears he is spending a fair bit more than £30 on Google ads). Then she asked me if I’d got my keywords set up correctly! 🙂 back on my turf at least and able to explain why keywords don’t really work anymore.

I still dont know what to do with the Google voucher. Business mostly comes via word of mouth and reputation with former employer…

Don’t count on the data

A follow up to uncontrolled vocabulary.

Many of the more advanced search engines try to improve relevance by extracting meaning out of the content of documents, i.e. without any input required from users. This is commonly known as auto-classification (or auto-categorisation). It then enables you to perform faceted search queries (also known as conceptual search). For example, if you search on ‘politics’, the results might include facets for different political parties (Democrats, Republicans), politicians (Bush, Clinton, McCain, Obama), government activities (budget, new policies) and so on. It’s an effective way of quickly narrowing down your search criteria and hence refining results.

The challenge is can you trust the data? How accurate is the content within your documents? For example, let’s look at the number 35.

Why 35?

That’s the average number of people wrongly declared dead every day in the US as a result of data input errors by social security staff. Doh! 🙂 (Source: MSNBC. Published in NewScientist magazine, 8 March 2008)

Uncontrolled vocabulary

In case you are wondering, this is not going to be an expletive-filled post or a discussion about what happens if you suffer damage to the frontal lobe of your brain.

One of the big challenges for enterprise search solutions is deciphering meaning from words. There are five aspects to the challenge. (Possibly more, given I originally started this post with two!)

1. The same word can have multiple meanings

We only have so many words, and we have a habit of using the same words to mean very different things. For example:

  • Beetle = insect or motor vehicle?
  • Make-up = be nice to someone or apply paint to your face
  • Sandwich = literal (food) or lateral (something squashed between something else)
  • Resume = to continue or required for a job application?
  • SME = subject matter expert or small to medium enterprise?

Double meanings can be partially solved by studying the words in context of words around them. If resume means to continue, it is usually preceded by the word ‘to’. To determine if the beetle in question has legs or wheels, the phrase will hopefully include a little more information – not too many people (intentionally) put fuel in the insect version or find the motor version in their shower.

A popular method used by search engines to decipher meaning from documents (and search queries) is Bayesian Inference. For the SharePoint fans in the audience, the answer is no. SharePoint does not use Bayesian Inference in any shape or form at the moment. It used to, back in the first version (SPS 2001) but currently doesn’t. Yes, other search engines do.

Abbreviations can be trickier because they are often assumed/applied in context of a human perspective. For example, is a search for ‘Health care SME’ looking for an internal expert on health care to ask them a question or looking for small businesses in the health care industry to include in a targeted sales campaign?

2. The same meaning is applied to different words

This one is easiest to explain with an example. A typical trans-atlantic flight offers you the choice of up to 4 ‘classes’ – Economy, Economy Plus, Business, First. But it seems no two airlines use these precise words any more. For example:

Controlled Vocabulary Actual: Virgin Atlantic Actual: British Airways
Economy Economy World Traveller
Economy Plus Premium Economy World Traveller Plus
Business Upper Class Club World
First n/a Club First

A common method to identify different words that share the same meaning is a thesaurus. In enterprise search solutions, a thesaurus is typically used to replace or expand a query phrase. For example, the user searches for ‘flying in economy plus’, the search engine checks the thesaurus and also searches for ‘premium economy’ and ‘world traveller plus’.

3. Different words have the same meaning in theory but are different in practice

Back to our airlines. Having travelled in the Economy Plus cabin of both Virgin Atlantic and British Airways recently, I can tell you from first-hand experience that the experience varies. On a Virgin Atlantic flight, Economy Plus is half-way between Economy and Business (Upper Class in their terminology). On a British Airways flight, Economy Plus is one step away from Economy and a world way from Business (Club World in their terminology). Added to the challenge, experience changes over time. It’s been a few years since I last travelled Economy Plus on British Airways and I would say that it is not as good as it used to be. But then I’ve had the luxury of travelling Business Class with them in between. The Virgin Atlantic flight was on a plane that had just had its Economy Plus cabin refurbished and was all shiny and new. And I have never travelled Business Class on Virgin Atlantic.

This is a tricky issue for a search engine to solve. It crosses into the world of knowledge and experience, a pair of moving and ambiguous goal posts at the best of times.

4. Different words are spelt near-identically

A cute feature available in most search engines is the ‘Did you mean’ function. If you spell something slightly incorrectly, for example: ‘movy’, the search engine will prompt you with ‘did you mean movie?’

The trickier challenge for search engines is when a user types in the wrong word (often accidentally) but the word typed exists in its own right (and hence doesn’t get spotted or picked up by spell-checkers):

  • Stationary (not moving) vs Stationery (materials for writing)
  • There (somewhere) vs Their (people)
  • Here (somewhere) vs Hear (what your ears do)
  • Popular (people like you) vs Poplar (a type of tree)

As in the first challenge, advanced algorithms can be used to try and decipher meaning by also looking at words used in association. ‘Can someone please move that stationery vehicle?’ probably isn’t referring to a cardboard cut-out. But this scenario is much harder to sport. (Spot the deliberate typo.) For example: ‘information about popular trees’ is a different query to ‘information about poplar trees’ but both look just fine to the search engine. Worse still, the content being indexed may itself contain a lot of typos. An author might have been using ‘stationery’ when they should have been using ‘stationary’ for years. (I’m supposed to be good at spelling and I had to check the dictionary to be sure.) In this scenario, Bayesian Inference can start to fail because there is now a range of words in documents containing ‘stationery’ that have nothing to do with paper or writing materials but occur often enough across a set of documents to suggest a relationship. Search queries correctly using ‘stationary’ won’t find highly relevant results containing mis-spelt ‘stationery’ unless the thesaurus is configured to spot the errors.

5. Different words are used by different people to mean the same thing

This is the ultimate gotcha for enterprise search engines, and the reason why most taxonomies fail to deliver on their promises. Search engines that rely on word-breaking to match search queries with index results can fall down completely when the search query uses words that don’t actually exist in the document that the searcher is looking for. For example, a search for ’employee manual’ could be seeking the ‘human resources handbook’ that doesn’t actually contain the words employee and manual together in the same sentence, let alone in any metadata. Because there is no relation, the relevance will be deemed low to the query.

The thesaurus and friends (such as Search Keywords/Best Bets) can again come to the rescue – for example, tying searches for ’employee’ with ‘human remains’. The challenge is who decides what goes in the thesaurus? Referring to a previous blog post – When taxonomy fails – the Metropolitan Museum of Art in New York ran a test of their taxonomy using a non-specialist audience. More than 80% of the terms used by the test audience were not even listed in the museum’s documentation. The experts who decided on the taxonomy were talking a completely different language to the target audience. When the experts are allowed to ‘tweak’ the search engine, don’t assume that search results will improve…

Summing Up (i.e. closing comments, not a mathematical outcome)

I’ve over-simplified how search engines work and apply controlled vocabularies through tools such as thesauri. But hopefully this post helps show why so many people become frustrated with the results returned from enterprise search solutions, even the advanced ones. Quite often, t
he words simply don’t match.

[Update: 20 March] Using the ‘Stationary vs Stationery’ example in the most recent Enterprise Search workshop, an attendee came up with a great little snippet of advice to remember the difference. Remember that envelopes begin with the letter ‘e‘ and they are stationery. And hilariously, I can now look at the Google ads served up alongside this post and spot the spelling mistakes 🙂

References:

Technorati tags: Enterprise Search; Taxonomy; Information Architecture; Search Server