Search Ad value

A light-hearted post to start what seems to be the true (in terms of weather) first day of Spring here in Warwickshire, UK.

First, the context. Improving information findability (aka search) has been one of my key interests for the past 8 years. I still remember my induction day at Microsoft back in March 2000. During introductions, we all had to say what our browser home-page is set-to. The swats answered msn.com, the graduates – yahoo.com. Then I stood up and said ‘google.com’. Tumbleweed blowing in the wind. Few had even heard of it. Imagine doing the same today? Quickest way to secure a P45… 12 months later, whilst others were nursing hangovers from the party the night before, I multi-tasked and sat enthralled as Jonathan Kauffman explained the algorithms being introduced with SharePoint Portal Server 2001. The point of all this…

Working through year-end paperwork with my finance manager, a.k.a. my mom, at the weekend. I flourished a Google voucher before her, £30 to spend on adwords (received it in a magazine).

¨Pff, that won’t buy you much¨

Whaaaaatttttt? My blinkin’ mother is a Google expert. (She does the books for my brother’s business and it appears he is spending a fair bit more than £30 on Google ads). Then she asked me if I’d got my keywords set up correctly! 🙂 back on my turf at least and able to explain why keywords don’t really work anymore.

I still dont know what to do with the Google voucher. Business mostly comes via word of mouth and reputation with former employer…

Don’t count on the data

A follow up to uncontrolled vocabulary.

Many of the more advanced search engines try to improve relevance by extracting meaning out of the content of documents, i.e. without any input required from users. This is commonly known as auto-classification (or auto-categorisation). It then enables you to perform faceted search queries (also known as conceptual search). For example, if you search on ‘politics’, the results might include facets for different political parties (Democrats, Republicans), politicians (Bush, Clinton, McCain, Obama), government activities (budget, new policies) and so on. It’s an effective way of quickly narrowing down your search criteria and hence refining results.

The challenge is can you trust the data? How accurate is the content within your documents? For example, let’s look at the number 35.

Why 35?

That’s the average number of people wrongly declared dead every day in the US as a result of data input errors by social security staff. Doh! 🙂 (Source: MSNBC. Published in NewScientist magazine, 8 March 2008)

Uncontrolled vocabulary

In case you are wondering, this is not going to be an expletive-filled post or a discussion about what happens if you suffer damage to the frontal lobe of your brain.

One of the big challenges for enterprise search solutions is deciphering meaning from words. There are five aspects to the challenge. (Possibly more, given I originally started this post with two!)

1. The same word can have multiple meanings

We only have so many words, and we have a habit of using the same words to mean very different things. For example:

  • Beetle = insect or motor vehicle?
  • Make-up = be nice to someone or apply paint to your face
  • Sandwich = literal (food) or lateral (something squashed between something else)
  • Resume = to continue or required for a job application?
  • SME = subject matter expert or small to medium enterprise?

Double meanings can be partially solved by studying the words in context of words around them. If resume means to continue, it is usually preceded by the word ‘to’. To determine if the beetle in question has legs or wheels, the phrase will hopefully include a little more information – not too many people (intentionally) put fuel in the insect version or find the motor version in their shower.

A popular method used by search engines to decipher meaning from documents (and search queries) is Bayesian Inference. For the SharePoint fans in the audience, the answer is no. SharePoint does not use Bayesian Inference in any shape or form at the moment. It used to, back in the first version (SPS 2001) but currently doesn’t. Yes, other search engines do.

Abbreviations can be trickier because they are often assumed/applied in context of a human perspective. For example, is a search for ‘Health care SME’ looking for an internal expert on health care to ask them a question or looking for small businesses in the health care industry to include in a targeted sales campaign?

2. The same meaning is applied to different words

This one is easiest to explain with an example. A typical trans-atlantic flight offers you the choice of up to 4 ‘classes’ – Economy, Economy Plus, Business, First. But it seems no two airlines use these precise words any more. For example:

Controlled Vocabulary Actual: Virgin Atlantic Actual: British Airways
Economy Economy World Traveller
Economy Plus Premium Economy World Traveller Plus
Business Upper Class Club World
First n/a Club First

A common method to identify different words that share the same meaning is a thesaurus. In enterprise search solutions, a thesaurus is typically used to replace or expand a query phrase. For example, the user searches for ‘flying in economy plus’, the search engine checks the thesaurus and also searches for ‘premium economy’ and ‘world traveller plus’.

3. Different words have the same meaning in theory but are different in practice

Back to our airlines. Having travelled in the Economy Plus cabin of both Virgin Atlantic and British Airways recently, I can tell you from first-hand experience that the experience varies. On a Virgin Atlantic flight, Economy Plus is half-way between Economy and Business (Upper Class in their terminology). On a British Airways flight, Economy Plus is one step away from Economy and a world way from Business (Club World in their terminology). Added to the challenge, experience changes over time. It’s been a few years since I last travelled Economy Plus on British Airways and I would say that it is not as good as it used to be. But then I’ve had the luxury of travelling Business Class with them in between. The Virgin Atlantic flight was on a plane that had just had its Economy Plus cabin refurbished and was all shiny and new. And I have never travelled Business Class on Virgin Atlantic.

This is a tricky issue for a search engine to solve. It crosses into the world of knowledge and experience, a pair of moving and ambiguous goal posts at the best of times.

4. Different words are spelt near-identically

A cute feature available in most search engines is the ‘Did you mean’ function. If you spell something slightly incorrectly, for example: ‘movy’, the search engine will prompt you with ‘did you mean movie?’

The trickier challenge for search engines is when a user types in the wrong word (often accidentally) but the word typed exists in its own right (and hence doesn’t get spotted or picked up by spell-checkers):

  • Stationary (not moving) vs Stationery (materials for writing)
  • There (somewhere) vs Their (people)
  • Here (somewhere) vs Hear (what your ears do)
  • Popular (people like you) vs Poplar (a type of tree)

As in the first challenge, advanced algorithms can be used to try and decipher meaning by also looking at words used in association. ‘Can someone please move that stationery vehicle?’ probably isn’t referring to a cardboard cut-out. But this scenario is much harder to sport. (Spot the deliberate typo.) For example: ‘information about popular trees’ is a different query to ‘information about poplar trees’ but both look just fine to the search engine. Worse still, the content being indexed may itself contain a lot of typos. An author might have been using ‘stationery’ when they should have been using ‘stationary’ for years. (I’m supposed to be good at spelling and I had to check the dictionary to be sure.) In this scenario, Bayesian Inference can start to fail because there is now a range of words in documents containing ‘stationery’ that have nothing to do with paper or writing materials but occur often enough across a set of documents to suggest a relationship. Search queries correctly using ‘stationary’ won’t find highly relevant results containing mis-spelt ‘stationery’ unless the thesaurus is configured to spot the errors.

5. Different words are used by different people to mean the same thing

This is the ultimate gotcha for enterprise search engines, and the reason why most taxonomies fail to deliver on their promises. Search engines that rely on word-breaking to match search queries with index results can fall down completely when the search query uses words that don’t actually exist in the document that the searcher is looking for. For example, a search for ’employee manual’ could be seeking the ‘human resources handbook’ that doesn’t actually contain the words employee and manual together in the same sentence, let alone in any metadata. Because there is no relation, the relevance will be deemed low to the query.

The thesaurus and friends (such as Search Keywords/Best Bets) can again come to the rescue – for example, tying searches for ’employee’ with ‘human remains’. The challenge is who decides what goes in the thesaurus? Referring to a previous blog post – When taxonomy fails – the Metropolitan Museum of Art in New York ran a test of their taxonomy using a non-specialist audience. More than 80% of the terms used by the test audience were not even listed in the museum’s documentation. The experts who decided on the taxonomy were talking a completely different language to the target audience. When the experts are allowed to ‘tweak’ the search engine, don’t assume that search results will improve…

Summing Up (i.e. closing comments, not a mathematical outcome)

I’ve over-simplified how search engines work and apply controlled vocabularies through tools such as thesauri. But hopefully this post helps show why so many people become frustrated with the results returned from enterprise search solutions, even the advanced ones. Quite often, t
he words simply don’t match.

[Update: 20 March] Using the ‘Stationary vs Stationery’ example in the most recent Enterprise Search workshop, an attendee came up with a great little snippet of advice to remember the difference. Remember that envelopes begin with the letter ‘e‘ and they are stationery. And hilariously, I can now look at the Google ads served up alongside this post and spot the spelling mistakes 🙂

References:

Technorati tags: Enterprise Search; Taxonomy; Information Architecture; Search Server

Vertical search within results

Just spotted this up on Techmeme from SearchEngineLand – Google Tests Additional Search Box Within Search Results. Check the full post for images to explain. Here’s the quick summary.

Your search results include sites like Amazon, Wikipedia and New York Times. For those results, you can perform a follow up search that covers just that site. It’s an interesting way to keep people using Google as the start location for search.

Sitting in on a session with FAST at the SharePoint Conference today, FAST had a great slide up showing online search as a long tail. Only 30% of searches actually start from the main gateways – Google, Yahoo, Microsoft. So that means Google actually only has a 60% market share of 30% of the market… in terms of direct searches. I daresay a fair few of the locations that account for the other 70% of searches use Google for their search… by the by, the idea of ‘search within search’ is another way of protecting and expanding that starting point for online search.

To click or not to click

One of the topics I spend some time on when delivering enterprise search workshops is helping organisations to identify and understand when you want people to click on a search result versus not.

When someone is seeking interactive help – for example: I want to submit a question to an internal discussion forum – then you want the search results to display that forum as the first link on the page, with enough information for the user to be confident that it is the right forum. I click on the link, enter the forum, post my question and start a conversation. Similarly, when someone is seeking detailed information – for example: I need last year’s budget – you want enough information displayed in the results page for me to be confident that the 10Mb file I am about to download over a flaky Internet connection is the right one.

When someone is seeking a snippet of information – for example: I want to know the telephone number for a customer – then you can improve productivity by displaying enough information in the search results for people to not have to click anything. I don’t need to view a page or open a document about the customer, I just want the telephone number because I need to call them.

This can be done within SharePoint Server 2007 (and it’s sibling Search Server 2008) using managed properties. You can modify the results pages to display additional information by tweaking the XML that determines what and how information is displayed.

Interestingly, Yahoo has made an announcement to enable similar behaviour on their Internet search engine – An open approach to Search. Web site owners can submit data and Yahoo will display their results in a more informative format. Here is the example (‘before’ on the left/’after’ on the right) given on their blog post:

From a user perspective, the ‘after’ is a big improvement. If that’s the restaurant I was looking for, I can see it has some good reviews and I’ve got the telephone number to make a reservation. But I’m not sure the web site owner will agree. That result isn’t from the restaurant itself, it is from a review/directory site. The argument behind search engine optimisation (SEO) is that many people start their Internet journey at a search engine. If you want your web site to be the next destination, then you need to be at the top of the search results. In this example, improving the visual display of search results means that there is no next destination. The search engine becomes the start and end. If you have a web site with a business model dependent on online advertising revenue (dependent on people visiting your site), the search engine just ate your lunch.

Naturally, there is a solution. If you are an intermediary web site, you need the search result to display information that will still bring a visitor to your site or keep you in the loop of the transaction. In this example, perhaps being able to offer a 10% discount on the meal if booked through the review site…

Technorati tags: Search ; Enterprise Search

All web sites great and small

Spotted a depressing article on Techmeme on Friday – Hackers turn Google into a vulnerability scanner (Infoworld). I suppose it was inevitable that this would happen.

Hacking group Cult of the Dead Cow (CDC) have kindly released a tool that uses Google to automatically scour web sites for sensitive information. Because it is automatic, it means that new and novice web sites are no longer protected by relative anonymity. If you are storing information anywhere in ‘the cloud’ and are worried about it being kept private and secure, the best approach is to run the tool for yourself and find out if your site needs fixing.

Whether Google likes it or not, they are as good as a monopoly on the Internet. There isn’t the proprietary lock-in achieved by a certain other technology company. But Google is the one location that most* people go to in search of stuff and therefore the one location most web sites aim to be discovered by. The trouble with technology monopolies is the lack of diversity. It’s what makes Microsoft software so vulnerable. Give a cold to one computer and you can pass it on to them all. Now the Internet is the focus and Google is the target to exploit. The CDC tool doesn’t care if your web site is on page 1 or page 1,000,001 of Google’s search results. It can and will find you (cue Terminator music).

The ultimate irony – the tool takes advantage of Google’s index, has been written using Microsoft .NET and is licensed as free open source… it’s not often you see those three areas come together as a single solution. Pity it had to be this one.

*According to comScore World Metrix, Google hosted 62.4% of web searches in December 2007. Next nearest rival was Yahoo with 12.8%, trailed by Microsoft with 2.9%

How relevant is your content?

Last year I wrote a blog post that generated a bit of offline feedback – Search Lessons. Or rather, it was explaining The Google™ Effect that sparked a debate.

¨Why can’t our enterprise search just be like Google?¨

People are used to Google when searching the Internet – simple (and single) user interface, fast results. Relevance varies depending on the search but is usually good enough for most queries. When it comes to searching for information within the boundaries of your organisation, the answers are not so easy to find…

There’s one simple answer: Intranet ≠ Internet.

They might look similar, thanks to web standards, but looks can deceive. A mushroom looks like a toadstool but one is edible and the other isn’t.

On the Internet, pages want to be found. The same is rarely true within the enterprise.

On the Internet, if your site doesn’t appear in search results, it doesn’t exist. That’s a huge incentive to ensure anyone publishing content uses tags and any other tricks to attract the attention of a search engine. Enough of an incentive to create a whole new industry – Search Engine Optimisation (SEO). The same incentives rarely apply within the enterprise and few organisations budget for in-house SEO. Added to that, the search vendors have to continuously optimise the index to balance any negative effects caused by SEO. In-house? You need to do both! Enterprise search is a continuous improvement process, not an isolated project.

On the Internet, there is no relationship between the search engine and the end-user. If you don’t find what you are looking for, you try a different query or look somewhere else. There is no contract stating that the Internet must provide you with answers. There are no guarantees that what you find will be accurate or useful. When an enterprise search engine fails to perform, the IT department usually hears about it.

On the Internet, all users are equal (and anonymous). You can’t login to Google with a different ID and suddenly access previously hidden search results. In the enterprise, security is rarely considered optional.

The Internet doesn’t care what you do with what you find. An enterprise search engine should.

When an enterprise search engine fails to deliver, the technology usually gets the blame:

¨Our search engine is rubbish, the results it finds are never relevant.¨

But is it really the search engine that is to blame? When it comes to calculating relevance, it all depends on the content. Jenny Everett, over on the SharePoint Blogs, published the results of a survey that clearly demonstrate the real challenge facing enterprise search solutions.

Source: SharePoint Search is not Psychic by Jenny Everett

85% of search issues had nothing to do with relevance ranking. The numbers may vary by organisation, but I would wager that the top two issues are consistent. Imagine if you performed a search on the Internet and the first page of results consisted of items with the title ‘Blank’.