In case you are wondering, this is not going to be an expletive-filled post or a discussion about what happens if you suffer damage to the frontal lobe of your brain.
One of the big challenges for enterprise search solutions is deciphering meaning from words. There are five aspects to the challenge. (Possibly more, given I originally started this post with two!)
1. The same word can have multiple meanings
We only have so many words, and we have a habit of using the same words to mean very different things. For example:
- Beetle = insect or motor vehicle?
- Make-up = be nice to someone or apply paint to your face
- Sandwich = literal (food) or lateral (something squashed between something else)
- Resume = to continue or required for a job application?
- SME = subject matter expert or small to medium enterprise?
Double meanings can be partially solved by studying the words in context of words around them. If resume means to continue, it is usually preceded by the word ‘to’. To determine if the beetle in question has legs or wheels, the phrase will hopefully include a little more information – not too many people (intentionally) put fuel in the insect version or find the motor version in their shower.
A popular method used by search engines to decipher meaning from documents (and search queries) is Bayesian Inference. For the SharePoint fans in the audience, the answer is no. SharePoint does not use Bayesian Inference in any shape or form at the moment. It used to, back in the first version (SPS 2001) but currently doesn’t. Yes, other search engines do.
Abbreviations can be trickier because they are often assumed/applied in context of a human perspective. For example, is a search for ‘Health care SME’ looking for an internal expert on health care to ask them a question or looking for small businesses in the health care industry to include in a targeted sales campaign?
2. The same meaning is applied to different words
This one is easiest to explain with an example. A typical trans-atlantic flight offers you the choice of up to 4 ‘classes’ – Economy, Economy Plus, Business, First. But it seems no two airlines use these precise words any more. For example:
||Actual: Virgin Atlantic
||Actual: British Airways
||World Traveller Plus
A common method to identify different words that share the same meaning is a thesaurus. In enterprise search solutions, a thesaurus is typically used to replace or expand a query phrase. For example, the user searches for ‘flying in economy plus’, the search engine checks the thesaurus and also searches for ‘premium economy’ and ‘world traveller plus’.
3. Different words have the same meaning in theory but are different in practice
Back to our airlines. Having travelled in the Economy Plus cabin of both Virgin Atlantic and British Airways recently, I can tell you from first-hand experience that the experience varies. On a Virgin Atlantic flight, Economy Plus is half-way between Economy and Business (Upper Class in their terminology). On a British Airways flight, Economy Plus is one step away from Economy and a world way from Business (Club World in their terminology). Added to the challenge, experience changes over time. It’s been a few years since I last travelled Economy Plus on British Airways and I would say that it is not as good as it used to be. But then I’ve had the luxury of travelling Business Class with them in between. The Virgin Atlantic flight was on a plane that had just had its Economy Plus cabin refurbished and was all shiny and new. And I have never travelled Business Class on Virgin Atlantic.
This is a tricky issue for a search engine to solve. It crosses into the world of knowledge and experience, a pair of moving and ambiguous goal posts at the best of times.
4. Different words are spelt near-identically
A cute feature available in most search engines is the ‘Did you mean’ function. If you spell something slightly incorrectly, for example: ‘movy’, the search engine will prompt you with ‘did you mean movie?’
The trickier challenge for search engines is when a user types in the wrong word (often accidentally) but the word typed exists in its own right (and hence doesn’t get spotted or picked up by spell-checkers):
- Stationary (not moving) vs Stationery (materials for writing)
- There (somewhere) vs Their (people)
- Here (somewhere) vs Hear (what your ears do)
- Popular (people like you) vs Poplar (a type of tree)
As in the first challenge, advanced algorithms can be used to try and decipher meaning by also looking at words used in association. ‘Can someone please move that stationery vehicle?’ probably isn’t referring to a cardboard cut-out. But this scenario is much harder to sport. (Spot the deliberate typo.) For example: ‘information about popular trees’ is a different query to ‘information about poplar trees’ but both look just fine to the search engine. Worse still, the content being indexed may itself contain a lot of typos. An author might have been using ‘stationery’ when they should have been using ‘stationary’ for years. (I’m supposed to be good at spelling and I had to check the dictionary to be sure.) In this scenario, Bayesian Inference can start to fail because there is now a range of words in documents containing ‘stationery’ that have nothing to do with paper or writing materials but occur often enough across a set of documents to suggest a relationship. Search queries correctly using ‘stationary’ won’t find highly relevant results containing mis-spelt ‘stationery’ unless the thesaurus is configured to spot the errors.
5. Different words are used by different people to mean the same thing
This is the ultimate gotcha for enterprise search engines, and the reason why most taxonomies fail to deliver on their promises. Search engines that rely on word-breaking to match search queries with index results can fall down completely when the search query uses words that don’t actually exist in the document that the searcher is looking for. For example, a search for ’employee manual’ could be seeking the ‘human resources handbook’ that doesn’t actually contain the words employee and manual together in the same sentence, let alone in any metadata. Because there is no relation, the relevance will be deemed low to the query.
The thesaurus and friends (such as Search Keywords/Best Bets) can again come to the rescue – for example, tying searches for ’employee’ with ‘human remains’. The challenge is who decides what goes in the thesaurus? Referring to a previous blog post – When taxonomy fails – the Metropolitan Museum of Art in New York ran a test of their taxonomy using a non-specialist audience. More than 80% of the terms used by the test audience were not even listed in the museum’s documentation. The experts who decided on the taxonomy were talking a completely different language to the target audience. When the experts are allowed to ‘tweak’ the search engine, don’t assume that search results will improve…
Summing Up (i.e. closing comments, not a mathematical outcome)
I’ve over-simplified how search engines work and apply controlled vocabularies through tools such as thesauri. But hopefully this post helps show why so many people become frustrated with the results returned from enterprise search solutions, even the advanced ones. Quite often, t
he words simply don’t match.
[Update: 20 March] Using the ‘Stationary vs Stationery’ example in the most recent Enterprise Search workshop, an attendee came up with a great little snippet of advice to remember the difference. Remember that envelopes begin with the letter ‘e‘ and they are stationery. And hilariously, I can now look at the Google ads served up alongside this post and spot the spelling mistakes 🙂
Technorati tags: Enterprise Search; Taxonomy; Information Architecture; Search Server