Originally, this series was supposed to be 4 parts. But started to realise that part 2 in one shot would be a very looooooong blog post. So here is part 2a. (Click here for part 1):
Key messages from the presentation:
Slide 3. Whilst most documents about indexing in SharePoint include a complicated diagram to explain the indexing process, here’s the simple version of how it works:
- Connectors (aka Protocol Handlers) connect to a store and suck out its content. This should include security permissions and metadata managed by the store. Hence the connector needs to adhere to an agreed protocol to keep the store happy. Different stores manage files, security and metadata differently and require different connectors.
- Once the indexing server has its hands on the content, filters are used to strip out the unnecessary gumpf (technical term) from within each item. If you open a MS Word document in Notepad, you’ll see a ton of square boxes before you get to any text. That’s because Notepad doesn’t understand MS Word formatting. Filters don’t care about formatting and chuck it all away to get down to the raw text and any metadata stored within the document. Different file formats need different filters.
- So, for each item retrieved, the indexing server has a pile of raw text, metadata (some found within the document, some held by the store alongside the document), a link to the original item and security permissions (who can access it). All the metadata becomes ‘crawled properties’ and are dumped into a property store. All the individual words – the raw text – are dumped into the content index.
- There is one additional element included – a static rank. We’ll cover that in part 3.
Slide 4. Now we have an index, people can start to query it and receive search results. When you type a query in a basic search box, here is what happens: (e.g. ‘SharePoint and enterprise search’)
- The search query will be word broken and noise words removed – language match determines what dictionary and noise list are used (our example now becomes ‘SharePoint’; ‘Enterprise’, ‘Search’. Say bye bye to ‘and’ – it’s a noise word)
- If you have stemming enabled (it is off by default), then the search will probably also include ‘searches’, ‘searching’ and ‘enterprises’. If the thesaurus has been configured, it may include additional acronyms for SharePoint, such as ‘SPS’. Not sure about ‘MOSS’ – technically, that’s a dictionary word). See related post – SharePoint and Stemming – for more information about stemming, noise words and the thesaurus
- We now have a list of words that form the search query. A list of results are returned from the index that match any and all of the words in the query (when performing a basic search – advanced search enables you to only return docs that only match all words in the query, and other options)
- The results are security-trimmed – anything that the user doesn’t have permission to see is removed. They can also be scope-trimmed, if a scope has been selected (e.g. only return docs that are less than one year old)
- The remaining set of results are relevance ranked – a dynamic rank (calculated based on the search query terms) is added to the static rank held in the index – and returned as an ordered list
Slide 5. One of the most popular areas to customise in SharePoint is property management. An index can contain lots and lots of crawled properties. You can leverage those properties in search queries. For example, looking for all documents classified as ‘finance’ – see related post: Classifying content in SharePoint. To do this, you create managed properties – a managed property can be mapped to one or more crawled properties. The example in the slide – you might want people to be able to look up content classified by ‘customer name’, ‘customer name’ may be used across multiple different content stores with different titles. Managed properties can be added to the Advanced Search page and used in scopes. We’ll revisit this in part 2b (or 2c, depending on how long 2b ends up)
Slide 6: In a typical SharePoint deployment, you will have one single central indexing and search server. This server will index all your different content sources. If you’ve got the licences, you will probably separate search from indexing, using query servers. This means the indexing server can focus on doing indexing and propagates index changes up to the query servers. If the indexing server decides to take a break (literally), users can still search for content because copies of the index reside on each query server. It simply means there will be no updates to the index until the indexing server returns from vacation. The new feature introduced this year is federation. An indexing server can include federated connectors to other indexes. Great for accessing content not indexed natively by SharePoint and also great for spreading the indexing load. When a user submits a search query, results are returned from the central index and any other indexes with a federated connector. If you want to see this in action, try performing a search at http://www.infomash.co.uk/. You will see results returned from Flickr, Technorati, Twitter (Summize) and FriendFeed. If you want to have some fun, try http://www.infomash.co.uk/googmsft.aspx – you’ll see results returned from Google and Live side-by-side. Great for comparing how they determine relevance. (Note: it’s a prototype server, no guarantees regarding availability or performance)
Slide 7: Some capacity planning tips:
- Already mentioned – the first scale issue you will hit is indexing server performance. Indexing will fight with search queries to win RAM and CPU attention. Put them in separate playgrounds (or, if you don’t have timezone problems, schedule indexing to only take place out of hours – but means your index will always be a day old).
- The most popular indexing question – how big will the index be? The numbers can be quite frightening – up to 50% of the size of content you are indexing. Average is usually nearer to 20% but it all depends on your range of vocabulary. Lots of colourful language and each different word has to go into that index. Lots of metadata and it all gets stored…
- The required disk space for the index is important – you need to allow 2.5 x the index size. This is due to the fact that you can, for a temporary period of time, have 2 copies of the index stored (to do with how changes are managed and propagated). But, it is 2.5 x the index size, not the size of the corpus being indexed (I’ve seen the latter stated at MS conferences and in some documents on TechNet – wrong.)
- Federated connectors give you lots more flexibility in your architecture. Whilst you will mostly hear about how they let you include other indexes in your search results, there is a hidden benefit. You can split up your SharePoint indexes and then federate all the results on a single page. It won’t be one results list – each federated connector displays its results in a separate web part. Required, because each results set will have a different rank calculation. Great potential for branch office scenarios to save bandwidth and keep indexes fresher, and for dropping an indexing server onto niche applications and content stores (if they are running Windows Server 2003 or later, you can use Search Server Express and not have to pay for any extra licences)
Note: Federated connectors are currently only available in Search Server 2008. They are due to be released for SharePoint Server 2007, hopefully quite soon.
To download a copy of the presentation (3Mb) – MS-Search-Pt2a.pdf
Related blog posts: