I've pondered hard to find the best analogy that I could think of and I've come up with ... the phone book. Yes, a phone book.
Imagine that each page in our phone book is equivalent to a Sybase 2K data page. Every time we read a page from our phone book it is equivalent to one disk I/O.
Since we are imagining, let's also imagine that our mythical SQL Server (that runs against the phone book) has only enough data cache to buffer 200 phone pages. When our data cache gets full we have to flush an old page out so we can read in a new one.
Fasten your seat belts, because here we go...
The phone book is clustered by last name.
create clustered index on phone_book (last_name)It's fast to perform the following queries on the phone book:
create nonclustered index on phone_book (phone_number)Our non-clustered index will be built and maintained by our Mythical SQL Server as follows:
The phone numbers will be kept in ascending order.
Now when we ask the question:
Find the address of those whose phone number is 440-1300we don't look at the phone book directly but go to our new data structure and it tells us which page and row within the page the above phone number can be found. Neat eh?
Draw backs? Well, yes. Because we probably still can't answer the question:
Find the address of those whose prefix is 440This is because of the data structure being used to implement non-clustered indexes. The structure is a list of ordered values (phone numbers) which point to the actual data in the phone book. This indirectness can lead to trouble when a range or a match query is issued.
The structure may look like this:
What may happen is that we re-read the same phone page many times. This isn't a problem if the phone page is in memory. We have limited memory, however, and we may have to flush our memory to make room for other phone pages. So the re-reading may actually be a disk I/O.
The Server needs to decide when it's best to do a table scan versus using the non-clustered index to satisfy mini-range type of queries. The way it decides this is by applying a heuristic based on the information maintained when an update statistics is performed.
In summary, non-clustered indexes work really well when used for highly selective queries and they may work for short, range type of queries.
When you cluster, build the cluster to satisfy the largest percentage of range type queries. Don't put the clustered index on your primary key because typically primary keys are increasing linearly. What happens is that you end up inserting all new rows at the end of the table thus creating a hot spot on the last data page.
For detail rows, create the clustered index on the commonly accessed foreign key. This will aid joins from the master to it.
Use nonclustered index to aid queries where your selection is very selective. For example, primary keys. :-)