Clustered Indexes? Sedimentary, my dear Watson

  • Ken Lee-263418 (5/26/2010)


    ONLY the first field values are stored in the binary tree of ANY index.

    Sorry, but that is not true. I don't, from your explanation, know what was wrong with your indexed view, but for a composite index, all key columns are present at all levels if the index tree. In a clustered index, all the other columns are at the leaf level, in a nonclustered index, include columns are at the leaf and the clustering key will be present either as a key column or as an include column, depending whether or not the index is define as unique.

    Setup code

    CREATE TABLE TestingIndexStructure (

    SomeNumber INT,

    SomeArbString VARCHAR(10),

    Filler CHAR(800) -- simply here to make the rows and hence index bigger

    )

    INSERT INTO TestingIndexStructure

    SELECT TOP 5000 b.object_id, LEFT(a.name,10),''

    FROM master.sys.columns a CROSS JOIN master.sys.columns b

    -- composite clustered index. Two columns, SomeNumber and SomeArbString

    CREATE CLUSTERED INDEX idx_TestingIndexStructure_NameString

    ON TestingIndexStructure (SomeNumber, SomeArbString)

    To see where the various index structures are, a little bit of undocumented fun. Be sure to change the first param of this to match the DB name if anyone's trying this themselves

    DBCC IND ('Testing','TestingIndexStructure',1)

    This tells me that, in my case, the index root page is page 134684 in file 1, that one of the intermediate pages is at page no 134683 (there are 3 in total) and that there's a leaf page at page no 135480 (and 556 of them in total)

    First up, the leaf page:

    DBCC TRACEON (3604)

    DBCC PAGE (11,1,135480,3) -- first parameter is database_id for anyone trying this themselves

    Output (trimmed for brevity)

    Slot 0 Offset 0x60 Length 821

    Record Type = PRIMARY_RECORD Record Attributes = NULL_BITMAP VARIABLE_COLUMNS

    Record Size = 821

    Memory Dump @0x000000000EACC060

    0000000000000000: 30002803 03000000 20202020 20202020 †0.(.....

    0000000000000010: 20202020 20202020 20202020 20202020 †

    [snip]

    0000000000000320: 20202020 20202020 04000002 00310335 † .....1.5

    0000000000000330: 03616464 72††††††††††††††††††††††††††.addr

    Slot 0 Column 0 Offset 0x0 Length 4 Length (physical) 0

    UNIQUIFIER = 0

    Slot 0 Column 1 Offset 0x4 Length 4 Length (physical) 4

    SomeNumber = 3

    Slot 0 Column 2 Offset 0x331 Length 4 Length (physical) 4

    SomeArbString = addr

    Slot 0 Column 3 Offset 0x8 Length 800 Length (physical) 800

    Filler =

    Well, that's obvious. All the columns have to be there, it's the leaf level and this is the clustered index. Now for the intermediate.

    DBCC PAGE (11,1,134683,3)

    First column of the index (someNumber) highlighted in red, second column of the index (SomeArbString) highlighted in green. Oh, and the uniquifier is there because the index is not unique.

    Lastly the root level of the index

    DBCC PAGE (11,1,134684,3)

    Same highlighting of columns.

    And finally, some cleanup.

    DBCC TRACEOFF (3604)

    DROP TABLE TestingIndexStructure

    I will leave the equivalent examination of a nonclustered index as an exercise to the reader, because I'm going to be late for a meeting if I don't leave now.

    These (two blog posts and an article series) may be of interest here.

    Index columns, selectivity and equality predicates[/url]

    Index columns, selectivity and inequality predicates[/url]

    Introduction to indexes[/url]

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Ken Lee-263418 (5/26/2010)


    Ken Lee says You can easily ferret using "EXEC sp_depends 'tablename'" or see more detail about the first queries with

    "select name, case when loc<51 then substring(text,1,150) else substring(text,loc-50,150) end text

    from (select OBJECT_NAME(id) name, CHARINDEX('tablename',text) loc,text

    from dbo.syscomments

    where text like '%tablename%') d"

    That's a very helpful solution. Fixing the queries that hit tha table should be all that is required.

  • Nick W* (5/25/2010)


    There'll always be a need for DBAs to slap the programmers upside the head when they do it in the most boneheaded RBAR way imaginable. 😉

    It's equally true that there will always be a need for the programmers to slap the DBAS upside the head when they do it in the most boneheaded RBAR way imaginable.

    I've known far more DBAs who wrote procedural row-at-a-time code than developers who pulled such stunts (maybe because developers usually don't learn how to do that stupid thing).

    I've also known developers who habitually designed tables not in first normal form, so I'm not biased in favor of developers any more than I am biased in favor of DBAs.

    Tom

  • It's an interesting editorial. An some interesting comments - some people seem to be assuming that the clustered index must be unique, or that the primary key io always clustered, but I guess that's just a matter of learning a bit more. The assumption that an identity will always be good as the primary key if rows frequently accessed together are close in insertion date is a pernicious one though, both on account of the insert blitz on the current page problem already mentioned and because of the simple fact that in many cases the natural formultion of the most frequent access pattern is in terms of the natural key, and converting from that to use the surrogate key is pure overhead.

    There are plenty of queue-like tables out there, where inserts vastly outweigh everything else except deletes because reads only happen on the rare occassions where diagnostics are needed because something has gone horribly wrong, but it's easy to build an alternative surrogate out of an identity and use it for clustering so as to eliminate the blitzed current page effect while causing not too many page splits (and split in such a way that the number of unfull pages is static). You can use a table definition that begins something like

    create table T (

    ck as (id%64)*33554432+id/64 Persisted primary key clustered,

    id int not null unique identity(1,1),

    .........

    so that if the average row length were 1000 bytes the insertion of the first 8 rows would hit the same page 8 times, but after 64 rows had been inserted a page would be hit next after a given insertion only after there had been insertions to 8 other pages in this table and page allocations occur at the same rate on average as they would if the identity were used for clustering; the downside is that the page splits are bunched (once you have 64 rows you will repeat a sequence of 56 inserts with no splits followed by 8 inserts each requiring a new page). Mostly diagnostic table rows are a bit shorter than that, so 64 and 33554432 might be 512 and 4194304 instead, but that exarcebates the bunching; and if there are going to be more than a couple of billion rows, the big number has to be bigger and the identity has to be a bigint instead of an int.

    If the insert rate is low enough that messing about to avoid inserts always being consectutive in space is pointless and the natural where clause selects a range of natural key values, there's little point in requiring the cluster to be on the identity even if the identity order and the natural key order are the same; the identity should still of course have unique and not null constraints on it so that it can be the target of foreign key relations and used for most joins if the natural key is enough longer than the identity (int or bigint) for using the identity to be a useful saving. It doesn't matter which is the primary key - the natural key or the identity - as long as they each have either not null and unique or primary key constraints (personally I would usually prefer the natural key to be the primary key) - but it does matter which is the cluster key (it should usually be whichever one is going to entail the fewest reads in frequent queries).

    In the case where the natural key order doesn't correlate well with sedimentary order, I find it hard to work out what to do - using an identity to cluster (with or without a hack to spread the inserts a bit) reduces the number of page splits, but doesn't help at all with queries that want a range on the natural key. Using the natural key to cluster may lead to lots of page splits, not using it can lead to lots of extra reads. It's a lot of hard work to calculate which is best, and testing with random test data isn't going to be much help unless you really do know the value distributions of the various chunks of data so that your random test data can be generated in such a way that it looks something like the real world will look when you hit it (and even then, management has to be persuaded to allocate a lot of storage and process power and man-days to performance measurement experiments - - which isn't going to happen when some ignorant jerk is boldly asserting one of the three common myths: he "knows" the decision is easy because (Myth 1) it's always right to cluster on a surrogate or because (Myth 2) it's always right to cluster on the natural key or because (Myth 3) the surrogate should be defined in a table that contaions only it and the natural key, no information about the natural key except what it and its surrogate are should be in that table. Accepting the myth (whichever of the three it is) obviates the need for expensive performance testing to decide which way to go, and given the choice between risking getting it wrong but defering spending and doing it right by spending now it's usually risk getting it wrong that wins (not always; just usually).

    Tom

  • I agree that all fields in the index are stored in the index binary tree structure, in fact, I believed that when I set up the primary key of the partitioned view. The adelpated behaviour of SQL caused me to believe the opposite was true.

    In order to set up a partitioned view (not indexed) you MUST define a value that has a check constraint in the primary key. I figured since all the fields are stored in the binary key lookup are stored in the primary key and the partitioning key is SO important to be included in the primary key, maybe that should be listed first.

    IF SQL USED both fields in its binary tree lookup, that would be a valid assumption. 7 tables, the first field must be a value between 1 and 7 because of the check constraint, it would use the secondary field to find the location in the binary tree and go directly to the physical location where it was needed like it used a laser pointer to find where it goes.

    In lab tests, inserts were taking 7 SECONDS, switching the fields, inserts took 6 MILLISECONDS. The only explanation that made any sense is that the binary tree didn't store the secondary index in the binary tree since it sure wasn't USING it to find the location.

    At least in SQL 2000, that's the behavior I saw. The only explanation I could come up with is that the binary tree only stores the first field. I have no idea if that behaviour continues in 2005 and later. Since I was burned in SQL 2000, I am reluctant to try in later versions.

  • Ken Lee-263418 (8/28/2011)


    In order to set up a partitioned view (not indexed) you MUST define a value that has a check constraint in the primary key. I figured since all the fields are stored in the binary key lookup are stored in the primary key and the partitioning key is SO important to be included in the primary key, maybe that should be listed first.

    If it is defined as the first column of the primary key, then it is the leading column of the index key. Order of columns in an index key are important, SQL can only seek on left-based subsets of the index key (think how the telephone directory works)

    IF SQL USED both fields in its binary tree lookup, that would be a valid assumption. 7 tables, the first field must be a value between 1 and 7 because of the check constraint, it would use the secondary field to find the location in the binary tree and go directly to the physical location where it was needed like it used a laser pointer to find where it goes.

    SQL doesn't do binary tree lookup, indexes aren't binary trees at all. They're balanced trees (b+ trees if you want to get really technical) and they contain all the columns of the index key at all levels. If you're doing a 2-column search on a 2-column index, SQL uses both columns to navigate down the balanced tree structure.

    See the indexing article I wrote for a little bit more detail: http://qa.sqlservercentral.com/articles/Indexing/68439/ (it's an introductory article, not a deep technical one)

    I have no idea if that behaviour continues in 2005 and later. Since I was burned in SQL 2000, I am reluctant to try in later versions.

    The general structure of indexes (b+ tree containing all key columns at all levels) hasn't changed since SQL 7, probably since way before that. The only recent changes have been the addition of include columns (non-key columns stored only at the leaf level) and filtered indexes (indexes that don't contain all the rows of the table)

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster (8/28/2011...

    If it is defined as the first column of the primary key, then it is the leading column of the index key. Order of columns in an index key are important, SQL can only seek on left-based subsets of the index key (think how the telephone directory works)

    ...

    SQL doesn't do binary tree lookup, indexes aren't binary trees at all. They're balanced trees (b+ trees if you want to get really technical) and they contain all the columns of the index key at all levels. If you're doing a 2-column search on a 2-column index, SQL uses both columns to navigate down the balanced tree structure.

    See the indexing article I wrote for a little bit more detail...

    The general structure of indexes (b+ tree containing all key columns at all levels) hasn't changed since SQL 7, probably since way before that. The only recent changes have been the addition of include columns (non-key columns stored only at the leaf level) and filtered indexes (indexes that don't contain all the rows of the table)

    In a telephone directory, I can know that the second of 5 books would have "Lee" in it. When I find the 300+ pages of Lee's, I can hold my index fingers around them and use my thumbs to get a little left of center of the entire list and then thumb through the pages that are around Lee, K to find Ken. Then it's only 4 or 5 pages of Ken's I need to look through to find the one of interest. It isn't great, but it is a useful method.

    Even if all 5 books only had "Lee" in it, I'd know to pick up the 2nd of 5 and try there first. Trying to find "Ken" when you don't know the last name would be agonizing.

    Yes it is a b+ tree used by SQL. The first page should have a list of ranges to use in the next b tree lookup. This is called a modified binary tree lookup because instead of left/right, you have left, middle, middle, middle,... right to pick from. So, I dropped the "modified", shoot me. In a sensible manner, the "middle" values should be distributed across the table data, first on the first field and second on the second field.

    I don't know how SQL does it, but it "ACTS" like it distributes across the first field and then lists the second in the order found on the pages, so the first 5000 rows are accounted for in the first page of the range except the last list, which still accounts for 20 million rows. Then you do that again and again and again until you reach 15 million rows into the table and make your insert.

    You reverse the order of the fields and the range ACTS like the first selection accounts for 200K records per range, the second 2K, and the third finds the 100 row page your data is on. (3 lookups vs 15 thousand. 3 milliseconds vs 3 seconds)

    Is it any wonder I came to the incorrect conclusion that the index only stores the first field in the b tree lookup?

  • Ken Lee-263418 (8/28/2011)


    In a telephone directory, I can know that the second of 5 books would have "Lee" in it. When I find the 300+ pages of Lee's, I can hold my index fingers around them and use my thumbs to get a little left of center of the entire list and then thumb through the pages that are around Lee, K to find Ken. Then it's only 4 or 5 pages of Ken's I need to look through to find the one of interest. It isn't great, but it is a useful method.

    Even if all 5 books only had "Lee" in it, I'd know to pick up the 2nd of 5 and try there first. Trying to find "Ken" when you don't know the last name would be agonizing.

    Which is roughly how SQL uses it's indexes (I say roughly because SQL has the b-tree structure to traverse instead of the 'thumbing through' that we do with a phone book). It uses all the columns that it can to navigate down the tree to the required page. If SQL had an index on Surname, Firstname and all it was given was FirstName = Ken, it would have to scan the leaf level of the index as it has no way to search for that value alone (much the same as finding 'Ken' in a phone book when you don't know his surname)

    Yes it is a b+ tree used by SQL. The first page should have a list of ranges to use in the next b tree lookup.

    The first page, the root, has the first index key values (all of the key columns) for each page beneath it in the tree (again, see my article). So if the first page beneath the root starts with "Adams, Alan", that is what appears as the first entry in the root page. If the second page beneath the root starts with "Adams, William" then that's what appears as the second entry in the root page. Third page starts with "Attree, Michael", that appears as the third entry in the root page, carry on down the pages beneath the tree until you reach the last page beneath the root, let's say it starts with Woodridge, Tony, that appears as the last entry on the root page. So the root page looks roughly like this (which each entry having a pointer to the page that the key value appears on as well):

    Adams, Alan

    Adams, William

    Attree, Michael

    ...

    ...

    < rest of page>

    ...

    ...

    Woodridge, Tony

    If the pages beneath are not the leaf level, then they have the same structure - the key values for the first row on each page below. That repeats down to the root.

    This is called a modified binary tree lookup because instead of left/right, you have left, middle, middle, middle,... right to pick from. So, I dropped the "modified", shoot me. In a sensible manner, the "middle" values should be distributed across the table data, first on the first field and second on the second field.

    Binary tree = 0, 1 or 2 descendants only, and can be varying depths across the tree. There is nothing binary about a SQL index tree and you cannot in any way describe a tree that has "left, middle, middle, middle,... right" as binary, it's not. If there can be more than 2 descendants from a node is is not (by definition) a binary tree.

    There's no tree structure called a 'modified binary tree', nor is there a tree-related algorithm called a 'modified binary tree lookup'.

    I don't know how SQL does it, but it "ACTS" like it distributes across the first field and then lists the second in the order found on the pages, so the first 5000 rows are accounted for in the first page of the range except the last list, which still accounts for 20 million rows. Then you do that again and again and again until you reach 15 million rows into the table and make your insert.

    I honestly don't understand what you're trying to say here.

    With a b+tree, each page beneath the root has aprox the same number of entries on it (allowing for varying row-length, deletes and the possibility of the last page being only part full)

    p.s. This is all covered in more detail in the article I mentioned above.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • OK, in my case there was a tinyint field, in the entire table, and that field had only one value. I know what that value is. The second field was varchar(8) set up as case sensitive. I also know what that value is. I knew the combination was unique.

    My expectation was that it would evenly divide the records in the table based on the distribution of data in the table. I didn't expect that it would first divide on the first field (no division) and then sequentially list the first record of the next page for the second record in the list.

    I expected the table to distribute the list based on the remaining records that needed to be divided by first dividing the first list and then dividing the remaining records evenly against the table based on the value of the second field.

    I came to believe the index only stored the first field of the index because the lookup of the second value was so poor.

    In actuality, if the second index wasn't stored in the index's list, it might be MORE efficient than as it currently is stored. (The list is shorter per record because the second index's value isn't stored. Looking up the last page of the index list would tell you the second index's value and you would then know if you had to go to the next list or start searching on this one.)

    I of course think it would work better if it distributed the list based on the records contained in the second index, using it as a secondary sort instead of ignoring it.

  • Ken Lee-263418 (8/28/2011)


    OK, in my case there was a tinyint field, in the entire table, and that field had only one value. I know what that value is. The second field was varchar(8) set up as case sensitive. I also know what that value is. I knew the combination was unique.

    That's pretty bad as an index layout, not because of the index tree structure, but because of statistics which only keep the histogram of the first column. So looking at the statistics, the index would look useless. If it was marked as unique then SQL would know that both columns together are highly selective, if the index wasn't marked unique SQL wouldn't know that and would be less likely to use the index because of the very low density leading column.

    My expectation was that it would evenly divide the records in the table based on the distribution of data in the table.

    Which is exactly what it does.

    In your case (with made up examples for the varchar(8)) the root page would look something like this (with the pages below containing either more pointers or the actual rows in the ranges.

    1, AAAAAAAA

    1, AAADFGYZ

    1, AZERFGTY

    1, BTRESDFD

    1, CEDFGESS

    1, FRESFFSA

    1, GDWWSAAY

    ....

    1, WEREKSSD

    1, ZSDDWEW

    The page pointed to from the first row would contain rows where the second key column >=AAAAAAAA, <= AAADFGYZ

    The page pointed to from the second row would contain rows where the second key column >=AZERFGTY, <= BTRESDFD

    and so on down to the last page, which would contain rows >= ZSDDWEW

    (assuming we only have a 2-level tree, I don't have the time or enthusiasm to walk through a deeper tree)

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster (8/28/2011)


    The general structure of indexes (b+ tree containing all key columns at all levels) hasn't changed since SQL 7, probably since way before that. The only recent changes have been the addition of include columns (non-key columns stored only at the leaf level) and filtered indexes (indexes that don't contain all the rows of the table)

    SQL Server's structures aren't actually B+ trees. SQL Server doesn't do the required shuffling on deletes and moves to limit the tree depth (for fixed length keys, the maximum index depth for a B+ tree is the (integer) ceiling of the base floor(B/2) log of N where N is the number of leaf values). For variable length keys with fixed sixze nodes (pages) there may be no number (even if you do the standard shuffling) for which neither does any node have more than B children nor does any interior node have fewer than floor((B-1)/2) children unless you artificially limit the number of children a node can take, throwing away storage efficiency, so you sometimes can't have a storage efficient B+ tree. B+ trees make sense for fixed-length keys, but for variable length keys they would need variable length nodes or low storage efficiency. Given that support for variable length keys at good strorage efficiency is required, and the node size is fixed, neither B Trees nor B+ Trees are possible. The trade-off on doing/not doing shuffling may or may not be in favour of not doing it (it costs extra CPU, extra IO, and extra locking on some deletes and moves, in exchange for which index rebuilds are needed less often) - I imagine MS have done some experimantation (whether concrete experiments or though experiments) and have good grounds for not doing it. Celling SQL_Servers index things (sructure+inser/delete/move code) B+ Trees is a misuse of the term; but the things MS is actually using are probably more appropriate to the job than B+ trees would be.

    Tom

  • Tom.Thomson (8/27/2011)


    It's an interesting editorial. An some interesting comments - some people seem to be assuming that the clustered index must be unique, or that the primary key io always clustered, but I guess that's just a matter of learning a bit more...

    I've known for a long time a clustered index didn't have to be unique or be the primary key. I was told that if you cluster an index that wasn't unique, an unaddressable field would be added to the table to make the clustered index internally unique.

    Just wondering if that was true.

  • Ken Lee-263418 (8/29/2011)


    Tom.Thomson (8/27/2011)


    It's an interesting editorial. An some interesting comments - some people seem to be assuming that the clustered index must be unique, or that the primary key io always clustered, but I guess that's just a matter of learning a bit more...

    I've known for a long time a clustered index didn't have to be unique or be the primary key. I was told that if you cluster an index that wasn't unique, an unaddressable field would be added to the table to make the clustered index internally unique.

    Just wondering if that was true.

    Yup. It's called the uniqueifier. 4-byte integer. Added to duplicate key values only.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • A big deal was made that a B+ tree isn't anything like a binary tree, so it isn't a modified binary tree.

    1. A binary tree is a method of organizing data so that information can be found quickly.

    1a. A B+ tree is a method of organizing data so that information can be found quickly.

    2. A binary tree can be completely sequential so information can't be found quickly.

    2a. A B+ tree can be completely sequential so information can't be found quickly.

    3. A binary tree can be reorganized without changing it's structure so that information can again be found quickly.

    3a. OK, you got me, they are totally different.

    I just don't understand why they are totally different. Say the B+ tree has 100 ids per page and the table has 20 million rows. Put the unique field first and every single id in the page accounts for 20 thousand rows, the next split would locate 64K bytes of the table which should easily account for 200 rows in most tables. In three lookups you found the record you are interested in. Switch it, and the first 99 should point to about 20K rows. It would take about 100 lookups to find the record of interest and that still doesn't explain to me why, in reality, it is a thousand times slower.

  • Please, please, please read that indexing article I referenced.

    It doesn't matter which way around the keys are, the index tree is the same depth (and SQL pages are always 8k in size) and it takes the same number of page traversals to find a row.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass

Viewing 15 posts - 31 through 45 (of 49 total)

You must be logged in to reply to this topic. Login to reply