B-Tree, Binary Tree and Indexes Architecture

24 ביולי 2008

תגיות: , ,
תגובה אחת


סטיב האוורד (Steve Howard ) הוא מומחה SQL. הוא אחד מאלה שכתב את הסדנה שהעברתי בישראל בשבוע שעבר בנושא ביצועי שרת SQL Server 2005.

בקורס דיברנו לא מעט על ארכיטקטורה של אינדקסים וכיצד הדבר משפיע על הביצועים.


למי שרוצה ממש לחפור בארכיטקטורה מוזמן לקרוא את הפוסט ולגלות ש B-Tree  ו   Binary Tree

הם שני דברים שונים לגמרי!!!!


B-Tree is NOT a binary tree


One of the most common mistakes I hear from students is the confusion of a binary tree with a B-tree or B+ tree. These are not the same. The B in B-Tree does NOT stand for "binary." In fact, nobody knows for sure what (if anything) the B in B-Tree stands for. I find it helpful to relate this little bit of trivia to students to help them make this distinction. Once they have it firmly in mind that a B-Tree is not a binary tree, I find it easier to address the misconceptions they have about its operations:

The B-tree was developed by a team consisting of Rudoph Bayer, and Ed McCreight in the early 1970s. Bayer was the leader on this team (that's one B). These two worked for the Boeing labs (a second B) and developed a data structure that had, as a characteristic, that it stayed balanced (that's a third B). Some other characteristics are that it is a very broad structure (a fourth B) as opposed to a  binary tree, or that it had many branches  from each node so it was bushy (a fifth B). But in the paper that Mr. Bayer et al wrote introducing this structure, they never defined for us what the ‘B' stood for, so nobody knows for certain if it has a meaning. (See http://en.wikipedia.org/wiki/B-tree and http://www.nist.gov/dads/HTML/btree.html).

By contrast, we know what a binary tree is, and the two are very different.

Binary tree

A binary tree is a tree structure where each node contains data, and contains exactly 2 pointers (thus the name "Binary." It is built top down, meaning that the first data value entered becomes the root node, and all subsequent values entered will be placed logically below this node. An evaluation is used on each record that is entered at each level until it finds its logical place. The two pointers on each node are logically called the "left" and the "right." Whether a new data value inserted goes to the left or the right of a node depends on whether it is greater than, or less than the value of the current node (equal values are resolved to either left or right – so long as they are consistent within the same tree).

To illustrate, a value of 10 is entered first, this becomes the root node. Next, a value of 9 is entered. Since 9 is less than 10, it goes to the left. No node exists on the left pointer of 10, so no evaluation is made. 9 then becomes the new node on the left of 10 (the root). Next, a value of 13 is entered. 13 is greater than 10, so this goes on the right pointer from 10. No node currently exists on the right node from 10, so 13 becomes the new node on the right of 10. Next, a value of 12 is inserted. This will start at the root. 12 is greater than 10, so it will go to the right pointer. On the right, it finds 13 which was inserted previously. 12 is less than 13 so it now goes to 13's left pointer. Since no node currently exists at 13's left pointer, 12 will become the new node on 13's left. And so the process continues. It is always built top-down in this manner.

The worst case scenario for a binary tree is sequential or reverse-sequential insertions. Consider this: The first value entered is 1, so 1 becomes the root node. Next, a 2 is inserted. 2 is greater than 1, so it goes to the right. Since there is no node there, 2 becomes the new node on the right of 1. Next a 3 is entered. It goes right from 1, then right from 2 to become the new node on the right of 2. Next a 4 is entered, so it goes to the right of 3, then 5 which goes to the right of 4, etc. In this sequential insertion scenario, the right of each node is built, but no new node ever goes left of any existing node.

For these types of sequential inserts, the binary tree effectively degrades to a simple linked list. In this case, if we wanted to find 1500, we would have to evaluate 1499 times whether to go left or right to find that value.

If people have a binary tree confused with a B or B+ tree, then this is almost certainly top of their mind on why they do not want to create clustered (or other) indexes on sequentially entered data.

There are variations of binary trees such as red-black trees (which I will not go into here) that are designed to rebalance binary trees as they are built, but these make it more expensive to build binary trees. Additionally, binary trees are rather narrow (each node on each level points to, at most, 2 nodes in the next level). Therefore, binary trees are not good choices for consistent, fast access on large datasets. Binary trees, because of their top-down construction, always have inherent challenges with staying balanced, and the farther out of balance a binary tree becomes, the less efficient it becomes for access.


Although our documentation always calls our index structures a "B-tree," if you look in most data structures textbooks, you will likely find that that our indexes don't exactly fit the B-Tree definition. In our version of a B-Tree, no data actually occurs on the branch nodes or branch pages. On our intermediate pages, there are only the index keys, and pointers to the nodes below them. In a data structures textbook, this is usually called a "B+ tree."

A B-Tree has the characteristic that data can appear in the intermediate nodes. This can reduce the "fan out" of the B-tree since not every entry in the intermediate nodes is a pointer to a lower node. This increases the cost of the "worst case" of an access in a B-Tree, and reduces the consistency of a data access. But B-Trees are built bottom-up like B+ trees, and that results in a B-Tree staying balanced.

B+ Tree (what our indexes really are)

A B+ tree is a modification of a B tree where all the data appears at the leaf level, and intermediate levels are only pointers to lower levels. This increases the "fan out" of the tree structure which minimizes the depth, and thus the number of reads to access the data in the worst case seek. All data in a B+ tree exists at the leaf level. This is the most common structure used to construct indexes in databases.

In a B+ tree, each node (which will be a page in SQL Server's application) can consist of any number of elements. In our application, this will be however many elements will fit onto an 8K page. This implementation also means that different pages can have different numbers of elements if we have variable length data types in our index. The root, and each intermediate level node contains only the keys for the index, and pointers to the next level. Our implementation also contains other meta-data, but the keys and the pointers are the minimal components of the branch nodes of a B+ tree. The leaf nodes contain the data we are after, which in our case will be the actual data in a clustered index, or the clustered index key or RID for a nonclustered index.

A B+ tree is built bottom up. So consider this sequence of events: A value of 1 is inserted. This becomes the first value in our first node in our B+ tree. Next a value of 2 is inserted. This becomes the second value in the root node. This continues until at some point, no room exists for another value to be inserted into this node. (I'll stick with the textbook definition of the operation first, then I'll show the optimization SQL Server has for this operation.) Let's say that our node can contain 69 entries. So when we attempt to insert 70, there is no more room in our root node. At this point, the root node must split.

In the textbook definition of this operation, a new node will be created. Half of the data will remain in the previously existing node, and about half of the data from our old root node will be moved to the new node. This means we now have 2 nodes existing on the level where our root node once existed alone. To reconnect these into one tree, we push up to a new level, and create a new node on this level. This node will become our new root node, and it will contain pointers to the two nodes on the level that was previously our root level. Now, on the new node that we created to make room for the new insert, we can insert our new value of 70.

If we continue to insert only to the right, then we will continue to split at the level that was once our root level, but all we will need to do at this point is add a pointer into our new root node that points to each new node as we split. But at some point, we will run out of room in this root to add new pointers to new leaf nodes.

When this happens, an insert of a new value causes a page split on the leaf level. We will then need to add a pointer to our root level. But since no room exists in the root node to add a pointer to this new node created by our page split, we must split the root page. At this point, half of the pointers will remain with the original root node, and half of the pointers will be moved to the new node on this level. This means we now have two nodes on the root level, and nothing connection them. To resolve this, we push up to another level, and create a new node on this level. This will be our new root node, and it will have pointers to the two nodes created by the page split of the previous root node.

So at this point, we have added a new level in the tree, but all the leaf nodes (which contain the data) remain on the same level – the same distance from the  new root. It really does not matter where we added the records into the leaf nodes of the tree structure (left, right, or somewhere in the middle), the bottom-up construction of the B+ tree ensures that the B+ tree stays in balance, and does not degrade into a singly linked list like a binary tree does. So a B or B+ tree never needs to be rebalanced due to insertions.

Deletions can be a different story, and can cause underfill issues that can necessitate re-balancing, but deletes are not normally as common as inserts in a database.

See http://en.wikipedia.org/wiki/B%2B_tree

Optimization for sequential updates

Consider our example in the last section. If our data is always inserted sequentially, then on every page split, the lower half of the values remain in the left node after our page split, and the higher values are transferred to the right node. Since our data is inserted sequentially, nothing will ever be inserted into the left node after the page split. This means that as we continue to add to the tree, we create, and leave half full nodes (pages) to the left. This results in a "half full" tree. Although this is much preferred to the singly linked list created by sequential inserts into a binary tree, this is still the worst case scenario for a B+ tree, and something we would like to avoid.

SQL Server has an optimization to handle this situation and make it more efficient. If you have a sequential index, then when a page split occurs, SQL will determine that the new value is to the right of the highest value in the existing page, and when a page split occurs, the only value that will be written into the new page is the new value that caused the page split (see the attached example to see this in operation for yourself). This means that the pages in the left on these page splits will be left completely full after the page split instead of half full per the textbook algorithm for a page split.

This turns the worst case scenario for a B or B+ tree into a best case scenario for the tree, but it also differs from the textbook definition of the operation is a couple of significant ways which your students may have in mind. The first way it differs is just in the page split itself – we no longer move half of the data into the new node. The second way it differs is because the textbook definition of a B or B+ tree normally specifies that no node other than the root node can be less than half full. In our optimization, the new node in a sequential page split is less than half full, but the tradeoff is that it leaves the old nodes completely full, and thus optimized to fit into buffer, and optimized for access with the fewest number of page reads.

Step through the attached demo in PageSplitOptimizationForSequentialInserts.sql to see this optimization in action.

Indexes on sequentially inserted columns does NOT create a hotspot

One more objection I hear from students is that creating a clustered (or other) index on sequentially entered data (such as on an identity column) creates a hot spot which hurts insert performance. Let's consider this.

By definition, a hot spot is created when one section of data must be locked by concurrent transactions, and one transaction must wait briefly for the other in order to proceed. This can create a chain of short lived blocks that hurt performance just by their volume. But if we have a table "taba", and it has a clustered index on "col1" which is an identity column, then when we insert into this table with concurrent transactions, the following occurs:


Transaction T1 begins

Transaction T2 begins

T1 inserts into taba

  • a. An IX lock is taken on taba

  • b. An IX lock is taken on the page where the new insert will occur

  • c. An X lock is taken on the key for the new row to be inserted

T2 inserts into taba before T1 commits

  • a. A new IX lock is taken on taba – no blocking and no waiting occurs here

  • b. A new IX lock is taken on the page where the new insert will occur – no waiting here

  • c. A new X lock is taken on the key for the new row to be inserted. This does not conflict with T1, and no blocking occurs.

Transaction T3 can begin and perform its insert into taba, and we will still have no blocking, and no hot spot. And since these operations are taking place in buffer, this is not causing extra disk hits either. We might have short waits for latches at the time when a page split occurs, but per our optimization for sequential inserts, we have minimized the number of page splits for sequential inserts, and have minimized the expense of each page split as well (no data is being copied to the new page).

If we had taken the lock on the page level, then we would have created a hot spot on the page. By taking the X lock on the key or row, we are able to perform these operations efficiently without conflict, and thus without creating a hot spot.

If we do not have a clustered index, the difference in locking is that we take a RID lock instead of a key lock. Again, we have not created a hot spot, but we have inefficiency due to the PFS scan to insert into a table with no clustered index.


So these are to address some of the misconceptions I hear from students, and where they lead to. I have also fielded a few questions along these lines recently from among us, so I thought it might be worth the time to organize this and put it together for those that would find it helpful. Hopefully, it equips someone to answer questions you will encounter in classes, or to understand it better yourself. Please see the attached SQL file for examples to give you a visual demonstration of the things explained here.



Itay Braun                       

Senior consultant – SQL Server and BI                       

E-Mail: itay@twingo.co.il

Mobile: +972-52-866-6527

Blog: http://blogs.microsoft.co.il/blogs/itaybraun/

Website: http://www.twingo.co.il/  

Veni              Vidi              Fixit



Itay Braun                       

Senior consultant – SQL Server and BI                       

E-Mail: itay@twingo.co.il

Mobile: +972-52-866-6527

Blog: http://blogs.microsoft.co.il/blogs/itaybraun/

Website: http://www.twingo.co.il/  

Veni              Vidi              Fixit

הוסף תגובה
facebook linkedin twitter email

כתיבת תגובה

האימייל לא יוצג באתר. שדות החובה מסומנים *

תגובה אחת

  1. Shaila26 בספטמבר 2008 ב 1:55

    Very good article..It helped me to understand index structure very well.