Presentation Transcript: Introducing SQL Server 2012 Columnstore Indexes

Presentation Transcript: Introducing SQL Server 2012 Columnstore Indexes

SQL11UPD02-TSCRIPT-01

This wiki is a transcript of a previously recorded video.

Related content assets: 


Slide: Title Slide

Hey everybody – this is Joe Sack and I’m a Principal Consultant with SQLskills and in this module we’re going to be going over columnstore indexing for SQL Server 2012.

If you get a chance, please check out the demo. I actually run through a scenario where I show you the before and the after, and I also show you partition switch methodologies - so how to update the data with the columnstore index.

Slide: Improved Data Warehouse Query Performance

So columnstore indexing offers significant improvements for data warehouse queries – so for example against a star schema and a typical fact table with dimension tables. If you’re performing aggregate schemas against that kind of schema, you can have significant performance improvements and this is really one of those features that branches… or bridges engine and business intelligence together.

So for those folks who are only really interested in database engine feature improvements or those folks on the other side of the fence who are interested in business intelligence, this is definitely the joining of the two worlds. It’s not taking over the BI side – it’s very specific in its scenario of what it applies to – and we’ll talk a little bit more about that. But the key is that if you’re on one side or the other, both sides should pay attention to this one.

So you can get very good performance – so in this case I’m showing 10 times to 100 times improvement in performance but it really depends on your data distribution, the compression friendliness of the data you’re hitting against when you add the columnstore index, and then the other aspect is the construction of your queries. So we’ll talk a little bit about all of that.

Slide: What Happens When…

So stepping back for a second – SQL Server 2008, 2008 R2 – if you had a need for high performance queries against a data warehouse or a data mart relational database, and it had to be fast performance, you had a couple of options (and you still have these options today) which are to move that data to a cube, and so you could use SQL Server Analysis Services to query that data. And if you’re leveraging a lot of other features in SQL Server Analysis Services, then you might stick with that approach because like I mentioned earlier, that functionality might be more appropriate and you’re leveraging it.

But if you’re just moving over that data so you can get that high performance, this is something that you might want to consider as a good candidate for columnstore indexing.

The other scenario is relational data warehouse. Let’s say I have the fact table and the dimension tables and you have specific query needs that have to run quickly, the other technique was to create intermediate tables, or aggregate tables, and then you would have your end users query those tables and you’d obviously have to keep those tables up-to-date. And then the problem is that if anything changed in the query – let’s say my tables had order quantity and sales quantity but it didn’t have a specific variation on that column or it was missing a couple of columns from the fact table. Well, you have to go back to the drawing board and basically add that to your process of generating the intermediate or aggregate result sets instead of having another option that we’ll talk about next.

SQL Server 2012 has a much more flexible option for this so can create a columnstore index on that fact table and ideally you’d create it across any column that you anticipate the end user might use in their queries. And so you can create it across several columns, and we’ll talk about some of the limitations, or the rules around that, or best practices around that. But once you’ve created it, it’s recognized by the query optimizer.

So if you’re creating, let’s say, show me a total of sales by each department and you have a columnstore index on that fact table, even if, in the scenario with the intermediate result sets and the aggregate tables, if you hadn’t accounted for that it wouldn’t be there. If it’s part of the columnstore index covering fact table then the query optimizer can go ‘oh! I have this and I can use it out of the box’.

So it’s minimal refactoring – you don’t have to do a lot of work – the syntax, as you’ll see, for creating a columnstore index is very easy and it utilizes the database engine functionality. So for those of you that don’t want to be moving to different solutions but would rather be dealing natively with the database engine, this is an option to consider.

Now, the big aspect that you have to consider when you create that columnstore index is that that table becomes read-only, so you have extra considerations around that. So you can use partitioning so you can still switch in and out partitioning but, for example, insert, update, and delete, or merge – you can no longer do that while the columnstore index is on the table.

So we’ll talk a little bit about that. If you’re doing incremental loads, by the way, so let’s say you’re loading data every day, you know, you can drop that index, load the data, and recreate the index. So for those that have periodic data loads, this might not be an issue. The other thing is that if you are using the approach of moving things to Analysis Services or using the approach with intermediate tables, that window that you’re already using to load that data might be well above what it would just take to create the columnstore index on the fact table – so keep that in mind too that, yep, a lot of people hear the read-only aspect and go, whoa, that’s a limiting factor – keep in mind that it might not be as limiting as you thought.

Slide: How Are These Performance Gain Achieved?

So how do we get the performance aspects? And again I recommend looking at the demonstration because you see a pronounced effect. In the demonstration I only used, and I say only, but I only used 6 million rows to demonstrate it so that it would be a relatively quick demonstration, but just last weekend I did 123 million rows and again, it could have easily been anything – it could have been 6 million rows, 123 million rows – the key is this scenario works with the billion row scenario depending on the data distribution.

So there are two complementary technologies that help provide this performance gain and I would think about it as the base improvement is first of all the storage aspect. So the data’s stored in a compressed column-based format, so imagine the original b-tree index structure where we have a data page, and I’ll have a visualization on this, but you have a data page and you have x number of rows on that data page and those rows each have every column.

So let’s say you have 50 columns – those 50 columns are in that row in that page and columnstore index on the other hand – its one page, one column and now imagine the compression-friendliness of that. So let’s say you have 1500 different rows from one column, imagine in your fact table, let’s say that’s OrderQuantity, and at any given point you might have OrderQuantity from 1 to 100, let’s say, for this particular stock-keeping unit. Imagine over a billion rows how much of that could be compressed.

So one of the big benefits is instead of storing by row, where you have every column in that row, a data page has just one column’s data. Ok, so that’s one aspect. And that data in turn is compressed, so you have improved buffer pool usage so instead of drawing, let’s say I’m selecting from a query where I say show SUM of Department where OrderQuantity, basically show me the average OrderQuantity by Department. I’m pulling, in that case, a couple of columns – I’m not pulling all the other columns with the data pages as I was before – I’m just pulling the data that I need into the cache. So you’ve got compression and then you’re only pulling what you need.

The second aspect, and we’ll drill more into the second part, is the batch-mode execution. So on top of the benefits that you natively get from the columnstore index and the columnar storage in the data page, you now have this batch-mode execution which leverages basically instead of saying, ok I’m processing one row at a time, it natively processes, for example, 1000-row batches. And so it’s a behind-the-scenes leveraging of modern CPU architecture, and so it’s allowing you if it can use batch mode, it’s allowing you to significantly add query performance on top of what you’re already getting through the compression of the column-based storage.

And a secondary, this is kind of a third way that you could potentially get a performance improvement is something called segment elimination. So let’s imagine a scenario where you have a date key and that date key is pretty critical but you usually only pull for the last two months, and behind the scenes that data is segmented in million or so row chunks of data and you can basically say, ok my predicate has where date key equals July. It will allow basically at the storage-engine level to go, oh, well July only applies to this segment so it bypasses millions of other rows. So that’s an additional aspect.

One thing I definitely recommend, by the way, because this is not going to go into that much detail on how segment elimination works, is to check out the TechNet wiki/website. Eric Hanson and his colleagues have put a lot of information out there about columnstore indexing – so check it out – especially the FAQ. Be sure to look at that because there’s quite a bit of information about how to fully leverage the performance capabilities. So for example, batch mode: how do you increase the chance of batch mode being used for your query? And then also segment elimination: how can you benefit from segment elimination?

So check that out – TechNet wiki – just do ‘columnstore index technet wiki’ and you’ll find a load of content.

So this next screen’s just a little visualization… so I’m showing the row store format, so a heap or b-tree, and as you can see I’ve got ProductID, OrderDate, and Cost on a data page. So even though I might not be in a query pulling all three of those columns, those data pages get drawn into memory and I get it whether I like it or not. Whereas with columnstore indexing, you know, each column is broken out into its own set of pages and then on those pages it can be compressed so either you’re going to see dictionary lookups for it or you’re going to see the actual values if you’re looking at the catalog views. But the key thing is that the compression ability of a specific column increases that much more for this columnstore storage. And then at the very end of the query those columns, whatever columns you chose for the query, get brought back together in one result set.

Now, talk a little bit more about batch mode. So approximately 1000 row blocks would be a batch mode. There are specific operators that can run in batch mode and again this is an additional benefit. And you can find out whether you’re running batch mode vs. row mode in the execution plan. So if I look at an aggregate, let’s say I’m aggregating against OrderQuantity and that fact table is covered by a columnstore index, I can look over at the columnstore index reference in my execution plan and hover over it, look at the tooltip, and I can see, ok, mode is row mode or it might be batch mode.

A couple of aspects are that batch mode really leverages CPU capabilities significantly. So maybe, you know, as an example let’s say you have 400 instruction sets, now you only have 40 instruction sets. So there’s advanced algorithms that get applied in the scenario. The second thing is the segment elimination that I mentioned in the previous slide.

And just be aware though that batch mode isn’t something that will always run. So for example, when I was working on a demonstration and it had just one CPU enabled, nothing was running in batch mode so parallelism needed to be enabled. And if you’re memory constrained, that’s something that might impact the choice of whether or not you run in batch mode versus row mode.

So be aware that the other thing is that some operations aren’t enabled for batch mode so… I gave a few examples here: joining strings, for example, which I would say if you have a string that you’re joining against from a fact table, you might want to reconsider that design choice anyhow. IN, EXISTS, scalar aggregates – this is another plug for TechNet wiki because there are all kinds of use cases, there are all kinds of scenarios that are described in terms of how to make sure that you’re having batch-friendly query construction.

Slide: Columnstore Format and Batch Mode Variations

In terms of benefits, so breaking out what are the scenarios that you can see performance increase: first would be from columnstore indexing alone you’re going to get that basic benefit from having all the data for a specific column stored on a set of separate pages, so that alone is going to have a benefit from the storage and then also from the fact that your buffer pool is being better utilized by just what you need.

Second aspect is you could have columnstore indexing with the batch mode in the query processor so you can have a plan that’s using all of that. And then the third, and this is probably the most common scenario of the mixture, is that you’re going to see columnstore indexing plus a hybrid of the batch and traditional row mode.

So in a relatively complex query you might see, alright, I see batch mode if you go all the way to the far right of your query plan, you might hover of the operators and say, ok, this is using batch mode, batch mode, batch mode, oh, row mode, row mode. So that’s typically what you’re going to see. If you don’t see any batch mode at all, try to see if there’s a way to re-factor your query so you can leverage that because, again, that can just all the more improve the performance of your query.

Slide: Creating a Columnstore Index

And here’s a little bit about the syntax. I mentioned earlier that it’s very straightforward so nonclustered columnstore index – I’m doing just CREATE NONCLUSTERED COLUMNSTORE INDEX and by the way, nonclustered is the option – there isn’t a clustered option. So there’s no option to basically say, look, I’ve got a heap, replace that heap with just this one index – that heap or clustered index becomes the data source for that associated columnstore index. And that might change in the future, but that’s the choice that you have right now.

And then the list of columns, and we’ll talk a little bit about what columns you choose and how it differs from class b-tree type indexes.

In the other screenshot here, just of SQL Server Management Studio – so you can right-click the index folder and choose nonclustered columnstore index and so if you have several columns that you don’t want to hand-code, for example, you can just add those through the graphical interface and then script it out, for example.

Slide: Good Candidates for Columnstore Indexing

Alright, so what are good candidates for a columnstore index? Very first thing is very large fact tables – so if it’s a small table, don’t bother. It’s one of those things where you have to undo your thinking of if you’re used to smaller tables and thinking that it could benefit as well, no. We’re talking about billions of rows – I mentioned I did a 100 million row test and it was like I was throwing nothing at it. So it was great query performance – so if it’s a very large fact table, and even very large dimension tables could be appropriate.

And if you have a question about what would work and what would not work, keep in mind how simple it is to implement this index so it could be as easy as you setting up a production-like environment, creating the indexes and then find out for yourself based on your standard workload – do I have a benefit from doing this and it is worth the trade-off from a manageability perspective.

So the other thing is query candidates. In terms of queries, you know, it’s not just what tables are good candidates but what queries do I have in my workload – would they be columnstore-index friendly?

So the key thing here is it’s another reverse in your thinking – you are looking for scan-like queries. So these are queries that might scan an entire table – so show me a SUM across all rows in the fact table, for example.

Now scan versus seek – so if we’re talking about seek operations, b-trees are still the most optimal approach for that – regular standard columnstore… not columnstore but clustered and nonclustered indexes are more optimal than thinking about a columnstore index for that.

Situations, queries that aggregate results from a large size down to a small size – so you might be going against a 100 million rows but your query result set is maybe 1000 rows. That’s another good candidate to consider.

Joins to smaller dimension tables – so you have one central fact table that’s very large, you have smaller tables – great candidate.

Filtering on fact tables – so star-schema pattern – I already mentioned that.

A sub-set of columns – that’s a big one. So if you have 50 fact table columns, and you only want 10% of them – that’s a great candidate for columnstore indexing because now you’re just pulling from, let’s say it’s not already in cache, you’re pulling from disk just those columns that you need into cache.

So sub-set of columns and single-column joins. So if I’m doing a star schema and I’m doing joins – one join predicate per table and it’s not a string-related join, that’s another good candidate as well.

Slide: Defining the Columnstore Index

Alright – in terms of defining it – so there’s some other shifts in what you’re used to in indexing. There’s no concept of having a unique columnstore index. The whole idea of key ordering – so having to have on the left edge of an index the most unique or something that you might use in a predicate – forget about that for columnstore indexing.

The key here is that you want to list those columns from that, in this case, fact table – let’s just say it’s a very large fact table – you want to cover those columns that you think your end users might use in their queries – or the application developers might reference in their queries. If you think you might use them, go ahead and add them.

Now, there are data types that aren’t compliant and we’ll talk a little bit about that on another slide. Other aspects of like ascending or descending order - INCLUDE – those don’t apply here. You do have the 1024 column limit. If you have a fact table with 1000-plus columns, again, that might be a design choice you want to think about and ask yourself – is that the appropriate decision?

Let’s see what else here? 900-byte index key size limit doesn’t apply any more. And one other thing about the ascending descending, you don’t define that order – that’s done behind the scenes – so that’s dealt with, with the compression algorithms, specifically around columnstore indexing.

Slide: Supported Data Types

So - data types. Basically, standard data types are supported – so the standard business data types, so we’re looking at character data types, decimal, numeric, integer, bigint, dates – those are all supported. And in terms of what’s not supported – you’d probably not be surprised – XML, large object data types, Hierarchy IDs, things that would be considered probably not standard that you would see in a fact table, are not supported at this time.

Slide: Maintaining Data in a Columnstore Index

In terms of maintaining this, I brought this up at the very beginning, once you add that columnstore index it becomes read only and so but you still have options for modifying the data - so in the big case that I would look at would be partitioning – so partitioning switch in and switch out – we’ve already partitioned that table – that’s a very viable and assuming we’re doing partitions – a very quick approach.

You can also if you’re doing incremental loads, you can drop that columnstore index, do your load, add that columnstore index back in again.

And then there’s a UNION ALL technique – I’m downplaying this a little bit because you want to make sure that how you’re doing that UNION ALL isn’t affecting performance. So if you’re depending on, let’s say a trickle table or a delta table and you’re loading data into that delta table, and then everybody’s querying that UNION ALL view of your columnstore index covered table and your trickle table or your delta table. Just be cognizant of the performance impact – take a look at the execution plan and see if batch mode is being used. Look at that scenario but I would definitely recommend, if you can, looking at partitioning – switch in/switch out as a viable option. Or if you’ve already been doing data loads to Analysis Services or doing intermediate data loads to aggregate tables, take a look at that time window – it may very well be that just adding the columnstore index on that and doing a drop/recreate is well within your time boundaries for doing that modification.

Slide: Limitations

Limitations – so, because you’re putting that table in a read-only state you can imagine that there are some limitations that are around that - so Change Data Capture, Change Tracking . FILESTREAM can be on the base table but not defined in the index itself, the columnstore index itself. And page, row and vardecimal – that’s not going to something that additionally you can apply to columnstore indexing. Columnstore indexing handles its own compression. Sparse columns, filtered statistics, filtered indexes – not applicable in this scenario. Replication of that table with the columnstore index. And then I already mentioned some of the data type limitations – you know, binary, varbinary – so there’s a list of different data types that you can’t add.

And then let’s say you have that columnstore index on your table and you don’t want to use it for whatever reason. There are specific singleton queries that you want to run, there are a few different rows that you want to pull and so there’s a hint that’s available to you to IGNORE_NONCLUSTERED_COLUMNSTORE_INDEX. You can add that hint and the query optimizer will not then consider the columnstore index for its query.

Slide: Summary

So that was a summary of columnstore indexing for SQL Server 2012. I really hope you can check out the demonstration as well – it gives a brief overview of the before and the after impact. And also get some hands-on experience so you can really understand the behavior of creating it on a specific data warehouse database design.

And to recap, so you can have performance improvements, you know, by 10 to 100 times the original performance. The real key is to make sure that you can leverage all the benefits of columnstore indexing – so you could the benefit of compression on top of batch mode – that would be the ideal situation. And additionally if you can get segment elimination, that’s another way to improve the performance. And the second aspect is that the relational data warehouse and OLAP methods that you might be using, might still be appropriate – so for example, if you’re using specific techniques that are working for you around Analysis Services, SQL Server Analysis Services, Power View, PowerPivot – then this does not replace that.

So columnstore indexing is for a specific area where maybe you’re going to those technologies, but you’re only doing it because you want the fast performance – then I would consider columnstore indexing. But if you’re using self-service BI or other functionality in Analysis Services, and you’re happy with that – that’s still in its own area and not something that gets replaced by that.

So I hope you get a chance to check out the demonstration and thanks for watching.


Return to SQL Server 2012 Developer Training Kit BOM (en-US)

Leave a Comment
  • Please add 7 and 4 and type the answer here:
  • Post
Wiki - Revision Comment List(Revision Comment)
Sort by: Published Date | Most Recent | Most Useful
Comments
  • Richard Mueller edited Original. Comment: Removed (en-US) from title, added tag

Page 1 of 1 (1 items)
Wikis - Comment List
Sort by: Published Date | Most Recent | Most Useful
Posting comments is temporarily disabled until 10:00am PST on Saturday, December 14th. Thank you for your patience.
Comments
  • Richard Mueller edited Original. Comment: Removed (en-US) from title, added tag

Page 1 of 1 (1 items)