December 2009 - Posts
The 2010 Microsoft BI conference will be held on June 2010, and will be part of the 2010 TechEd North America, according to this blog post. I’m not sure I’ll be attending this conference, but definitely be on the lookout for a more detailed agenda to see if there is added value in me making the cross-Atlantic trip. With so many new features in the Microsoft BI stack for 2009 and 2010 ( a series of PowerPivot blog posts is coming up on this blog in the coming weeks, for example), there’s enough new material to cover and showcase in the conference.
Happy New Year, Everybody!
-Yaniv
(This is post number 6 for week number 5 of my SQL blog challenge)
Technorati Tags:
Microsoft,
BI
When you work with large tables in SQL Server 2005 and 2008, you will inevitably start using partitioned tables in order to ease the maintenance of these tables, and improve query performance. I recently wanted to merge several partitions at the tail of an existing partition function, and I wanted to understand whether I need to merge the partitions in a sequential order from the last partition and work my way up the chain of partitions, or would it be better to start from the newest partition (that I can merge) and work my way down the chain of partitions. The following diagram probably explains this better:
I figured this has something to do with the way we have defined the partition function boundary as being a LEFT boundary or a RIGHT boundary, but I wanted to be sure, so I performed the following test. First, I created a new table, “FactInternetSales_Partitioned”, based on the “FactInternetSales” table from the AdventureWorksDW sample database. Next I created a partition function and a partition scheme which defines the boundary as a LEFT boundary:
I then created a clustered index on the “FactInternetSales_Partitioned” table, on the new partition scheme:

And then I started to merge partitions, first from 20010701, going up…
For each merge operation, I could tell by looking at the statistics information, how the logical reads increase by the size of the growing partition being moved:
Now, performing the merge operation from 20010705, going back in time, resulted in a different IO pattern, and you can see that the statistics information shows that each merge operation only moves data from a single partition:
So, there you have it: if you’re using a LEFT bounded partition function, you’d do better merging partitions from “new” to “old” and not the other way around. I also checked it with the RIGHT bounded partition function and, as expected, the results were the exact opposite: merging partitions from “old” to “new” incurred the least amount of IO operations.
-Yaniv
(This is post number 5 for week number 5 of my SQL blog challenge)
I’m using SSIS for several years now, and I want to shortly cover an SSIS feature which last week help me a lot: using checkpoints in order to recover failed packages. I’m currently developing an ETL solution which requires the import and export of data in and out of a production system on a daily basis, and it is very important to ensure the process only touches the system when required and does not impact on the system performance (or, should I say, impact the least).
What I actually need is to get the daily content of several tables out to text files and I want to be sure that if any of the data flow tasks fail for any reason, I will only re-run this particular data flow and not the other tasks, which may have completed successfully. In comes the checkpoint file.
SSIS can use a checkpoint file in order to save package runtime information (this includes variables values, where the package actually failed etc.) and when the package is re-run after failure, it will start the package from the failed object/task, thus avoiding the work already done and completed successfully. You can configure the checkpoint file location in the package properties tab. you will also need to configure the “checkpoint usage” property where you have three possible settings to choose from: Never: which means the package does not use checkpoints and will restart from the beginning of the workflow, Always: which means the package will always try to restart from the previous failed task, and if the checkpoint file does not exist, it will fail, and IfExists, which tells the SSIS engine to start the package from the previously failed task only if a checkpoint file exists. If not – the package will restart from the beginning of the workflow. You will also need to set the “Save Checkpoints” property to “True” in order to enable the usage of checkpoint files.
And now, for the gotchas…
- Checkpoints are only applicable for the control flow tasks. Within a data flow task, you can not use checkpoints, and SSIS will always start the data flow execution from the beginning.
- The atomic unit of work that can be restarted is any task host container (such as the “execute SQL” task or the “For Loop” task). You can not restart a package from a failed “For Each” task: restarting a package from this task will re-run it from scratch. So, for example, if you’re looping through files in a directory and the package failed in the middle of looping, next time the package runs, it will loop through the files from the beginning. Remember that when you’re configuring checkpoints.
- Package configurations are not re-loaded when a package restarts, It uses the previously loaded configurations, which were saved in the checkpoint files.
- And, last but not least, for each container you want to define as a “restart” point, you need to set the “FailPackageOnFailure” property to “True”. This will ensure SSIS considers this container as a “restarting” point candidate.
Using checkpoints is a slight compromise for me in the above mentioned scenario. What I was really looking for and couldn’t find a solution just yet, is how to recover from a failed data flow task. So, when I’m exporting data to a text file and the task fails, I need a way to recover from that point, so I will not have to export the entire data set again, rather only the data I still need. Some sort of data auditing capabilities, where I can grab and keep the primary key of the data that flows in the data flow would really help. I hope Microsoft is thinking about this as well… Maybe in SQL Server 2011…
-Yaniv
(This is post number 4 for week number 4 of my SQL blog challenge)
Technorati Tags:
SSIS,
Checkpoints
I’ve been trying the new Report Builder 3.0, which ships with the SQL Server 2008 R2 November CTP and among the various new features, I want to write a bit about the Shared Data Sets feature. Up until this version, report authors could create a data source, and when deploying the report could make the decision whether the data source for this report is shared among other users/authors or be linked only to the individual report they have created. If the report author decided to share the data source, new reports could be based on the shared data source, avoiding duplicate server objects and, even better, avoiding unnecessary work.
In Report Builder 3.0, the Shared Data Sets feature is introduced, where report authors can not only share data sources, but also share specific data sets, based on data sources. Sharing a data set can be used if you want to create several reports which share fairly similar data requirements, but perhaps, different design and parameters requirements.
But putting the reuse of data objects aside, one could even achieve better performing reports, by using the caching options for Shared Data Sets. So say, for example that a report is run at 09:00 in the morning, and a second, different report which uses the same shared data set is run at 10:00, the lucky user who is running the report at 10:00 will get to see the report faster, because the data set is already cached. You can also choose to schedule a cache refresh just before the day begins so every user will benefit from the caching option.
I decided to put this feature to the test, to see if and how this caching mechanism works.
To start, I’ll launch the Report Builder 3.0 from the report manager web page. As you can see, I’ve already set up a shared data source, pointing to the Adventure Works DW 2008 database.
Upon launch, we see a new welcome screen which prompts us either to create a new report or a new shared data source. I’ll choose the shared data source option.
For simplicity sake, I’ll choose the DimProduct table as the source for our data set.
Next, I’ll save the newly created data set to the report server. I’ll call it: DimProductSharedDS.
Looking back at the Report Manager webpage, we’ll be able to see our new shared data set. Clicking on the data source will bring us to the data set properties. We’re interested in the “caching” tab:
I’ll be choosing the “Cache Shared Dataset” option, with an expiry of 30 minutes. you can see that there is an additional cache related tab: “Cache Refresh Options”. On this tab you can define a scheduled cache plan, so the data set will be refreshed every morning, for example, instead of letting the first user in the morning perform the caching implicitly by running a related report. We’ll skip this option for now.
Next, I’ll create two different reports, one displaying product name and size, and the other displaying product name and weight, both use the shared data set I’ve just created and deployed. Running the first report and tracing SQL activity on Profiler, we can see the following SQL statement being issued:
The Report Server issued a SELECT statement which corresponds to the shared data set definition (i.e. all the columns defined in the data set, even though the report itself consumes only 2 of them).
Running the second report, I got to see no activity on Profiler. The Report Server used its cached data set to produce the second report, and we have just saved a trip to our operational/DWH server and hopefully saved us a bit of report execution time.
I then disabled the caching option and re-ran the two reports, and the results were as expected: on both reports I got to see the Report Server issuing the same SELECT query to the AdventureWorksDW database.
To conclude, this new feature looks like a real boon for both developers, who can save time on development of report objects, and DBAs who will get to see their operational databases being accessed less.
-Yaniv
(This is post number 3 for week number 3 of my SQL blog challenge)
Technorati Tags:
SSRS,
Report Builder I thought I’d write about this topic, even though it’s a bit of an old news item, because I keep getting questions about this when I visit clients. As part of the SQL Server 2005 Reporting Services product, many of us were really enthusiastic about the Report Builder tool. This tool enabled the organisation's power users to create their own custom-made reports, using a pre-defined Report Model. This Report Model was simply a logical “map” of specific database fields that the administrator wrapped inside this model, thus enabling the users to simply drag and drop them to their self-built reports. The users could then deploy their self-made reports onto the Reports Portal for everybody to use.
As part of the SQL Server 2008 Feature Pack, Microsoft released the Report Builder 2.0. Now, this stand-alone tool was a big enhancement to the previous version. It is basically a standalone report authoring tool, where the users can create their own reports, using an office-like interface. When I first used this tool, the first thing I noticed was that I didn’t have to use a Report Model in order to create a report. I could simply create my own data sources and my own queries and design and deploy the report, as if I was using the BIDS platform.
So, at this stage I thought: OK, this is a good option for database-oriented users, but what about the users that I want to restrict their access to the database and only provide them with a pre-defined set of database fields? I double-checked BOL, and it turns out you can either provide a Report Model or not, either way it’s fine. For the database-savvy and trustworthy users you can, as an administrator, grant access to the tables and let them create their own reports from scratch, including setting up the data sources and queries. For the users whom you want to limit access to the database, simply configure a data model for them to use. Both ways, the users will have to adjust to the new interface, which provides them with much more design capabilities than the Report Builder 1.0 tool. In my opinion, it’s as intuitive but gives so much more power to the user.
One more thing to note is that you can now launch the Report Builder tool from either the “Start-Programs” menu as a standalone application, or you can still (as in SSRS 2005) launch it from the Report Manager webpage. However, after you download and install the SQL Server 2008 feature pack and get the latest Report Builder 2.0 tool, when you try to launch the Report Builder from the Report Manager, you will still get the Report Builder 1.0 running, not the new version you have just installed. Check here, in BOL, to find out how to set the Report Builder 2.0 as the default ClickOnce option on the Report Manager page.
(This is post number 2 for week number 2 of my SQL blog challenge)
Technorati Tags:
SSIS,
Data Flow Task I came across a BOL topic discussing SSIS data flow optimization techniques. For the OLE-DB source component there is a recommendation to use a SELECT statement instead of the "Table or View" access mode. For the sake of clarity, when you use the "Table or View" access method, you get to choose a specific table or view from the database you have designated as your source database, in a drop-down list. When you use the "SQL Command" access method, you can simply type in the query editor any SELECT statement you like.
The obvious advantage of using the "SQL Command" access method is that you can choose which columns to get and which not, thus reducing unnecessary data load on the ETL process. When you choose a specific table in the "Table or View" access mode, all columns from that table will be retrieved so in some, if not most cases this is a thing you would want to avoid. Just to clarify this point, even if you choose the "Table or View" access mode it does not mean that all columns of that table should pass through the data pipeline. You can still remove specific columns, simply by un-ticking them on the "Columns" pane of the OLE DB Source Editor. However, the source still has to consider those columns when it retrieves the metadata for the table. Using the "SQL Command" access mode can remove this overhead. Another benefit of using the "SQL Command" access method is that when the table schema changes, if the columns modified are not included in the SQL statement, then no metadata changes are detected by the OLE DB Source adapter, thus eliminating the need to refresh it.
However, the BOL article states that the "SQL Command" simply "performs better", regardless of the number of columns you choose to retrieve. I decided to check that.
I created a destination table based on the "FactInternetSales" table in the "AdventureWorksDW2008" sample database and named it "FactInternetSales_Destination".
CREATE TABLE [dbo].[FactInternetSalesDestination](
[ProductKey] [int] NOT NULL,
[OrderDateKey] [int] NOT NULL,
[DueDateKey] [int] NOT NULL,
[ShipDateKey] [int] NOT NULL,
[CustomerKey] [int] NOT NULL,
[PromotionKey] [int] NOT NULL,
[CurrencyKey] [int] NOT NULL,
[SalesTerritoryKey] [int] NOT NULL,
[SalesOrderNumber] [nvarchar](20) NOT NULL,
[SalesOrderLineNumber] [tinyint] NOT NULL,
[RevisionNumber] [tinyint] NOT NULL,
[OrderQuantity] [smallint] NOT NULL,
[UnitPrice] [money] NOT NULL,
[ExtendedAmount] [money] NOT NULL,
[UnitPriceDiscountPct] [float] NOT NULL,
[DiscountAmount] [float] NOT NULL,
[ProductStandardCost] [money] NOT NULL,
[TotalProductCost] [money] NOT NULL,
[SalesAmount] [money] NOT NULL,
[TaxAmt] [money] NOT NULL,
[Freight] [money] NOT NULL,
[CarrierTrackingNumber] [nvarchar](25) NULL,
[CustomerPONumber] [nvarchar](25) NULL,
CONSTRAINT [PK_FactInternetSalesDestination_SalesOrderNumber_SalesOrderLineNumber] PRIMARY KEY CLUSTERED
(
[SalesOrderNumber] ASC,
[SalesOrderLineNumber] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
Next thing I created a new SSIS package, added an EXECUTE SQL task to clear the procedure cache and data buffers:
I then added another EXECUTE SQL task to truncate the destination table:
Next, I added a data flow task that will retrieve data from the source table and place it on the destination table. I've added no transformations at all to the data flow and I haven't modified any of the data flow default settings.
I executed the package 3 times in each access mode: the "Table or View" access mode and the "SQL Command" access mode, and the results of the test can be seen here(in seconds):
As you can see, execution time of the data flow task is much the same, regardless of the access mode chosen.
Next thing I wanted to check was the Profiler. I removed the "Destination" component in the data flow task, because I didn't want to capture the "Insert Bulk" statements, just the data retrieval statement.
Using the "Table or View" access mode, the query to get the data was:
Using the "SQL Command" access, the query to get the data was:
So, here lies the difference: when using the "Table or View" access mode, SSIS issues an ad-hoc SELECT statement on the source table. When using the "SQL Command" access method, it uses an sp_prepare and an sp_execute statements instead. Now, I know that using the sp_prepare statement can benefit performance because the execution plan fits the sent query. I'm also guessing that the fact that in my tests I didn't get to see any performance differences between the two methods is due to the fact that I'm running these tests on my small, isolated laptop and not in a busy server environment.
So, to conclude: do use the "SQL Command" access method, it's better for performance and it's better for manageability and better coding practices.
-Yaniv
(This is post number 1 for week number 1 of my SQL blog challenge)