Thursday, September 6, 2012

How to deal with slow Lookup transformation in SSIS

Lookup transformation in SSIS is one of the widely used transformation while developing and ETL package. Many of SSIS developers have been complaining about how Lookup transformation is too slow and it takes forever to run. In this post I will look into how to make best use of Lookup and a SQL alternative to it when we have no other choice.

Tuning Lookup Transformation :

  1. Set the Cache Mode:



    • Full Cache: This is the default mode when you open your SSIS Package. It pulls the entire Dataset we are looking up against into memory during pre execute phase. Advantage of caching the data is the whole lookup operation is very quick but the catch is if you have big reference table, the data flow task fails. The memory gets full and it won't spill the data to disk. Use it when you have small reference tables or better keep the reference tables small (more about it below).

    • Partial Cache: This is the most interesting mode and is missed by most developers. This mode caches the data at the run time instead of pre execute phase. As a matching row comes it caches them so that it can be used later.
      There is also a way to setup size of Cache. As Cache becomes full it drops old rows and caches the new data. Partial cache mode is usually slower than Full cache as it hits the database more often but it is very useful in some scenarios viz. when your reference table is updated at same time you are doing lookup or when your reference table is large and you can't afford full cache to fill the memory.


    • No Cache: This is the least used mode. In most cases Partial cache should be given a look before moving on to No Cache. No cache mode doesn't cache any data and repeatedly hits database for every row. This should be used when you have a few records in reference table and it is indexed properly. 

  2. Use SQL Query while setting up Connection: Usually the lookup is to be performed on maximum 4 columns but most developers make a complete table as a reference table using "Use a table or a view". Instead of that try "Use results of an SQL Query" . Notice the use of article an before SQL. :) Using SQL query will select only few columns and keep the reference table light and in turn keep more memory free and your execution quick.

Alternative to Lookup Transformation :
  1. Using the Database engine with Staging Table and SQL: Sometimes transformations refuse to speed up. In such scenarios you can always shift the load to Database Engine by using Execute SQL task.

    • Use staging table and Execute SQL Task: Insert the new data into Staging/Intermediate Table using Data Flow task and then use Execute SQL task to transform it as you like using traditional Joins and SQL queries. You have to understand that Lookup is just another way of joining the data, just that it doesn't put pressure on Database engine and is optimized on its own. 

  2. Using a custom Script Component: You can write your custom C# script which reproduces the same cache effect of Lookup. A reference can be found here - MSDN Blogs.

Lookup like many transformations has been introduced through SSIS to make our life easier, at the same time it can be big issue if not used properly. You can leave comments if you are facing some problems in using lookup in your scenario. I would be happy to help.


3 comments:

  1. Interesting Point, very useful thanks

    ReplyDelete
  2. By using Lookup transformation for 7 million of data it make performance extremely slow ,do we have any alternative for that without using staging table and Execute sql commnad if possible.

    ReplyDelete