One common practice in data warehousing is to load the source data, then hash the values so that differencing can be applied when loading to the final tables. Many times, this process is done by using the SQL Server Hashbytes function in an update statement after the data has been initially loaded. However, there are times when it is useful to have the hash calculated in the processing stream in SSIS.
In order to solve this problem, here are several different approaches that could be taken:
- Write a custom SSIS Component
- Find a third party component (SSIS Multiple Hash is a great one)
- Use a script component and custom code
Which solution is most fitting really depends on the scenario. For example, if more than one hash value needs to be generated (for instance, a partial versus a full hash), the Multiple Hash component linked above would be a great fit. However, if all that is needed is a single hash for each row that can easily be configured for any combination of columns, the script component and custom code will be the simplest solution.
Here is how to configure the script component solution. I used some of the code from this great post on SQLServerCentral.com about setting up incremental processes in SSIS to build my own. This solution is built using SSIS 2008 R2, but could be adapted back for SSIS 2005 or forward for SQL Server 2012 with some minor tweaks.
1. Add a script component to the data flow at a place in the stream where all columns that need to be hashed are present in the metadata. Choose “Transformation” in the “Select Script Component Type” window.
2. Double click the script task or right click and choose “Edit.” to get to the configuration window.
3. With “Script” selected in the left pane, choose “Microsoft Visual C# 2008” as the ScriptLanguage.
4. Select “Input Columns” on the left pane and select each column to be included in the hash. As a note, I generally do not include the key columns in the hash values.
5. Click on the “Inputs and Outputs” entry in the left pane, expand “Output 0”, click on “Output Columns” and click the “Add Column” button. Give the new column a name (preferably the same one as the hash column in the database). Change the DataType to string and set the length to 40.
6. Return to the “Script” section in the left pane and click the “Edit Script…” button.
7. Add the following to the references section at the top of the code:
using Microsoft.SqlServer.Dts.Pipeline;
using System.Text;
using System.Security.Cryptography;
8. Remove the three existing public override void methods that are in the code to start with an empty class.
9. Add the following code inside the class to create a copy of the PipelineBuffer that can be used in the hashing methods.
private PipelineBuffer inputBuffer;
public override void ProcessInput(int InputID, PipelineBuffer Buffer)
{
inputBuffer = Buffer;
base.ProcessInput(InputID, Buffer);
}
10. Next, add in the code to process each row. It will loop through each of the input columns and build one big string. That string can then be passed to the hashing method and the results sent to the output.
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
var counter = 0;
var values = new StringBuilder();
//loop through input columns
for (counter = 0; counter < inputBuffer.ColumnCount; counter++)
{
object value;
value = inputBuffer[counter];
//add each column value to one big string
values.Append(value);
}
//set output column as results of hash method
Row.HashColumn = CreateHash(values.ToString());
base.Input0_ProcessInputRow(Row);
}
11. Add the code to create the hash value from the long string created in step 10.
private string CreateHash(string data)
{
//get byte array of long data string
var dataToHash = (new UnicodeEncoding()).GetBytes(data);
//create hash provider and compute hash of byte array
var sha1 = new SHA1CryptoServiceProvider();
var hashedData = sha1.ComputeHash(dataToHash);
RNGCryptoServiceProvider.Create().GetBytes(dataToHash);
//convert results to hexadecimal string (SQL friendly format)
var result = BitConverter.ToString(hashedData).Replace(“-”, “”);
return result;
}
12. Final code for the class should look like this:
public class ScriptMain : UserComponent
{
private PipelineBuffer inputBuffer;
public override void ProcessInput(int InputID, PipelineBuffer Buffer)
{
inputBuffer = Buffer;
base.ProcessInput(InputID, Buffer);
}
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
var counter = 0;
var values = new StringBuilder();
//loop through input columns
for (counter = 0; counter < inputBuffer.ColumnCount; counter++)
{
object value;
value = inputBuffer[counter];
//add each column value to one big string
values.Append(value);
}
//set output column as results of hash method
Row.HashColumn = CreateHash(values.ToString());
base.Input0_ProcessInputRow(Row);
}
private string CreateHash(string data)
{
//get byte array of long data string
var dataToHash = (new UnicodeEncoding()).GetBytes(data);
//create hash provider and compute hash of byte array
var sha1 = new SHA1CryptoServiceProvider();
var hashedData = sha1.ComputeHash(dataToHash);
RNGCryptoServiceProvider.Create().GetBytes(dataToHash);
//convert results to hexadecimal string (SQL friendly format)
var result = BitConverter.ToString(hashedData).Replace(“-”, “”);
return result;
}
}
13. Save and close the script window
14. Connect the script component to the next downstream component or destination and use the hashed script output column as necessary
As a final note, performance using the script component will be slightly slower in comparison with using an update with the built in hashbytes function, so this method should only be used when the hash values are needed in the processing stream or when landing the data first and updating with hash values is not an optimal solution.



