I have a goal of writing a tech blog Entry a week. I am terrible at that goal. Even now with a better plan for when I will write I am still terrible at it. With the acknowledgement that I am terrible I am going to get this written even if I don’t have as much time as I would like to write it.
My day job is leading up the Platform Product line at Health Catalyst so I have a good feel for ETL, moving data, data warehousing, etc… This past weekend I was working on a side project that involved moving data to the cloud. I was working with Azure, Azure SQL specifically. I was needing to take a collection of CSV files that had days as columns and rows that reflected a specific entity and the value on that day for that entity populated. I wanted to take the values for each day and pivot the data so that it was row based instead of columnar. That meant for most files I was transforming 12 thousand records or so to 1 or 2 million. The biggest file turned into 62 million records (and a couple of Gigabytes).
I tried the following items
- PolyBase with Azure SQL
Surprise this doesn’t work/isn’t enabled. I was surprised by that as MSFT has gotten into a cadence where they generally ship new features into the Cloud first and then into their on-prem “box” products. PolyBase is a new feature that integrates HDFS distributed tables into SQL Server and enables them to be configured for querying. Since it is being included in SQL 2016 as I understand I figured it would show up in Azure SQL – no dice. Not available and no communication that I could find if it every would be. It is available to Azure SQL Data Warehouse, but that is a different product with a much higher cost.
- Azure Data Factory (ADF)
I have used SSIS for years and have a love/hate relationship with it. It does some things really well and a lot of things not well at all (version control, upgrades between SQL Server versions, etc…). It has been interesting to see that SSIS has not been ported to the cloud like other products, but that ADF which is also a platform for data movement has been built as a cloud first product seemingly in place of SSIS. I don’t have time in this post to go through my complaints, but I have thought for a long time that ADF is the answer to SSIS and that is definitely not the case. I was surprised by how ADF was built and the weird combination of scheduling integrated with ETL that it mixes. The strangest issue I found is that if I wanted to run a data pipeline in ADF just one time (not on a schedule) then I had to redefine it if I wanted to rerun it. Very strange. Would love to talk to the Program Managers for the product to see what direction they are going because it certainly didn’t match what I expected for an SSIS (but better) in the cloud – probably I am thinking about what it is trying to solve wrong.
- R with BCP
The solution I settled on that worked for me was using R to transform the data into the row centric format instead of the column centric format (a pivot like operation) and then taking the output files and loading them up through BCP. It took a little bit of time to produce the files, but seeing as how I had to learn a little R to figure out how to do it I was pleased with how quickly it went.