September 10, 2021
This post aims to provide a walk-through of how to deploy a Databricks cluster on Azure with its supporting infrastructure using Terraform. At the end of this post, you will have all the components required to be able to complete the Tutorial: Extract, transform, and load data by using Azure Databricks tutorial on the Microsoft website.
Once we have built out the infrastructure I will lay out the steps and results that the aforementioned tutorial goes through.
I will go through building out the Terraform code file by file, below are the files you will end up with:
The below diagram outlines the high-level components that will be deployed in this post.
A key file here is providers.tf, this is where we will declare the providers we require and any configuration parameters that we need to pass to them.
As can be seen here we are setting the azurerm providers features attribute to be an empty object, and telling databricks where to find the ID for the azurerm_databricks_workspace resource.
Another pretty important file in modern Terraform is versions.tf this is where we tell Terraform what version of Terraform it must use as well constraints we may have about any providers. For example, we may want to specify a provider version, or we may be hosting a version/fork of the provider somewhere other than the default Terraform registry.
Now that we have our providers and any constraints setup about our Terraform environment, it is time to move onto what information we pass into our code, this is done by variables.tf.
Above we have declared four variables, some of which are required and some that are optional. I find it good practice to mark a variable as optional or required in the description. On two of the variables validation has been added to ensure that what is being passed into our code is what Azure expects it to be. Doing this helps surface errors earlier.
main.tf is where anything that is foundational will usually live, an example of this for Azure is broadly an azurerm_resource_group as this is likely to be consumed by any other code written. By splitting out the code into different files it helps developers more easily understand what is going on in the codebase.
In our main.tf as can be seen below we are declaring a few things.
As can be seen above we use the following format() blocks to more consistently name our resources. This will help give engineers more confidence in the naming of resources, and if there is standard that is kept to across the platform it will enable them to more easily orientate themselves within any environment on the platform.
There are some more advanced ways to deal with this, such as the creation of a Naming Provider or Naming Module. The main advantage for having user friendly and consistent names for resources in my opinion is from engineers who are consuming Azure from the portal or CLI.
Now that we have the code ready for our Databricks workspace we need to create the network as you can see that we are referencing those in our main.tf file. The types of resources that we are creating and their purpose are as follows;
The network sizes declared above are rather large and ideally should not be used in any production environment, however, they are perfectly fine for development or proof of concept work as long as the network is never peered or connected to an express route where it might conflict with internal ranges.
It is very important to note the following things:
Before we look into creating the internals of our Databricks instance or our Azure Synapse database we must first create the Azure Active Directory (AAD) application that will be used for authentication.
Above we are creating resources with the following properties:
Now that we have our credentials all ready to go we can setup the Synapse instance, as well as any ancillary resources that we might need.
As per usual we will go through each of the resources being created and explain what they do.
Finally from a resource creation perspective we need to setup the internals of the Databricks instance. This mostly entails creating a single node Databricks cluster where Notebooks etc can be created by Data Engineers.
We will actually create a notebook later and perform some operations on it.
The last thing we will need to write in Terraform will be our outputs.tf, this is the information we want returned to us once the deployment of all the previous code is complete.
In this we are simply outputting information such as our Service Principal details, information about our storage account and Synapse instance and how to authenticate to them.
Now that we have all the pieces ready for us to use we can deploy it. This assumes that the files are all in your local directory and that you have Terraform installed.
terraform plan -var="environment=dev" -var="project=meow"
terraform apply -var="environment=dev" -var="project=meow"
Now that we have our environment deployed we can run through the ETL tutorial from Microsoft I linked at the top of this page.
The result of the above command is below, and it shows the configuration has been written.
We can see below that the account configuration has also been written.
It can be seen below that the file has been downloaded to temporary storage on the cluster.
The result of this command can be seen below as true.
From the result of the above command we can see that the data is now in a dataframe.
Below it can be seen that the dataframe contains the data from the sample file.
Given that this is running in a Databricks notebook a cleaner way to show the contents of the dataframe is to use the following:
You can see below that the only columns that are now returned in the dataframe are:
Below shows that the second transformation we have performed is to rename the column level to subscription_type.
This last screenshot shows that the data has been pushed over to our Azure Synapse instance.