Creating and mounting an Azure data lake in Databricks via Terraform
Table of Contents
A few days ago Databricks announced their Terraform integration with Azure and AWS, which enables us to write infrastructure as code to manage Databricks resources like workspaces, clusters (even jobs!). A new version of their Terraform provider has been released just two days ago so let’s use it right away to see how that works. As an example, we’ll create an Azure Data Lake Storage Gen2 and mount it to Databricks DBFS.
tl;dr: We can now provision Databricks infrastructure (workspaces, clusters, etc …) via Terraform. There is a complete example at the end of this page.
Foreword #
Databricks provider version #
Many of the provider’s resources still have an evolving API so it’s a good idea to set a specific version of it in our Terraform configuration.
Terraform plan will require an active cluster #
There is one important consequence of mounting a data lake using the Databricks provider to keep in mind: in this situation, refreshing the Terraform state (or running terraform plan
) requires an active Databricks cluster and will create one if needed.
If we already have a cluster running 24/7, this doesn’t make any difference; if we don’t, on the other had, this means that
terraform plan
will take longer to run whenever the cluster needs to be created on the spot (which takes 5-7 minutes for a minimal cluster with a single worker node)- we will incur in additional provisioning + usage costs
Naming convention #
Whenever I don’t feel the need to provide a more specific name for a resource (because there’s only one of that kind in the configuration, for example) I will use the name "this"
: in this context "this"
doesn’t have any particular meaning and could be replaced with "foo"
, "example"
, or any other name.
First things first #
If you haven’t downloaded Terraform yet, you can get version 0.13.x by running
brew install terraform
If you instead need to use 0.12.x (or want the option to switch version depending on the project), you can achieve that using tfenv:
brew install tfenv
tfenv install 0.12.26
tfenv use 0.12.26
We now need to tell Terraform that we want to use version 0.2.5 (the latest, at the moment of writing) of the Databricks provider.
On Terraform 0.13.x #
terraform {
required_version = "= 0.13.2"
required_providers {
azurerm = "~> 2.24.0"
databricks = {
source = "databrickslabs/databricks"
version = "0.2.5"
}
}
}
On Terraform 0.12.x #
The provider has to be installed manually by running
curl https://raw.githubusercontent.com/databrickslabs/databricks-terraform/master/godownloader-databricks-provider.sh | bash -s -- -b $HOME/.terraform.d/plugins v0.2.5
Initialize #
We can now finally run
terraform init
Configure Azure provider #
Not much to say, it’s a standard configuration:
variable "client_id" {
type = string
}
variable "client_secret" {
type = string
}
variable "tenant_id" {
type = string
}
variable "subscription_id" {
type = string
}
provider "azurerm" {
features {}
client_id = var.client_id
client_secret = var.client_secret
tenant_id = var.tenant_id
subscription_id = var.subscription_id
}
data "azurerm_client_config" "current" {}
resource "azurerm_resource_group" "this" {
name = "myrg"
location = "West Europe"
}
The azurerm_client_config
data source will be useful later on, to provide client information to Databricks resources.
We’re also creating a resource group where will we create all other resources.
Note: In alternative, all the above variables can also be sourced from their equivalent environment variables (ARM_CLIENT_ID
, ARM_CLIENT_SECRET
, ARM_TENANT_ID
, and ARM_SUBSCRIPTION_ID
), in which case the above provider
block changes as follows:
provider "azurerm" {
features {}
}
Create Azure data lake #
We’ll now create an Azure Storage Data Lake Gen2 instance:
resource "azurerm_storage_account" "this" {
name = "mystorageaccount"
resource_group_name = azurerm_resource_group.this.name
location = azurerm_resource_group.this.location
access_tier = "Hot"
account_kind = "StorageV2"
is_hns_enabled = true
account_tier = "Standard"
account_replication_type = "GRS" # or another type, according to your needs
enable_https_traffic_only = true
}
resource "azurerm_storage_container" "this" {
name = "data"
storage_account_name = azurerm_storage_account.this.name
container_access_type = "private"
}
In the above configuration, is_hns_enabled = true
is required by Databricks: it enables the hierarchical namespace (providing improved filesystem performance, POSIX ACLs, and more); as a consequence, account_tier
must be set to Standard
.
Create Databricks workspace #
Our Databricks resources need a home, so let’s create one for them:
resource "azurerm_databricks_workspace" "this" {
name = "myworkspace"
resource_group_name = azurerm_resource_group.this.name
location = azurerm_resource_group.this.location
sku = "standard"
}
Configure Databricks provider #
We now have all we need to initialize the Databricks provider, so here we go:
provider "databricks" {
azure_resource_group = azurerm_resource_group.this.name
azure_tenant_id = data.azurerm_client_config.current.tenant_id
azure_workspace_name = azurerm_databricks_workspace.this.name
}
Create Databricks cluster #
In order to mount the data lake to Databricks (so that it can be accessed via DBFS), we need to provision a cluster (which we can later use for notebooks/jobs as well, if we want to)
resource "databricks_cluster" "this" {
cluster_name = "terraform"
idempotency_token = "terraform"
spark_version = "7.0.x-scala2.12" # any version will do here, just using the latest
driver_node_type_id = "Standard_DS3_v2" # any type will do here, just using the cheapest
node_type_id = "Standard_DS3_v2" # any type will do here, just using the cheapest
num_workers = 1
autotermination_minutes = 10
}
The (optional) idempotency_token
prevents the creation of another cluster if one with the same token value already exists (in which case, this resource will
- restart the cluster (if not running), and
- returns its ID
Mount data lake #
We’re almost ready to mount the data lake in DBFS (for example, to /mnt/data
): the last remaining bit is that this operation requires us to provide the credentials of a service principal that has been assigned the Storage Blob Data Contributor
role on the storage account; for that purpose, we’ll create a Databricks secret scope containing the service principal’s client_secret
:
resource "databricks_secret_scope" "this" {
name = "terraform"
}
resource "databricks_secret" "this" {
key = "service_principal_key"
string_value = var.client_secret
scope = databricks_secret_scope.this.name
}
resource "databricks_azure_adls_gen2_mount" "this" {
cluster_id = databricks_cluster.this.id
storage_account_name = azurerm_storage_account.this.name
container_name = azurerm_storage_container.this.name
mount_name = "data"
tenant_id = data.azurerm_client_config.current.tenant_id
client_id = data.azurerm_client_config.current.client_id
client_secret_scope = databricks_secret_scope.this.name
client_secret_key = databricks_secret.this.key
initialize_file_system = true
}
The mount_name
argument specifies a subpath of dbfs:/mnt/
, so the above value translates to dbfs:/mnt/data
.
Create everything #
We can finally create everything by running
terraform apply
Conclusion #
That’s it! Here is the complete example (for Terraform 0.13.x): have fun!
terraform {
required_version = "= 0.13.2"
required_providers {
azurerm = "~> 2.24.0"
databricks = {
source = "databrickslabs/databricks"
version = "0.2.5"
}
}
}
variable "client_id" {
type = string
}
variable "client_secret" {
type = string
}
variable "tenant_id" {
type = string
}
variable "subscription_id" {
type = string
}
provider "azurerm" {
features {}
client_id = var.client_id
client_secret = var.client_secret
tenant_id = var.tenant_id
subscription_id = var.subscription_id
}
data "azurerm_client_config" "current" {}
resource "azurerm_resource_group" "this" {
name = "myrg"
location = "West Europe"
}
resource "azurerm_storage_account" "this" {
name = "mystorageaccount"
resource_group_name = azurerm_resource_group.this.name
location = azurerm_resource_group.this.location
access_tier = "Hot"
account_kind = "StorageV2"
is_hns_enabled = true
account_tier = "Standard"
account_replication_type = "GRS"
enable_https_traffic_only = true
}
resource "azurerm_storage_container" "this" {
name = "data"
storage_account_name = azurerm_storage_account.this.name
container_access_type = "private"
}
resource "azurerm_databricks_workspace" "this" {
name = "myworkspace"
resource_group_name = azurerm_resource_group.this.name
location = azurerm_resource_group.this.location
sku = "standard"
}
provider "databricks" {
azure_resource_group = azurerm_resource_group.this.name
azure_tenant_id = data.azurerm_client_config.current.tenant_id
azure_workspace_name = azurerm_databricks_workspace.this.name
}
resource "databricks_cluster" "this" {
cluster_name = "terraform"
idempotency_token = "terraform"
spark_version = "7.0.x-scala2.12"
driver_node_type_id = "Standard_DS3_v2"
node_type_id = "Standard_DS3_v2"
num_workers = 1
autotermination_minutes = 10
}
resource "databricks_secret_scope" "this" {
name = "terraform"
}
resource "databricks_secret" "this" {
key = "service_principal_key"
string_value = var.client_secret
scope = databricks_secret_scope.this.name
}
resource "databricks_azure_adls_gen2_mount" "this" {
cluster_id = databricks_cluster.this.id
storage_account_name = azurerm_storage_account.this.name
container_name = azurerm_storage_container.this.name
mount_name = "data"
tenant_id = data.azurerm_client_config.current.tenant_id
client_id = data.azurerm_client_config.current.client_id
client_secret_scope = databricks_secret_scope.this.name
client_secret_key = databricks_secret.this.key
initialize_file_system = true
}