Creating and mounting an Azure data lake in Databricks via Terraform

6 minute read

A few days ago Databricks announced their Terraform integration with Azure and AWS, which enables us to write infrastructure as code to manage Databricks resources like workspaces, clusters (even jobs!). A new version of their Terraform provider has been released just two days ago so let’s use it right away to see how that works. As an example, we’ll create an Azure Data Lake Storage Gen2 and mount it to Databricks DBFS.

tl;dr: We can now provision Databricks infrastructure (workspaces, clusters, etc …) via Terraform. There is a complete example at the end of this page.

Foreword

Databricks provider version

Many of the provider’s resources still have an evolving API so it’s a good idea to set a specific version of it in our Terraform configuration.

Terraform plan will require an active cluster

There is one important consequence of mounting a data lake using the Databricks provider to keep in mind: in this situation, refreshing the Terraform state (or running terraform plan) requires an active Databricks cluster and will create one if needed.

If we already have a cluster running 24/7, this doesn’t make any difference; if we don’t, on the other had, this means that

  • terraform plan will take longer to run whenever the cluster needs to be created on the spot (which takes 5-7 minutes for a minimal cluster with a single worker node)
  • we will incur in additional provisioning + usage costs

Naming convention

Whenever I don’t feel the need to provide a more specific name for a resource (because there’s only one of that kind in the configuration, for example) I will use the name "this": in this context "this" doesn’t have any particular meaning and could be replaced with "foo", "example", or any other name.

First things first

If you haven’t downloaded Terraform yet, you can get version 0.13.x by running

brew install terraform

If you instead need to use 0.12.x (or want the option to switch version depending on the project), you can achieve that using tfenv:

brew install tfenv
tfenv install 0.12.26

tfenv use 0.12.26

We now need to tell Terraform that we want to use version 0.2.5 (the latest, at the moment of writing) of the Databricks provider.

On Terraform 0.13.x

terraform {
  required_version = "= 0.13.2"

  required_providers {
    azurerm    = "~> 2.24.0"
    databricks = {
      source = "databrickslabs/databricks"
      version = "0.2.5"
    }
  }
}

On Terraform 0.12.x

The provider has to be installed manually by running

curl https://raw.githubusercontent.com/databrickslabs/databricks-terraform/master/godownloader-databricks-provider.sh | bash -s -- -b $HOME/.terraform.d/plugins v0.2.5

Initialize

We can now finally run

terraform init

Configure Azure provider

Not much to say, it’s a standard configuration:

variable "client_id" {
  type = string
}

variable "client_secret" {
  type = string
}

variable "tenant_id" {
  type = string
}

variable "subscription_id" {
  type = string
}

provider "azurerm" {
  features {}

  client_id         = var.client_id
  client_secret     = var.client_secret
  tenant_id         = var.tenant_id
  subscription_id   = var.subscription_id
}

data "azurerm_client_config" "current" {}

resource "azurerm_resource_group" "this" {
  name     = "myrg"
  location = "West Europe"
}

The azurerm_client_config data source will be useful later on, to provide client information to Databricks resources. We’re also creating a resource group where will we create all other resources.

Note: In alternative, all the above variables can also be sourced from their equivalent environment variables (ARM_CLIENT_ID, ARM_CLIENT_SECRET, ARM_TENANT_ID, and ARM_SUBSCRIPTION_ID), in which case the above provider block changes as follows:

provider "azurerm" {
  features {}
}

Create Azure data lake

We’ll now create an Azure Storage Data Lake Gen2 instance:

resource "azurerm_storage_account" "this" {
  name                      = "mystorageaccount"
  resource_group_name       = azurerm_resource_group.this.name
  location                  = azurerm_resource_group.this.location
  access_tier               = "Hot"
  account_kind              = "StorageV2"
  is_hns_enabled            = true
  account_tier              = "Standard"
  account_replication_type  = "GRS" # or another type, according to your needs
  enable_https_traffic_only = true
}

resource "azurerm_storage_container" "this" {
  name                  = "data"
  storage_account_name  = azurerm_storage_account.this.name
  container_access_type = "private"
}

In the above configuration, is_hns_enabled = true is required by Databricks: it enables the hierarchical namespace (providing improved filesystem performance, POSIX ACLs, and more); as a consequence, account_tier must be set to Standard.

Create Databricks workspace

Our Databricks resources need a home, so let’s create one for them:

resource "azurerm_databricks_workspace" "this" {
  name                = "myworkspace"
  resource_group_name = azurerm_resource_group.this.name
  location            = azurerm_resource_group.this.location
  sku                 = "standard"
}

Configure Databricks provider

We now have all we need to initialize the Databricks provider, so here we go:

provider "databricks" {
  azure_resource_group = azurerm_resource_group.this.name
  azure_tenant_id      = data.azurerm_client_config.current.tenant_id
  azure_workspace_name = azurerm_databricks_workspace.this.name
}

Create Databricks cluster

In order to mount the data lake to Databricks (so that it can be accessed via DBFS), we need to provision a cluster (which we can later use for notebooks/jobs as well, if we want to)

resource "databricks_cluster" "this" {
  cluster_name            = "terraform"
  idempotency_token       = "terraform"
  spark_version           = "7.0.x-scala2.12" # any version will do here, just using the latest
  driver_node_type_id     = "Standard_DS3_v2" # any type will do here, just using the cheapest
  node_type_id            = "Standard_DS3_v2" # any type will do here, just using the cheapest
  num_workers             = 1
  autotermination_minutes = 10
}

The (optional) idempotency_token prevents the creation of another cluster if one with the same token value already exists (in which case, this resource will

  • restart the cluster (if not running), and
  • returns its ID

Mount data lake

We’re almost ready to mount the data lake in DBFS (for example, to /mnt/data): the last remaining bit is that this operation requires us to provide the credentials of a service principal that has been assigned the Storage Blob Data Contributor role on the storage account; for that purpose, we’ll create a Databricks secret scope containing the service principal’s client_secret:

resource "databricks_secret_scope" "this" {
  name = "terraform"
}

resource "databricks_secret" "this" {
  key          = "service_principal_key"
  string_value = var.client_secret
  scope        = databricks_secret_scope.this.name
}

resource "databricks_azure_adls_gen2_mount" "this" {
  cluster_id             = databricks_cluster.this.id
  storage_account_name   = azurerm_storage_account.this.name
  container_name         = azurerm_storage_container.this.name
  mount_name             = "data"
  tenant_id              = data.azurerm_client_config.current.tenant_id
  client_id              = data.azurerm_client_config.current.client_id
  client_secret_scope    = databricks_secret_scope.this.name
  client_secret_key      = databricks_secret.this.key
  initialize_file_system = true
}

The mount_name argument specifies a subpath of dbfs:/mnt/, so the above value translates to dbfs:/mnt/data.

Create everything

We can finally create everything by running

terraform apply

Conclusion

That’s it! Here is the complete example (for Terraform 0.13.x): have fun!

terraform {
  required_version = "= 0.13.2"

  required_providers {
    azurerm    = "~> 2.24.0"
    databricks = {
      source = "databrickslabs/databricks"
      version = "0.2.5"
    }
  }
}

variable "client_id" {
  type = string
}

variable "client_secret" {
  type = string
}

variable "tenant_id" {
  type = string
}

variable "subscription_id" {
  type = string
}

provider "azurerm" {
  features {}

  client_id         = var.client_id
  client_secret     = var.client_secret
  tenant_id         = var.tenant_id
  subscription_id   = var.subscription_id
}

data "azurerm_client_config" "current" {}

resource "azurerm_resource_group" "this" {
  name     = "myrg"
  location = "West Europe"
}

resource "azurerm_storage_account" "this" {
  name                      = "mystorageaccount"
  resource_group_name       = azurerm_resource_group.this.name
  location                  = azurerm_resource_group.this.location
  access_tier               = "Hot"
  account_kind              = "StorageV2"
  is_hns_enabled            = true
  account_tier              = "Standard"
  account_replication_type  = "GRS"
  enable_https_traffic_only = true
}

resource "azurerm_storage_container" "this" {
  name                  = "data"
  storage_account_name  = azurerm_storage_account.this.name
  container_access_type = "private"
}

resource "azurerm_databricks_workspace" "this" {
  name                = "myworkspace"
  resource_group_name = azurerm_resource_group.this.name
  location            = azurerm_resource_group.this.location
  sku                 = "standard"
}

provider "databricks" {
  azure_resource_group = azurerm_resource_group.this.name
  azure_tenant_id      = data.azurerm_client_config.current.tenant_id
  azure_workspace_name = azurerm_databricks_workspace.this.name
}

resource "databricks_cluster" "this" {
  cluster_name            = "terraform"
  idempotency_token       = "terraform"
  spark_version           = "7.0.x-scala2.12"
  driver_node_type_id     = "Standard_DS3_v2"
  node_type_id            = "Standard_DS3_v2"
  num_workers             = 1
  autotermination_minutes = 10
}

resource "databricks_secret_scope" "this" {
  name = "terraform"
}

resource "databricks_secret" "this" {
  key          = "service_principal_key"
  string_value = var.client_secret
  scope        = databricks_secret_scope.this.name
}

resource "databricks_azure_adls_gen2_mount" "this" {
  cluster_id             = databricks_cluster.this.id
  storage_account_name   = azurerm_storage_account.this.name
  container_name         = azurerm_storage_container.this.name
  mount_name             = "data"
  tenant_id              = data.azurerm_client_config.current.tenant_id
  client_id              = data.azurerm_client_config.current.client_id
  client_secret_scope    = databricks_secret_scope.this.name
  client_secret_key      = databricks_secret.this.key
  initialize_file_system = true
}