Automate Private Azure Databricks Unity Catalog Creation

Disclaimer: All the views and opinions expressed in the blog belong solely to the author and not necessarily to the author's employer or any other group or individual. This article is not a promotion for any cloud/data management platform. All the images and code snippets are publicly available on the Azure/Databricks website.

What Is Databricks?

Databricks is a powerful platform that helps people work with large amounts of data easily. It allows companies to store, process, and analyze their data in one place.

What Is Databricks Unity Catalog?

Databricks Unity Catalog is a tool that helps manage and organize data across an organization in a simple, secure way. It allows companies to keep track of all their data, making it easier to find, share, and control who can access it. Unity Catalog works across different cloud storage systems and lets teams manage permissions, governance, and data access from one place, ensuring data is used safely and efficiently. 

Why Unity Catalog Creation Should Be Automated

How to Automate Unity Catalog Creation in an Azure Subscription

The step-by-step guide below is to create the Unity Catalog from scratch in an Azure Subscription. The script assumes below Azure services are configured properly. The script only creates the necessary components inside the Databricks workspace. 

Automate Private Azure Databricks Unity Catalog Creation

Prerequisites

Automation Script

Step 1: Create a Notebook and Install Databricks SDK

Login to the Databricks workspace and create a new Notebook by clicking on the "New" button in the upper left corner of the home page. Copy and paste the below Python code into the cell and run it. This will install the Databricks SDK.

Python
 
%python
%pip install databricks-sdk --upgrade
dbutils.library.restartPython()


Step 2: Create Storage Credential

A storage credential represents an authentication and authorization mechanism for accessing data stored on your cloud tenant, using an Azure-managed identity or service principal for Azure Data Lake Storage Gen2 containers. Copy and paste the Python code below into the next cell and run it to create the storage credential for the catalog.

Python
 
%python
import time
from pyspark.sql import functions as F
from databricks.sdk import WorkspaceClient
from databricks.sdk.service import catalog

databricks_url = 'https://adb-xxxx.azuredatabricks.net/' # replace the url with your workspace url
api_token = 'xxxx' # replace the token with your personal access token for the workspace. Best practice - store the token in Azure KV and retrieve the token using key-vault scope.

w = WorkspaceClient(host=databricks_url, token=api_token)

storage_cred = 'storage_cred' # replace the credential name with your choosen name

access_connector_id = '/subscriptions/xxx/resourceGroups/xxx/providers/Microsoft.Databricks/accessConnectors/xxx' # replace the id with your Azure Databricks Access Connector Id

storage_cred = w.storage_credentials.create(
    name=storage_cred,
    azure_managed_identity=catalog.AzureManagedIdentityRequest(access_connector_id=access_connector_id))


Step 3: Create an External Location

An external location is an object that combines a cloud storage path with a storage credential that authorizes access to the cloud storage path. Each storage location is subject to Unity Catalog access-control policies that control which users and groups can access the credential. Copy and paste the code below into the next cell and run it.

SQL
 
CREATE EXTERNAL LOCATION IF NOT EXISTS 'external_loc' -- Replace the External Location name as needed
URL 'abfss://CONTAINER_NAME@STORAGE_ACCOUNT_NAME.dfs.core.windows.net/'  -- Replace the Storage Account and Container name with the ones that you have created in the Azure
WITH (STORAGE CREDENTIAL 'STORAGE_CREDENTIAL_NAME') -- Replace the Storage Credential name with the one created in the earlier step


Step 4: Create a New Catalog

Now you can create a new catalog using an SQL command. Copy and paste the code below into the next cell and run it. 

SQL
 
CREATE CATALOG IF NOT EXISTS 'catalog' -- Replace the Catalog name as necessary 
MANAGED LOCATION 'abfss://CONTAINER_NAME@STORAGE_ACCOUNT_NAME.dfs.core.windows.net/'  -- Replace the Storage Account and Container name with the ones that you have created in the Azure


Step 5: Validation

Click on Catalog on the workspace homepage.Automate Private Azure Databricks Unity Catalog Creation

You should see a new catalog in the workspace when you navigate to the catalog section. Expand the catalog by clicking on the ">" icon next to the catalog. You should see a default schema. Now select the default schema and create a new table. Automate Private Azure Databricks Unity Catalog Creation

You can upload a file with valid data with any of the supported file formats (.csv, .tsv, .tab, .json, .jsonl, .avro, .parquet, .txt, or .xml) to create a table in Unity Catalog. 

Automate Private Azure Databricks Unity Catalog Creation

Once the table is created you can see the table in the catalog under the default schema. You can open a SQL Notebook and query the table. Also if you navigate to the Container inside the Azure Storage Account. You should see the table metadata inside the __unitystorage folder.

 

 

 

 

Top