How to Adopt Unity Catalog

Blueprint Technologies
5 min readDec 7, 2023

--

Unity Catalog (UC) is a unified governance solution for data and AI assets on the Lakehouse. It provides centralized access control, auditing, lineage, and data discovery capabilities across Azure Databricks workspaces. Using different features in Unity Catalog can help organizations streamline their operations and achieve greater efficiency. Efficiency improvements managing their digital assets, sharing information, collaborating on projects, and automating workflows. This can lead to significant cost savings and improved productivity over time.

If you’re considering adopting Unity Catalog in your organization, there are a few things to keep in mind. First, it’s crucial to clearly identify your key stakeholders. This will include representatives from your Governance, DevOps, Sysadmin, Security, Infrastructure, and Data engineering teams. You might also need additional representatives for teams that consume data directly from your Lakehouse. The larger this group becomes, the more effort is required for your migration.

To determine the level of complexity in your Databricks environment, consider the following cases:

  • Are you just starting to use Databricks and only have a few objects defined in your Lakehouse?
  • Have you used Databricks for a while, defined many objects, identified stakeholders for the migration project, and limited technical debt?
  • Are many objects defined with complicated rights, roles, and permissions? Do you have significant technical debt, complex notebook interaction, and custom business logic baked into your ETL processes?

The closer you get to case 1, the fewer issues you’ll face during migration. It’s not impossible to complete a migration for the second two cases, but you will have to take a more controlled and iterative approach.

Step 1: Discovery and Initial Setup

We must gather vital information to meet our use cases, requirements, and applicable regulations. We must know them. You’ll need to collect a list to apply these requirements early in your process as objects move from hive metastores into the Unity catalog.

During the discovery phase, you’ll want to identify what identity provider you will use. If you’re on Azure, that’s easy; it’s Azure Active Directory (AAD). If you’re on AWS or GCP, you must use one compatible with Databricks; your identity provider must support SAML 2.0 or OpenID Connect. Once selected, you must provision all the identities that will access your Lakehouse. Users, app registrations, and Managed Service Identities (MSI) will be these identities. This will allow secure access to the Lakehouse. Once your identities are set up, you’ll need to connect your Databricks account to your identity provider so that you can use your identities.

At this point, you can enable Unity Catalog for your account and set up your first metastore. While you can only have one Unity Catalog per account, you can set up one metastore per data center (or region) hosting one of your Databricks workspaces. So, if you have workspaces in three regions, you must set up a metastore for each region.

During the discovery process, you’ll inventory your workspace artifacts. These include users and groups, notebooks or Python files, queries, pipelines, and objects currently defined in your hive metastores. This data should be collected so you can use it to control your migration process in subsequent steps.

Once you have your inventory, you can begin to generate impact statements. You can discover what code access which hive objects. You can also see what users have access to which objects and codebases. These impact statements will be used to order your migration to minimize conflicts. You can also use them to test your migration while it’s in motion to determine when new code or Unity Catalog objects can be released for broader consumption.

Step 2: Plan the migration

Once you have your inventory, you can use the data to configure your migration. You can use the list of workspace groups and users to decide whether to provision those users at the account level. By saving these choices as a configuration, you can drive the migration through automation rather than manually provisioning each user. This user configuration can also control granting users access to a given workspace.

You can use your list of metastore objects to build a configuration that would let you define which catalog, schema, and table to define those objects in your UC Metastore. Next, take your list of users and objects and build a configuration that will migrate object permissions to Unity Catalog.

You can decide whether to buy or build a tagging solution to help you define additional data attributes, like PII, PCI, or another regulatory requirement.

You can introduce the requirements and regulations from the discovery step by defining your migration choices in a configuration. You’ll also be able to use the inventory and configuration to generate documentation and diagrams that can be used to provide a detailed overview of the migration process. These generated documents will enable you to get buy-in and approvals from stakeholders.

By adopting the configuration-driven approach, the migration process can be iterative, and it’s unnecessary to migrate everything in one go. These configurations allow you to migrate objects over time, ensuring that each step is thoroughly reviewed and approved before proceeding. This will help ensure that the migration process succeeds and that the data remains secure.

Step 3: Run the migration

During this step, you make the changes defined in your configuration. This could be manual, but we highly recommend an automated approach. This ensures that the migration follows your defined choices. This is where users, objects, and permissions are created in Unity Catalog. This is also the step where notebooks, queries, and pipelines are updated from references to the hive metastore objects and replaced with references to their UC counterparts.

This is also where any tagging you’ve selected occurs on the new objects as they are created and secures them the moment they are made.

You’ll need to verify that objects, permissions, and workloads continue to work as you move to the new objects. As the migration proceeds, you can test the new versions work before committing to use them.

You may repeat steps two and three as many times as needed to migrate all the objects to Unity Catalog.

Step 4: Deprecate hive metastore objects

Once all the objects have been migrated to Unity Catalog, you’ll want to meet with the stakeholders again. In these meetings, you’ll determine that the original hive metastore objects are no longer needed. Once you reach this decision, you can remove those objects. You may also want to prevent future objects from being written to the hive metastore. This can prevent additional objects requiring migration.

After deprecation, you have completed your migration. You can fully enjoy all the benefits Unity Catalog brings!

--

--

Blueprint Technologies
Blueprint Technologies

Written by Blueprint Technologies

We’re a nationwide technology solutions firm that connects strategy, product, and delivery.

No responses yet