Data Exploration Using Serverless SQL Pool In Azure Synapse

Azure Synapse Analytics, formerly known as Azure SQL Data Warehouse, is a cloud-based analytics service provided by Microsoft Azure. It provides capabilities like data exploration, data analysis, data integration, advanced analytics,  machine learning, etc., on the Azure Data Lake Blob Storage.

In this article, we will take a deep dive into the ingestion of data files into Azure Synapse, data analysis, and transformation of the files in Azure Synapse using a Serverless SQL Pool. This article expects you to have a basic understanding of Azure fundamentals. Open your Synapse workspace and have it ready. Let's go.

Ingest/View Files in the Azure Data Lake

1. On the left side of Synapse Studio, use the >> icon to expand the menu. This reveals the different pages within Synapse Studio that we'll use to manage resources and perform data analytics tasks.

2. On the Data page, view the Linked tab and verify that your workspace includes a link to your Azure Data Lake Storage Gen2 storage account.

Data page

3. For this example, we are going to use free CSV files from NZ Govt Stats

4. Under the Azure Data Lake Storage Gen2, you should see your default storage account provisioned. Let us have a files/CSV folder created and upload the .csv files from the above link. For this example, we have uploaded the 'Business-employment-data-september-2022-quarter-csv.csv' file.

5. Once you uploaded a .csv file, right-click the CSV folder, then in the 'New SQL script' list on the toolbar, select 'Select Top 100 rows.' Review the Transact-SQL code that is generated.

SQL
 
SQL -- This is auto-generated code 



SELECT     TOP 100 * FROM     OPENROWSET(         BULK 'https://datalakexxxxxxx.dfs.core.windows.net/files/csv/Business-employment-data-september-2022-quarter-csv.csv', 



        FORMAT = 'CSV',         



        PARSER_VERSION='2.0'     ) 



AS [result] 


6. The above code uses the Azure OPENROWSET to read data from the CSV files in the files folder and retrieves the first 100 rows of data. In the connect to list, ensure the built-in is selected. It represents the built-in SQL pool that was created with our workspace. In the real world, the administration team at your organization would have created a named SQL Pool in your workspace. Then on the toolbar, use the Run button to run the SQL code and review the results, which should look like this.

review the results

In this case, the data files include the column names in the first row, so modify the query to add a HEADER_ROW = TRUE parameter to the WITH clause, as shown here (don't forget to add a comma after the previous parameter):

SQL
 

SELECT

    TOP 100 *

FROM

    OPENROWSET(

        BULK 'https://datalakexxxxxxx.dfs.core.windows.net/files/csv/Business-employment-data-september-2022-quarter-csv.csv',

        FORMAT = 'CSV',

        PARSER_VERSION='2.0',

        HEADER_ROW = TRUE

    ) AS [result]


Run the above query, and you should see the results below.

query results

You can publish the results using the "Publish" button on the pane.

Transform Data Using CREATE EXTERNAL TABLE AS SELECT Statements

A simple way to use SQL to transform data in a file and persist the results in another file is to use a CREATE EXTERNAL TABLE AS SELECT statement. This statement creates a table based on the requests of a query, but the data for the table is stored as files in a data lake. The transformed data can then be queried through the external table or accessed directly in the file system (for example, for inclusion in a downstream process to load the transformed data into a data warehouse).

Create an External Data Source and File Format

By defining an external source in a database, you can use it to reference the data lake location where you want to store files for external tables. An external file format enables you to define the format for those files — Parquet, CSV, etc. To use these objects to work with external tables, you need to create them in a database other than the default master database.

1. In Synapse studio, on the 'Develop' page, in the '+' menu, select SQL script

Develop

2. In the new script pane, add the following code (replacing datalakexxxxxxx with the name of your data lake storage account) to create a new database and add an external data source to it.

SQL
 
--Database for Employment data

CREATE DATABASE Employment

COLLATE Latin1_General_100_BIN2_UTF8;

GO;

Use Employment;

GO;


-- External data is in the Files container in the data lake

CREATE EXTERNAL DATA SOURCE employment_data WITH (

    LOCATION = 'https://datalakeXXXXXXX.dfs.core.windows.net/files/csv/Business-employment-data-september-2022-quarter-csv.csv'

);


GO;


-- Format for table files

CREATE EXTERNAL FILE FORMAT ParquetFormat

    WITH (

            FORMAT_TYPE = PARQUET,

            DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'

        );


GO;


Modify the script properties to change its name to Create Employment DB, and publish it. Ensure that the script is connected to the built-in SQL pool and the master database, and then run it. Switch back to the Data page and use the ↻ button at the top right of Synapse Studio to refresh the page. Then view the Workspace tab in the Data pane, where a SQL database list is now displayed. Expand this list to verify that the Employment database has been created.

Synapse Studio

Expand the Employment database, its External Resources folder, and the External data sources folder under that to see the employment_data external data source you created.

Expand the Employment database

In the properties pane, you can name the script like the one below and publish it. It will save the code, and the script will be saved so you can edit it later.

script

Create an External Table

In the Synapse studio, on the Develop page, in the '+' menu, select SQL script, and add the following code to retrieve and aggregate data from the CSV files by using the external data source. Note that the BULK path is now relative to the folder location (files folder) on which the data source is defined.

SQL
 
USE Employment;

GO;


SELECT Count(Series_reference) AS Series_count,

Period as period

FROM

    OPENROWSET(

        BULK 'csv/Business*.csv',

        DATA_SOURCE = 'employment_data',

        FORMAT = 'CSV',

        PARSER_VERSION = '2.0',

        HEADER_ROW = TRUE

    ) AS series_count

GROUP BY Period;


Run the script, and the results should show as below.

script result

Modify the SQL code to save the results of the query in an external table like this:

SQL
 
USE Employment;

GO;


CREATE EXTERNAL TABLE Series_by_period

WITH (

    LOCATION = 'files/SeriesbyPeriod/',

    DATA_SOURCE = employment_data,

    FILE_FORMAT = ParquetFormat

)
AS

SELECT Count(Series_reference) AS Series_count,

Period as period

FROM
OPENROWSET(

        BULK 'csv/Business*.csv',
  
        DATA_SOURCE = 'employment_data',

        FORMAT = 'CSV',

        PARSER_VERSION = '2.0',

        HEADER_ROW = TRUE

    ) AS orders
GROUP BY Period;


Run the script. This time there's no output, but the code should have created an external table based on the results of the query. Name the script Create SeriesByPeriod Table and publish it.

Create SeriesByPeriod Table

In the Data->Workspace menu for the Series_by_period table, select New SQL script > Select TOP 100 rows. Then run the resulting script and verify that it returns the aggregated Series Count by Employment Period data.

Series_by_period

Employment Period data

On the files tab containing the file system for your data lake, view the contents of the files folder (refreshing the view if necessary) and verify that a new SeriesbyPeriod folder has been created.

SeriesbyPeriod

In the SeriesbyPeriod folder, observe that one or more files with names similar to F5C*** parquet have been created. These files contain the aggregated Series by Period data.

SeriesbyPeriod folder

To verify this, you can select one of the files and use the New SQL script > Select TOP 100 rows menu to query it directly.

SQL
 
-- This is auto-generated code

SELECT

    TOP 100 *

FROM

    OPENROWSET(

        BULK 'https://datalakerg2vh64.dfs.core.windows.net/files/sales/SeriesbyPeriod/F5C90A74-1FA5-4F41-919D-BC5C09F2B2F0_24_0-1.parquet',

        FORMAT = 'PARQUET'

    ) AS [result]


query executed successfully


Conclusion

 We have demonstrated how to ingest data into Azure Data Lake Storage Gen2, and how to perform data analysis and data transformation activities in Azure Synapse Studio. This is very useful for Data Analysts to perform data analysis activities, Developers to check their code before automation, Architects and Data Engineers to understand the data and explore the datasets and create data pipelines, and Test Engineers to validate the data.

 

 

 

 

Top