Monday, August 14, 2017

Azure SQL Data Warehouse - Part II - Analyzing Created Databases

I started writing on this long time back but had no way of completing it. My first post on this was Azure SQL Data Warehouse - Part I - Creating and Connecting and here is the second part of it.

The first post discussed about Azure SQL Data Warehouse and how to create a server and database using the portal. As the next step, let's talk about the architecture of it bit and then see how Azure data warehouse maintains and processes data.

Control Node

Azure SQL Data Warehouse is a distributed database. It means that data is distributed in multiple locations. However, once the data warehouse is created, we will be connecting with one component called Control Node. It is not exactly a SQL Server database but when connecting to it, it looks and feels like connecting to a SQL Server Database. Control node handles all communication and computation. When we make a request to the data warehouse, Control node accepts it, determines the way it should be distributed based on divide and conquer approach, get it processed and finally send the result to us.

Compute Node

Control node get the data processed in parallel using Compute Nodes. They are SQL Server databases and store all our records. Based on the number of DWU configured, data warehouse is set with one or more Compute Nodes.

Distribution

Data related to the data warehouse is stored in Azure Blob Storage and distributed in multiple locations. It is independent from Compute Nodes, hence they can be operated/adjusted independently. These locations are called as Distributions. The number of distributions for an Azure SQL data warehouse is a fixed number that is sixty (60). These distributions are assigned dynamically to Compute Nodes and when a query is executed, each distribution processes data related to them. This is how the parallel execution happens.

If you need more compute power, you can increase the number of Compute Nodes by increasing DWUs. When the number of Compute Nodes are getting changed, the number of distributions per Compute Node is getting changed as well.

Architecture

This image shows the architecture when you create an Azure SQL Data Warehouse with 100 DWU.


This image shows the architecture when you create an Azure SQL Data Warehouse with 400 DWU.


Let's create two databases and clarify this configuration.

I have discussed all steps related to server creation and database creation in my first post that is  Azure SQL Data Warehouse - Part I - Creating and Connecting , hence I am not going to discuss the same here. Note the image below. That is how I have created two databases for checking the configurations.


As you see, the first data warehouse is created using 100 DWUs and second one with 400 DWUs. Let's see how nodes have been created for these two databases. We can simply use sys.dm_pdw_nodes DMV for getting this information.

SELECT * FROM sys.dm_pdw_nodes;

SELECT type, Count(*) NofNodes
FROM sys.dm_pdw_nodes
GROUP BY type;

Result with the first data warehouse that is created using 100 DWUs.


Result with the second data warehouse that is created using 400 DWUs.


Note the second data warehouse. Since we used more DWUs, it has been created with four Compute Nodes that gives better performance than the first one. Since this is a sample database and it has tables, we can check one of the tables and see how data is distributed with distributions.

The following code shows the distributions created for one data warehouse. As mentioned above, it is always 60.


Here is the code for seeing how rows of a table are distributed in distributions with the second data warehouse created. Note how each distributions are assigned to Compute Nodes.


Records are distributed based on the design of the table. Azure SQL Data Warehouse uses two types of distributions: Round Robin and Hash distributions. Let's talk about it with the next post.

No comments: