Wednesday, October 18, 2017

tempdb becomes full when updating Clustered Columnstore index

I had a table with 3.5 billion records and wanted to update one column in all records. This table was not partitioned though the design says that it should be partitioned. With spec of 2 processors (12 cores) and 64 GB RAM, it could never complete the update as it makes the tempdb full, almost 800GB, consuming all space in the drive.

I am still not sure about the exact reason for this but listing out possible reasons that could help you too.
  1. Table is not partitioned. I should have applied partitioning before loading billions of records.
  2. Clustered columnstore index needs to be rebuilt. May it takes long time because the index is fragmented (had no way of checking as everything was urgent).
  3. May be, it is not something to be surprised, update means, delete and insert, that makes records adding delta and then move to rowgroups, hence takes time and needs extra support from tempdb.
  4. Not enough memory for completing the UPDATE process, hence use tempdb.
However, I manage to execute the UPDATE as a batch process. Although it took a significant time, it did not use tempdb (it might have slightly used but did not notice).

Here is the code I finally used;

WHILE EXISTS (SELECT * FROM dbo.Table01 WHERE Column1 IS NULL)
BEGIN

 UPDATE TOP (1000000) dbo.Table01
  SET Column1 = RIGHT(Column2, 4)
 WHERE Column1 IS NULL
END

Monday, October 16, 2017

SQL Server Always Encrypted - Serving applications using Azure Key Vault and Certificate Store

Here is my second video related to Always Encrypted. This video discusses the way of configuring CMK for serving multiple applications hosted in multiple places, using Local Key Store - Certificate Store and Centralized Key Store - Azure Key Vault.


Here are the codes related to the video. I create a table in my Azure SQL Database, this is the code for it;

CREATE TABLE dbo.Message
(
 MessageId int identity(1,1) primary key
 , MessageCode char(5) COLLATE Latin1_General_BIN2 not null 
 , Message varchar(4000)  COLLATE Latin1_General_BIN2 not null
);
GO

INSERT INTO dbo.Message (MessageCode, Message)
 VALUES ('AA56B', 'This is a test message');
GO

CREATE OR ALTER PROCEDURE dbo.AddMessage @MessageCode char(5)
  , @Message varchar(4000)
AS
BEGIN

 INSERT INTO dbo.Message
  (MessageCode, Message)
 VALUES
  (@MessageCode, @Message);
END
GO


CREATE OR ALTER PROCEDURE dbo.GetMessage @MessageCode char(5)
   , @Message varchar(4000) OUTPUT
AS
BEGIN

 SELECT @Message = Message 
 FROM dbo.Message
 WHERE @MessageCode = MessageCode;
END
GO

And then, I have a .Net Application that accesses Azure Key Vault for taking the CMK and inserting and updating records. Here is the code of it. 

private static ClientCredential _clientCredential;

static void InitializeAzureKeyVaultProvider()
{
    _clientCredential = new ClientCredential("9e67ee1f-50ef-49d1-8ee0-0c48eaf4457b", "HjRqkx7BKLP7Lu+UYgTa5D/zCKAdxx3YITQ6fRrsQTQ=");

    SqlColumnEncryptionAzureKeyVaultProvider azureKeyVaultProvider =
        new SqlColumnEncryptionAzureKeyVaultProvider(GetToken);

    Dictionary providers =
        new Dictionary();

    providers.Add(SqlColumnEncryptionAzureKeyVaultProvider.ProviderName, azureKeyVaultProvider);
    SqlConnection.RegisterColumnEncryptionKeyStoreProviders(providers);
}

public async static Task GetToken(string authority, string resource, string scope)
{
    var authContext = new AuthenticationContext(authority);
    AuthenticationResult result = await authContext.AcquireTokenAsync(resource, _clientCredential);

    if (result == null)
        throw new InvalidOperationException("Failed to obtain the access token");
    return result.AccessToken;
}

public Form1()
{
    InitializeComponent();
    InitializeAzureKeyVaultProvider();

}

private void buttonSubmit_Click(object sender, EventArgs e)
{
    SqlConnection connection = new SqlConnection("Password=Pa$$w0rd;Persist Security Info=True;User ID=Dinesh;Initial Catalog=Marketing;Data Source=dinesqlserver.database.windows.net;Column Encryption Setting = Enabled");
    SqlCommand command = new SqlCommand("AddMessage", connection);
    command.CommandType = CommandType.StoredProcedure;

    SqlParameter parameterMessageCode = new SqlParameter("MessageCode", SqlDbType.Char, 5);
    parameterMessageCode.Value = textBoxMessageCode.Text;

    SqlParameter parameterMessage = new SqlParameter("Message", SqlDbType.VarChar, 4000);
    parameterMessage.Value = textBoxMessage.Text;

    command.Parameters.Add(parameterMessageCode);
    command.Parameters.Add(parameterMessage);

    connection.Open();
    command.ExecuteScalar();
    connection.Close();

}

private void buttonGet_Click(object sender, EventArgs e)
{
    SqlConnection connection = new SqlConnection("Password=Pa$$w0rd;Persist Security Info=True;User ID=Dinesh;Initial Catalog=Marketing;Data Source=dinesqlserver.database.windows.net; Column Encryption Setting = Enabled");
    SqlCommand command = new SqlCommand("GetMessage", connection);
    command.CommandType = CommandType.StoredProcedure;

    SqlParameter parameterMessageCode = new SqlParameter("MessageCode", SqlDbType.Char, 5);
    parameterMessageCode.Value = textBoxMessageCode.Text;

    SqlParameter parameterMessage = new SqlParameter("Message", SqlDbType.VarChar, 4000);
    parameterMessage.Direction = ParameterDirection.Output;

    command.Parameters.Add(parameterMessageCode);
    command.Parameters.Add(parameterMessage);

    connection.Open();
    command.ExecuteScalar();
    connection.Close();

    MessageBox.Show(parameterMessage.Value.ToString());
}

You can find more information and standard codes related to Azure Key Vault usage with Always Encrypted at: https://docs.microsoft.com/en-us/azure/sql-database/sql-database-always-encrypted-azure-key-vault


Thursday, October 12, 2017

Introduction to Always Encrypted

Here is my second video that speaks about Always Encrypted feature. This video shows How Always Encrypted works, how it should be configured and how we can access and update date using a .Net application.


Monday, October 2, 2017

Introduction to Azure HDInsight

I have written few posts on HDInsight and thought to make series of Video on it as well. Here is the first one, with this video, I discuss, what is Hadoop, What is HDInsight, Configuration of HDInsight, Important Settings related to HDInsight and how to access the cluster using Azure Portal and PuTTY.

Watch and see!


Monday, September 18, 2017

Self-Service Business Intelligence with Power BI


Understanding Business Intelligence

There was a time that Business Intelligence (BI) had been marked as a Luxury Facility that was limited to the higher management of the organization. It was used for making Strategic Business Decisions and it did not involve or support on operational level decision making. However, along with significant improvement on data accessibility and demand for big data, data science and predictive analytics, BI has become very much part of business vocabulary. Since modern technology now makes previously-impossible-to-access data, ever-increasing data and previously-unknown data available, and business can use BI with every corner related to the business for reacting fast on customers’ demand, changes in the market and competing with competitors.

The usage of Business Intelligence may vary from organization to another. Organizations work with large number of transactions need to analyze continuous, never-ending transactions, almost near-real-time, for understanding the trends, patterns and habits for better competitiveness. Seeing what we like to buy, enticing us to buy something, offering high-demand items with lower-price as a bundled item in a supermarket or commercial web site are some of the examples for usage of BI. Not only that, the demand on data mashups, dashboards, analytical reports by smart business users over traditional production and formal reports is another scenario where we see the usage of BI. This indicates increasing adoption of BI and how it assists to run the operations and survive in the competitive market. 

The implementation of traditional BI (or Corporate BI) is an art. It involves with multiple steps and stages and requires multiple skills and expertise. Although a BI project is considered and treated as a business solution than a technical/software solution, this is completely an IT-Driven solution that requires major implementations like ETLing, relational data warehouse and multi-dimensional data warehouse along with many other tiny components. This talks about an enormous cost. The complexity of the solution, time it takes and the finance support it needs, make the implementation tough and lead to failures but it is still required and demand is high. This need pushed us to another type of implementation called Self-Service Business Intelligence.

Self-Service Business Intelligence

It is all about empowering the business user with rich and fully-fledge equipment for satisfying their own data and analytical requirements. It is not something new, though the term were not used. Ever since the spreadsheet was introduced and business users manipulate data with it for various analysis, it was time that Self-Service BI appeared. Self-Service BI supports data extractions from sources, transformations based on business rules, creating presentations and performing analytics. Old-Fashioned Self-Service BI was limited and required some technical knowledge but availability of modern out-of-the-box solutions have enriched the facilities and functionalities, increasing the trend toward to Self-Service BI.

BI needs Data Models. Data Model can be considered as a consistence-view (or visual representations) of data elements along with their relationships to the business. However, data models created by developers are not exactly the model required for BI. The data model created by developer is a relational model that breaks entities to multiple parts but the model created to BI has real-world entities giving a meaning to data. This is called as Semantic Model. Once it is created, it can be used by users easily as it is a self-descriptive model, for performing data analytics, creating reports using multiple visualizations and creating dashboards that represent key information of the organization. Modern Self-Service BI tools support creating models, hence smart business users can create models as per requirements without a help from IT department.

Microsoft Power BI


Microsoft offers few products for supporting Self-Service BI. Microsoft Office Excel is the most widely-used spreadsheet software and with recent addition of four power tools: Power Pivot, Power Query, Power View and Power Map, it has become one of key software for performing Self-Service BI. In addition to that, Microsoft Reporting Services (specifically Report Builder) and SharePoint Services support Self-Service BI allowing users to perform some Self-Service BI operations.

Microsoft Power BI is the latest from Microsoft; a suite of business analytical tools. This addresses almost all Self-Service BI needs, starting from gathering data, preparation for analysis and presenting and analyzing data as required.

Power BI comes in two flavors: Power BI Desktop and Power BI Service. Power BI Desktop is a stand-alone tool that allows us to connect with hundreds of data sources, internal or external, structured or unstructured, model loaded data as per the requirements, apply transformations to adjust data and create reports with stunning visuals. Power BI Service can hold reports created by individuals as a global repository allowing us to share reports among others, create reports using shared/uploaded datasets and creating personalized dashboards.

Business Users who love Excel might see this as another tool that offers the same Excel functionalities. But it offers much more than Excel in terms of BI. Of course, you might see some functionalities that are available in Excel but not in Power BI. The biggest issue we see related to this is, entering data manually. Excel supports entering data, providing many facilities for data-entry but Power BI support on it is limited. Considering these facts, there can be certain scenario where Power BI cannot be used over Excel but in most cases, considering Self-Service BI, it is possible. The transition is not difficult, it is straight forward as Power BI easily corporates with Excel in many ways. This article gives a great introduction to using Power BI within Excel – Introduction to PowerPivot and Power BI. 

In order to perform data analysis or create reports, an appropriate data model with necessary data should be created. With traditional BI, relational and OLAP data warehouses fulfil this requirement but creating the same with Self-Service BI is a challenge. Power BI makes this possible by allowing us to connect with multiple data sources using queries and bring them together in the data model. Regardless of the type of data sources, Power BI supports creating relationships between sources with their original structures. Not only that, transformations on loaded data, creating calculated tables, columns and hierarchies are possible with Power BI data models. 

What about ETLing? How it is being handled with Self-Service BI or Power BI? This is the most complex and time consuming task with traditional BI and the major component of it is transformation. Although some transformations can be done in Power BI Data Model, advanced transformations have to be done in Query Editor that comprise more sophisticated tools for transformations. 

Selecting the best visual for presenting data and configuring it appropriately is the key of visualization. Power BI has plenty of visuals that can be used for creating reports. In addition to the visuals given, there are many other visuals available free, created and published by community supporters. If a specific visual is required, it can be downloaded and added easily.

Visualization is done with Power BI Report. Once they are created, they can be published to Power BI Service. Shared reports can be viewed via Power BI Service portal and native mobile apps. Not only that, Power BI Service portal allows users to create dashboards, pinning visuals in multiple reports to a single dashboard, extending Self-Service BI.

In some occasions, model is already created either as a relational model or OLAP model by IT department. When the model is available, users do not need to create the model for creating reports, they can directly connect with the model and consume data for creating reports. This reduces the space need to consume as user does not need to import data and offer real-time BI. Power BI supports making direct connections to certain relational data sources using Direct Query method and OLAP data sources using Live Connection method. 

Power BI is becoming the best Self-Service BI tool in the market. Microsoft is working hard on this, it can be clearly witnessed with monthly releases of Power BI Desktop and weekly upgrades of Power BI Service. If you have not started working with Power BI yet, register at http://www.powerbi.com and start using it. You will see Business Intelligence like never before.


Sunday, September 17, 2017

SQL Server 2016 - CREATE/ALTER PROCEDURE must be the first statement in a query batch

Everyone knows that SQL Server started supporting CREATE OR ALTER statement with SQL Server 2016 and it can be used with few objects. However, with my new installation of SQL Server 2016, I noticed this error which was not supposed to see with 2016.


What could be the reason? I could not immediately realize the issue and was thinking that what went wrong. It should be working as this is SQL Server 2016.

After few minutes, I realized the issue;


It is all about the version I used. This feature was introduced with SQL Server 2016 - Service Pack I and I had not installed SP1 for my new instance. You may experience the same, hence made this post to note it :).

Wednesday, September 13, 2017

SQL Server TRY_CONVERT returns an error instead of NULL

Can this be happened? The functions TRY_CONVERT and TRY_CAST have been designed for handling errors and they should return NULL when there is an error.


But, it is not always true.


What is the reason? Although it returns NULL when an error occurred with the conversion, it does not happen if the conversion is not a permittable. The above code tries to convert an integer into an unique identifier that is not permitted, hence it fails with an error. 

Tuesday, September 12, 2017

Run Hive Queries using Visual Studio

Once HDInsight cluster is configured, we generally use either the portal dashboard (Powered by Ambari) or a tool like PuTTY for executing queries against data loaded. Although they are not exactly a developer related tools, or in other words, not an IDE, we had to use because we did not have much options. However, now we can use the IDE we have been using for years for connecting with HDInsight and executing various types of queries such as Hive, Pig and USQL. It is Visual Studio.

Let's see how we can use Visual Studio for accessing HDInsight.

Making Visual Studio read for HDInsight

In order to work with HDInsight using Visual Studio, you need to install few tools on Visual Studio. Here are the supported versions;
  • Visual Studio 2013 Community/Professional/Premium/Ultimate with Update 4
  • Visual Studio 2015 any edition
  • Visual Studio 2017 any edition
You need to make sure that you have installed Azure SDK on your Visual Studio. Click here for downloading the Web Platform Installer and make sure following are installed;


This installs Microsoft Azure Data Lake Tools for Visual Studio as well, make sure it is installed.


Now your Visual Studio is ready for accessing HDInsight. 

Connecting with HDInsight

Good thing is, you can connect with your cluster even without creating a project. However, once the SDK is installed, you can see new Templates called Azure Data Lake - HIVE (HDInsight), Pig (HDInsight), Storm (HDInsight) and USQL (ADLA) and HIVE template can be used for creating a project.


Project creates one hql file for you and you can use it from executing your Hive Queries. In addition to that, You can open Server Explorer (View Menu -> Server Explorer), and expand Azure (or connect to your Azure account and then expand) for seeing all components related to Azure.


As you see, you can see all databases, internal and external tables, views and columns. Not only that, by right-clicking the cluster, you can open a windows for writing a query or viewing jobs. Here is the screen when I use the first option that is Write a Hive Query.


Did you notice Intelli-Sense? Yes, it supports with almost all metadata, hence it is really easy to write a query.

Executing Queries

If you need to see records in tables without limiting data with predicates or constructing the query with additional functions, you can simply right-click on the table in Server Explorer and select View top 100 Rows


If you need to construct a query, then use the above method for opening a window and write the query. There are two ways of executing the code: Batch and Interactive. Batch mode does not give you the result immediately but you will be able to see or download once the job submitted is completed. If you use the Interactive, then it is similar to SSMS result.


If you use the Batch mode, you can see the way job is getting executed. Once the job is completed, you can click on Job Output for seeing or downloading the output.


As you see, there is no graphical interface to see the job execution. Visual Studio will show the job execution using a graphical interface only when the job is executed by Tez Engine. Remember, HDInsight will always use Tez Engine to execute Hive Queries but simpler queries will be executed using Map Reduce Engine.

See this query that has some computation;


Can we create table with this IDE?
Yes, it is possible. You can right-click on the your database in Azure Server Explorer and select Create table menu item. 


Let's talk about more on this with later posts.

Monday, September 11, 2017

SQL Server - Merging Partitions - Does it physically move data from a data file to another?

Partitioning is a common practice when maintaining large number of records in tables regardless of the database type: whether the database is an OLTP database or a data warehouse based on dimensional modeling. Partitioning requires a Partition Function that describes boundary values and number of partitions required, and Partition Scheme that assigns File Groups to partitions.

Here are few posts I have made on partitioning;

Once a table is partitioned, based on the workloads, we add new boundary values for introducing new partitions to the table and we remove boundary values for combining partitions (or removing partitions). These two operations are done with two partition related functions: SPLIT and MERGE. Now the question is, when we remove a partition using the MERGE function, if partitions are distributed with different file groups (means different data files), does SQL Server moves data from one file to another? Do we really have to consider it?

Let's make a database and do a simple test. Let's create a database called AdventureWorksTest and add multiple file groups and data files.

-- Creating the test database
USE master;
GO

CREATE DATABASE AdventureWorksTest
GO

-- Add multiple file group and files
-- files are distributed
-- Add filegroup 1
ALTER DATABASE AdventureWorksTest
ADD FILEGROUP FileGroup1;
GO
ALTER DATABASE AdventureWorksTest
ADD FILE 
(
    NAME = AdventureWorksTest_Data_01,
    FILENAME = 'C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\DATA\AdventureWorksTest_Data_01.ndf',
    SIZE = 5MB,
    MAXSIZE = 100MB,
    FILEGROWTH = 5MB
)
TO FILEGROUP FileGroup1;

-- Add filegroup 2
ALTER DATABASE AdventureWorksTest
ADD FILEGROUP FileGroup2;
GO
ALTER DATABASE AdventureWorksTest
ADD FILE 
(
    NAME = AdventureWorksTest_Data_02,
    FILENAME = 'C:\Databases\AdventureWorksTest_Data_02.ndf',
    SIZE = 5MB,
    MAXSIZE = 100MB,
    FILEGROWTH = 5MB
)
TO FILEGROUP FileGroup2;

-- Add filegroup 3
ALTER DATABASE AdventureWorksTest
ADD FILEGROUP FileGroup3;
GO
ALTER DATABASE AdventureWorksTest
ADD FILE 
(
    NAME = AdventureWorksTest_Data_03,
    FILENAME = 'D:\Databases\AdventureWorksTest_Data_03.ndf',
    SIZE = 5MB,
    MAXSIZE = 100MB,
    FILEGROWTH = 5MB
)
TO FILEGROUP FileGroup3;

-- Add filegroup 4
ALTER DATABASE AdventureWorksTest
ADD FILEGROUP FileGroup4;
GO
ALTER DATABASE AdventureWorksTest
ADD FILE 
(
    NAME = AdventureWorksTest_Data_04,
    FILENAME = 'E:\Databases\AdventureWorksTest_Data_04.ndf',
    SIZE = 5MB,
    MAXSIZE = 100MB,
    FILEGROWTH = 5MB
)
TO FILEGROUP FileGroup4;
GO

As you see, we have five file groups now.


Let's create a Partitioned Table and load it using AdventureWorks2014 Sales.SalesOrderHeader table. Note that, below code creates a Partition Function with 2014, 2015, 2016 and 2017 as boundary values. And it creates the Partition Scheme setting with multiple file groups.

-- Creating parition function and scheme
USE AdventureWorksTest;
GO

CREATE PARTITION FUNCTION pf_Year (datetime) 
AS RANGE right
FOR VALUES ('2014-01-01 00:00:00.000', '2015-01-01 00:00:00.000',  '2016-01-01 00:00:00.000',  '2017-01-01 00:00:00.000');
GO

CREATE PARTITION SCHEME ps_Year 
AS PARTITION pf_Year 
TO ([PRIMARY], FileGroup1, FileGroup2, FileGroup3, FileGroup4);
GO



-- Creating a partitioned table and loading from AdventureWorks2014 database
-- Note that sales dates of AdventureWorks2014 database have been updated for year 2014 to 2017
CREATE TABLE Sales
(
 SalesOrderID int NOT NULL,
 RevisionNumber tinyint NOT NULL DEFAULT (0),
 OrderDate datetime NOT NULL DEFAULT GETDATE(),
 DueDate datetime NOT NULL,
 ShipDate datetime NULL,
 [Status] tinyint NOT NULL DEFAULT(1),
 SubTotal money NOT NULL DEFAULT(0.00),
 TaxAmt money NOT NULL DEFAULT (0.00),
 Freight money NOT NULL DEFAULT (0.00),
 TotalDue money,
 Comment nvarchar(128) NULL,
 rowguid uniqueidentifier ROWGUIDCOL NOT NULL DEFAULT NEWID(),
 ModifiedDate datetime NOT NULL DEFAULT GETDATE()
) 
ON ps_Year(OrderDate);
GO

INSERT INTO Sales
  (SalesOrderID, RevisionNumber, OrderDate, DueDate, ShipDate, [Status]
  , SubTotal, TaxAmt, Freight, TotalDue, Comment, rowguid, ModifiedDate)
SELECT SalesOrderID, RevisionNumber, OrderDate, DueDate, ShipDate, [Status]
  , SubTotal, TaxAmt, Freight, TotalDue, Comment, rowguid, ModifiedDate
FROM AdventureWorks2014.Sales.SalesOrderHeader;

-- Checking records;
SELECT  $PARTITION.pf_Year(OrderDate) PartitionNo, *
FROM Sales;

Here is the result of the query.


Here is another query to see how data is distributed with each file group.

-- Get the record count for each partition/file group
SELECT t.name AS TableName,
  p.partition_number PartitionNumber,
  COALESCE(f.name, d.name) FileGroupName, 
  SUM(p.rows) NumberOfRecords,
  SUM(a.total_pages) NumberOfDataPages
FROM sys.tables AS t
 INNER JOIN sys.indexes i 
  ON i.object_id = t.object_id
 INNER JOIN sys.partitions p 
  ON p.object_id = t.object_id 
   AND p.index_id = i.index_id
 INNER JOIN sys.allocation_units a 
  ON a.container_id = p.partition_id
 LEFT OUTER JOIN sys.filegroups f 
  ON f.data_space_id = i.data_space_id
 LEFT OUTER JOIN sys.destination_data_spaces AS dds 
  ON dds.partition_scheme_id = i.data_space_id 
   AND dds.destination_id = p.partition_number
 LEFT OUTER JOIN sys.filegroups d 
  ON d.data_space_id = dds.data_space_id
WHERE t.[type] = 'U' AND i.index_id IN (0, 1) AND t.name LIKE 'Sales'
GROUP BY t.name, p.partition_number, COALESCE(f.name, d.name) 
ORDER BY t.name, p.partition_number;


If I illustrate the same in different manner;


This is how the data is distributed with my system now. If I try to delete a data file, example, AdventureWorksTest_Data_03 that is belong to FileGroup3, SQL Server will not allow me to do as it holds data.


If I decide to combine 2015 and 2016, in other words, if I want to remove the 4th partition that holds 2016 data, what will happen. Let's do and see. The below codes merge the partition;

ALTER PARTITION FUNCTION pf_Year()
MERGE RANGE ('2016-01-01 00:00:00.000');

Once delete, let me run the same query to see how data is partitioned now.


As you see, FileGroup3 is no longer used by the table and number of records in the FileGroup2 has been increased. The partition which was 5 is now 4. This means, data that was maintained with FileGroup3 (or AdventureWorksTest_Data_03) has been moved to FileGroup2 (or AdventureWorksTest_Data_02). This is what has happened;


This clearly shows that data is moved when partitions are merged. Okay, do we have to really consider about this? Yes, we have to, when this is a small table, we do not see any performance issue but if the table is large, moving data from one file to another will surely take time, hence, think twice before merging, specifically when they are distributed in multiple files.

** Now, if you need, you can delete the AdventureWorksTest_Data_03 file and SQL Server will allow you to delete as it is empty.

Sunday, September 10, 2017

How to refer files in HDInsight - Azure Storage using different ways

If you have started working with Big Data, you surely need to check the Microsoft support on it via Azure platform - HDInsight service. HDInsight allows you to create a Hadoop environment within few minutes and it can be anytime scaled out or in based on your requirements. I have written few posts on this, you can have a look on them using following links;


In order to work with data loaded to HDInsight, or Hadoop, data files have to be refereed using supported syntax. There are multiple ways for referring files in the storage with HDFS. Here are the ways;

Fully qualified path with wasb(s) protocol

This is most accurate and correct way of referring files in the storage. Here is the pattern;

wasb://container_name@storage_account_name.blob.core.windows.net/file_path.

Here is an example using Putty, connecting with HDInsight and reading a file (processed with Hive) exist. My container name is dinesqlashdinsight and storage name is dinesqlasstorage. File path is data/cleanlog/000000_0 (this is a Hive table in fact).


Connecting with the default container

If your files are in the default container, you can skip the container name and storage name as follow;

wasb:///file_path

Note the three slashes. It is required when you do not mentioned the container name.


Connecting using Hadoop/Linux/Unix native ways

Generally, when you work with Hadoop using Linux/Unix, you refer files without the protocol. Azure HDInsight supports the same and we can refer files using that syntax.

file_path


Do I need double quotes for my paths?

It is required when you have some odd characters like equal (=) sign with your path. See the example below. I try to read a data file exist in a the cluster and the path has equal signs, hence path is encased with double quotes.



Friday, September 8, 2017

SQL Server - Dropping Primary Key will drop the Clustered Index?

Everyone knows that SQL Server creates a Clustered Index when we add a Primary Key if there is no Clustered Index already exist in the table. It adds a Non-Clustered Key for the Primary Key if we have already added a Clustered Index. However they are two different objects; one is a Constraint and other is an Index. What if I drop one object? Will it drop the other as well?

Let's make a test and see. If I create a table Customer like below, making Customer Key as the Primary Key;


As you see, it will create both Key and the Index.


Now if I drop either one, it will drop the other one as well. For an example, if I drop the Primary Key, it will drop the Index as well.

If I need to make sure that it does not happen, I can create them separately, first create the Clustered Index on CustomerKey and then make the CustomerKey as the Primary Key. However, that will add another index specifically for the Primary Key.


The reason for above behavior is, Primary Key needs an Index. It is always associated with an index therefor if one is getting dropped, the associate also getting dropped.


Thursday, September 7, 2017

Dimension with just one attribute - data warehouse design

Have you ever come across a situation where you have to design a dimension that has no attributes except the one explain the dimension itself? I am sure that you have experienced such dimensions and you might have created a table in relational data warehouse for maintaining it. But is it the best way of managing such dimensions?

Assume that you have a fact table that maintains Issues and it is linked with multiple dimensions such as Date, Warehouse and Customer. And you have identified Issue Number as another dimension as there is a requirement for analyzing claims by issue number. You may design your data warehouse as below.


As you see, there are no other identified attributes for Issue Number dimension. Do we have an alternate way of designing it?

This is where we use a dimension type called Degenerated Dimension. Whenever we see such dimensions and we know for sure that it has only one attribute and there will be always a unique value to a transaction in the fact, it is always better to add it to the fact table itself rather maintaining another table. The dimension that is maintained in the fact is called as Degenerated Dimension.


Wednesday, September 6, 2017

How to Connect HDInsight using Putty

I have written many articles on HDInsight Service which is an Azure service that allows us to provision Apache Hadoop Cluster and use it for various type of big data related implementations. If you need to see all my articles written on HDInsight, please use the link: http://dinesql.blogspot.com/search?q=hdinsight.

Although I have mainly used Azure portal to demonstrate HDInsight (Ambari dashboard), many use a tool called PuTTY that is a free implementation of SSH and Telnet for Windows and Unix platform. It is a good tool to work with, specifically if you need to connect the cluster remotely and work with it for navigating nodes or doing some data work with some sub projects like Hive against the cluster. Remember, if you have configured your HDInsight cluster as a Linux-based cluster, you need a SSH client to remotely connect with it. Let's see the way of using it.

First of all you need to download PuTTY for using it as Windows machines do not have it as a built-in tool (Linux and Mac OS computers have a SSH client interface built-in). Here is the link for downloading PuTTY.


There is no installation with it. You can just use the putty.exe. However, you need a host name to connect. If you go to the portal and open your cluster blade, you can get the Host Name for connecting with a SSH Client by clicking Secure Shell (SSH) button.


Note that there are two, one for Windows and another for Linux, Unix and OS X users, make sure that you copy the Windows one.


When you open PuTTY, you should see an input box called Host Name for entering the copied host name. Enter the copied value and make sure that the Connection Type is selected as SSH.

All you have to do now is, click Open. This opens the standard window with standard font and font size. If you need a larger font, you can go to Window - Appearance  and change settings as you want.

When you open this first time, you will get a security warning as below;


This says that we do not have the key related to the host cached locally, hence it is not secure or not sure whether we connect to the right one. Since we know that we connect to the right host, we can simply click Yes and continue.

Once connected, you will be promoted for user id and password. Note that you need to use the one you configured as SSHUser and the Password when the cluster was created. It does not accept the HTTP User.


Enter the SSHUser and the password, you will be connected with the default node.


Everything is ready now. As you see, first part of the last line is the user connected and hn0 is one of the head nodes. Last part of it shows the cluster name but we see it partially as it is lengthy. Now I can navigate, make folders, move files or connect with Hive without any issue.





Monday, August 14, 2017

Azure SQL Data Warehouse - Part II - Analyzing Created Databases

I started writing on this long time back but had no way of completing it. My first post on this was Azure SQL Data Warehouse - Part I - Creating and Connecting and here is the second part of it.

The first post discussed about Azure SQL Data Warehouse and how to create a server and database using the portal. As the next step, let's talk about the architecture of it bit and then see how Azure data warehouse maintains and processes data.

Control Node

Azure SQL Data Warehouse is a distributed database. It means that data is distributed in multiple locations. However, once the data warehouse is created, we will be connecting with one component called Control Node. It is not exactly a SQL Server database but when connecting to it, it looks and feels like connecting to a SQL Server Database. Control node handles all communication and computation. When we make a request to the data warehouse, Control node accepts it, determines the way it should be distributed based on divide and conquer approach, get it processed and finally send the result to us.

Compute Node

Control node get the data processed in parallel using Compute Nodes. They are SQL Server databases and store all our records. Based on the number of DWU configured, data warehouse is set with one or more Compute Nodes.

Distribution

Data related to the data warehouse is stored in Azure Blob Storage and distributed in multiple locations. It is independent from Compute Nodes, hence they can be operated/adjusted independently. These locations are called as Distributions. The number of distributions for an Azure SQL data warehouse is a fixed number that is sixty (60). These distributions are assigned dynamically to Compute Nodes and when a query is executed, each distribution processes data related to them. This is how the parallel execution happens.

If you need more compute power, you can increase the number of Compute Nodes by increasing DWUs. When the number of Compute Nodes are getting changed, the number of distributions per Compute Node is getting changed as well.

Architecture

This image shows the architecture when you create an Azure SQL Data Warehouse with 100 DWU.


This image shows the architecture when you create an Azure SQL Data Warehouse with 400 DWU.


Let's create two databases and clarify this configuration.

I have discussed all steps related to server creation and database creation in my first post that is  Azure SQL Data Warehouse - Part I - Creating and Connecting , hence I am not going to discuss the same here. Note the image below. That is how I have created two databases for checking the configurations.


As you see, the first data warehouse is created using 100 DWUs and second one with 400 DWUs. Let's see how nodes have been created for these two databases. We can simply use sys.dm_pdw_nodes DMV for getting this information.

SELECT * FROM sys.dm_pdw_nodes;

SELECT type, Count(*) NofNodes
FROM sys.dm_pdw_nodes
GROUP BY type;

Result with the first data warehouse that is created using 100 DWUs.


Result with the second data warehouse that is created using 400 DWUs.


Note the second data warehouse. Since we used more DWUs, it has been created with four Compute Nodes that gives better performance than the first one. Since this is a sample database and it has tables, we can check one of the tables and see how data is distributed with distributions.

The following code shows the distributions created for one data warehouse. As mentioned above, it is always 60.


Here is the code for seeing how rows of a table are distributed in distributions with the second data warehouse created. Note how each distributions are assigned to Compute Nodes.


Records are distributed based on the design of the table. Azure SQL Data Warehouse uses two types of distributions: Round Robin and Hash distributions. Let's talk about it with the next post.

Friday, August 4, 2017

bcp error - Invalid column type from bcp client for colid 1 - Solution

You might be experiencing the following error with bcp when loading text data to SQL Server table;


It is possible to see this error when there is a data type mismatch between a column related to the destination table and a field coming from the file. However, there can be a situation where you have used a view for the bcp and underline table has been changed. If so, you may see the same error.

In my case, my destination table has an identity column, hence I created a view for getting data via bcp. I started seeing this error after making some modifications to the table. Once the view is dropped and recreated, it worked without any issue.

Thursday, August 3, 2017

Script for creating DimTime Dimension Table and populating data

The posts I made on date dimension table (Script for populating Date Dimension with Financial Year and Loading Data Script for Snowflake type Date Dimension) had a good response and few asked about the same for Time related table. This is one of the old scripts I used for populating Time Dimension, it maintains up to the second but if you need more, the script can be easily change.

Here is the code for creating the table and populating data.

CREATE TABLE dbo.Time
(
 TimeKey int identity(1,1) 
 , Constraint pk_Time primary key (TimeKey)
 , [Hour] smallint not null
 , [24Hour] smallint not null
 , [Minute] smallint not null
 , [Second] smallint not null
 , [Time] char(10) not null
 , [24Time] char(8) not null
 , HourBucket char(9)
);
GO

DECLARE @Hour smallint = 0
DECLARE @Minute smallint = 0
DECLARE @Second smallint = 0

WHILE (@Hour < 24)
BEGIN

 WHILE (@Minute < 60)
 BEGIN

  WHILE (@Second < 60)
  BEGIN

   INSERT INTO dbo.[Time]
    ([Hour], [24Hour], [Minute], [Second], [Time], [24Time], HourBucket)
   VALUES
    (CASE WHEN @Hour > 12 THEN @Hour - 12 ELSE @Hour END
    , @Hour, @Minute, @Second
    , REPLICATE('0', 2 - LEN(CASE WHEN @Hour > 12 THEN @Hour - 1 ELSE @Hour END)) + CONVERT(varchar(2), CASE WHEN @Hour > 12 THEN @Hour - 1 ELSE @Hour END) 
     + ':' + REPLICATE('0', 2 - LEN(@Minute)) + CONVERT(varchar(2), @Minute) 
     + ':' + REPLICATE('0', 2 - LEN(@Second)) + CONVERT(varchar(2), @Second) 
     + CASE WHEN @Hour > 12 THEN 'PM' ELSE 'AM' END
    , REPLICATE('0', 2 - LEN(@Hour)) + CONVERT(varchar(2), @Hour) 
     + ':' + REPLICATE('0', 2 - LEN(@Minute)) + CONVERT(varchar(2), @Minute) 
     + ':' + REPLICATE('0', 2 - LEN(@Second)) + CONVERT(varchar(2), @Second) 
    , '[' + REPLICATE('0', 2 - LEN(@Hour)) + CONVERT(varchar(2), @Hour) 
     + ' - ' 
     + REPLICATE('0', 2 - LEN(@Hour+1)) + CONVERT(varchar(2), @Hour+1) + ']')
   SET @Second += 1;
  END
  SET @Second = 0;
  SET @Minute += 1;
 END
 SET @Minute = 0;
 SET @Hour += 1;
END
GO

SELECT * FROM dbo.[Time];


Wednesday, August 2, 2017

How to reset IDENTITY in Memory-Optimized Tables

Even though SEQUENCE object is available with much more flexibility, we still use IDENTITY property for adding sequence values to tables, specifically when we need to introduce a surrogate key. However, if you need the same with Memory-Optimized tables, you need to know certain things.

Can I add the IDENTITY property to Memory-Optimized tables?
Yes, it is possible but it should be always IDENTITY(1,1). You cannot use a different values for seed and increment, they should be always set as 1.

CREATE TABLE dbo.MemoryOptimizedTable
(
 Id int IDENTITY(1,1) NOT NULL primary key nonclustered,
 CurrentDate datetime NULL default (getdate())
)WITH ( MEMORY_OPTIMIZED = ON , DURABILITY = SCHEMA_AND_DATA )
GO

INSERT INTO dbo.MemoryOptimizedTable (CurrentDate) VALUES (DEFAULT);
GO 5

Can I reset the IDENTITY seed using DBCC CHECKIDENT?
No, this is not supported. If you need to reset the IDENTITY seed, only way is inserting a new value explicitly by turning SET IDENTITY_INSERT on. As shown in the example, once the record with value 100 is inserted, the next value of the seed is set to 100+1.

SET IDENTITY_INSERT dbo.MemoryOptimizedTable ON
GO
INSERT INTO dbo.MemoryOptimizedTable (Id, CurrentDate) VALUES (100, DEFAULT);
GO
SET IDENTITY_INSERT dbo.MemoryOptimizedTable OFF
GO

INSERT INTO dbo.MemoryOptimizedTable (CurrentDate) VALUES (DEFAULT);
GO

SELECT * FROM dbo.MemoryOptimizedTable;


What if insert a lower value explicitly?
It is possible as long as it does not violate any rules. Look at the code below. It inserts a record with value 50 explicitly. But it does not mean that the seed is getting reset to 50+1. This does not happen because the last generated value is greater than 50+1. Therefore, the value of next record is 102, not 51.

SET IDENTITY_INSERT dbo.MemoryOptimizedTable ON
GO
INSERT INTO dbo.MemoryOptimizedTable (Id, CurrentDate) VALUES (50, DEFAULT);
GO
SET IDENTITY_INSERT dbo.MemoryOptimizedTable OFF
GO

INSERT INTO dbo.MemoryOptimizedTable (CurrentDate) VALUES (DEFAULT);
GO

SELECT * FROM dbo.MemoryOptimizedTable;


Tuesday, August 1, 2017

Script for populating Date Dimension with Financial Year

Once I published the code I used for creating date dimension tables and populating data. But it had not handled finance year related date elements. Generally we handle financial year elements with Date Table with Star Schema not with Snowflake Schema. And in most cases (or old days), we usually use OLAP Data Warehouse  as the source for reporting and Analysis Services can easily handle financial dates with Start Schema implemented in the Relational Data Warehouse. However, in modern world, we try to make the solution just using the Relation Data Warehouse without using Multi-dimensional models (or using Tabular Models), thought make a another script for handling financial dates with Snowflake Schema.

With this script, four tables are created: Year, Quarter, Month and Date. If need to see elements related to the financial year, the link between Year and Quarter should be set with Year.YearKey and Quarter.FinancialYearKey. For calendar date elements, the link should be between Year.YearKey and Quarter.YearKey.

This is how you make a search for financial dates;


This is how you make a search for Calendar dates;


Here is the script for creating tables;

CREATE TABLE dbo.DimYear
(
 YearKey smallint 
 , Year smallint not null
 , Constraint pk_DimYear Primary Key (YearKey)
);
GO

CREATE TABLE dbo.DimQuarter
(
 QuarterKey smallint 
 , YearKey smallint not null
 , FinancialYearKey smallint not null
 , Quarter smallint not null
 , YearQuater varchar(20) not null
 , QuarterName varchar(20) not null
 , FinancialQuarter smallint not null
 , FinancialYearQuarter varchar(20) not null
 , Constraint pk_DimQuarter Primary Key (QuarterKey)
 , Constraint fk_DimQuarter_DimYear Foreign Key (YearKey)
  References dbo.DimYear (YearKey)
 , Constraint fk_DimQuarter_DimYear_Financial Foreign Key (FinancialYearKey)
  References dbo.DimYear (YearKey)
);
GO

CREATE TABLE dbo.DimMonth
(
 MonthKey int 
 , QuarterKey smallint not null
 , MonthNumber smallint not null
 , MonthName varchar(20) not null
 , YearMonth varchar(20) not null
 , MonthShortName char(3) not null
 , FinancialMonthNumber smallint not null
 , Constraint pk_DimMonth Primary Key (MonthKey)
 , Constraint fk_DimMonth_DimQuarter Foreign Key (QuarterKey)
  References dbo.DimQuarter (QuarterKey)
);
GO

CREATE TABLE dbo.DimDate
(
 DateKey int 
 , MonthKey int not null
 , Date date not null
 , WeekDayNumber smallint not null
 , WeekDayName varchar(20) not null
 , DayOfMonth smallint not null
 , DayOfYear smallint not null
 , IsWeekend bit not null
 , IsHoliday bit not null
 , WeekNumberOfYear smallint not null
 , Constraint pk_DimDate Primary Key (DateKey)
 , Constraint fk_DimDate_DimMonth Foreign Key (MonthKey)
  References dbo.DimMonth (MonthKey)
);
GO

Here is the script for populating dates;

SET DATEFIRST 1;

DECLARE @StartDate date = '1990-01-01';
DECLARE @EndDate date = '2019-12-31';
DECLARE @FinancialYearStartingQuarter smallint = 3; -- Starting from July, If it starts from April then 2

DECLARE @YearKey smallint;
DECLARE @Quarter smallint, @QuarterKey smallint;
DECLARE @Month int, @MonthKey int;
DECLARE @Date int, @DateKey int;

DECLARE @FinancialStartingMonth smallint;
SET @FinancialStartingMonth = CASE @FinancialYearStartingQuarter WHEN 1 THEN 1 WHEN 2 THEN 4 WHEN 3 THEN 7 WHEN 4 THEN 10 END

INSERT INTO dbo.DimYear 
 (YearKey, [Year]) 
VALUES 
 (YEAR(@StartDate) - 1, YEAR(@StartDate) - 1);


WHILE (@StartDate <= @EndDate)
BEGIN

 -- Inserting years
 SET @YearKey = YEAR(@StartDate);
 IF NOT EXISTS (SELECT * FROM dbo.DimYear WHERE YearKey = @YearKey)
  INSERT INTO dbo.DimYear (YearKey, [Year]) 
  VALUES 
  (@YearKey, @YearKey);


 -- Inserting quarters
 SET @QuarterKey = Convert(smallint, Convert(varchar(4), YEAR(@StartDate)) + Convert(varchar(1), DATEPART(q, @StartDate)))
 SET @Quarter = DATEPART(q, @StartDate);

 IF NOT EXISTS (SELECT * FROM dbo.DimQuarter WHERE QuarterKey = @QuarterKey)
 INSERT INTO dbo.DimQuarter 
  (QuarterKey, YearKey
  , FinancialYearKey
  , [Quarter], YearQuater, QuarterName
  , FinancialQuarter, FinancialYearQuarter) 
 VALUES (@QuarterKey, @YearKey
  , CASE WHEN @Quarter < @FinancialYearStartingQuarter THEN @YearKey -1 ELSE @YearKey END
  , @Quarter, Convert(varchar(4), YEAR(@StartDate)) + 'Q' + Convert(varchar(4), @Quarter)
  , CASE @Quarter WHEN 1 THEN 'First Quarter' WHEN 2 THEN 'Second Quarter' WHEN 3 THEN 'Third Quarter' ELSE 'Forth Quarter' END
  , CASE @Quarter WHEN 1 THEN 3 WHEN 2 THEN 4 WHEN 3 THEN 1 WHEN 4 THEN 2 END 
  , CASE @Quarter WHEN 1 THEN Convert(varchar(4), YEAR(@StartDate) - 1) + 'Q3'
      WHEN 2 THEN Convert(varchar(4), YEAR(@StartDate) - 1) + 'Q4'
      WHEN 3 THEN Convert(varchar(4), YEAR(@StartDate)) + 'Q1'
      WHEN 4 THEN Convert(varchar(4), YEAR(@StartDate)) + 'Q2'END
  );
  

 ---- Inserting months
 SET @MonthKey = Convert(int, Convert(varchar(4), YEAR(@StartDate)) + (CASE LEN(MONTH(@StartDate)) WHEN 1 THEN '0' ELSE '' END) + Convert(varchar(2), MONTH(@StartDate)));
 SET @Month = MONTH(@StartDate)
 IF NOT EXISTS (SELECT * FROM dbo.DimMonth WHERE MonthKey = @MonthKey)
  INSERT INTO dbo.DimMonth 
   (MonthKey, QuarterKey, MonthNumber, MonthName, YearMonth, MonthShortName
   , FinancialMonthNumber) 
  VALUES 
   (@MonthKey, @QuarterKey, @Month, DATENAME(MONTH, @StartDate)
   , Convert(varchar(4), YEAR(@StartDate)) + ' ' + LEFT(DATENAME(MONTH, @StartDate), 3)
   , LEFT(DATENAME(MONTH, @StartDate), 3)
   , CASE 
     WHEN @FinancialStartingMonth = 1 THEN @Month
     WHEN @FinancialStartingMonth = 4 AND @Month  < @FinancialStartingMonth THEN @Month + 9
     WHEN @FinancialStartingMonth = 4 AND @Month  >= @FinancialStartingMonth THEN @Month - 3
     WHEN @FinancialStartingMonth = 7 AND @Month  < @FinancialStartingMonth THEN @Month + 6
     WHEN @FinancialStartingMonth = 7 AND @Month  >= @FinancialStartingMonth THEN @Month - 6
     WHEN @FinancialStartingMonth = 10 AND @Month  < @FinancialStartingMonth THEN @Month + 6
     WHEN @FinancialStartingMonth = 10 AND @Month  >= @FinancialStartingMonth THEN @Month + 3
    END
   );
  
  ---- Inserting dates
  SET @DateKey = Convert(int, Convert(varchar(8), @StartDate, 112))
  IF NOT EXISTS (SELECT * FROM dbo.DimDate WHERE DateKey = @DateKey)
   INSERT INTO dbo.DimDate (DateKey, MonthKey, Date, WeekDayNumber, WeekDayName, DayOfMonth
    , DayOfYear, IsWeekend, IsHoliday, WeekNumberOfYear) 
   VALUES (@DateKey, @MonthKey, @StartDate, DatePart(WEEKDAY, @StartDate), DateName(WEEKDAY, @StartDate), DatePart(DAY, @StartDate)
    , DatePart(DAYOFYEAR, @StartDate), CASE WHEN DatePart(WEEKDAY, @StartDate) IN (6, 7) THEN 1 ELSE 0 END, 0
    , DatePart(WEEK, @StartDate));

  SET @StartDate = DATEADD(dd, 1, @StartDate);
END;
GO