Stairway to U-SQL Level 1: Introduction to U-SQL and Azure Data Lakes

Question

Stairway to U-SQL Level 1: Introduction to U-SQL and Azure Data Lakes

Viewing 15 posts - 16 through 30 (of 37 total)

You must be logged in to reply to this topic. Login to reply

Eric M Russell SSC Guru Points: 125255 More actions · Answer 1

So this basically just T-SQL with additional extensions for sourcing from or outputting to flat files and also perhaps analytical functions?

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

Mike McQuillan SSCertifiable Points: 5997 More actions · Answer 2

Hi there Gerald

Unfortunately, at the moment there isn't an on-prem version of Data Lakes.

Your best bet is to ask Michael Rys (he's in charge of all this stuff) whether there are any plans for this. You can find him on Twitter at @mikedoesbigdata.

Regards,

Mike.

Mike McQuillan SSCertifiable Points: 5997 More actions · Answer 3

Hi there Eric

U-SQL looks like T-SQL on the surface, but it's a blend of both T-SQL and C#. The WHERE clauses are very different - you can't write WHERE x = 1, for instance, you write WHERE x == 1. This is a very simple example, here's another lifted straight from the article published today (22/06/16):

WHERE postcode.Substring(0, 3).ToLower() == "m12";

This is effectively the same as the T-SQL statement:

WHERE LOWER(LEFT(postcode, 3)) = 'm12'

The stairway series will continue to highlight the differences between U-SQL and T-SQL as it progresses.

Regards,

Mike.

John N Hick Ten Centuries Points: 1320 More actions · Answer 4

Great article, Mike. Much better than the Microsoft documentation - all examples are TBD. Can't wait for the rest of it! Five celestials from me.

Mike McQuillan SSCertifiable Points: 5997 More actions · Answer 5

Hi John,

Thanks for the kind words (and the vote of course)! Hope you continue to enjoy the series.

Regards,

Mike.

aphillips-912560 SSC Journeyman Points: 76 More actions · Answer 6

Just saw this in my inbox today. Very interesting! Keep up the good work.

Meagan Longoria SSC Enthusiast Points: 193 More actions · Answer 7

Great article. I used it to get started with my first data lake. But could you fix the typo in the title? "Sairway" is constantly distracting me now that I have noticed it. Thanks again for the helpful content!

Mike McQuillan SSCertifiable Points: 5997 More actions · Answer 8

Thanks to you both for the kind words, glad you are enjoying the series!

I too have noticed the mis-spelling of "Stairway", I will ask Steve if he can fix it.

Regards,

Mike.

David Schwantner-383412 SSC Enthusiast Points: 183 More actions · Answer 9

The link to the initial data file “Headcounts and household estimates for postcodes" didn't work, but I found it here: http://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationestimates/datasets/2011censusheadcountsandhouseholdestimatesforpostcodesinenglandandwales

DS

Mike McQuillan SSCertifiable Points: 5997 More actions · Answer 10

Hi David

You're quite right - that will teach to use direct file URIs!

Thanks for digging out the updated link. I'll have a word to see if the article can be updated.

Regards,

Mike.

ashley-385902 Grasshopper Points: 23 More actions · Answer 11

I have a requirement where-in i need to process "CSV" files every week and store them to Data Lake / Warehouse. The only glitch is that the structure may differ every week. Has anyone worked with dynamism in U-SQL

Mike McQuillan SSCertifiable Points: 5997 More actions · Answer 12

Hi Ashley

Your scripts need to have an idea of the structure in the files. How many different types of format do you deal with? Is there something in each file which you can use to uniquely identify the format? If this is viable, one possible solution is:

1. Create separate directories in your Data Lake, one per file format.

2. Write a piece of code which inspects the file when it arrives, and deduces which format it contains.

3. The code can dump the file into the appropriate Data Lake directory.

4. Write a U-SQL script per directory to handle each directory's unique file format.

There are many ways of achieving this, this is just one possibility. Another option is to write something which transforms your various file formats into a common format, and then dump that up into the Data Lake.

Good luck!

Regards,

Mike.

xsevensinzx One Orange Chip Points: 25560 More actions · Answer 13

I have been reading your articles for awhile now. I saw you had some new ones up and wanted to place my question at the beginning of the series because it relates to the performance comparison to Azure Data Warehouse.

There is a lot of data lake, data store, data warehouse, and even data mart products out there. All of them are starting to mesh together to a point in understanding what is the difference between them all. Therefore, what is the benefit of using Azure Data Lake with Azure Data Warehouse or even Data Lake OVER using Data Warehouse?

Reading the articles, Azure Data Lake can load flat files and drop a schema on top of it. I can do this within a database, drop a schema on it, define a table, define a view, define a stored procedure, and even join table, views, and other data sources together. The same can be done in Azure Data Warehouse using T-SQL elements with the help of Polybase. Except in Data Warehouse, there is more of a SQL DB feel where you can fully manage it in management studio as well use T-SQL.

The thing that twists me up is that Azure Data Warehouse can break some common DW rules because it's a MPP system. I can load raw untouched data in Azure Data Warehouse fine. It's designed to handle large globs of data at ease without having to transform and aggregate it. As you mentioned in your article, that's clearly the difference between a data lake and a traditional data warehouse in a traditional SMP SQL DB.

But, why use Data Lake over Data Warehouse when they are built on similar concepts? What is data lake giving me that data warehouse can't give me? A different pricing model based on transactions versus hours? Does one out perform the other in certain use cases? That's what I struggle with the most here. I use data warehouse like a data lake because it feels like a data lake to a point where I can load large tables, create a columnstore, then aggregate it on read super easy in a very flat form to populate a star schema in a traditional SMP system.

(Note: when I say data warehouse, I'm referring to the specific Azure product Data Warehouse, not a traditional data warehouse.)

What's your thoughts?

Mike McQuillan SSCertifiable Points: 5997 More actions · Answer 14

xsevensinzx - Wednesday, August 30, 2017 7:10 AM
I have been reading your articles for awhile now. I saw you had some new ones up and wanted to place my question at the beginning of the series because it relates to the performance comparison to Azure Data Warehouse.
There is a lot of data lake, data store, data warehouse, and even data mart products out there. All of them are starting to mesh together to a point in understanding what is the difference between them all. Therefore, what is the benefit of using Azure Data Lake with Azure Data Warehouse or even Data Lake OVER using Data Warehouse?
Reading the articles, Azure Data Lake can load flat files and drop a schema on top of it. I can do this within a database, drop a schema on it, define a table, define a view, define a stored procedure, and even join table, views, and other data sources together. The same can be done in Azure Data Warehouse using T-SQL elements with the help of Polybase. Except in Data Warehouse, there is more of a SQL DB feel where you can fully manage it in management studio as well use T-SQL.
The thing that twists me up is that Azure Data Warehouse can break some common DW rules because it's a MPP system. I can load raw untouched data in Azure Data Warehouse fine. It's designed to handle large globs of data at ease without having to transform and aggregate it. As you mentioned in your article, that's clearly the difference between a data lake and a traditional data warehouse in a traditional SMP SQL DB.
But, why use Data Lake over Data Warehouse when they are built on similar concepts? What is data lake giving me that data warehouse can't give me? A different pricing model based on transactions versus hours? Does one out perform the other in certain use cases? That's what I struggle with the most here. I use data warehouse like a data lake because it feels like a data lake to a point where I can load large tables, create a columnstore, then aggregate it on read super easy in a very flat form to populate a star schema in a traditional SMP system.
(Note: when I say data warehouse, I'm referring to the specific Azure product Data Warehouse, not a traditional data warehouse.)
What's your thoughts?

Hi there

Thanks for your very thoughtful question. I completely agree that the continually growing number of warehousing solutions are difficult to sift through. Microsoft seems to be moving all of their solutions towards a common ground.

I am not convinced that the data lake is necessarily giving you something the Azure Data Warehouse cannot provide (certainly since the functionality to access the Data Lake Store data was implemented); rather, it seems like an alternative approach to the same problem. I suspect we wouldnâ€™t find major performance differences with the different approaches (if I find some time I might investigate this), unless we threw more compute power at say the data warehouse.

For me, it comes down to three things, one of which you mentioned:

1.     Cost â€“ storage, compute power, transactions, time and so on.
2.     Ease of use â€“ traditional SSAS needs structure to be defined, you can just write a quick script in U-SQL for a data lake (as you mentioned, this issue is eliminated in Azure Data Warehouse).
3.     The preferred programming language.

I think the language might be the key. Using Azure Data Warehouses with T-SQL wouldnâ€™t cause any problems for a SQL Server developer, but it would for the majority of application developers Iâ€™ve met over the years. I suspect thatâ€™s one of the reasons Microsoft has introduced U-SQL, to cater for app developers who have been asked to handle data in Azure. They can leverage their C# knowledge and it would be a fairly gentle learning curve for them.

Iâ€™ve really tried to avoid the usual answer of â€œit dependsâ€ here, but as with most things in computing that seems to be the correct answer â€“ it depends not only on the technology you prefer to use, but the language associated with that technology too. There isnâ€™t a wrong way of doing things, and if the Data Warehouse approach is working for you then thatâ€™s great (Iâ€™m sure itâ€™s working for a lot of other people too).

I have the next few articles in the series planned out but Iâ€™ll see if I can incorporate how Data Lakes compares with the Azure Data Warehouse in a future article.

Thanks again for the question!

Regards,
Mike.

xsevensinzx One Orange Chip Points: 25560 More actions · Answer 15

Yeah, that's what I was thinking - different ways of skinning the same cat.

I don't feel the traditional data warehouse has no use. From an enterprise standpoint, the traditional data warehouse is going to be a good option. In my case, I want more of a schema-on-read approach versus a schema-on-write. Azure Data Warehouse allows me to do that with still having a defined model, but a much wider and denomalized model without any solid relationships (i.e.: data -> table, data->table) because it's column based than row based. This is why I look at Azure Data Warehouse as more of a structured data lake / data store than traditional warehouse.

But I do wonder if it's just worth wild gong Hadoop with Hive tables or MongoDB or something like Azure Data Lake. Thus far, outside of the pricing model of Azure Data Lake, I think Azure Data Warehouse is a good option for me and a wonderful option for the business that is already heavily invested in the Microsoft stack. Maybe in time Azure Data Lake or something else will show different or maybe present itself in a way needing both to solve even more complex problems. 🙂

Thanks for the response.