Stairway to U-SQL Level 1: Introduction to U-SQL and Azure Data Lakes

  • So this basically just T-SQL with additional extensions for sourcing from or outputting to flat files and also perhaps analytical functions?

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Hi there Gerald

    Unfortunately, at the moment there isn't an on-prem version of Data Lakes.

    Your best bet is to ask Michael Rys (he's in charge of all this stuff) whether there are any plans for this. You can find him on Twitter at @mikedoesbigdata.

    Regards,

    Mike.

  • Hi there Eric

    U-SQL looks like T-SQL on the surface, but it's a blend of both T-SQL and C#. The WHERE clauses are very different - you can't write WHERE x = 1, for instance, you write WHERE x == 1. This is a very simple example, here's another lifted straight from the article published today (22/06/16):

    WHERE postcode.Substring(0, 3).ToLower() == "m12";

    This is effectively the same as the T-SQL statement:

    WHERE LOWER(LEFT(postcode, 3)) = 'm12'

    The stairway series will continue to highlight the differences between U-SQL and T-SQL as it progresses.

    Regards,

    Mike.

  • Great article, Mike. Much better than the Microsoft documentation - all examples are TBD. Can't wait for the rest of it! Five celestials from me.

  • Hi John,

    Thanks for the kind words (and the vote of course)! Hope you continue to enjoy the series.

    Regards,

    Mike.

  • Just saw this in my inbox today. Very interesting! Keep up the good work.

  • Great article. I used it to get started with my first data lake. But could you fix the typo in the title? "Sairway" is constantly distracting me now that I have noticed it. Thanks again for the helpful content!

  • Thanks to you both for the kind words, glad you are enjoying the series!

    I too have noticed the mis-spelling of "Stairway", I will ask Steve if he can fix it.

    Regards,

    Mike.

  • The link to the initial data file “Headcounts and household estimates for postcodes" didn't work, but I found it here: http://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationestimates/datasets/2011censusheadcountsandhouseholdestimatesforpostcodesinenglandandwales

    DS

  • Hi David

    You're quite right - that will teach to use direct file URIs!

    Thanks for digging out the updated link. I'll have a word to see if the article can be updated.

    Regards,

    Mike.

  • I have a requirement where-in i need to process "CSV" files every week and store them to Data Lake / Warehouse. The only glitch is that the structure may differ every week. Has anyone worked with dynamism in U-SQL

  • Hi Ashley

    Your scripts need to have an idea of the structure in the files. How many different types of format do you deal with? Is there something in each file which you can use to uniquely identify the format? If this is viable, one possible solution is:

    1. Create separate directories in your Data Lake, one per file format.

    2. Write a piece of code which inspects the file when it arrives, and deduces which format it contains.

    3. The code can dump the file into the appropriate Data Lake directory.

    4. Write a U-SQL script per directory to handle each directory's unique file format.

    There are many ways of achieving this, this is just one possibility. Another option is to write something which transforms your various file formats into a common format, and then dump that up into the Data Lake.

    Good luck!

    Regards,

    Mike.

  • I have been reading your articles for awhile now. I saw you had some new ones up and wanted to place my question at the beginning of the series because it relates to the performance comparison to Azure Data Warehouse.

    There is a lot of data lake, data store, data warehouse, and even data mart products out there. All of them are starting to mesh together to a point in understanding what is the difference between them all. Therefore, what is the benefit of using Azure Data Lake with Azure Data Warehouse or even Data Lake OVER using Data Warehouse?

    Reading the articles, Azure Data Lake can load flat files and drop a schema on top of it. I can do this within a database, drop a schema on it, define a table, define a view, define a stored procedure, and even join table, views, and other data sources together. The same can be done in Azure Data Warehouse using T-SQL elements with the help of Polybase. Except in Data Warehouse, there is more of a SQL DB feel where you can fully manage it in management studio as well use T-SQL.

    The thing that twists me up is that Azure Data Warehouse can break some common DW rules because it's a MPP system. I can load raw untouched data in Azure Data Warehouse fine. It's designed to handle large globs of data at ease without having to transform and aggregate it. As you mentioned in your article, that's clearly the difference between a data lake and a traditional data warehouse in a traditional SMP SQL DB.

    But, why use Data Lake over Data Warehouse when they are built on similar concepts? What is data lake giving me that data warehouse can't give me? A different pricing model based on transactions versus hours? Does one out perform the other in certain use cases? That's what I struggle with the most here. I use data warehouse like a data lake because it feels like a data lake to a point where I can load large tables, create a columnstore, then aggregate it on read super easy in a very flat form to populate a star schema in a traditional SMP system.

    (Note: when I say data warehouse, I'm referring to the specific Azure product Data Warehouse, not a traditional data warehouse.)

    What's your thoughts?

  • xsevensinzx - Wednesday, August 30, 2017 7:10 AM

    I have been reading your articles for awhile now. I saw you had some new ones up and wanted to place my question at the beginning of the series because it relates to the performance comparison to Azure Data Warehouse.

    There is a lot of data lake, data store, data warehouse, and even data mart products out there. All of them are starting to mesh together to a point in understanding what is the difference between them all. Therefore, what is the benefit of using Azure Data Lake with Azure Data Warehouse or even Data Lake OVER using Data Warehouse?

    Reading the articles, Azure Data Lake can load flat files and drop a schema on top of it. I can do this within a database, drop a schema on it, define a table, define a view, define a stored procedure, and even join table, views, and other data sources together. The same can be done in Azure Data Warehouse using T-SQL elements with the help of Polybase. Except in Data Warehouse, there is more of a SQL DB feel where you can fully manage it in management studio as well use T-SQL.

    The thing that twists me up is that Azure Data Warehouse can break some common DW rules because it's a MPP system. I can load raw untouched data in Azure Data Warehouse fine. It's designed to handle large globs of data at ease without having to transform and aggregate it. As you mentioned in your article, that's clearly the difference between a data lake and a traditional data warehouse in a traditional SMP SQL DB.

    But, why use Data Lake over Data Warehouse when they are built on similar concepts? What is data lake giving me that data warehouse can't give me? A different pricing model based on transactions versus hours? Does one out perform the other in certain use cases? That's what I struggle with the most here. I use data warehouse like a data lake because it feels like a data lake to a point where I can load large tables, create a columnstore, then aggregate it on read super easy in a very flat form to populate a star schema in a traditional SMP system.

    (Note: when I say data warehouse, I'm referring to the specific Azure product Data Warehouse, not a traditional data warehouse.)

    What's your thoughts?

    Hi there

    Thanks for your very thoughtful question. I completely agree that the continually growing number of warehousing solutions are difficult to sift through. Microsoft seems to be moving all of their solutions towards a common ground.

    I am not convinced that the data lake is necessarily giving you something the Azure Data Warehouse cannot provide (certainly since the functionality to access the Data Lake Store data was implemented); rather, it seems like an alternative approach to the same problem. I suspect we wouldn’t find major performance differences with the different approaches (if I find some time I might investigate this), unless we threw more compute power at say the data warehouse.

    For me, it comes down to three things, one of which you mentioned:

    1.     Cost – storage, compute power, transactions, time and so on.
    2.     Ease of use – traditional SSAS needs structure to be defined, you can just write a quick script in U-SQL for a data lake (as you mentioned, this issue is eliminated in Azure Data Warehouse).
    3.     The preferred programming language.

    I think the language might be the key. Using Azure Data Warehouses with T-SQL wouldn’t cause any problems for a SQL Server developer, but it would for the majority of application developers I’ve met over the years. I suspect that’s one of the reasons Microsoft has introduced U-SQL, to cater for app developers who have been asked to handle data in Azure. They can leverage their C# knowledge and it would be a fairly gentle learning curve for them.

    I’ve really tried to avoid the usual answer of “it depends†here, but as with most things in computing that seems to be the correct answer – it depends not only on the technology you prefer to use, but the language associated with that technology too. There isn’t a wrong way of doing things, and if the Data Warehouse approach is working for you then that’s great (I’m sure it’s working for a lot of other people too).

    I have the next few articles in the series planned out but I’ll see if I can incorporate how Data Lakes compares with the Azure Data Warehouse in a future article.

    Thanks again for the question!

    Regards,
    Mike.

  • Yeah, that's what I was thinking - different ways of skinning the same cat.

    I don't feel the traditional data warehouse has no use. From an enterprise standpoint, the traditional data warehouse is going to be a good option. In my case, I want more of a schema-on-read approach versus a schema-on-write. Azure Data Warehouse allows me to do that with still having a defined model, but a much wider and denomalized model without any solid relationships (i.e.: data -> table, data->table) because it's column based than row based. This is why I look at Azure Data Warehouse as more of a structured data lake / data store than traditional warehouse.

    But I do wonder if it's just worth wild gong Hadoop with Hive tables or MongoDB or something like Azure Data Lake. Thus far, outside of the pricing model of Azure Data Lake, I think Azure Data Warehouse is a good option for me and a wonderful option for the business that is already heavily invested in the Microsoft stack. Maybe in time Azure Data Lake or something else will show different or maybe present itself in a way needing both to solve even more complex problems. 🙂

    Thanks for the response.

Viewing 15 posts - 16 through 30 (of 37 total)

You must be logged in to reply to this topic. Login to reply