Stairway to U-SQL Level 5: Databases and Tables in Azure Data Lakes with U-SQL

  • This is essentially Hive for HDFS. You define a table object on a collection of documents in the data store so you can have a consistent structure that you can interact with for those collection of documents. The same concept applies for Polybase with external tables. You are defining a table object that points to a data source outside the database so you can pull it into the database. However, I do believe in this particular instance with ADLA tables, you cannot do that with Polybase for Azure Data Warehouse. Last I saw, it was still in development.

    If you want connectivity to this with SSMS, the best approach is likely using SQL Server or Azure Data Warehouse with Polybase enabled. Then you can set up those External Tables that point to the collection of documents in your data store. This will allow you to log into database from SSMS and then allow you to query that External Table directly on the documents themselves through your Azure Data Lake Store (or Blob storage if you don't use Data Store). This does not require you to have Azure Data Lake Analytics enabled.

    I have yet to use tables in ADLA for the mere fact that you can just use USQL to output the structure you want to a new document. Then feed that new document into a database. But, that does not mean it's not useful setting up that table object to just read the table instead. It's essentially like a SQL View for the data store or for your database if you're using external tables with Polybase.

  • Hey Mike, tried to use the files hosted here http://mcqtech.com/articles/usql/files/usql_files.zip - they look like html inside the csv's in the zip.  Are these files broke?

    NVM - saw the post about github.  Thanks!

  • Hi MP

    Glad you managed to grab the files. Sounds like I need to update my Web site! I might just update the article to point at github. Thanks for taking the time to post the message.

    Regards,
    Mike.

  • Thanks, yeah a link back to github couldn't hurt.  Anyways, great articles, it's helping me catch up.  Keep them coming.

  • naunihal - Tuesday, March 13, 2018 11:38 PM

    How does Database in ADLA compares to ADW.
    ADW is limited to 240TB, does ADLA DB has any limitations ?

    It's been awhile since I last commented on this article and since then, I've been using ADLA more and more. I'll toss some feedback if you are still around.

    I use ADW the has data sourced from both Blob Storage and Azure Data Lake Store. Data is processed using Azure Data Lake Analytics (U-SQL).

    The speed of ADLA compared to ADW is pretty amazing. I can take 3 billion row files and JOIN them together with related dimensional files extremely fast in ADLA without having to worry about indexes, statistics or anything of that nature. I'm pretty amazed by the speed compared to having to do the same in ADW where I have to scale up to say, 20+ USD a hour just to run the query.

    However, for me at least, this is not a replacement for ADW. The benefits here is computation super fast. But when you need to secure, maintain, and expose the final output, this is where ADW comes into play the most. This is why I feed all my final data to ADW. I aggregate it down using ADLA and then expose it through ADW where it can be fully managed as well have scaling power if needed. It works out great when you then need to start sourcing say, PowerBI or Cubes or Data Marts located in Elastic SQL instances.

    The other thing to mention here is that ADLA, much like Spark and much like MapReduce, rely on more object-oriented programming to make shine the most. Java for MapReduce, Python for Spark, and .NET(C#) for U-SQL. It becomes a real pain when having to convert a simple date string or handle records with double-quotes (although ADLA has extractors that handle quoting, it still fails once it splits the data up sometimes). Then things as simple as not having the ability to do JOIN >=  is not supported among other basic approachs you may take in T-SQL. You will often spend your time trying to figure out the .NET approach or writing custom .NET functions to handle what you need than anything else, which is a pain for most DBA-types.

    But hey, you really can't beat not having to define a schema, normalize/denormalize, index, add statistics or any of that jive to get to the data. You upload, you computate, you're done. You can do this on insanely large datasets at ease. It's freaking amazing. I'm really suprised when I talk to DBA's who smash data lakes and their usage. It's insanely fast and can save you a lot of time IF USED RIGHT. It doesn't replace the data warehouse, but it surely helps get that data into it's final resting place pretty damn fast and if you take a simplistic approach to USQL like I have, pretty damn easy too.

  • mpsmith 63921 - Monday, June 11, 2018 1:53 PM

    Hey Mike, tried to use the files hosted here http://mcqtech.com/articles/usql/files/usql_files.zip - they look like html inside the csv's in the zip.  Are these files broke?

    NVM - saw the post about github.  Thanks!

    I was able to get the files by choosing the raw option then saving the files down a few months back.

    [font="Tahoma"]Cheers,

    Sean :-D[/font]

  • xsevensinzx - Tuesday, June 12, 2018 7:24 AM

    naunihal - Tuesday, March 13, 2018 11:38 PM

    How does Database in ADLA compares to ADW.
    ADW is limited to 240TB, does ADLA DB has any limitations ?

    It's been awhile since I last commented on this article and since then, I've been using ADLA more and more. I'll toss some feedback if you are still around.

    I use ADW the has data sourced from both Blob Storage and Azure Data Lake Store. Data is processed using Azure Data Lake Analytics (U-SQL).

    The speed of ADLA compared to ADW is pretty amazing. I can take 3 billion row files and JOIN them together with related dimensional files extremely fast in ADLA without having to worry about indexes, statistics or anything of that nature. I'm pretty amazed by the speed compared to having to do the same in ADW where I have to scale up to say, 20+ USD a hour just to run the query.

    However, for me at least, this is not a replacement for ADW. The benefits here is computation super fast. But when you need to secure, maintain, and expose the final output, this is where ADW comes into play the most. This is why I feed all my final data to ADW. I aggregate it down using ADLA and then expose it through ADW where it can be fully managed as well have scaling power if needed. It works out great when you then need to start sourcing say, PowerBI or Cubes or Data Marts located in Elastic SQL instances.

    The other thing to mention here is that ADLA, much like Spark and much like MapReduce, rely on more object-oriented programming to make shine the most. Java for MapReduce, Python for Spark, and .NET(C#) for U-SQL. It becomes a real pain when having to convert a simple date string or handle records with double-quotes (although ADLA has extractors that handle quoting, it still fails once it splits the data up sometimes). Then things as simple as not having the ability to do JOIN >=  is not supported among other basic approachs you may take in T-SQL. You will often spend your time trying to figure out the .NET approach or writing custom .NET functions to handle what you need than anything else, which is a pain for most DBA-types.

    But hey, you really can't beat not having to define a schema, normalize/denormalize, index, add statistics or any of that jive to get to the data. You upload, you computate, you're done. You can do this on insanely large datasets at ease. It's freaking amazing. I'm really suprised when I talk to DBA's who smash data lakes and their usage. It's insanely fast and can save you a lot of time IF USED RIGHT. It doesn't replace the data warehouse, but it surely helps get that data into it's final resting place pretty damn fast and if you take a simplistic approach to USQL like I have, pretty damn easy too.

    Hi xsevensinzx

    Good to hear from you again! I actually agree with you - ADLA isn't a replacement for a warehouse. But it's a brilliant assistant! You're using it in the same way I've seen it used everywhere else so far - as a supplement to a relational or data warehouse system.

    I think it's a great bit of technology, I agree about people who diss the Data Lake. It's just another tool in the data arsenal, and when a quick conversion of a large data set is needed, there aren't many things out there to touch it.

    Hope you carry on enjoying U-SQL and the Data Lake! Thanks for taking the time to report back.

    Regards,
    Mike.

Viewing 7 posts - 16 through 21 (of 21 total)

You must be logged in to reply to this topic. Login to reply