Flying high on the Big Data hot-air

  • Hi Phil,

    I look forward to the replay of this editorial in 5 years and the reaction others have to it then.

    As to the comment now, you say that "For years, we in the database industry have struggled..."

    and this may well continue as long as we try to limit data strictly to a database. Data has and will continue to be both the combination of databases and other data. As you state, we have known that for a considerable amount of time.

    Also you indirectly state that we have had everything we need to deal with Big Data for some time. Note that with R and SQL Server we lack only two things. If we can get it all using R and SQL Server we must be able to use then to search all data generated by all users on all platforms, all email, memos, white papers, websites, and generated reports. And we have had that capability for some time?

    I know that there is hype, and I agree that we should point and advise, and we really need to get on board and get all we can out of this current wave. There we have complete agreement. But we also need to look very closely with what we have and determine if the tools we used yesterday are able to manage horizontally scaled data correctly and with a large enough sample to do basic analysis as well as meet the demands of investigations where every element of certain criteria is require to be presented. If the tools of yesterday cannot do it and those being developed today will also fall short, it might be good to get involved as you have in trying to scope and define the tools of the future.

    Thanks for getting this on the table, good to see and hear.

    Not all gray hairs are Dinosaurs!

  • PHYData DBA (7/30/2013)


    Phil Factor (7/30/2013)


    @wim.bekkens

    Thanks for that.

    For a simple example, take a look at this series that is now coming out on Simple-talk. It walks you through an example application that involves using R to report KPIs in a SQL Server database.

    Creating a Business Intelligence Dashboard with R and ASP.NET MVC: Part 1[/url]

    Creating a Business Intelligence Dashboard with R and ASP.NET MVC: Part 2[/url]

    Phill,

    R was cool but unless you are working at the NIS isn't it dated?

    When compared to some of the newer and highly maintained Graphical Stat Display tools such as the free Sigma Plot MySystat http://www.systat.com/MystatProducts.aspx R feels like an amber screen from the 80's.

    Don’t get me wrong. It was awesome and I used it.

    Now there are IMHO better tools that require less heavy lifting.

    If anyone is interested in a free large scale Database solution used by Netflix, Twitter, eBay, reddit, Cisco, etc... http://cassandra.apache.org/ 😎

    <sermon>

    For the Point and Click crowd, I suppose SAS/SPSS/MiniTab/Excel and the like feel "modern". In terms of implementing (cutting-edge) stats, R has no peer. And, it's taking market share from all of the closed-source alternatives. As I mentioned earlier, Oracle has followed in the footsteps of Postgres with its integration strategy. RStudio does a good job of integrating the necessary bits of R. No Point&Click as yet, however. And not likely, either. The momentum in R is toward R as programming language, rather than R as stat command language. Julia is the current front runner. As such, integrating into the sql/database engine, rather than variations on ODBC from RStudio/etc. is the way forward.

    </sermon>

  • RobertYoung (7/30/2013)


    PHYData DBA (7/30/2013)


    Phil Factor (7/30/2013)


    @wim.bekkens

    Thanks for that.

    For a simple example, take a look at this series that is now coming out on Simple-talk. It walks you through an example application that involves using R to report KPIs in a SQL Server database.

    Creating a Business Intelligence Dashboard with R and ASP.NET MVC: Part 1[/url]

    Creating a Business Intelligence Dashboard with R and ASP.NET MVC: Part 2[/url]

    Phill,

    R was cool but unless you are working at the NIS isn't it dated?

    When compared to some of the newer and highly maintained Graphical Stat Display tools such as the free Sigma Plot MySystat http://www.systat.com/MystatProducts.aspx R feels like an amber screen from the 80's.

    Don’t get me wrong. It was awesome and I used it.

    Now there are IMHO better tools that require less heavy lifting.

    If anyone is interested in a free large scale Database solution used by Netflix, Twitter, eBay, reddit, Cisco, etc... http://cassandra.apache.org/ 😎

    <sermon>

    For the Point and Click crowd, I suppose SAS/SPSS/MiniTab/Excel and the like feel "modern". In terms of implementing (cutting-edge) stats, R has no peer. And, it's taking market share from all of the closed-source alternatives. As I mentioned earlier, Oracle has followed in the footsteps of Postgres with its integration strategy. RStudio does a good job of integrating the necessary bits of R. No Point&Click as yet, however. And not likely, either. The momentum in R is toward R as programming language, rather than R as stat command language. Julia is the current front runner. As such, integrating into the sql/database engine, rather than variations on ODBC from RStudio/etc. is the way forward.

    </sermon>

    I agree with a lot of what you say...

    edit -- I have to admit that R has come a long way. I have done some reading about its latest advances as it's own programming language. Currently the creator of S, John Chambers, is working on the R team.

    http://en.wikipedia.org/wiki/R_(programming_language)

    I like SciLab for a GNU compatible library. It is based of off Matlab syntax so code reuse or porting from MatLab makes things easier. It is also very current and highly maintained.

    Julia does have great promise. The fact that it is a true AST makes it even better. Hopefully it becomes embraced and adopted soon as well as S, R, Matlab, C, and Fortran. http://stats.stackexchange.com/questions/25672/does-julia-have-any-hope-of-sticking-in-the-statistical-community

  • Thank you, @Phil Factor, for the explanation of the R Language.

    Kindest Regards, Rod Connect with me on LinkedIn.

  • In terms of implementing (cutting-edge) stats, R has no peer.

    The Python combo of pandas and numpy begs to disagree...

  • chrisn-585491 (7/30/2013)


    In terms of implementing (cutting-edge) stats, R has no peer.

    The Python combo of pandas and numpy begs to disagree...

    I expect they would. Dueling pistols at dawn????:w00t:

  • Well put, Phil. I've been getting 'Big Data' emails in my inbox for years and not one of these emails actually had any real practical content in them.

  • But we also need to look very closely with what we have and determine if the tools we used yesterday are able to manage horizontally scaled data correctly and with a large enough sample to do basic analysis as well as meet the demands of investigations where every element of certain criteria is require to be presented. If the tools of yesterday cannot do it and those being developed today will also fall short, it might be good to get involved as you have in trying to scope and define the tools of the future.

    With 'Big Data', it isn't the quantity of data, it is the way you deal with it. After all, Nate Silver's spectacular predictions of the result of the US election were done on a spreadsheet. The first 'big data' applications I came across were in analyzing the test data for automobiles, in the days of Sybase and DECs. The trick is that, once you've extracted the 'juice' from the raw data, you archive it if you can/need, or else throw it away. You usually don't let it anywhere near the database doing the analysis. Think hierarchically. Nowdays we have Streaminsight and Hadoop to do the low-level drudgery for us. Sure it is easier, but these techniques were developed in the eighties when engineering industries were awash with test data and had to develop ways of dealing with it.

    Best wishes,
    Phil Factor

  • Nice read! There does seem to be an overwhelming amount of buzz about big data and I'm skeptical about how much of it is purely hype and the newest thing for purely selling organisations to make a quick buck.

    But we shouldn't let the marketers have all the fun πŸ™‚

  • Phil Factor (7/31/2013)


    But we also need to look very closely with what we have and determine if the tools we used yesterday are able to manage horizontally scaled data correctly and with a large enough sample to do basic analysis as well as meet the demands of investigations where every element of certain criteria is require to be presented. If the tools of yesterday cannot do it and those being developed today will also fall short, it might be good to get involved as you have in trying to scope and define the tools of the future.

    With 'Big Data', it isn't the quantity of data, it is the way you deal with it. After all, Nate Silver's spectacular predictions of the result of the US election were done on a spreadsheet. The first 'big data' applications I came across were in analyzing the test data for automobiles, in the days of Sybase and DECs. The trick is that, once you've extracted the 'juice' from the raw data, you archive it if you can/need, or else throw it away. You usually don't let it anywhere near the database doing the analysis. Think hierarchically. Nowdays we have Streaminsight and Hadoop to do the low-level drudgery for us. Sure it is easier, but these techniques were developed in the eighties when engineering industries were awash with test data and had to develop ways of dealing with it.

    What Silver did is technically a meta-analysis, which in English means an analysis of other analyses. Drug companies have been pushing such "analysis" for some time, as a way to get approval when a bunch trials miss their endpoints. In some cases, the aggregated data can be massaged into success.

    In Silver's case, he took existing *sample* survey data, added his own weights (based upon his political acuity), and spit out a new number. There was nothing big about his data.

    Big data, whether commercial, political, or military is an effort to find some needle in some haystack. The NSA vacuuming of communications, which it's been doing from its inception, is just the latest public airing. The big data practitioners argue that speed and volume of data makes separating the wheat from the chaff with legacy RDBMS impractical. They also argue that transactional guarantees aren't meaningful, so let's all use some cobbled together file spec. And so it was.

    On the stats side, similar conflict. Big data is census data, and thus the source of merely descriptive statistics (which aren't technically statistics in the first place). All that inferential machinery (frequentist or Bayes) doesn't matter much if you're not sampling, or endlessly re-sampling for the Bayesians.

    So, spend gobs of money looking for a needle or two. Sometimes the expected value of the needle is worth the cost. Mostly, not so much. But lemmings will be lemmings.

Viewing 10 posts - 16 through 24 (of 24 total)

You must be logged in to reply to this topic. Login to reply