SELECT DISTINCT on Long Table

  • I am going to implement this as a separate table. I don't want the additional overhead of the index on the table.

    But I do have a question: Is this a common way to solve this problem? I ask because I really need to solve this for other fields, getting a distinct list from a 10 million record table where there are way more than 8 distinct values. Indexing on these fields would help by reducing the width of the records in the index scan, but indexing on these fields would still really be an unacceptably slow solution for speeding up these SELECT DISTINCTs, especially since the data only changes once per week.

  • Well - SELECT DISTINCT tends to be a costly operation, so doing it over and over and over again is something to be avoided, IF it can be avoided. Having a "lookup table" is a good way to do that, especially if the values don't change at all, or like in your scenarion, don't change for a reasonable timeframe.

    In that light - yes - it's a common practice. Treat it like a long-term table - if there are going to be lots of distinct values in the list - index it, etc...

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • With data that doesn't update constantly (like several dozen/hundred/thousand/million changes per hour), this is a pretty common solution.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • GSquared (4/21/2008)


    Since an indexed view is, fundamentally, a separate table, that would work. I'd go with a separate table, because, to me, that's easier to document. But really, they're the same solution and either is just as good as the other.

    But there is a fundamental difference too. Data in indexed view is updated automatically, while a separate table must be updated either explicitly (inaccurate) or from triggers on source table (costly). In this case at least.

  • Hello,

    I don´t know if you are familiar, but you can try to create a partition function to store the data in partitions. I not every percent shore but i think it could be a way.

    RGDS

    Rui

  • 1) partitioning the table won't help, since the query at hand will still have to read all the data. Well, it could help a tiny bit if the partitions were on separate spindles and there were sufficient CPU cores to handle the async IO.

    2) 10M row scan of a non-clustered index on a date column is still going to take a finite amount of time. Best you can do here is have sufficient RAM and sufficient IO so the query is hopefully answered from cache and pulled up from fast IO if not.

    3) Caching the data either via an off-line-'manually'-maintained table is another solution, as is indexed view. Either could be an 'instantaneous' retrieval but using them would depend on the data modification situation on the base table.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Regarding partitioning, if the partitions are by date, would not the Q.O. have the smarts

    to return the 8 dates 1 from each partition very quickly? If it knows that there

    is one unique date per partition, it should only need to perform 8 reads - at least

    that's what I would do if I were the optimizer 🙂

    Still, then there's partition maintenance routines that would be required.

    Larry

    *** Not yet a DBA - not only a developer ***

  • Larry, you are correct. the constraint used for the date partitioning would allow the optimizer to know that the only possible date for each of the 8 partitions would be the single date allowed by the optimizer. Hmm, actually, it may not. Unless you did a constraint that ensured exactly 00:00:00 for the time for example it would still have to read all the rows because the time is included with the datetime and smalldatetime datatypes. Perhaps a char representation would work too.

    I misread the initial post and thought there were 8 different date FIELDS to choose from, not 8 day values for a single date field. :Whistling:

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

Viewing 8 posts - 16 through 22 (of 22 total)

You must be logged in to reply to this topic. Login to reply