Joining on view with a group by or distinct causes slow query

  • Hi,

    I've been designing a query that links a phone_calls table to our asset system so that employees can see their mobile bills every month but it's running into some performance issues that are really slowing down the query and I'm hoping someone can explain the behavior for me as I'm still a relative newbie.

    Basically we have something like the following tables/views:

    VIEW - "asset_assignment" - Lists a start and end date that an employee had a phone_number

    e.g. (asset_id (int), start_date (date), end_date (date), contact_id (int), handset_number (varchar)) (~1K rows)

    TABLE - "phone_calls" - Lists every phone call made on a bill (~80K rows per bill)

    e.g. (call_id (PK clustered int), bill_id (FK int), call_date (date), handset_number (varchar), number_dialled (varchar))

    VIEW - "phone_numbers" - lists all phone numbers for entities on our system i.e. people, projects, organisations etc so we can join the number_dialled and see if the call was business (not null) or personal (null after join on this view) (~1K rows)

    e.g. (phone_number (varchar), entity_id (int), entity_identifier (int))

    So i have created a view "VW_Phone_Usage" to join all this up so i can have my pages produce different reports but one part of the joining is causing the execution times of a minimal query on the view from 1s to 8s which grows exponentially meaning a report can easily timeout.

    The view is as follows (simplified version):

    select

    pu.*, -- I dont really do this but didn't want to type all the columns in

    apn.*, -- same as above

    ecm.* -- same as above

    from phone_calls pu

    join asset_assignment apn on pu.handset_number = apn.handset_number and pu.call_date between apn.start_date and apn.end_date

    left outer join ( select phone_number

    from phone_numbers

    group by phone_number

    ) ecm on pu.number_dialled= ecm.phone_number

    Now when i run a query for one bill (~80K rows/calls) and then for one employee (cuts down to ~220 rows/calls) the query is <1s which is fine. Its the last part thats causing the issue, i.e. joining on the "phone_numbers" view to see if the number dialled is in our system.

    If I do the left outer join on the "phone_numbers" view without the group by the query joins fast and is still ~1s but the issue is the "phone_numbers" view is built from many tables and so could possibly contain duplicates e.g. a contact has the same phone number as the organisation he works for.

    Anyway I need to basically either join on the "phone_numbers" view using group by or a select distinct so that the ~220 rows that have been matched so far still only have one match on the number_dialled column.

    Now for some reason I don't understand, using the distinct or group by in the join causes the query time to jump from 1 to 8 secs. However this only happens if I return a column from the joined view, i.e. in the above view if I didn't have the ecm.* in the select statement then group by / distinct causes no performance drops but selecting any columns from the ecm joined view means it's slow again.

    All I can see from the query plan is that the majority of the extra time (77%) is being taken up by a clustered scan on the "phone_calls" table. I'm not sure though why that is as the select distinct / group by query on it's own runs in ~1s for ~1K rows but surely it should only be matching that against the ~220 rows/calls being returned. So I'm basically wondering if someone can explain to me what's going on with the distinct/group by as opposed to the straight join that causes the query to run 8 times slower and if there's a better way round this?

    Any help/comments appreciated.

    Thanks

  • Divide'n'Conquer... it seems like you're trying to do everything in a single query and that can really make a mess of things. Write a proc that will get the small subset of "key" info, store it in a temp table, and join your phone view to that.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Hi Jeff,

    Yeah I am wanting everything in one view I'm afraid. The reason being is that the application I have to use (3rd Party) has the ability to create custom searches by pointing the search framework at a view or table. It then creates search fields based on the columns available and allows you to configure the search fields and control the output.

    All of this is highly customisable and let's users create their own searches but although it's a generic configurable framework the data source is quite specific. You can only point the search at a view or table as it's data source so I'm limited to what I can do.

    I have custom pages that also reference this view and are not tied to the search framework in the application so I could write a proc but I would've preferred to be able to have all the phone call data in one view so I don't have to join on the phone_numbers view/table each time.

    I have sort of found a faster way that makes the view query run in 3/4 secs which is halving the time I suppose but still not giving me the performance I'd hope for.

    It's basically using a table valued function and joining on that as opposed to joining on the grouped phone_numbers view i.e.

    CREATE FUNCTION Phone_Numbers_Grouped ()

    RETURNS @png TABLE

    ( phone_number varchar(255) )

    AS

    BEGIN

    INSERT @png

    select x.phone_number

    from phone_numbers x

    where isnull(x.phone_number, '') ''

    group by x.phone_number

    RETURN

    END

    and then

    select

    pu.*, -- I dont really do this but didn't want to type all the columns in

    apn.*, -- same as above

    ecm.* -- same as above

    from phone_calls pu

    join asset_assignment apn on pu.handset_number = apn.handset_number and pu.call_date between apn.start_date and apn.end_date

    left outer join dbo.phone_numbers_grouped() ecm on pu.number_dialled= ecm.phone_number

    Running a search on the view now for one user on one bill runs in 3/4 secs.

    Incidentally rather than using the table valued function if I just used a table variable built from the same code used in the phone_numbers view (and adding a group by clause) so that no views were bring used to build the table variable and joined on that it would take 3 secs.

    The only issue with using the table variable is that I'd have to do that in every page and I couldn't then use it with the search framework of the application we use as you can't use a table variable in a view as far as I'm aware.

    I suppose I should be happy I've halved the execution time but I would've liked it down a bit more but I'm just not sure what else I can do.

    I'm using the phone_calls table (so can't do much else with it - not sure if adding any other indexes would help?), the asset_assignment and phone_numbers views are all built from multiple tables so I don't think there is a better way to create them and reference them from a view other that putting them in table valued functions however only the phone_numbers view joining is slowing the query. The asset_assignment join hardly slows the query at all so I'm guessing there's no real value in moving it to a function.

    Although 4 secs is an improvement

    1) if I run select * on phone_calls table it returns the 700K rows in 15 secs.

    2) if I run select * on phone_calls joined with asset_assignment it returns the 700K rows in 45 secs. (3x slower but it's double the no of columns returning as well)

    3) if I run select * on phone_calls joined with asset_assignment joined with phone_numbers function/temp table it seems to be returning ~7K rows per minute (a lot slower with only 1-3 extra cols being returned)

    So I'm convinced that it's the sql that builds the phone_numbers that's the issue and needs changed. As it's building from multiple tables It uses a few union alls so I'm unsure of how to change it to make it more efficient.

    The phone_Calls code is:

    select replace(isnull(cm.address_or_number, ''), ' ', '') as phone_number,

    contact_id as identifier,

    1 as class_id -- contact

    from contact_method_type cmt

    join contact_methods cm on cmt.contact_method_type_id = cm.contact_method_type_id

    where cmt.isEmail = 0

    union all

    select replace(isnull(phone_number, ''), ' ', '') as phone_number,

    organisation_id as identifier,

    2 as class_id -- organisation

    from org_summary_info

    union all

    select replace(isnull(fax_number, ''), ' ', '') as phone_number,

    organisation_id as identifier,

    2 as class_id -- organisation

    from org_summary_info

    union all

    -- more code to grab numbers from projects, mailing address and special exempt number tables

    Thanks for the suggestion though. If you have any other ideas, advice or pointers on where I'm going wrong I'd greatly appreciate it.

    Thanks

  • Did you try making an indexed view for the phone number part ?

    I usually don't use indexed views, but if you insist on having a single query doing it all ...

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

  • I had thought about an indexed view for the phone number part but I'm a bit hazy on whether I can do this or not. I've read a bit about it but tbh I'm not entirely sure If I should. Basically I know to do that you have to add the use schemabinding option but when I do that I get an error saying the tables the view is referencing are not schema bound which is correct. The issue I have is that the tables are part of the 3rd party DB and they regularly make updates so I'm not allowed to change any of their core tables as their updates can undo changes I make.

    I can mess with any custom tables I create but as the tables storing the phone numbers are core I'm wary about doing this or what it will mean. I can't contact the 3rd party vendor until tomorrow and I wasn't sure if it would help so I've not explored this yet.

    I'm still a relative newbie so I don't tend to mess unless i'm sure what I'm doing.

    Am I correct in my assumptions or have I got it wrong?

  • It doesn't necessarily help your current problem, I know you probably can't change it, and you may not even be concerned, but judging from the fact that you have more than one type of phone number on any given row in the org_summary_info table indicates a bit of a denormalization sin thyat will likely hurt other things in the future. For example, what if an organization has more than 1 fax #?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Hi Jeff,

    Yeah I hear where you're coming from and for contacts they have done this properly but not for other objects.

    Orgs can have a main phone and/or fax on the org table. Projects and orgs can also have one or more addresses but each address can only have one phone / fax number as well. And while you could use the addresses table as the system would allow this this isn't generally done and a separate branch org/project is created with a single address.

    It basically means I have to do the unions. Tbh I just didn't think it would slow things like this as the phone_numbers union is quite quick to run. It returns 18K rows total but runs in less than 1s. Though joining on the contact_methods table only which stores nos and email addresses (all of the number dialled matches in my test query are from this table anyway) runs in 2 secs but then doesn't have the isnull/replaces etc running so maybe getting it to 4 secs is going to be the best I can do.

    I just thought sql server would've been able to do this faster but maybe I'm being naive. Being a newbie I always assume I'm writing inefficient code anyway. :unsure:

  • If you need all the columns from all 3 of the tables, I'm not sure you'll ever be able to make it as snappy as you'd like for a view... it makes it just about impossible to get a seek without also having a bookmark lookup.

    It's a shame that you need to use a table or view. One way to speed this up a lot would be to get just the parts you need into a Temp Table or two and then do the joins on that. I believe that your table valued function will actually work in a view (I rarely use either because of performance problems ;-)) and you might be able to make it a bit more snappy if it were constructed as an "inline" TVF (a bit of a misnomer, for sure).

    One of the sure fire ways to get an index or table scan is like you posted...

    where isnull(x.phone_number, '') ''

    ... anytime you wrap a column in a function, it's likely going to be a problem. A good way to get around that particular one is like this...

    where x.phone_number > ' '

    Since a NULL can not be compared using such relationships, it will naturally be excluded. The code above will insure/ensure that neither NULLs nor blanks are included.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Hi Jeff,

    Sorry for not replying sooner. Hectic isn't the word for it these days. :crazy:

    Anyway I finally got the missing piece of the puzzle. I'd forgotten to put an index on the TVF. If I make the phone_number column a PK then the query execution time goes down to 1-1.5 secs which is now giving acceptable run times for all pages running. Plus made the change to the isnull you mentioned as every little helps. 😉

    So it may not be as fast as most queries should be it's an acceptable compromise for my limitations.

    Cheers for your help though. 😀

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply