Distinct query with all columns

Question

Distinct query with all columns

Viewing 13 posts - 16 through 27 (of 27 total)

You must be logged in to reply to this topic. Login to reply

ScottPletcher SSC Guru Points: 100262 More actions · Answer 1

Add a clustering index to pre-sort at least a few of the values. Just based off the very limited data you gave, I'd suggest something like this:

CREATE CLUSTERED INDEX TempAutomatorApiPerformance__CL ON db.TempAutomatorApiPerformance ( StartTimeId, EndTimeId, ExecutionId, ServerId ) WITH ( FILLFACTOR = 98, SORT_IN_TEMPDB = ON ) ON [PRIMARY]; /*change filegroup name to whatever fg you want the table to go to*/

If your SQL edition has the capability, you could also try page compressing the data to reduce the I/O required to scan the table.

SQL DBA,SQL Server MVP(07, 08, 09) "Money can't buy you happiness." Maybe so, but it can make your unhappiness a LOT more comfortable!

Mr. Brian Gale SSC-Insane Points: 24363 More actions · Answer 2

ss-457805 - Wednesday, July 19, 2017 12:14 PM
Thanks Guys for the advise and suggestion. I have raised it with the dev and app team.
bmg - without the DISTINCT, it takes 1 min and 32 secs.
Ed - Will be good to know the performant way to do it.
I normaly just follow this method:
https://support.microsoft.com/en-us/help/139444/how-to-remove-duplicate-rows-from-a-table-in-sql-server

While I do agree you should get rid of duplicates, I am a little confused as to why it takes 3 times as long without the distinct?
Another option that sometimes helps is using "GROUP BY" instead of distinct. I don't expect it to help in your case, but it is something you could try. I've seen stranger things happen.

But to me, 1.5 minutes to get 3 million rows seems like an eternity, especially when there is no joins on that... As a thought, is [TempAutomatorApiPerformance] a table or a view?

The above is all just my opinion on what you should do.
As with all advice you find on a random internet forum - you shouldn't blindly follow it. Always test on a test server to see if there is negative side effects before making changes to live!
I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

Gail Shaw SSC Guru Points: 1004494 More actions · Answer 3

bmg002 - Wednesday, July 19, 2017 2:36 PM
While I do agree you should get rid of duplicates, I am a little confused as to why it takes 3 times as long without the distinct?

Time to return and display the data, especially if the test is from SSMS, which is rather inefficient at displaying data

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

ss-457805 SSCertifiable Points: 5872 More actions · Answer 4

One more thing that I don't understand is - if I remove the column UPPER([MethodName]) AS MethodName, the query (with distinct) returns in 4 secs. Why could that be?

On another note - I took the copy of this database and tested in our test environment. Removed the duplicates ( using Gail's method ) and ran the query, it returns instantly.

Ed- I also tested the De-dupe using various combination and it looks good.Thanks for this.

I have emailed the developers, waiting to hear from them on the next plan of action.

blog: http://sarveshsingh.com Twitter: @sarveshsing

Jacob Wilkins One Orange Chip Points: 27972 More actions · Answer 5

Yeah, UPPER has some unfortunate side-effects. It's non-SARGable, so limits how supporting indexes can be used.

Similarly, it also messes with use of statistics, which means it might pick a less efficient way of implementing the DISTINCT due to poor estimates (for example, with a low number of distinct rows relative to total rows, using a hash match to implement the DISTINCT can perform much better than a distinct sort).

If you post the query plan for the query without UPPER, I would bet it's switched to using a different DISTINCT implementation, and has more accurate estimates (not hard, since the original query plan estimated that nearly every row was unique).

I would definitely avoid using things like UPPER and LOWER unless your database is case-sensitive and you need a case-insensitive search. Of course, as others have pointed out, the elephant in the room is the number of duplicates.

Deduplicating the table would make these other concerns less important, although you still will want to avoid these problems I mentioned if possible.

Cheers!

Mr. Brian Gale SSC-Insane Points: 24363 More actions · Answer 6

ss-457805 - Thursday, July 20, 2017 2:42 AM
One more thing that I don't understand is - if I remove the column UPPER([MethodName]) AS MethodName, the query (with distinct) returns in 4 secs. Why could that be?
On another note - I took the copy of this database and tested in our test environment. Removed the duplicates ( using Gail's method ) and ran the query, it returns instantly.
Ed- I also tested the De-dupe using various combination and it looks good.Thanks for this.
I have emailed the developers, waiting to hear from them on the next plan of action.

As a thought about why removing the "UPPER([MethodName])" speeds things up, what is the average length of values in that column? Like what if you run:
SELECT COUNT(*) FROM [TempAutomatorApiPerformance] WHERE LEN([MethodName]) = (SELECT MAX(LEN([MethodName])) FROM [TempAutomatorApiPerformance])

NOTE - I would run this against test as it likely will take a little bit to run.

I am just curious how many large sets of data are being converted to upper case. Having a character or 2 converted to upper case is quick. The longer the string, the slower the operation. And since you are doing it on 3 million rows, I imagine that most of them would be large.
Do you notice a similar performance increase by simply removing the UPPER function from your query? Or do you need to remove that entire column to get any performance increase?

The above is all just my opinion on what you should do.
As with all advice you find on a random internet forum - you shouldn't blindly follow it. Always test on a test server to see if there is negative side effects before making changes to live!
I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

ss-457805 SSCertifiable Points: 5872 More actions · Answer 7

bmg002 - Thursday, July 20, 2017 10:02 AM
ss-457805 - Thursday, July 20, 2017 2:42 AM
One more thing that I don't understand is - if I remove the column UPPER([MethodName]) AS MethodName, the query (with distinct) returns in 4 secs. Why could that be?
On another note - I took the copy of this database and tested in our test environment. Removed the duplicates ( using Gail's method ) and ran the query, it returns instantly.
Ed- I also tested the De-dupe using various combination and it looks good.Thanks for this.
I have emailed the developers, waiting to hear from them on the next plan of action.
As a thought about why removing the "UPPER([MethodName])" speeds things up, what is the average length of values in that column? Like what if you run:
SELECT COUNT(*) FROM [TempAutomatorApiPerformance] WHERE LEN([MethodName]) = (SELECT MAX(LEN([MethodName])) FROM [TempAutomatorApiPerformance])
NOTE - I would run this against test as it likely will take a little bit to run.
I am just curious how many large sets of data are being converted to upper case. Having a character or 2 converted to upper case is quick. The longer the string, the slower the operation. And since you are doing it on 3 million rows, I imagine that most of them would be large.
Do you notice a similar performance increase by simply removing the UPPER function from your query? Or do you need to remove that entire column to get any performance increase?

please find attached the execution plan without the column methodname, Instead of Sort, I get hash match. It returns in 6 secs. If I remove the UPPER function but leave the column, the result is 6 secs the same and the execution plan is the same as attached. When i run the query you mentioned above it returns 541992 counts. I am even more confused now..I wouldn't have expected the execution plan to change from the one i posted yesterday.

blog: http://sarveshsingh.com Twitter: @sarveshsing

Mr. Brian Gale SSC-Insane Points: 24363 More actions · Answer 8

So 1/6 of your data has the longest MethodNames. This is likely part of what is causing the slowness.
What is the max length of that column?

Also, what is the result of:
SELECT COUNT(DISTINCT MethodName) FROM TempAutomatorApiPerformance

You may actually be able to normalize that column which would improve performance. Well, presuming the developers are OK with it.
If I were you though, I'd take ownership of the database metadata. Change the table(s) however you decide will offer the best performance and give the developers views or stored procedures to get their data. This will allow you to normalize the data and do de-duplication of the data. The only time where this doesn't work well is with 3rd party software.
But if your company is like mine, you'd likely need managerial approval on this. But it would give you a lot more control over the database at the object level and generally make things faster. The above query will be lower than 1500, and storing an int and having a lookup table with a FK constraint on it will make things faster overall.
An even better normalizing solution (in my opinion) would be since there are under 1500 distinct rows, have the distinct rows stored in a lookup table and then the table that they store the data in is just 2 columns of ints (one as a PK, one as the row that was entered).
BUT I do understand that you aren't allowed to make certain changes. If you have a test system though, you could build up a proof of concept. Get the developer to compile 2 versions of the code - one pointing to live, one pointing to the POC. Then show them the perfromance increase with the only developer code changes being the database it points to.

But, back to your latest question - the other problem with the first query is that you don't have enough memory for it. You can see in the properties of the execution plan that it asked for 6031080 and it was granted 1527384. Looking at the second one, it requested 5384968 and got the whole thing. That will make a difference in performance I believe. But looking at what it reqeusted compared to what it required, I wonder if your statistics are out of date.

And changing the query will change the execution plan. That is not surprising. And SQL did a best guess based on the statistics as to how it should match rows up. Based on the data per row in the first query, it thought sorting it and then removing duplicates would be more efficient. Based on the data per row in the second query, it thought that doing a hash map would be more efficient.

The above is all just my opinion on what you should do.
As with all advice you find on a random internet forum - you shouldn't blindly follow it. Always test on a test server to see if there is negative side effects before making changes to live!
I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

Jacob Wilkins One Orange Chip Points: 27972 More actions · Answer 9

It looks like you're just getting lucky, in this case. A substantial bit of the performance improvement is going to be from implementing the DISTINCT with a hash match instead of a distinct sort (remember, for sets with a high number of total rows relative to number of distinct rows this tends to be much more efficient).

However, the estimates are just as bad with this version of the query, so you're not benefiting from improved estimates that lead to the better choice.

An easy way to test where the performance difference is coming from is to run the original query (with UPPER(MethodName)), but use the query hint OPTION (HASH GROUP) to force that behavior.

I'd wager most of the performance improvement comes from that; again, though, the real issue to be addressed is deduplicating the table.

Cheers!

ss-457805 SSCertifiable Points: 5872 More actions · Answer 10

bmg002 - Thursday, July 20, 2017 12:17 PM
So 1/6 of your data has the longest MethodNames. This is likely part of what is causing the slowness.
What is the max length of that column?
Also, what is the result of:
SELECT COUNT(DISTINCT MethodName) FROM TempAutomatorApiPerformance
You may actually be able to normalize that column which would improve performance. Well, presuming the developers are OK with it.
If I were you though, I'd take ownership of the database metadata. Change the table(s) however you decide will offer the best performance and give the developers views or stored procedures to get their data. This will allow you to normalize the data and do de-duplication of the data. The only time where this doesn't work well is with 3rd party software.
But if your company is like mine, you'd likely need managerial approval on this. But it would give you a lot more control over the database at the object level and generally make things faster. The above query will be lower than 1500, and storing an int and having a lookup table with a FK constraint on it will make things faster overall.
An even better normalizing solution (in my opinion) would be since there are under 1500 distinct rows, have the distinct rows stored in a lookup table and then the table that they store the data in is just 2 columns of ints (one as a PK, one as the row that was entered).
BUT I do understand that you aren't allowed to make certain changes. If you have a test system though, you could build up a proof of concept. Get the developer to compile 2 versions of the code - one pointing to live, one pointing to the POC. Then show them the perfromance increase with the only developer code changes being the database it points to.
But, back to your latest question - the other problem with the first query is that you don't have enough memory for it. You can see in the properties of the execution plan that it asked for 6031080 and it was granted 1527384. Looking at the second one, it requested 5384968 and got the whole thing. That will make a difference in performance I believe. But looking at what it reqeusted compared to what it required, I wonder if your statistics are out of date.
And changing the query will change the execution plan. That is not surprising. And SQL did a best guess based on the statistics as to how it should match rows up. Based on the data per row in the first query, it thought sorting it and then removing duplicates would be more efficient. Based on the data per row in the second query, it thought that doing a hash map would be more efficient.

You've got it spot on. It is a third party software with various other vendors in the mix, so making a change needs approval. This can be anything between 15 days to a month. I had a call with the developers this morning and showed them the various ways we can resolve this ( Thanks to you and Gail for the suggestion) in our test environment. Developers with an ego doesn't help either :-). Will see how this goes. Your query by the way returns 5.

blog: http://sarveshsingh.com Twitter: @sarveshsing

ss-457805 SSCertifiable Points: 5872 More actions · Answer 11

Jacob Wilkins - Thursday, July 20, 2017 12:27 PM
It looks like you're just getting lucky, in this case. A substantial bit of the performance improvement is going to be from implementing the DISTINCT with a hash match instead of a distinct sort (remember, for sets with a high number of total rows relative to number of distinct rows this tends to be much more efficient).
However, the estimates are just as bad with this version of the query, so you're not benefiting from improved estimates that lead to the better choice.
An easy way to test where the performance difference is coming from is to run the original query (with UPPER(MethodName)), but use the query hint OPTION (HASH GROUP) to force that behavior.
I'd wager most of the performance improvement comes from that; again, though, the real issue to be addressed is deduplicating the table.
Cheers!

Thanks Jacob. We are looking to tackle the real issue - de-duplicating.

blog: http://sarveshsingh.com Twitter: @sarveshsing

Ed Wagner SSC Guru Points: 287015 More actions · Answer 12

ss-457805 - Thursday, July 20, 2017 12:57 PM
Thanks Jacob. We are looking to tackle the real issue - de-duplicating.

Now that's really good to hear. I'd go the route that Gail described (create new, drop and rename) if at all possible.

The next step will be to prevent the duplicate rows from getting into the table in the first place.

Tally Tables - Performance Personified
String Splitting with True Performance
Best practices on how to ask questions

ss-457805 SSCertifiable Points: 5872 More actions · Answer 13

Ed Wagner - Thursday, July 20, 2017 1:21 PM
ss-457805 - Thursday, July 20, 2017 12:57 PM
Thanks Jacob. We are looking to tackle the real issue - de-duplicating.
Now that's really good to hear. I'd go the route that Gail described (create new, drop and rename) if at all possible.
The next step will be to prevent the duplicate rows from getting into the table in the first place.

Thanks Ed for your help. De-duplicating method you sugested definately helped and we are looking to plan this. However, it just needs to go through the process (Change Control). The immediate fix has been relayed as well (Gail's method). Thanks a lot.

blog: http://sarveshsingh.com Twitter: @sarveshsing