Partitioning large tables for performance.

  • I am trying to test whether using partitioning on our Users table will provide performance benefits.

    We are a notification company. When we sign up a client, the clients provide us with a list of users which we then enter into our database. Some clients have 20-30 users, and some have literally tens of thousands of users.

    Typically, one of the client's administrators logs into our website to send out an alert/broadcast. When the admin signs in, we present a list of all of the clients users to them which they can "deselect" individually to indicate whom to send this alert to.

    We want to speed up the retrieval of the client's users to minimize the wait time of fetching this data and so we are considering using partitioning over the "Users" table since we have a database server platform which has four Intel quad processors on it. I think we can achieve significant performance benefits by taking advantage of parallelism which is possible if we can use partitioning over the "Users" table.

    Each user row contains a User_ID (Int) field which seems like the best field to partition on. But the way I'd like to see the data distributed is by the least significant digit of the "User_ID" field so that the users of an organization would be roughly evenly distributed between the partitions.

    The query to examine distribution is:

    SELECT

    (ABS( USER_ID ) % 10) AS PartNo, COUNT(*) As UserCount

    FROM Users

    GROUP BY ABS( User_ID )%10

    ORDER BY ABS( User_ID )%10

    The results are:

    PartNoUserCount

    0206665

    1206586

    2206548

    3206496

    4206463

    5206367

    6206660

    7206383

    8206507

    9206529

    The problem I am encountering is that there doesn't seem to be a way to implement this logic in the partition function. I tried the following, but SQL doesn't like the syntax:

    CREATE PARTITION FUNCTION [pf_Users] ( ABS(INT)%10 )

    AS RANGE LEFT FOR VALUES( 0,1,2,3,4,5,6,7,8,9)

    Does anyone have an idea about how I might be able to accomplish this?

    Thanks in advance for your response and ideas.

  • This was removed by the editor as SPAM

  • Larry Kruse (6/23/2010)


    The query to examine distribution is:

    SELECT

    (ABS( USER_ID ) % 10) AS PartNo, COUNT(*) As UserCount

    FROM Users

    GROUP BY ABS( User_ID )%10

    ORDER BY ABS( User_ID )%10

    The results are:

    PartNoUserCount

    0206665

    1206586

    2206548

    3206496

    4206463

    5206367

    6206660

    7206383

    8206507

    9206529

    CREATE PARTITION FUNCTION [pf_Users] ( ABS(INT)%10 )

    AS RANGE LEFT FOR VALUES( 0,1,2,3,4,5,6,7,8,9)

    Does anyone have an idea about how I might be able to accomplish this?

    There are two ways to approach this :

    1.If the USER_ID is known prior to insert,then the (ABS( USER_ID ) % 10) can be done on the user_id into a variable and used in insert statement.

    2.If the USER_ID is an identity column where in the id will be known only at insert,then a computed column to hold the partition key approach be used.The computed column should be persisted.

  • Larry Kruse (6/23/2010)Typically, one of the client's administrators logs into our website to send out an alert/broadcast. When the admin signs in, we present a list of all of the clients users to them which they can "deselect" individually to indicate whom to send this alert to.

    Do you do this even for the clients that have "thousands" of users? the manual de-select thing, I mean.

    Larry Kruse (6/23/2010)We want to speed up the retrieval of the client's users to minimize the wait time of fetching this data and so we are considering using partitioning over the "Users" table since we have a database server platform which has four Intel quad processors on it. I think we can achieve significant performance benefits by taking advantage of parallelism which is possible if we can use partitioning over the "Users" table.

    If the idea is to take advantage of parallel processing I would finetune the ideal parallelism instead of partitioning. I do not see partitioning as the right tool for this scenario.

    Please check degree of parallelism here... http://qa.sqlservercentral.com/articles/Configuring/managingmaxdegreeofparallelism/1029/

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • I built a partitioned table by generating a persisted column which generates the least significant digit of the user's User_ID identity column.

    Alas! We didn't get the performance improvements we had hoped to see.

    Of the scenarios I've tested out, so far, the biggest improvement came from building indexes on a separate drive spindle.

    The hardware platform is composed of two internal RAID-1 drives and a NAS storage unit. One drive is the C: drive and is used for OS and paging only, the second internal drive is used for the log file and the NAS is the primary data store. I tested out a scenario where I built my indexes in a separate filegroup on the log-file drive (L:) and that resulted in a noticeable improvement.

    The general feeling around here is that I still need to partition the Users table in order to minimize maintenance windows for rebuilding indexes and what-not.

    Thanks for everyone's advice and suggestions. I'll keep plugging away at this and post something up if I stumble on a way to partition this table.

  • Larry Kruse (6/29/2010)


    Alas! We didn't get the performance improvements we had hoped to see.

    Not a surprise - check my previous post 🙂

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • Paul - Thanks for your feedback. I reviewed the article which mostly discusses the MAXDOP settings. While it certainly provides something interesting to consider regarding parallelism, I doubt seriously that it is particularly relevant to the partitioning of the table. While I'm sure that there is overhead associated with managing the parallelism of the execution threads, I'm guessing that it relatively small when compared against the advantages gained by having multiple parallel I/O's running at the same time in order to fetch a large amount of data.

    I guess I'm not understanding why you aren't surprised that splitting the table into partitions didn't make much difference.

    After reading your post and the link embedded in it, I ran a single query against both my partitioned table and against an unpartitioned table and received nearly identical results. The query against the unpartitioned data was consistently about 3 seconds faster than the same query running against the partitioned table setup. (Un-partitioned query retrieved about 281,000 rows in 12 seconds and the same query run against partitioned table retrieved the same number of rows in 17 seconds.) Both queries specified (NOLOCK) and both had MAXDOP 8.

  • Larry Kruse (6/29/2010)


    Paul - Thanks for your feedback. I reviewed the article which mostly discusses the MAXDOP settings. While it certainly provides something interesting to consider regarding parallelism, I doubt seriously that it is particularly relevant to the partitioning of the table.

    Exactly.

    This is the message I'm trying to convey. I do not see how partitioning is going to help in this specific case.

    I would go with a non-partitioned table and a good indexing strategy, how many columns are you retrieving in that particular query? Have you considered a covered index?

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • I think I agree with Paul. I'm not sure partitioning helps.

    Partitioning helps with administration, and with archiving and adding data, but not with queries themselves. To get a big improvement, you'd want to have the partitions on separate physical drives AND be running queries in parallel, as mentioned by Paul.

  • If you want to partition, consider doing it by client. Try creating individual partitions for the large clients on Client ID and a catch-all for the smaller ones. Then a query for users for a single client would have to access only a single partition.

    However, it would probably be simpler and just as fast to have a table clustered in Client ID and User ID, so that a query scans only a range of the clustered index.

  • This was removed by the editor as SPAM

  • stewartc-708166 (6/30/2010)


    Partitioning should not be used on moderate sized transactional tables.

    It works best for the archiving of huge datasets (hundred million plus records) that are not interrogated on a regular basis.

    I agree on everything but on the "...that are not interrogated on a regular basis" part of it.

    In our shop we do take advantage of partitioning in heavily used, multi-hundred-million and multi-billion rows FACT and DIM tables which are interrogated all day long.

    The trick is to partition those tables in a way where most queries only hit the specific partitions they need to serve the particular query.

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • Paul,

    Do you have partitions separated on different drives? Have you still seen large improvement of indexing? Or does this just act as another type of index?

  • Steve Jones - Editor (6/30/2010)


    Do you have partitions separated on different drives? Have you still seen large improvement of indexing? Or does this just act as another type of index?

    Steve --

    We do have partitions scattered across different drives.

    Also, since those are core FACT and DIM tables storing granular data in a large data warehouse environment. We deal with queries that have to process huge numbers of rows then having partitions by date - most queries are asking for a date range - allows optimizer to hit only the affected partitions.

    Bottom line is that in an extreme case it is cheaper to do a full scan on a 25 million rows partition than to do index range scan plus table access to retrieve 25 million rows out of a one billion rows table.

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • I agree with Paul, Steve and the rest. Partitioning is not the right tool for this job unless your users table has hundreds of millions of rows in it.

    Even if you do partition the table, you wouldn't want to do it by using the compluted column and the ABS function mumbo-jumbo. If you want use partitioning to speed up SELECTS, you'll want to do what Michael suggests and partition by Client. Don't try to outsmart the partitioning functions and evenly distribute your data. Check your query plan on the compluted column method, does it result in a table scan? Partitioning by client would aleviate this mess and allow the query optimizer to only query those partitions necessary to satisfy the query.

    With that said, I revert to my first statement, you don't need to partition this table unless you have 100 million + rows. A well defined index strategy and properly structured T-SQL is what you need here.

    How big is your table? Can you give us an example of the distribution by client? What is the response time you are trying to achieve?

    John Rowan

    ======================================================
    ======================================================
    Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply