95th percentile

  • Does anyone have any way of calculating the percentile of a column of of data.

    Ideally I need to show the 95th percentile.

    I need to use the result in a NHS dataset report via Reporting Services.

    As per my previuos topic i have used the excel function to see what result i should get. from the example data below i should see that the 95th percentile would be 754.25.

    I have seen some examples elsewhere on the web but they seem to use UNION SELECT but I am unsure of how to apply it.

    Any help wouuld be appreciated

    Many Thanks

    CREATE TABLE PS_TestForOnline

    (

    number NVARCHAR (20),

    );

    INSERT INTO PS_TestForOnline

    VALUES('1' );

    INSERT INTO PS_TestForOnline

    VALUES('5' );

    INSERT INTO PS_TestForOnline

    VALUES('10' );

    INSERT INTO PS_TestForOnline

    VALUES('15' );

    INSERT INTO PS_TestForOnline

    VALUES('20' );

    INSERT INTO PS_TestForOnline

    VALUES('999' );

    SELECT * FROM PS_TestForOnline

    DROP TABLE PS_TestForOnline

  • I'm not sure of how your calculation would work. Can you help explain what the '95th' percentile is? I can't seem to easily grok the math that came up with the number you provided.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Evil Kraig F (9/12/2011)


    I'm not sure of how your calculation would work. Can you help explain what the '95th' percentile is? I can't seem to easily grok the math that came up with the number you provided.

    My thoughts ed zachary.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • cribbed this from another web page but feel that it explains it fairly well.

    "Basically the 95th percentile says that 95% of the time, the usage is below this amount. Conversely of course, 5% of the time, usage is above that amount. The 95th percentile is a good number to use for planning so you can ensure you have the needed bandwidth at least 95% of the time."

    The data that I will eventually apply this to is the time patients wait in our A&E dept. Of the minutes that people wait to be treated and seen I need to know that 95 % of the time patients are treated in X amount of time.

    Thanks

  • OK so how did you come up with 754.25 from your sample data? It would seem that with your data 95% of them are no greater than 20. If you can explain how you came up with 754.25 it should be pretty simple.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • OK, perhaps my sample data wasnt the best example I could have used. I found this which i think may help.

    "The algorithm is as follows: collect all the data samples for a period of time (commonly a day, a week, or a month), then sort the data set by value from highest to lowest and discard the highest 5% of the sorted samples. The next highest sample is the 95th percentile value for the data set. "

    From what I've been told at work and from what i understand, you sort your data set and remove the top 5% of rows. If your dataset consisted of 100 rows sorted in order, your 95th percentile would be the value on the 96th row.

    Sorry to be so vague. I will confirm the description at work tomorrow.

    Thanks

  • Well then....

    select MAX(number) from

    (

    select top 95 percent number from PS_TestForOnline order by number

    ) MyAlias

    You probably want a bit more sample data to prove this but it should be what you are looking for.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Here is a quick and dirty proof.

    CREATE TABLE PS_TestForOnline

    (

    number int identity,

    );

    INSERT INTO PS_TestForOnline default values

    go 100

    select * from PS_TestForOnline

    select MAX(number) from

    (

    select top 95 percent number from PS_TestForOnline order by number

    ) MyAlias

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • I don't claim to be an expert in either SQL or Excel but I tested the method shown above with the original data values given and that method does not produce the correct answer. It would only produce the correct answer with the values 1,2,3,...99,100.

    We need to understand what the term "percentile" means in Excel. I found the following on YouTube which provides a good explanation.

    http://www.youtube.com/watch?v=XllEMSjVGGk

    Using the method described on YouTube, I came up with the following SQL that uses the values given. I'm sure that I am not using the best SQL methods, but at least it gives the correct answer of 754.25. I hope someone else here can improve the code.

    CREATE TABLE PS_TestForOnline

    (

    rowId int identity,

    minutes int,

    );

    INSERT INTO PS_TestForOnline

    VALUES(1);

    INSERT INTO PS_TestForOnline

    VALUES(5);

    INSERT INTO PS_TestForOnline

    VALUES(10);

    INSERT INTO PS_TestForOnline

    VALUES(15);

    INSERT INTO PS_TestForOnline

    VALUES(20);

    INSERT INTO PS_TestForOnline

    VALUES(999);

    SELECT * FROM PS_TestForOnline

    DECLARE @PERCENT REAL

    SET @PERCENT = .95

    PRINT @PERCENT

    DECLARE @COUNT INT

    SELECT @COUNT=COUNT(*) FROM PS_TestForOnline

    PRINT @COUNT

    DECLARE @POSITION REAL

    DECLARE @BETWEEN REAL

    DECLARE @ROW1 INT

    DECLARE @ROW2 INT

    SET @POSITION = @PERCENT * @COUNT + (1 - @PERCENT)

    PRINT @POSITION

    SET @ROW1 = @POSITION

    SET @ROW2 = @ROW1 + 1

    PRINT @ROW1

    PRINT @ROW2

    SET @BETWEEN = @POSITION - @ROW1

    PRINT @BETWEEN

    DECLARE @MINUTES1 INT

    DECLARE @MINUTES2 INT

    SELECT @MINUTES1 = minutes FROM PS_TestForOnline WHERE rowId = @ROW1

    PRINT @MINUTES1

    SELECT @MINUTES2 = minutes FROM PS_TestForOnline WHERE rowId = @ROW2

    PRINT @MINUTES2

    DECLARE @DISTANCE INT

    SET @DISTANCE = @MINUTES2 - @MINUTES1

    PRINT @DISTANCE

    DECLARE @add REAL

    SET @add = @BETWEEN * @DISTANCE

    PRINT @add

    DECLARE @PERCENTILE REAL

    SET @PERCENTILE = @MINUTES1 + @add

    PRINT @PERCENTILE

    DROP TABLE PS_TestForOnline

    Selected values from table:

    rowId minutes

    11

    25

    310

    415

    520

    6999

    Printed values:

    0.95 (@PERCENT)

    6 (@COUNT)

    5.75 (@POSITION)

    5 (@ROW1)

    6 (@ROW2)

    0.75 (@BETWEEN)

    20 (@MINUTES1)

    999 (@MINUTES2)

    979 (@DISTANCE)

    734.25 (@ADD)

    754.25 (@PERCENTILE) This is the desired correct answer.

    gmrose

  • gmrose (9/12/2011)


    I don't claim to be an expert in either SQL or Excel but I tested the method shown above with the original data values given and that method does not produce the correct answer. It would only produce the correct answer with the values 1,2,3,...99,100.

    That really depends on what the "correct" answer is. It is accurate if as the OP stated they wanted to remove the top 5% of the values. I guess we will find out tomorrow what the definition is. 😉

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • He's doing a statistical analysis based on a normal distribution. he's looking for the point in a normal distibution with an average and standard deviation commensurate with his set of numbers where 95% of the distribution is less than Z.

    In basic stats (I'm sure tom will be by and whip this sucker out a lot better), this approximates to:

    x < Avg(n) + 1.641 * stdev(n)

    In this case avg(n) =175 and stdev(n) = 403.733, which leads t something like:

    x < 838 and change

    Of course - this is fairly arbitrary, since the example he shows doesn't look like a normal distribution at all (at least not with the numbers he gave). a 5-number distribution with that kind of wild spread is nothing except undefined.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Matt Miller (#4) (9/12/2011)


    He's doing a statistical analysis based on a normal distribution. he's looking for the point in a normal distibution with an average and standard deviation commensurate with his set of numbers where 95% of the distribution is less than Z.

    In basic stats (I'm sure tom will be by and whip this sucker out a lot better), this approximates to:

    x < Avg(n) + 1.41 * stdev(n)

    In this case avg(n) =175 and stdev(n) = 403.733, which leads t something like:

    x < 744 and change

    Of course - this is fairly arbitrary, since the example he shows doesn't look like a normal distribution at all (at least not with the numbers he gave). a 5-number distribution with that kind of wild spread is nothing except undefined.

    Well, thank you for clarifying. I think I'll leave this to the folks who got past Basic Statistics (or remember more then I do). If a math formula swings past here I can grok, I'll help with the code though!


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Thank you all for your replies.

    I now have the definition from which i am expected to work to.

    "The 95th pecentile is taken as the value at position "0.95 x(number of rows -1) +1"

    My interpretation is that when the data is ordered I am looking for the value that is next following the 95th % row.

    so if we have a random selection of 100 rows sorted acending then the value in row 96 is what I would expect to be the 95th percentile.

    Hope this helps and sorry for the original confusion.

  • Try this:

    Based on the definition you gave, you can enter the percentile in the declare bit, or hard code it instead of the @percentile instead if you are always going to use 0.95 - let me know if it works!

    IF OBJECT_ID('tempdb..#DATA') IS NOT NULL DROP TABLE #data

    CREATE TABLE #data (number INT)

    INSERT INTO #data SELECT 15 as number union all

    SELECT 20 as number union all

    SELECT 35 as number union all

    SELECT 40 as number union all

    SELECT 50 as number

    DECLARE @percentile DECIMAL(18,3)

    SET @Percentile = 0.40

    ;with percentile as

    (

    select

    CAST(number AS DECIMAL(18,3)) as number

    ,row_number() OVER (PARTITION BY (SELECT NULL) order by number asc) as row

    from #data

    )

    ,percentile2 AS

    (

    SELECT

    number

    ,row

    ,MAX(row) OVER (PARTITION BY (SELECT NULL)) as max_row

    ,(@percentile * (MAX(row) OVER (PARTITION BY (SELECT NULL)) - 1)) + 1 as percentile_row

    FROM percentile

    )

    select

    AVG(number) as percentile_value

    from percentile2

    WHERE ABS(percentile_row - row) < 1

  • Thanks to Davin21 (and everyone elses comments) That seems to work fine.

    I need now to add group the result by a month and/or field.

    I am a bit unsure as to how to incorporate your script into a group script.

    Could anyone please advise?

    thanks in advance

Viewing 15 posts - 1 through 15 (of 22 total)

You must be logged in to reply to this topic. Login to reply