Selecting the record with the “nearest” date

  • I recall OP was saying some rows do have not null IndexStartDate.

  • I recall OP was saying some rows do have not null IndexStartDate.

    Yes, that's absolutely right. I need to add some cases to the test data.

    Sorry for not replying earlier, but I am trying to multitask and still studying the suggested code snippets in this thread. I have tried Dave Morrison's suggestion with an AsAtDate of 2015-06-01. It is about 60% correct on the test data, so not quite there yet. I will post again as soon as I feel I can comment intelligently on the possible reasons.

    Really appreciate the input so far, people - very instructive.

    MarkD

  • Hi experts

    Here is the complete test table definition with primary key and enhanced data. I have replicated the existing patients twice but with new IDs (starting with a Q or an R rather than a P) and more complete details. For comparison purposes, I have left the numeric part of the IDs alone. To illustrate using one patient:

    Their deficient "P" records from the existing data...

    SELECT 'P5608863213',NULL,'Jul 13 2015 12:00AM','Historic' UNION ALL

    SELECT 'P5608863213',NULL,'Jul 22 2015 12:00AM','Historic' UNION ALL

    SELECT 'P5608863213',NULL,'Dec 31 2500 12:00AM','Current' UNION ALL

    The enhanced "Q" records for an imaginary patient with the same details, only more completely expressed. All the dates, except the first which is unknown, have been filled in with the likeliest values:

    SELECT 'Q5608863213',NULL,'Jul 13 2015 12:00AM','Historic' UNION ALL

    SELECT 'Q5608863213','Jul 14 2015 12:00AM','Jul 22 2015 12:00AM','Historic' UNION ALL

    SELECT 'Q5608863213','Jul 23 2015 12:00AM','Dec 31 2500 12:00AM','Current' UNION ALL

    The enhanced "R" records for an imaginary patient with the same details. These are the same as the "Q" records except that the null date on the first record is replaced by 1753-01-01 as per Kevin Boles' suggestion:

    SELECT 'R5608863213','Jan 1 1753 12:00AM','Jul 13 2015 12:00AM','Historic' UNION ALL

    SELECT 'R5608863213','Jul 14 2015 12:00AM','Jul 22 2015 12:00AM','Historic' UNION ALL

    SELECT 'R5608863213','Jul 23 2015 12:00AM','Dec 31 2500 12:00AM','Current' UNION ALL

    The point with doing it this way is that whatever query we concoct, it should select the "same" records from the P, Q and R sets of patients.

    Oh - the TSQL:

    CREATE TABLE [dbo].[MDTest2](

    [PatientID] [varchar](11) NOT NULL,

    [IndexStartDate] [datetime] NULL,

    [IndexEndDate] [datetime] NOT NULL,

    [RecSource] [varchar](8) NOT NULL

    CONSTRAINT PrimaryKey PRIMARY KEY (PatientID,IndexEndDate)

    SELECT 'P5711063012',NULL,'Mar 24 2015 12:00AM','Historic' UNION ALL

    SELECT 'P5711063012',NULL,'Dec 31 2500 12:00AM','Current' UNION ALL

    SELECT 'P1063766833',NULL,'Sep 21 2015 12:00AM','Historic' UNION ALL

    SELECT 'P1063766833',NULL,'Dec 31 2500 12:00AM','Current' UNION ALL

    SELECT 'P0519084455',NULL,'Sep 21 2015 12:00AM','Historic' UNION ALL

    SELECT 'P0519084455',NULL,'Dec 31 2500 12:00AM','Current' UNION ALL

    SELECT 'P8362712077',NULL,'Sep 29 2015 12:00AM','Historic' UNION ALL

    SELECT 'P8362712077',NULL,'Dec 31 2500 12:00AM','Current' UNION ALL

    SELECT 'P6421082878',NULL,'Feb 22 2015 12:00AM','Historic' UNION ALL

    SELECT 'P6421082878',NULL,'May 22 2015 12:00AM','Deducted' UNION ALL

    SELECT 'P9814006870',NULL,'Jan 14 2015 12:00AM','Historic' UNION ALL

    SELECT 'P9814006870',NULL,'Jan 19 2015 12:00AM','Historic' UNION ALL

    SELECT 'P8293058689',NULL,'Sep 29 2015 12:00AM','Historic' UNION ALL

    SELECT 'P8293058689',NULL,'Dec 31 2500 12:00AM','Current' UNION ALL

    SELECT 'P8026056091',NULL,'Jan 19 2015 12:00AM','Historic' UNION ALL

    SELECT 'P8026056091',NULL,'Feb 18 2015 12:00AM','Historic' UNION ALL

    SELECT 'P8026056091',NULL,'Dec 31 2500 12:00AM','Current' UNION ALL

    SELECT 'P9037628490',NULL,'Oct 7 2015 12:00AM','Historic' UNION ALL

    SELECT 'P9037628490',NULL,'Dec 31 2500 12:00AM','Current' UNION ALL

    SELECT 'P1276762217',NULL,'Aug 25 2015 12:00AM','Historic' UNION ALL

    SELECT 'P1276762217',NULL,'Dec 31 2500 12:00AM','Current' UNION ALL

    SELECT 'P5608863213',NULL,'Jul 13 2015 12:00AM','Historic' UNION ALL

    SELECT 'P5608863213',NULL,'Jul 22 2015 12:00AM','Historic' UNION ALL

    SELECT 'P5608863213',NULL,'Dec 31 2500 12:00AM','Current' UNION ALL

    SELECT 'P4004258144',NULL,'Jun 29 2015 12:00AM','Historic' UNION ALL

    SELECT 'P4004258144',NULL,'Dec 31 2500 12:00AM','Current' UNION ALL

    SELECT 'P2695719501',NULL,'May 18 2015 12:00AM','Historic' UNION ALL

    SELECT 'P2695719501',NULL,'Dec 31 2500 12:00AM','Current' UNION ALL

    SELECT 'P9460813602',NULL,'Mar 4 2015 12:00AM','Historic' UNION ALL

    SELECT 'P9460813602',NULL,'Dec 31 2500 12:00AM','Current' UNION ALL

    SELECT 'Q5711063012',NULL,'Mar 24 2015 12:00AM','Historic' UNION ALL

    SELECT 'Q5711063012','Mar 25 2015 12:00AM','Dec 31 2500 12:00AM','Current' UNION ALL

    SELECT 'Q1063766833',NULL,'Sep 21 2015 12:00AM','Historic' UNION ALL

    SELECT 'Q1063766833','Sep 22 2015 12:00AM','Dec 31 2500 12:00AM','Current' UNION ALL

    SELECT 'Q0519084455',NULL,'Sep 21 2015 12:00AM','Historic' UNION ALL

    SELECT 'Q0519084455','Sep 22 2015 12:00AM','Dec 31 2500 12:00AM','Current' UNION ALL

    SELECT 'Q8362712077',NULL,'Sep 29 2015 12:00AM','Historic' UNION ALL

    SELECT 'Q8362712077','Sep 30 2015 12:00AM','Dec 31 2500 12:00AM','Current' UNION ALL

    SELECT 'Q6421082878',NULL,'Feb 22 2015 12:00AM','Historic' UNION ALL

    SELECT 'Q6421082878','Feb 23 2015 12:00AM','May 22 2015 12:00AM','Deducted' UNION ALL

    SELECT 'Q9814006870',NULL,'Jan 14 2015 12:00AM','Historic' UNION ALL

    SELECT 'Q9814006870','Jan 15 2015 12:00AM','Jan 19 2015 12:00AM','Historic' UNION ALL

    SELECT 'Q8293058689',NULL,'Sep 29 2015 12:00AM','Historic' UNION ALL

    SELECT 'Q8293058689','Sep 30 2015 12:00AM','Dec 31 2500 12:00AM','Current' UNION ALL

    SELECT 'Q8026056091',NULL,'Jan 19 2015 12:00AM','Historic' UNION ALL

    SELECT 'Q8026056091','Jan 20 2015 12:00AM','Feb 18 2015 12:00AM','Historic' UNION ALL

    SELECT 'Q8026056091','Feb 19 2015 12:00AM','Dec 31 2500 12:00AM','Current' UNION ALL

    SELECT 'Q9037628490',NULL,'Oct 7 2015 12:00AM','Historic' UNION ALL

    SELECT 'Q9037628490','Oct 8 2015 12:00AM','Dec 31 2500 12:00AM','Current' UNION ALL

    SELECT 'Q1276762217',NULL,'Aug 25 2015 12:00AM','Historic' UNION ALL

    SELECT 'Q1276762217','Aug 26 2015 12:00AM','Dec 31 2500 12:00AM','Current' UNION ALL

    SELECT 'Q5608863213',NULL,'Jul 13 2015 12:00AM','Historic' UNION ALL

    SELECT 'Q5608863213','Jul 14 2015 12:00AM','Jul 22 2015 12:00AM','Historic' UNION ALL

    SELECT 'Q5608863213','Jul 23 2015 12:00AM','Dec 31 2500 12:00AM','Current' UNION ALL

    SELECT 'Q4004258144',NULL,'Jun 29 2015 12:00AM','Historic' UNION ALL

    SELECT 'Q4004258144','Jun 30 2015 12:00AM','Dec 31 2500 12:00AM','Current' UNION ALL

    SELECT 'Q2695719501',NULL,'May 18 2015 12:00AM','Historic' UNION ALL

    SELECT 'Q2695719501','May 19 2015 12:00AM','Dec 31 2500 12:00AM','Current' UNION ALL

    SELECT 'Q9460813602',NULL,'Mar 4 2015 12:00AM','Historic' UNION ALL

    SELECT 'Q9460813602','Mar 5 2015 12:00AM','Dec 31 2500 12:00AM','Current' UNION ALL

    SELECT 'R5711063012','Jan 1 1753 12:00AM','Mar 24 2015 12:00AM','Historic' UNION ALL

    SELECT 'R5711063012','Mar 25 2015 12:00AM','Dec 31 2500 12:00AM','Current' UNION ALL

    SELECT 'R1063766833','Jan 1 1753 12:00AM','Sep 21 2015 12:00AM','Historic' UNION ALL

    SELECT 'R1063766833','Sep 22 2015 12:00AM','Dec 31 2500 12:00AM','Current' UNION ALL

    SELECT 'R0519084455','Jan 1 1753 12:00AM','Sep 21 2015 12:00AM','Historic' UNION ALL

    SELECT 'R0519084455','Sep 22 2015 12:00AM','Dec 31 2500 12:00AM','Current' UNION ALL

    SELECT 'R8362712077','Jan 1 1753 12:00AM','Sep 29 2015 12:00AM','Historic' UNION ALL

    SELECT 'R8362712077','Sep 30 2015 12:00AM','Dec 31 2500 12:00AM','Current' UNION ALL

    SELECT 'R6421082878','Jan 1 1753 12:00AM','Feb 22 2015 12:00AM','Historic' UNION ALL

    SELECT 'R6421082878','Feb 23 2015 12:00AM','May 22 2015 12:00AM','Deducted' UNION ALL

    SELECT 'R9814006870','Jan 1 1753 12:00AM','Jan 14 2015 12:00AM','Historic' UNION ALL

    SELECT 'R9814006870','Jan 15 2015 12:00AM','Jan 19 2015 12:00AM','Historic' UNION ALL

    SELECT 'R8293058689','Jan 1 1753 12:00AM','Sep 29 2015 12:00AM','Historic' UNION ALL

    SELECT 'R8293058689','Sep 30 2015 12:00AM','Dec 31 2500 12:00AM','Current' UNION ALL

    SELECT 'R8026056091','Jan 1 1753 12:00AM','Jan 19 2015 12:00AM','Historic' UNION ALL

    SELECT 'R8026056091','Jan 20 2015 12:00AM','Feb 18 2015 12:00AM','Historic' UNION ALL

    SELECT 'R8026056091','Feb 19 2015 12:00AM','Dec 31 2500 12:00AM','Current' UNION ALL

    SELECT 'R9037628490','Jan 1 1753 12:00AM','Oct 7 2015 12:00AM','Historic' UNION ALL

    SELECT 'R9037628490','Oct 8 2015 12:00AM','Dec 31 2500 12:00AM','Current' UNION ALL

    SELECT 'R1276762217','Jan 1 1753 12:00AM','Aug 25 2015 12:00AM','Historic' UNION ALL

    SELECT 'R1276762217','Aug 26 2015 12:00AM','Dec 31 2500 12:00AM','Current' UNION ALL

    SELECT 'R5608863213','Jan 1 1753 12:00AM','Jul 13 2015 12:00AM','Historic' UNION ALL

    SELECT 'R5608863213','Jul 14 2015 12:00AM','Jul 22 2015 12:00AM','Historic' UNION ALL

    SELECT 'R5608863213','Jul 23 2015 12:00AM','Dec 31 2500 12:00AM','Current' UNION ALL

    SELECT 'R4004258144','Jan 1 1753 12:00AM','Jun 29 2015 12:00AM','Historic' UNION ALL

    SELECT 'R4004258144','Jun 30 2015 12:00AM','Dec 31 2500 12:00AM','Current' UNION ALL

    SELECT 'R2695719501','Jan 1 1753 12:00AM','May 18 2015 12:00AM','Historic' UNION ALL

    SELECT 'R2695719501','May 19 2015 12:00AM','Dec 31 2500 12:00AM','Current' UNION ALL

    SELECT 'R9460813602','Jan 1 1753 12:00AM','Mar 4 2015 12:00AM','Historic' UNION ALL

    SELECT 'R9460813602','Mar 5 2015 12:00AM','Dec 31 2500 12:00AM','Current' ;

    Please note that I will be offsite tomorrow and unlikely to be able to respond before Thursday morning UK time. Thanks for all the help so far and for bearing with me. Sorry this is such a long post.

    MarkD

  • John Mitchell-245523 (12/9/2015)


    Kim Crosser (12/8/2015)


    In similar situations, I found the following to work pretty well:

    SELECT PatientID, ...

    from dbo.MDTest tbl1

    where tbl1.IndexEndDate > @dParamDate

    and not exists (select 1

    from dbo.MDTest tbl2

    where tbl2.PatientID = tbl1.PatientID

    and tbl2.IndexEndDate > @dParamDate

    and tbl2.IndexEndDate < tbl1.IndexEndDate);

    This will return the latest record where the end date is greater than the specified date, but where there are no other records for the patient where the other record's end date is also greater than the specified date, but earlier than the "candidate" record's end date.

    This avoids all functions, data type conversions, etc. This would work especially well with a clustered index on (PatientID, IndexEndDate).

    Yes, fine on small data sets. But check out the execution plan - it does two scans of the table and therefore as the table gets larger and larger you are likely to see performance deteriorate.

    John

    Did you try this on a large data set? On a small data set, it will definitely do two entire passes, but with a larger data set (and a clustered index on (PatientID, IndexEndDate)), the second pass should not occur - it should use at most an Index Scan on that index.

    There is no way to avoid the initial full scan of the table, as the query for IndexEndDate only has a lower bound - without an upper bound, a full table scan is the only possible outcome.

    IFF there was a way to be sure that all records had some end date bounds, the query could be more efficient. For example, if you could safely *assume* that all patients of interest have at least one recorded encounter within a 3- or 4- year period, then you could bound the initial query as:

    declare @dUpperBound datetime = dateadd(year,3,@dParamDate);

    SELECT PatientID, ...

    from dbo.MDTest tbl1

    where tbl1.IndexEndDate between @dParamDate and @dUpperBound

    and not exists (select 1

    from dbo.MDTest tbl2

    where tbl2.PatientID = tbl1.PatientID

    and tbl2.IndexEndDate > @dParamDate

    and tbl2.IndexEndDate < tbl1.IndexEndDate);

  • DOH!

    There IS an upper bound - today!

    Therefore, try the following:

    create unique NONCLUSTERED index uixEndDate on <table> (IndexEndDate, PatientID);

    create unique CLUSTERED index ucixPatientEndDate on <table> (PatientID, IndexEndDate);

    Then, run the query using "today" (i.e., getdate()) as the upper bound, as follows:

    declare @dUpperBound datetime = getdate();

    SELECT PatientID, ...

    from dbo.MDTest tbl1

    where tbl1.IndexEndDate between @dParamDate and @dUpperBound

    and not exists (select 1

    from dbo.MDTest tbl2

    where tbl2.PatientID = tbl1.PatientID

    and tbl2.IndexEndDate > @dParamDate

    and tbl2.IndexEndDate < tbl1.IndexEndDate);

    Instead of a full table scan on the first pass, the query optimizer should only have to do an Index scan on the nonclustered index to get the "possible" records, followed by an Index Seek (I hope) on the second index to verify the correct records.

    This may not be that efficient when searching for records from a several years ago, but *assuming* that most queries are going to be for the fairly recent past, this should return a reasonable result set, and since the PatientID is included in the index as well as the IndexEndDate, in the second query the resulting records can be obtained without referencing the main table at all.

    Depending on the distribution of the data, it *may* be better to switch the clustering, so you may want to try both ways, but I think this should be the optimum configuration.

    Why two indexes? The first nonclustered index provides a searchable index for the date range when the PatientID is not yet known, but including the PatientID in the index allows it to be a unique index (helping the optimizer decide to use it). You could instead use "... on <table> (IndexEndDate) INCLUDES (PatientID)", but then the index would not necessarily be a valid unique index - there could be multiple records with exactly the same date/time values.

    For the subquery, we now know the PatientID and one or more IndexEndDate values, so now the query optimizer can use the second index to quickly verify whether each candidate record is the smallest date and discard the others. Since these are clustered on PatientID and date, index seeks (or an index scan at worst) should occur.

Viewing 5 posts - 16 through 19 (of 19 total)

You must be logged in to reply to this topic. Login to reply