A real challenge and I need suggestions on efficient ways to solve it

  • I have a need to determine the site of cancer given some complexities. Each cancer patient has up to 4 diagnosis codes with the first being the primary. I currently have a mapping table that is capable of mapping up to two values to a result value. Currently I only check the primary diagnosis to see if it is on the mapping to determine the cancer site.

    We are now being asked to do more complex mapping for certain primary diagnosis codes. If the primary diagnosis code is one of 6 particular codes then I have to interrogate the second, third and fourth diagnosis lines ( they are separate lines in a patient's diagnosis list, not separate columns on one record)  to see if there is a diagnosis in any one of those that is also on the cancer diagnosis list mapping.  If the primary diagnosis code is NOT one of the 6 special codes, then you simply check if there is a cancer code on any of the lines and use the first one you find.

    To complicate this, the table being used that has the diagnosis codes for each patient admission is many billions of records. Yes with a 'B'. I can split the load by year as it is partitioned that way. So adding more complicated logic is going to have serious performance implications.

    The table structures are in the attached WORD document

    So the pseudocode logic is this:

    • Read primary diagnosis record (OHADiagnosisOrder = 1)  for an admission.
    • Determine if the DiagnosisCode is one of the special codes (these will be flagged as special in dim_OHADiagnosis) and is in the dim_Mapping table as a Cancer Site diagnosis (about 1300 records)
    • If it is one of them, then determine whether the second, third or fourth diagnosis record is for a value in the dim_Mapping table as a Cancer Site diagnosis. If any of them are on the list, take the lowest sequenced one and get the MapToValueString value in dim_Mapping (it will be a code).
    • If are no other diagnoses that are on the cancer site list, use the mapping value for the special diagnosis code.
    • If the primary diagnosis is NOT one of the special ones, then use the MapToValueString value in dim_Mapping for the first diagnosis that is on the dim_Mapping (does not need to be the primary diagnosis record). So out of up to 4 diagnosis codes being interrogated, if any of the 4 match up to dim_Mapping, table the first one by sequence and use that to get the mapped value.

    Given that the table is so large, I don't want to have to read every record.

    I have thought of the following possibilities, none of which are great:

    • PIVOT the first 4 diagnosis code records into 4 columns per admission and join 4 times to the dim_Mapping, once for each column. Then use CASE logic to determine which mapping value to use;
    • Use a window function with LEAD to read all of the diagnoses into 4 columns and then feed that to a higher level query that joins to dim_Mapping 4 times
    • Do something I cannot currently imagine working! LOL.

    This is where I need your help, folks. Please let me know if you need any additional information from me. I appreciate any and all suggestions.

    Diana Bodell

     

     

    Attachments:
    You must be logged in to view attached files.
  • do you have any sample data and expected results?

    For better, quicker answers, click on the following...
    http://www.sqlservercentral.com/articles/Best+Practices/61537/

    For better answers on performance questions, click on the following...
    http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • Is it going to be impossible to discuss without sample data? If so, let me know and I will create a spreadsheet. If you have a test data generator I guess you could use that. I can't send real data for obvious reasons... let me know.

  • diana.bodell 56517 wrote:

    Is it going to be impossible to discuss without sample data? If so, let me know and I will create a spreadsheet. If you have a test data generator I guess you could use that. I can't send real data for obvious reasons... let me know.

    I'm not going to generate test data since I don't know what your data (or sample data) looks like. My data would only go down the 'Happy Path'

    For better, quicker answers, click on the following...
    http://www.sqlservercentral.com/articles/Best+Practices/61537/

    For better answers on performance questions, click on the following...
    http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • diana.bodell 56517 wrote:

    Is it going to be impossible to discuss without sample data? If so, let me know and I will create a spreadsheet. If you have a test data generator I guess you could use that. I can't send real data for obvious reasons... let me know.

    We don't need real data and we don't need real table definitions either. You can make it all up, and then take us through a worked example, using what you have created.

    We can help you with an optimal solution for your example, and then you can use the techniques provided to optimise your own solution.

    And please, not in spreadsheet or Word form. Your sample DDL and DML should be included in a text file – such that we can copy and paste into SSMS. The link in my signature will help with this.

    If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

  • If you want better performance, I think we do need actual table definitions, esp. including the indexes.  It appears the diagnosis id is what's critical.  You should likely be partitioning and clustering the big diagnosis table on that rather than year / identity.

    The fact table also has bloated columns.  Since it's billions of rows, that's not good.   More on that in a separate post.

    Your pseudo-logic would be very helpful, except I can't follow it because it doesn't match the table definitions you provided.  Keep in mind, we know nothing about your data.   I'll try to show what I mean:

    "Read primary diagnosis record (OHADiagnosisOrder = 1)  for an admission."

    This is clear.  A separate index should be able to help this (specifically a filtered index with "WHERE OHADiagnosisOrder = 1" and the minimum number of other columns needed to do the other necessary lookups).

    "Determine if the DiagnosisCode is one of the special codes (these will be flagged as special in dim_OHADiagnosis) and is in the dim_Mapping table as a Cancer Site diagnosis (about 1300 records)"

    Nope, don't really understand.  Don't see a "dim_OHADiagnosis", don't see anything in the tables about "Cancer Site".   Presumably specific values in the dim_mapping table mean that, but I don't really know.

    The good news is, I think once that's clarified, the rest of it should make a lot more sense.

    SQL DBA,SQL Server MVP(07, 08, 09) "Money can't buy you happiness." Maybe so, but it can make your unhappiness a LOT more comfortable!

  • Regarding the fact_OHADiagnosis table:

    1) should likely be clustered / partitioned by ( OHADiag_id, OHADiagnosisOrder ). You can have a table that can quickly translate FileYear to beginning and ending OHADiag_id. (I'm assuming a diag id is a point-in-time event, not multiple events across time. That may not be correct.)

    2) don't use nvarchar unless absolutely necessary, since it takes twice as many bytes. Use char instead of varchar when appropriate.

    3) check to see if page compression would significantly reduce the table size (typically it will, but you should definitely check it first):

    EXEC sys.sp_estimate_data_compression_savings 'OHA', 'fact_OHADiagnosis', 1, 1, 'PAGE' /*check partitions one by one*/

    --Here are some suggestions for table improvements (again, just based on general knowledge, 
    --since I don''t know your data specifically):

    CREATE TABLE [OHA].[fact_OHADiagnosis](
    [IdentityRow] [bigint] IDENTITY(1,1) NOT NULL,
    [RowID] [int] NOT NULL,
    [FileName] [nvarchar](100) NULL,
    [FileYear] char(4) /*[nvarchar](4)*/ NOT NULL,
    [FileType] char(3) /*[nvarchar](3)*/ NOT NULL,
    [OHAICDType] [smallint] NOT NULL,
    [OHADiag_id] [int] NOT NULL, ---- JOIN TO DIM_ICD10DIAG
    [OHADiagPOA_ID] [int] NOT NULL,
    [OHADiagnosisOrder] [int] NOT NULL, --- SEQUENCE OF DIAGNOSIS IN ADMISSION
    [Package Name] varchar(64) /*[nvarchar](64)*/ NOT NULL, /*even better, encode this to a smallint id you assign to represent the name*/
    [Execution Start Time] [datetime] NOT NULL,
    [Machine Name] [nvarchar](15) NOT NULL, /*better to encode this to a tinyint or smallint*/
    [Task Name] varchar(64) /*[nvarchar](64)*/ NOT NULL, /*if task names repeat across diags, better to encode this*/
    [User Name] [nvarchar](64) NOT NULL, /*better to encode this*/
    [OHAEncounterKey] varchar(20) /*[nvarchar](20)*/ NOT NULL
    ) ON [PRIMARY]
    GO

    SQL DBA,SQL Server MVP(07, 08, 09) "Money can't buy you happiness." Maybe so, but it can make your unhappiness a LOT more comfortable!

  • How is a  DiagnosisCode flagged as special in dim_OHADiagnosis?

    How is a row in  dim_Mapping identified as a Cancer Site diagnosis?

    Which columns are used to join the tables together?

     

     

  • Good morning. Just to let you know I am in the process of putting together the DDL and DML and sample data you requested. It will take me a few days but I will post it out here as instructed. I really appreciate the help.

  • Hi, Scott. Thanks for your input. My question isn't so much about how to improve performance of the database but what SQL logic would be most efficient to use given that the table is very big. I am in the process of creating sample data and a better pseudo-logic diagram to allow you all to assist me more efficiently.

  • In response to your comments about better design, I do have reasons for using ANSI standard data types in this case. The rest of my data marts and warehouse are strictly Microsoft T-SQL data types because there is nothing coming in from non-Microsoft sources. In this case, however, I would like to avoid any conversions.

    My DBA and I did look at compression and it did not pan out.

    A diagnosis is not a moment in time. It is a code that is used to define the patient's issue and there is a one to many relationship between a diagnosis code and a patient and/or admission. A patient can have many visits and each visit may have many diagnosis codes associated with it. Sometimes we look at things for patients who have had a diagnosis and sometimes we look at admissions where the diagnosis code is present.

  • So after trying to get sample data for you guys to use, it turns out that the data is not organized from the source in a way that even makes what I am trying to do possible. I didn't realize that the source does not actually specify which diagnosis is the primary, secondary or tertiary in a way that is determinate. The customer has agreed that this is not possible and has withdrawn the request.

    I want to thank you though for having spent whatever time you did thinking about this issue.

    Diana

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply