Table Theory - In Table Row Versioning - Questions - HELP!?!?

  • At my current workplace, there is a group of programmers working on our next-generation app for in-house work. Their work includes the design of the enterprise DB, which is where I have a question for people.

    In the current design, they are implementing in-table row versioning. In this implementation, when a row is modified, instead of actually modifying the existing row, the original row is flagged as 'deleted', and a new row is added. A pair of Date fields indicate the date/times that the row was valid from - to (labeled as Date added and date modified), with no IsDeleted flag indicating the current row.

    Each table will have an identity field, and a second surrogate 'Identity' field for identifying the rows related to one entity. The table's PK (and a clustered index) will be based on the second 'Identity' field.

    Personally, I believe that tables like this will become increasingly problematic as time goes on, growing ridiculously large and slow to query, and difficult to run applications against. However, I need some actual information on this:

    1) Has anyone ever actually considered an implementation like this in SQL Server?

    2) If so, what were your thoughts on this?

    3) If you went to Implementation, what were the results?

    4) Are there any existing theoretial works that support this methodology? If so, what are they?

    This is one of the most intelligent groups I know on SQL Server questions, and I would really like an assist on answering this.

    THANKS!

  • Sounds like a good plan... except for the part where the history of the table is kept in the master table. I can't imagine having to go through millions of useless rows just to fetch some simple data.

    I would think that it's best to put the history of the data into it's own history table... Then when you need to check the history, you go in the history and when you need the data, you go get the data .

    As for theory... I would think that I wouldn't want to keep the apples that have passed their due date with the good ones in the same basket... but that's just me .

  • I know of no theory that would support this implementation, but I can name one that says this is a bad implementation. How about the relational model of database management? You should not store different things in one table. A flag that makes a row change 'mode' indicates that there are actually more than one kind of entity stored in a table.

    And what do you mean with two surrogate keys using identity? Since every new version of a row will receive a new pk it means rows in other tables that reference the old row will need to be updated with the new key.

  • Hehe.. I missed that one (referencial integrity). But I think it's the best reason to stop (or not start) this non-sens. Not to re-mention speed, ease of use, correct design and a few others I'm surely missing again.

  • this is another classic example of programmers getting their hands on the data, brainstorming, and then coming up with a solution that probably fits their needs and they can rationalise until the cows come home.

    i'd say your concerns are more than justified. put your foot down and stand your ground, you have the relational model to fall back on. fight the power.

    cheers

    dbgeezer

  • >>And what do you mean with two surrogate keys using identity?

    Here is what a sample table layout would look like, with generic identifiers:

    VersionID int identity

    RecordID int <<--- Core of PK

    RecordData varchar(12)

    IsDeleted bit

    Created datetime

    CreatedBy Varchar(15)

    LastModified datetime

    LastModifiedBy varchar(15)

    So, let's enter data item 1, and edit it twice:

    1 1 ThisisTests 1 06/01/2005 12:30:00 Brendt 06/02/2005 10:15:00 Mike

    2 1 ThisisTest 1 06/02/2005 10:15:00 Mike 06/04/2005 08:44:10 Brendt

    3 1 ThisIsATest 0 06/04/2005 08:44:10 Brendt

    As near as I can tell, this means RI through triggers, and no actual PK/FK relationships. Piles of data, long queries or inclusion of LastModified Is Null in all queries. Lots of additional overhead, bad indexing, and bad performance is what I see coming.

  • Amen.

    May the force be with you .

  • Can you live with this type of query???

    Select VersionID, RecordID, RecordData, isDeleted, Created, CreatedBy, LastModified, LasmodifiedBy from dbo.YourTable Main inner join

    (Select RecordID, max(LastModified) as LastModified

    from dbo.YourTable

    group by RecordID) dtCurrentDate on Main.RecordID = dtCurrentDate.RecordID and Main.LastModiFied = dtCurrentDate.LastModified

    EDIT

    Sorry I forgot about the IsDeleted Column. You'd still have to scan the whole table, flush 90% of the data, to present the whole table in a report.... is this something you can live with???

  • Nope - that's why I am looking for outside reinforcing opinions on this table structure. Thanks!

  • If you need real punch, you should ask Joe Celko to comment on this design. They'll start running for their mommies once he's done with 'em .

  • You don't even need to look at how complex the queries for this table will be. It is absolutely not an acceptable design! There is no PK (RecordID cannot be PK since there will be many rows with equal values for it). Like I said before the table contains multiple types of entities. There is unnecessary data stored with the Created/LastModified design. There is no real key (at least not in this example). Etc.

    Say no, get a copy of Fabian Pascal's Practical Issues in Database Management and have the designers read that if they are going to have anything to say about the database design.

  • Indeed.

     

    To the dark side poor design leads.

     

    cheers

    dbgeezer

  • Chris already mentioned all there is to say. You should suggest, more or less gently, that those developers take some course in database design. Maybe you can get your hands on http://www.oreilly.com/catalog/databaseid/

    I've started yesterday to review it and it looks very promising. Relational theory mere mortals outside the ivory tower can understand.

    --
    Frank Kalis
    Microsoft SQL Server MVP
    Webmaster: http://www.insidesql.org/blogs
    My blog: http://www.insidesql.org/blogs/frankkalis/[/url]

  • As it has been said: History should be managed with different table or you could consider using different dbms product that has a native support for multiversion data.

  • I have successfully implimented this type of historical schema but with a slightly different approach such as:

    -- The CORE table

    create table Partys

    ( PartyId integer identity(1,1)

    , EffectiveDt datetime not null

    , ExpirationDt datetime not null

    , constraint Partys_P primary key (PartyId)

    )

    -- The Version table

    -- Note that PartyVersions should not have its own identity column.

    Create table PartyVersions

    (PartyId integer identity(1,1) not null

    ,AsOfDt datetime not null

    ,CreatedDt datetime not null

    ,CreatedBy Varchar(15) not null

    ,LastModified datetime not null

    ,LastModifiedBy varchar(15) not null

    -- Data Columns go here

    , constraint PartyVersions_P primary key (PartyId , AsOfDt )

    , constraint Partys_F_PartyVersions foreign key

    (PartyId ) references Partys

    )

    -- A version table with a uniqueness constraint

    Create table PartyTaxIdVersions

    (PartyId integer identity(1,1) not null

    ,AsOfDt datetime not null

    ,CreatedDt datetime not null

    ,CreatedBy Varchar(15) not null

    ,LastModified datetime not null

    ,LastModifiedBy varchar(15) not null

    ,TaxIdentifier integer not null

    , constraint PartyTaxIdVersions_P primary key (PartyId , AsOfDt )

    , constraint PartyTaxIdVersions_P unique (TaxIdentifier, AsOfDt )

    , constraint Partys_F_PartyTaxIdVersions foreign key

    (PartyId ) references Partys

    )

    All Queries for this class of structure will be similiar to

    select *

    from Partys

    join PartyVersions

    on PartyVersions.PartyId = Partys.PartyId

    join (select PartyVersions.PartyId

    , MIN(AsOfDt) as AsOfDt

    from PartyVersions

    where AsOfDt >= @AsOfDt

    ) as PartyVersionLatest (PartyId, AsOfDt )

    on PartyVersionLatest.PartyId = PartyVersions.PartyId

    and PartyVersionLatest.AsOfDt= PartyVersions.AsOfDt

    join PartyTaxIdVersions

    on PartyTaxIdVersions.PartyId = Partys.PartyId

    join (select PartyVersions.PartyId

    , MIN(AsOfDt) as AsOfDt

    from PartyTaxIdVersions

    where AsOfDt >= @AsOfDt

    ) as PartyTaxIdVersionLatest (PartyId, AsOfDt )

    on PartyTaxIdVersionLatest.PartyId = PartyVersions.PartyId

    and PartyTaxIdVersionLatest.AsOfDt= PartyVersions.AsOfDt

    Please be aware of these situations:

    The typical SQL will contain a variable ( As Of Date ) within a subquery, views cannot contain variables and therefore, views cannot be used to aid re-usability.

    Since stored procedure use parameter sniffing to determine the optimal query plan, the AsOfDt value must always be passed and must not be defaulted to the current date within the stored procedure. i.e. do not have this logic in any stored procedures:

    if @AsOfDt is null set @AsOfDt = current_timestamp

    Columns that have uniqueness constraints should be in a seperate table or updates to the other columns will cause many rows having the same value which increases the index size resulting in performance problems.

    The Core table and the version table should be on different disks to increase thruput.

    SQL = Scarcely Qualifies as a Language

Viewing 15 posts - 1 through 15 (of 30 total)

You must be logged in to reply to this topic. Login to reply