Do We Need a PK?

  • Will Summers (12/11/2007)


    Maybe I'm just asking what is the advantage of a PK over just a simple unique index? Is there any potential increase in speed? Cause if there is, I'm all over it!!!

    The purpose of the PK is that you then have a definitive way to address every record. Mechanically - the only difference between a PK and a UKI is that a PK doesn't allow for nulls in the key, whereas a Unique index does. So - it doesn't really help for speed (over using a unique index).

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • RowID, StateCode, StateName, Deleted, LastUpdated

    Yes indeed. We typically have 5 custom fields on every table (Custom01 through Custom05) but not on the fact tables.

    RowID, OrderTypeID, OrderTypeName, Deleted, LastUpdated

    What do you think you are gaining by having a second Unique column in these tables?

    In the state tables I would understand if you had a CountryCode, but not a RowID that just duplicates the functionality of the StateCode field. And In the OrderTypes why do you have a RowID and an OrderTypeID, what is the benefit?

    --

    JimFive

  • I guess I am still unclear about the definition of PK here, not the fact of how PKs are implemented (or may be wrongly implemented) in real database systems.

    My understanding is that a PK is one or more attributes of an entity that uniquely defines a tuple (I don't like that esoteric term but I am trying to stay theoretical) in that entity. If you have a one-entity model - theoretically possible even if it is almost never done in practice - then there is no other entity that will have FKs referencing the PK. Yet the need to uniquely define a tuple will still exist. Otherwise the entity is not a valid entity in the relational model, from what I have read - although in the real world a real table can be created without a PK.

    I'm with you. Isn't a PK by any other name still a PK? If a table has a single valid unique set of columns that identify a single row in the table, why wouldn't that be the PK of the table? If it has more than one valid unique set of columns that identify a single row in the table, each of them would be a key for the row; but of all the available keys, wouldn't one of them still be the "primary" key? (I'm just waiting to see Mr. Celko pop on here and state yet again that "If you don't have a primary key, you don't have a table." 🙂 )

    I don't see the benefit of adding a single unique constraint (index) to a table as opposed to just explicitly declaring it to be the PK. Both approaches will still impose an index to enforce uniqueness in SQL Server, and neither need be clustered.

    Andrew

    --Andrew

  • James Goodwin (12/11/2007)


    What do you think you are gaining by having a second Unique column in these tables?

    Excellent question, James. Our policy is to never use auto number columns for anything other than identifying a row for update or delete. That column has no meaning to the relationships whatever. They are strictly to identify that specific row.

    It's not a policy that will work for everone but it has saved us so much time during repairs that it has been worth it. You milage may vary.

    ATBCharles Kincaid

  • Excellent question, James. Our policy is to never use auto number columns for anything other than identifying a row for update or delete.

    I guess that doesn't really answer my question. If the row is identifiable using the e.g. StateCode then what benefit is it to identify the row by an identity? You can already find that row by a much more natural method. I would agree that using a synthetic id makes sense when the PK is unwieldly or difficult to ascertain, but to use one in what amounts to a simple code table seems like a "foolish consistency"*

    Now, having said that, I can come up with a reason to do this for the state code, but not for the OrderType code. For the state code, if you decide later that you need to add the country you would also then need to add the country to any table that referenced the state code. However, that is a result of the real-world fact that states are within countries. With the ordertype I don't see that situation arising as you basically are defining two codes, an integer(identity) code and a code of a different type, that have the same usability.

    --

    JimFive

    *Ralph Waldo Emerson

  • James Goodwin (12/11/2007)


    Excellent question, James. Our policy is to never use auto number columns for anything other than identifying a row for update or delete.

    I guess that doesn't really answer my question. If the row is identifiable using the e.g. StateCode then what benefit is it to identify the row by an identity? You can already find that row by a much more natural method. I would agree that using a synthetic id makes sense when the PK is unwieldly or difficult to ascertain, but to use one in what amounts to a simple code table seems like a "foolish consistency"*

    Now, having said that, I can come up with a reason to do this for the state code, but not for the OrderType code. For the state code, if you decide later that you need to add the country you would also then need to add the country to any table that referenced the state code. However, that is a result of the real-world fact that states are within countries. With the ordertype I don't see that situation arising as you basically are defining two codes, an integer(identity) code and a code of a different type, that have the same usability.

    The value I see is to avoid the 3rd normal form update anomalies. If your "natural PK" is based on data that has meaning to the end-users/outside world, you'll eventually run into the scenario where someone wants to change one of the values, or change the format of said identifier, then you have a whole bunch of tables to start updating, so as to not lose relational integrity.

    For example - I helped manage/set up a physician billing system for a while for a large Healthcare organization. Against my advice - the unique identifier for a physician was tied to a national Unique identifier generated by Medicare, called the UPIN. All was fine and great, until CMS (the organization behind the Medicare program) decided that the UPIN wasn't "unique enough", and started an entirely NEW numbering system, called NPI, which is to replace UPIN. As a result - some 50 people and a team of consultants have been tasked this entire last year to go look for and replace all of these identifiers with the NEW identifiers (through a dozen systems feeding or being fed by this billing system), losing thousands and thousands of manhours that didn't need to happen.

    By the way - they replaced the OLD "externally significant" number with the NEW externally significant number...They just don't listen very well.....

    If you can avoid it - never enforce DRI based on data that is in any way significant to the end user.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • webrunner (12/11/2007)


    I guess I am still unclear about the definition of PK here, not the fact of how PKs are implemented (or may be wrongly implemented) in real database systems.

    My understanding is that a PK is one or more attributes of an entity that uniquely defines a tuple (I don't like that esoteric term but I am trying to stay theoretical) in that entity. If you have a one-entity model - theoretically possible even if it is almost never done in practice - then there is no other entity that will have FKs referencing the PK. Yet the need to uniquely define a tuple will still exist. Otherwise the entity is not a valid entity in the relational model, from what I have read - although in the real world a real table can be created without a PK.

    Part of the problem is the oversimplified definition of PK used so often in textbooks. A PK is a unique identifier used to provide referential integrity. The WHY you have a PK is just as important as the what it is. The need to uniquely identify a tuple can be accomplished without the use of a PK by simply using a unique constraint. Don't want nulls, add a non-null constraint. Need look-up speed, add an index. Need to find a particular row, specify the set conditions that define that row. No natural way to do that, add a surrogate key. Does a surrogate key have to be a PK? No more than any other candidate key.

    The original question was does every table need a PK. No. The many-to-many resolution table is a good example. Other than preventing duplicates, there is no reason to ever uniquely identify a single row. The table will never be accessed by the full key (since it would require both FKs at the same time, and there would be no need for the intersect table if they're both known). If the PK would never be used as an FK somewhere else, no reason for enforcing RI. Ergo, why would you need a PK?

    Suppose, however, that for whatever reason, you have just an employees table. Would you not need a PK to uniquely identify each employee?

    I know this sounds like splitting hairs, but I just want to understand what the real meaning of a PK is. You seem to be saying that it has nothing to do with uniquely identifying a row, which is different from what I learned.

    Thanks,

    webrunner

    Short answer, no. You might have a business need to uniquely identify that row and that can be handled through either constraints or a surrogate key if there is no other way. There may be any number of candidate or alternate keys, but that is just a way of saying they could serve as a PK if needed. One reason PK has become a synonym for uniqueness is that most DBMS's give you a unique constraint, non-null constraint, and an index when you specify a PK so it is just simpler than specifying them yourself. It also provides the meta-data the DBMS needs to provide RI. The overhead is usually low so, again, it is just easier to think of a PK as a unique id regardless of RI.

    I'm not saying PK has nothing to do with uniquely identifying a row, its just not a sufficient reason. There are plenty of mechanisms that can provide uniqueness and identity simultaneously on the same row. The PK designation means that it carries the additional burden of providing RI. You must be unique to be a PK, but you don't need to be a PK to be unique.

  • We have a data warehouse where the transactional areas are most often updated incrementally. We use two main sets of tables for the transactional "fact" tables, the main end fact tables which hold all the historical records, and the staging tables wwhich hold just that loads "daily" data for the incremental.

    The staging tables are not indexed and have no primary key as they are deleted each day for the run and indexes slow down deletion and force DELETE statements. The end fact tables are indexed aand have primary keys etc of course but no reason for them on the staging tables that we see.

  • The only time I have ever seen a regular database structure that is logically defined but not physically enforced with keys (PK,FK) is in CRM/ERP applications. The number one reason I see sited is performance based. Every time a record is inserted the keys have to be checked. Since they load all the data verification through the front end of the application they don't do it in the back end database. My argument was even if you put a non-unique clustered index you are know bumping up against the engine inserting data into the clustered index to make it unique.

    I haven't personally measured the hit you take for using a PK but I think it may be time to test several scenarios.

    PK,FK normal physical constraints.

    PK no FK

    no PK no FK non-unique clustered index (but unique data)

    no PK no FK non-unique clustered index (non-unique data, basically to simulate bad data entry through the app)

    no PK no FK non-clustered only indexes on heap.

    I'll be using the TPC-C database structure since I already have a load generator for it.

    if anyone can think of any others let me know.

    Wes

  • I’m still a lot wet behind the ears in the DBA position; so maybe this post won’t make sense or is completely irrelevant. If so, then please be kind if replying to it.

    When I first read the editorial I noticed that Steve Jones stated that he had no reason not to use PKs based on recommendation and the advice of MS. I first started pondering this; if MS recommends it then there must be a GOOD reason, right?

    On my wall I have a huge poster of the “SQL Server 2005 System Views”, and from my understanding this is views of the system and lists the relationships (and possibly constraints?) between the entities/objects. So, I started to notice there are a whole lot of objects that don’t have anything listed for a PK or Unique ID. Granted that the majority of times is for a surrogate object; but there are some that are related to other objects and still don’t have a PK or ID of some sort.

    An example is the sys.computed_columns is a derived object of the sys.colums.

    Another example is the sys.server_triggers has a 1..many relationship with sys.server_events.

    Can this be right, could MS itself deem that there are times that you don’t need a PK or Unique ID?

    Again, I could completely be misunderstanding this map and still have too little knowledge to actually be providing an insight in this posting; if this is the case then I apologize for wasting anyone’s time.

    Thanks,

    James

    ~ Without obstacles, you cannot progress ~
    http://sqln.blogspot.com/

  • I cannot speak for Microsoft, but my guess would be they eliminated superfluous constructs from the physical model in the interest of performance.

    I do the same thing. I will not create a database, table, column, index, constraint,... without justification. If it's needed, great. If not, it's just overhead.

    On the flip side, I can understand Microsoft's recommendation on defining a PK for every table. There is one indisputable benefit to towing the Microsoft line; clarity. If I design, develop, and then implement all of my databases using their recommendations, then any other DBA (who also conforms to Microsoft methodology/recommendations) who sees my work will more easily understand it. But then, the same could be said of any methodology.

    Take care,

    Bert

    "Speculations? I know nothing about speculations. I'm resting on certainties. I know that my Redeemer lives, and because He lives, I shall live also." - Michael Faraday

  • My guess is either the person building the poster didn't do a great job or the developers did what many people do when they're building a database:

    They forget to add a PK

    😉

  • Based on the rest of Microsoft's documentation, I'd choose the former. :crying:

    Take care,

    Bert

    "Speculations? I know nothing about speculations. I'm resting on certainties. I know that my Redeemer lives, and because He lives, I shall live also." - Michael Faraday

Viewing 13 posts - 46 through 57 (of 57 total)

You must be logged in to reply to this topic. Login to reply