Do You Really Need Surrogate Keys? Part 1 – Concepts and Methods

  • JJ B (10/25/2010)


    Hugo wrote:

    I can think of a giant advantage--at least to my programs.

    .....

    I used to listen to the DBAs who advocated against surrogate keys for "link" tables. In practice, it created a problem. I had to go back and add surrogate keys to several tables in order to create proper user-friendly applications in reasonable time frames. Now, I just add those surrogate primary keys up-front and avoid the headaches down the line.

    I took a simpler approach if someone wanted to modify a relationship in one of your "junction" tables.

    delete the original entry

    perform an IF NOT EXISTS.....INSERT.

    No error. User gets what they wanted even though the entry already existed. Win-win all round.

    Of course the UI shouldn't present them with the option to double add this sort of thing in the first place.

    I absolutely would not add a surrogate key to a "junction" table. I've worked with a developer who insisted on doing things like this. In splendid isolation his design worked fine but when it came to migrate code and data between environments we came unstuck because his data model was so out of whack.

    This was a guy who created primary keys everywhere called "NoUseWhatsoever", "asds", "fgfgfg". For the latter two look at your keyboard.

  • Hugo Kornelis (10/25/2010)


    The whole story about finding duplicates makes no sense to me...You only submit an UPDATE query when the values have changed. If there are key values among the changed values, and you don't want to use (for whatever reason) TRY CATCH error handling, you can simply check for existence of the new key value with EXISTS...

    But if the row has changed, but the key values have not, the EXISTS query will simply detect the preexisting row. So this approach requires you to also save the original key values, test them for change, and run the duplicate-test query only in that particular case. A lot of code (and a lot of failure points) for absolutely no good reason whatsoever. Catch the error, and translate it into something user friendly. Any good developer is going to be doing this anyway, so what's the problem?

    And, as you say, this approach also has concurrency issues. But then so does the OP's "fix" -- ANY type of pre-test has to either be wrapped in an expensive transaction, or else it will fail if a duplicate is created between the test and update.

    As I said in my first post, its just simply a bad idea to try to short-circuit out the constraint checking in the database and move it into the application layer.

  • This was a guy who created primary keys everywhere called "NoUseWhatsoever", "asds", "fgfgfg". For the latter two look at your keyboard.

    Yikes! I definitely feel your pain there.

  • I found the article was well done visually and well written with a good background on the pages.

    But it seems the article is more about varchar indexes vs integer indexes rather than the debate that was being promoted.

    I support Surrogate keys for the following: 1) Easy to change business information i.e. like changing the PartNumber of a Part is extremely difficult if it is the PK. And 2) easier to join tables with 1 column rather than several - although many Natural-Keyed tables have a single column as a PK, when you get to several levels of Parent childs - you get several columns in the PK. But some tables are better with Natural like we use a Branch table and the PK is a 10-char. The users know it is assigned for life. It's easier to use because the Key closely describes what the record is rather than a meaningless int.

    But my question for you, Wagner, is this: did you re-build your indexes and update your stats before you started? If you did - unless I missed it - you should have posted that. If you didn't your results might be quite different had you!

  • A nice article that misses the point.

    The purpose of a surrogate key is to abstract the relationships between rows so they are independent of changes to business identifiers (aka natural keys). Use of surrogate keys is critical in a data warehouse to ensure stable relationships over time.

    It has nothing to do with performance and performace should never be a determinant in using or not using surrogate keys.

  • Surrogate keys on pure junction tables are bad, but it seems like alot of people are putting clustered covering keys over both columns, which can be almost as bad. Junction tables are often subject to out of order inserts and delete activity that will badly fragment the index. My default (without knowing the use cases) is always going to be a heap.

  • cdesmarais 49673 (10/25/2010)


    Surrogate keys on pure junction tables are bad, but it seems like alot of people are putting clustered covering keys over both columns, which can be almost as bad. Junction tables are often subject to out of order inserts and delete activity that will badly fragment the index. My default (without knowing the use cases) is always going to be a heap.

    Is that bad or just pointless? Would a Clustered index on the 2 FKs be worse than a clustered index on a surrogate key on the junstion table?

  • Who cares about performance on small tables. Performance only becomes an issue on large tables. I'd like to see a comparison with tables of 1 million and 100 million rows.

  • denis.hosdil (10/25/2010)


    cdesmarais 49673 (10/25/2010)


    Surrogate keys on pure junction tables are bad, but it seems like alot of people are putting clustered covering keys over both columns, which can be almost as bad. Junction tables are often subject to out of order inserts and delete activity that will badly fragment the index. My default (without knowing the use cases) is always going to be a heap.

    Is that bad or just pointless? Would a Clustered index on the 2 FKs be worse than a clustered index on a surrogate key on the junstion table?

    A clustered index on the surrogate key is (performance wise) the worst option. But a heap may outperform a clustered index on the two FK's depending on use patterns.

  • A clustered index on the surrogate key is (performance wise) the worst option. But a heap may outperform a clustered index on the two FK's depending on use patterns.

    I agree in terms of inserts, updates, deletes - the heap would be fastest.

    But wouldn't a clustered index on the 2 FKs be worse than a clustered index on a Surrogate Key since the Surrogate Key is sequential.

  • cdesmarais 49673 (10/25/2010)


    Surrogate keys on pure junction tables are bad, but it seems like alot of people are putting clustered covering keys over both columns, which can be almost as bad.

    Eh? In the *general* case, it's neither good nor bad. For an OLAP application, it generally improves performance, for OLTP, it generally hurts it. But usage patterns will be the final determination.

    Junction tables are often subject to out of order inserts and delete activity that will badly fragment the index. My default (without knowing the use cases) is always going to be a heap.

    That was true, circa 1999. Currently, SQL Server almost always performs worse with a heap than a clustered index. Google any of a large number of MS/MSDN articles to verify this.

  • denis.hosdil (10/25/2010)


    I agree in terms of inserts, updates, deletes - the heap would be fastest.

    This isn't correct. In general, a clustered index outperforms a heap -- even for inserts and updates. It may seem counterintuitive, but its true.

  • I just did a whole bunch of benchmarking with 2008 r2. On pure junction tables with any sort of reasonable FK sizes, it's still true. The extra int column adds pages to every operation.

  • Denis

    As I said on this forum, I'm also a fan of surrogate keys.

    Actually, my goal in this first article is to show the concepts and methods I'm using to compare a surrogate key against any other candidate key.

    I pick the product table (surrogate versus char natural key) just to illustrate the use of those methods.

    And, by the way, I built 4 new tables from scratch, so I would not have to consider the impact of extra indexes. Ok, this is not a good implementation on the real world.

    There are so many issues impacting performance that I tried to isolate things and make the comparison a bit easier to understand.

  • Huh? What extra int column are you talking about? We're talking about a junction table with two FKs, period. The question is whether you have a clustered index or not on that table....and whether or not you do, you still have the same column count.

    The question of whether or not you add a surrogate to that table is a different story entirely. That WILL hurt performance...and you don't need any profiling to understand that.

Viewing 15 posts - 61 through 75 (of 178 total)

You must be logged in to reply to this topic. Login to reply