Time Bomb Design - A Longer Fuse

  • Stefan Krzywicki (5/26/2010)


    eric_russell 13013 (5/26/2010)


    Perhaps my biggest data modeling pet peve is when date/time values are contained in a VarChar column. Not only does this consume 20+ bytes, compared to only 4 bytes for SmallDateTime, but often times the values are sporatically miscoded with invalid dates or multiple applications code the dates in multiple formats, resulting in data type mismatch errors and convoluted case expressions in the SQL statement to decode the data. The reason typically provided for this is that different applications need to display the date in different formats. Of course, this can be better achieved by storing dates as Date in the database using a formatting mask on the application form controls.

    There is a valid reason to use varchar to store dates, though it doesn't come up often, especially in businesses in America. Before SQL Server 2008, the earliest date you could store in a date field was January 1, 1753. If you're creating a database that includes dates older than that, your only recourse was a varchar field. I've created databases for historical events and fictional events that couldn't use a date field for this very reason. And there are businesses in other parts of the world that have accounts that date back before 1753, so it could theoretically come up then too.

    With 2008, you can enter any date in AD through December 31, 9999. Even most SF won't give you a problem there. I suppose for BC you can add a flag to avoid needing to store it in Varchar.

    Several years ago, I developed a table with over 500,000,000 rows and which received 100,000s of inserts per day. We were only concerned about the date the rows were inserted, so we stored the insert date as an integer offset from a specific date in the past. Let's assume that Jan 1, 2010 is the offset date, so a query that returns a count of rows inserted in the past 30 days would be something like this:

    select @report_date_offset = datediff(day,'2010-01-01',getutcdate())-30;

    select count(*) row_count

    from tran_master

    where insert_date_offset >= @report_date_offset;

    I recall this specific case it was a 2 byte integer (SmallInt), allowing something like 90+ forward years of dates, and it only referred to forward dates. However, a 4 byte integer could point to practically any date forward or backward in time. To refer to dates prior to the offset date, you would use integers < 0.

    Of course, now we have the Date datatype in SQL Server 2008 which is 3 bytes and supports range from jan 1, 0001 - year 9999.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Any day I can open my email and see a new David Poole article, I know it's going to be a good day!

    Your comment made me think:

    Jeff Modem made the comment that people write time bomb code "because there's some sort of guarantee that there will never be more than some small number of rows in the table".

    A senior colleague opined that designing a system that isn't expected to grow is effectively designing a system that the business already regards as a failure before the first line of code is written!

    I was flipping through the news the other day and I saw a Congressman grilling a Wall Street trader. The Congressman asked the trader, "Do you know what IWBH/YWBH means?"

    After hedging and stalling, and a few reiterations of the question, the trader finally admitted it was shorthand for "I Won't Be Here/You Won't Be Here," a slogan a lot of traders said to each other when they were making instantly profitable deals with very bad repercussions for everyone else in the known universe.

    I think a lot of time bomb code is ordered by managers, spec'd by business analysts, designed by architects and built by developers who suffer from this same outlook on the future.

    Thanks for the great article David!

    Michael

  • Everything I used to write that had to do with the database was in long procedures due to requirements from the client, but I suppose that since jumping on the "database is just a persistant storage" (please take that with a huge grain of salt) bandwagon I've started reducing the amount of SQL written and moving more into the DAL, so the "large blocks of SQL" have become much less of an issue (I feel that this is the way things are moving...). On the other hand though, aliases have always been very clear in the large procedures which I've inherited; It's actually the poor naming of variables and parameters used throughout that lead to tylenol abuse, hehe.

    I'm sure that my point of view simply stems from my programming background 🙂 It's always nice to hear what others are doing and why. IMO I don't think that there's a right or wrong way regarding this, or an easier or a more difficult way, because I've used both; As long as the entire database follows the same standard then everything works out (unfortunately, the databases I'm currently working with have hundreds of tables, and there's no standard whatsoever!... ugh!).

    Cheers,

    Cyle

    @ The "customer_id" example:

    I still don't see how the fowing scenario could be confusing...

    PK Table: Customer

    PK: ID

    FK Table: Order

    PK: ID

    FK: CustomerID

    select

    ord.ID,

    cust.ID

    from

    dbo.[Order] ord

    inner join dbo.Customer cust on cust.ID = ord.CustomerID

    For me, if i were to write "cust.CustomerID" it feels to me like cust.CustomerID is referencing another table, whereas "cust.ID" feels more like i'm referencing the ID of the customer. If you wanted to be super-clear you could skip the aliases and simply:

    select [Order].ID, Customer.ID from dbo.[Order] inner join dbo.Customer on Customer.ID = [Order].CustomerID

    Of course you'd still have to use aliases if you were referencing the same table twice...

    I think both approaches have their pro's and cons. Using TableNameID gives you consistency with that PK throughout the DB, while using ID gives a sense of origination when you're referencing it (at a glance you know if the referenced table has the PK or if it's an FK).

  • sipe_16 (5/27/2010)


    If you wanted to be super-clear you could skip the aliases and simply:

    select [Order].ID, Customer.ID from dbo.[Order]

    inner join dbo.Customer on Customer.ID = [Order].ID

    Nope, doesn't make it easier for me, especially since the [Order].CustomerID was accidentally changed into [Order].ID

    I prefer <Tablename>.<Tablename>_ID to avoid errors like these. Clarity above brevity. But that's just what works best for me.


    Dutch Anti-RBAR League

  • Hehe sorry, I was going back-and-forth between work and writing so I wasn't focused. In all fairness, I would have made the same mistake regardless of which method I used because the only difference would have been "[Order].CustomerID" vs. "[Order].Customer_ID". Having written "Order.ID", I had "order ID" on my mind so I would have written [Order].Order_ID using your naming.

    I do see your point, however, with regards to being able to accidentally join the Customer table with the Order.ID instead of Order.CustomerID.

    Cheers,

    Cyle

  • David, assuming you meant Jeff Moden, not Jeff Modem? Although now I'll be hearing that old-school crackle and squeal of the dialup modem every post he makes...:-D

    ---------------------------------------------------------
    How best to post your question[/url]
    How to post performance problems[/url]
    Tally Table:What it is and how it replaces a loop[/url]

    "stewsterl 80804 (10/16/2009)I guess when you stop and try to understand the solution provided you not only learn, but save yourself some headaches when you need to make any slight changes."

  • Cinderella data users!

    Oh how I laughed so long I cried.

    The actual crying is pretty much done with now. That was almost the entire company when I joined. Every department with their own collection of access front ends and spreadsheets, though admittedly the access had a sql server back end.

    Unfortunately access was the, cough, design, cough tool to create the DBs and the people doing it wouldn't know a naming convention or foreign key if it bit them on the arse.

    I still have the list of distinct column names for membership number across the tables in one of the DBs

    memno

    Membership Number:

    mem no:

    txtMemNo

    cboMemno

    cboMemNo

    mem #:

    [font="Arial"]_______________________________________________________
    Change is inevitable... Except from a vending machine.[/font]

  • My thanks to Paul for his help in resolving the SQL2005 conundrum.

    It was my bad. SQL2005 does reclaim space when the clustered index is rebuilt.

    SQL2000 doesn't and this I HAVE definitely checked (on multiple boxes).

    I still maintain that it is better to get the data types and sizes right very early on.

    On the Table.ID issue, what do you do if you have one:one relationships in your DB? How can you tell if two tables can relate together?

  • I agree with most of the article except for this statement:

    "The instant you start writing queries that join tables together ID, name and description are taken outside of their context and are therefore ambiguous and not descriptive of their purpose."

    Completely, totaly disagree. It is *not possible* to take these out of context as they must *always* be in the context of the entity. Suppose we have the following:

    Select Id, Name, Description

    From Table1

    Join Table2

    ....

    This query will not compile in any DBMS of which I'm aware. You must supply the table name/alias and therefore the context to understand their meaning. "CustomerName" on a table called "Customer" is redundant as we would reference the column as "Customer.CustomerName". It is far more logical to have "Customer.Name" The problem comes when developers choose poor aliases:

    Select X.Id, Y.Name, Z.Description

    From Order As X

    Join Customer As Y

    On ...

    Join Product As Z

    On...

    If instead, they used the table name the context would be obvious:

    Select Order.Id, Customer.Name, Product.Description

    From Order

    Join Customer

    On ...

    Join Product

    On...

    Furthermore, what happens when a given foreign key column has multiple meanings? For an employee we might have Id, ManagerId, HumanResourceRepId, UnionRepId. Is the OP really suggesting EmployeeId, ManagerEmployeeId, HumanResourceRepEmployeeId and UnionRepEmployeeId?! Again, this redundant. The columns should be named for what they represent to the entiy in which they are contained with the (IMO very safe) assumption that you know the context or entity to which they belong in order to know what they mean.

    Look at the counter to this, the presumption with having "ProductId", is that you might have an OrderId but no context. Does that represent a ProductId on an order? A ProductId in an assembly? The product itself? Without the context to which the column belongs, having the column by itself is useless.

  • To some extent I agree.

    There is no need to have CustomerName in the Customer table but with columns such as ID I would never rely just on that. I would call it CustomerID if it was the PK for that table.

    Then, when it comes to FKs

    Orders.CustomerID makes sense and all columns which reference CustomerID have the same name everywhere rather than being ID in the parent table and <parentTableName>ID in child tables.

    [font="Arial"]_______________________________________________________
    Change is inevitable... Except from a vending machine.[/font]

  • Thomas-282729 (6/7/2010)


    You must supply the table name/alias and therefore the context to understand their meaning. "CustomerName" on a table called "Customer" is redundant as we would reference the column as "Customer.CustomerName". It is far more logical to have "Customer.Name" The problem comes when developers choose poor aliases:

    Select X.Id, Y.Name, Z.Description

    From Order As X

    Join Customer As Y

    On ...

    Join Product As Z

    On...

    While it may be "more logical", it isn't really within your power to ensure the developers code things your way. One of the main reasons for aliases is to keep from having to type full table names over and over. And the only place you're going to want to preface the field with the table name in most cases is the Primary Key of the column so it will make sense in joins and when used as a foreign key.

    If instead, they used the table name the context would be obvious:

    Select Order.Id, Customer.Name, Product.Description

    From Order As X

    Join Customer As Y

    On ...

    Join Product As Z

    On...

    Except that in your example there, you alias the tables, but then don't use them and have apparently named one of your tables a reserved word "Order". This makes the statement far less clear than your previous statement.

    Furthermore, what happens when a given foreign key column has multiple meanings? For an employee we might have Id, ManagerId, HumanResourceRepId, UnionRepId. Is the OP really suggesting EmployeeId, ManagerEmployeeId, HumanResourceRepEmployeeId and UnionRepEmployeeId?! Again, this redundant. The columns should be named for what they represent to the entiy in which they are contained with the (IMO very safe) assumption that you know the context or entity to which they belong in order to know what they mean.

    The foreign key should have the same name in the referring table as it does in the data's parent table. EmployeeID should be EmployeeID everywhere in the database for clarity. All the other instances you mention should be another field descriptor or evident from context of the table in which it is a foreign key. If you start renaming fields like this, other people are going to be very confused when working with this database, especially if you haven't left good documentation and someone new inherits this.

    --------------------------------------
    When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
    --------------------------------------
    It’s unpleasantly like being drunk.
    What’s so unpleasant about being drunk?
    You ask a glass of water. -- Douglas Adams

  • While it may be "more logical", it isn't really within your power to ensure the developers code things your way. One of the main reasons for aliases is to keep from having to type full table names over and over. And the only place you're going to want to preface the field with the table name in most cases is the Primary Key of the column so it will make sense in joins and when used as a foreign key.

    Again, that is the problem of the developer. There is no end of havoc that can be wrought by a bad developer. I'm not against using table aliases. I just do not want to bend over backwards with redundant column names to defend against the practice of using **poor** aliases.

    Except that in your example there, you alias the tables, but then don't use them and have apparently named one of your tables a reserved word "Order". This makes the statement far less clear than your previous statement.

    Having aliases and not using them was a typo. I've since fixed it. There is nothing wrong with having a table named Order and you are not required to enclose it in square brackets.

    My whole point is that people are complaining about bad use of aliases when simply not using an alias would make the query far clearer. Again, there seems to be this notion that we should use clunky column names to counter sloppy code. There are many tools that can be used to refactor a query so that it is far more readable including a replacement of aliases with better names so that going down the route of bad column names is not necessary.

    The foreign key should have the same name in the referring table as it does in the data's parent table. EmployeeID should be EmployeeID everywhere in the database for clarity. All the other instances you mention should be another field descriptor or evident from context of the table in which it is a foreign key. If you start renaming fields like this, other people are going to be very confused when working with this database, especially if you haven't left good documentation and someone new inherits this.

    By far the best documentation is the DRI. In most cases, it will be the case that the foreign key happens to have the entity to which it references (e.g. CustomerId in the Order table). If there is any question about what column in the parent table is referenced, a developer can simply inspect a database diagram. I've walked into plenty of database WTFs and I never rely on the column names representing the relationship to which they appear. I've built and developed against both naming schemes and frankly it is simpler to code against a system that does not prefix the table name in the Id column of the primary key(e.g. Id instead of PersonId, FooId, BarId).

    The name of the column should represent what it is, not the context in which it belongs. Why stop at the primary key column? Why not PersonName, PersonPhone, PersonAddress and so on? What do you do in a self-join? EmployeeId and ManagerEmployeeId in the same table? How is it that a PK of PersonId adds clarity and is good but a column named PersonName does not and is bad?

    IMO, for the same reason that PersonName is a poor choice for a column name in the Person table, so is PersonId in the same table.

  • Thomas-282729 (6/8/2010)


    Stefan Krzywicki (6/8/2010)


    While it may be "more logical", it isn't really within your power to ensure the developers code things your way. One of the main reasons for aliases is to keep from having to type full table names over and over. And the only place you're going to want to preface the field with the table name in most cases is the Primary Key of the column so it will make sense in joins and when used as a foreign key.

    Again, that is the problem of the developer. There is no end of havoc that can be wrought by a bad developer. I'm not against using table aliases. I just do not want to bend over backwards with redundant column names to defend against the practice of using **poor** aliases.

    To simply state "That is the problem of the developer" sets up an adversarial relationship. If you use decent practices that make the information easier to scan, you'll have less trouble in dealing with developers. They're not the enemy, they're the consumers of your data or those who allow others to consume it. Without developers, your database is useless. There's also no end of havoc that can be wrought by a bad DBA or database designer.

    There's nothing "poor" about an alias of E for a table named Employee, especially if the columns are well named. I hardly think naming a column EmployeeID or EmpID instead of ID is "bending over backward". It makes it clearer to read the database and easier to use.

    Thomas-282729 (6/8/2010)


    Stefan Krzywicki (6/8/2010)


    Except that in your example there, you alias the tables, but then don't use them and have apparently named one of your tables a reserved word "Order". This makes the statement far less clear than your previous statement.

    Having aliases and not using them was a typo. I've since fixed it. There is nothing wrong with having a table named Order and you are not required to enclose it in square brackets.

    My whole point is that people are complaining about bad use of aliases when simply not using an alias would make the query far clearer. Again, there seems to be this notion that we should use clunky column names to counter sloppy code. There are many tools that can be used to refactor a query so that it is far more readable including a replacement of aliases with better names so that going down the route of bad column names is not necessary.

    We're talking about clarity and your example was anything but. By not using aliases, the query can quickly become difficult to read, especially if you have long table names, many columns and some table or column names that are reserved words. You might not HAVE to have a table named Order in brackets, but if you're going for clarity then leaving them off is a step in the wrong direction. Naming the table something other than Order is a step toward clarity.

    Thomas-282729 (6/8/2010)


    Stefan Krzywicki (6/8/2010)


    The foreign key should have the same name in the referring table as it does in the data's parent table. EmployeeID should be EmployeeID everywhere in the database for clarity. All the other instances you mention should be another field descriptor or evident from context of the table in which it is a foreign key. If you start renaming fields like this, other people are going to be very confused when working with this database, especially if you haven't left good documentation and someone new inherits this.

    By far the best documentation is the DRI. In most cases, it will be the case that the foreign key happens to have the entity to which it references (e.g. CustomerId in the Order table). If there is any question about what column in the parent table is referenced, a developer can simply inspect a database diagram. I've walked into plenty of database WTFs and I never rely on the column names representing the relationship to which they appear. I've built and developed against both naming schemes and frankly it is simpler to code against a system that does not prefix the table name in the Id column of the primary key(e.g. Id instead of PersonId, FooId, BarId).

    I disagree. DRI is nice when it exists, but nothing beats a database diagram and notes from the designer explaining what the fields are and what the relationships are. If you've never inherited a database where the relationships are just implied instead of explicitly declared, you're very lucky. When you come upon a database where the creator didn't know how to declare references and decided to name the Employee ID - ID and then referenced it as EmployeeID, ManagerID, etc... with no documentation left behind, you're well on your way to a complete redesign with lots of business user meetings.

    It may be simpler for you to code against a system that doesn't prefix the table name in the Id column of the primary key, but your experiences are not universal ones. Simply naming the field ID can lead to confusion in a complex query where you have 5 tables each with a column named ID. Developers are going to use aliases whether you like it or not and they're going to end up with bad queries if you've created a confusing structure.

    The name of the column should represent what it is, not the context in which it belongs. Why stop at the primary key column? Why not PersonName, PersonPhone, PersonAddress and so on? What do you do in a self-join? EmployeeId and ManagerEmployeeId in the same table? How is it that a PK of PersonId adds clarity and is good but a column named PersonName does not and is bad?

    IMO, for the same reason that PersonName is a poor choice for a column name in the Person table, so is PersonId in the same table.

    Why stop at the primary key column? Because ID is likely the only column that's going to be in every table. You're very unlikely to have lots of "Phone" columns or "MaritalStatus" columns. If you do, you're likely designing badly.

    Self-Joins are a simple exception, there's rarely a rule that fits 100% of the time, but EmployeeID and EmpParentID are clear.

    PersonName might be a good column name if you have lots of tables with "Name" columns, though I'd more likely have FirstName, LastName columns. It isn't always about the compiler, it is about the humans that read the table schemas to decide what to do and who use them in interacting with the database. You could name everything random strings and the compiler wouldn't have any problems with it. Heck, you can actually name everything with unicode pipe symbols and the compiler doesn't have any problem with it. You have to take the users into account.

    --------------------------------------
    When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
    --------------------------------------
    It’s unpleasantly like being drunk.
    What’s so unpleasant about being drunk?
    You ask a glass of water. -- Douglas Adams

  • Non-database aware developers have a different perspective to DB aware developers on a particular problem.

    I would not advocate having "ManagerEmployeeID", ManagerID is clearly more appropriate. I used to use id, name and description when I was mainly a developer but I now avoid it like plague because experience has taught me that it becomes confusing in larger schemas, particularly where 1:1 relationships exist.

    The argument for using ID,name etc is that adding aliases into the query gets around that, but why have to put in aliases in every query you write rather that in the table once?

    Self-documenting code is another thing that REALLY annoys me. When there are things like nDoc, JavaDoc, SQLDoc etc the idea that I have to check everything out of source control, fire up an IDE and start trawling through code to find out how things work is anathema. Same with management studio. Why would I start trawling through the schema looking for DRI constraints when a web based internal WIKI can outline what I need in a fraction of the time?

    First question from the users of a data warehouse is usually "Do you have a schema diagram for this", closely followed by some variant of "Do you have a data dictionary of what is available"? If I said, "no, why don't you inspect the self-documenting code" I'd soon find I wouldn't be their first choice to work on their prestige projects.

  • David.Poole (6/8/2010)


    Self-documenting code is another thing that REALLY annoys me. When there are things like nDoc, JavaDoc, SQLDoc etc the idea that I have to check everything out of source control, fire up an IDE and start trawling through code to find out how things work is anathema. Same with management studio. Why would I start trawling through the schema looking for DRI constraints when a web based internal WIKI can outline what I need in a fraction of the time?

    First question from the users of a data warehouse is usually "Do you have a schema diagram for this", closely followed by some variant of "Do you have a data dictionary of what is available"? If I said, "no, why don't you inspect the self-documenting code" I'd soon find I wouldn't be their first choice to work on their prestige projects.

    Absolutely! Another problem with trying to figure out what's going on from the code and the schema, aside from the huge time sink that it is, is that this tells you how it is, not how it should be.

    --------------------------------------
    When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
    --------------------------------------
    It’s unpleasantly like being drunk.
    What’s so unpleasant about being drunk?
    You ask a glass of water. -- Douglas Adams

Viewing 15 posts - 31 through 45 (of 54 total)

You must be logged in to reply to this topic. Login to reply