Hidden Tricks To SQL Server Table Cleanup

  • You could also select all the targeted EventIDs into a temp table (with a clustered index if you want to delete in order). That's important if you don't want to delete all types of rows. Maybe you need to care about referencial integrity and therefore you can't delete some rows or you need to filter the rows by some weird criteria first... 😉

    Then, at every loop run, you could use a second temp table or a cte to select the top(x) EventIDs (from the first temp table) you want to delete, and put these into the IN clause of your DELETE statement.

    This is more flexible (and the way we do it since years) as we have a highly normalized datamodel with big transaction tables, covering transactions of various types with other tables referencing on it. So we decided to archive and delete only the most frequent transaction types and only the special detail tables having dependencies on it.

    PS: Don't forget to put indexes on tables referencing your "Event"-table, as SQL Server will check for referencing rows in these tables to maintain integrity during DELETE.

  • Another way might be to partition the table - or index - switch out the desired partition(s) to a temp table and drop the temp table.

    I'm doing this on 1BN + tables, and it's instantaneous.

    Although you do have to plan for it when creating your table or index.

  • Peter,

    how do you deal with referencial integrity, or don't you have any FK-references to that table? That's the reason why we couldn't apply partitioning and chose to copy/delete old data.

  • That's the benefit of ignorance: It never even crossed my mind there might be a problem, I just went ahead and did it 🙂

  • This would also work:

    SET ROWCOUNT 2000 -- this will limit the number of rows in the delete

    DECLARE @Rowcount int

    SET @Rowcount = -1 -- Initialise

    WHILE @Rowcount <> 0

    BEGIN

    DELETE EventTracking.dbo.Event

    WHERE EventCloseDate < (GETDATE() - 6) -- remove rows older than 6 days

    SET @RowCount = @@ROWCOUNT

    END

    SET ROWCOUNT 0 --set it back

  • There is the SQL syntax error in the artcle:

    CREATE VIEW [dbo].[VEventCleanup]

    AS

    (SELECT TOP(2000)

    FROM EventTracking.dbo.Event

    ORDER BY Eventid

    )

    The above code should be corrected as follows:

    CREATE VIEW [dbo].[VEventCleanup]

    AS

    (SELECT TOP 2000 *

    FROM EventTracking.dbo.Event

    ORDER BY Eventid

    )

  • This seems a good discussion to clear up my understanding of using "select top xxxx ..."

    I've always read the description to be that when an "order by" is not used then the data returned may or may not be consistent each time [in that case to say that the oldest is always returned would not be correct.] Does this depend on the existence of a clustered index? or is it determined by the primary key [clustered or not]?

    and when an "order by" is used that for the sql server to determine the order all data from the table is looked at to get the "top xxx"; it seems like that could take extra time and use a lot of memory if it is a large table.

    We have had multiple discussions about using "select top ..." , I contend that defining the criteria for the data you really want is better than depending on getting the correct top 100... Especially in cases where there is a delete or update involved, or the results have to be repeatable.

    What is correct? What do you think?

  • I did something very similar which is working great. The commit after each small batch is the real secret. I also put a WAITFOR DELAY '00:00:02' into the loop which seems to help keep the purge stored procedure from encroaching on the overall server performance.

  • This job is to be run mannualy right? because on the article a schedule for it was not created - but this is a minor thing. nice idea to use views to do the delete!

  • Like some others here, I'm not in favor of the "count" method of doing these DELETEs. In particular, if the process has to be stopped and then resumed later, you have to scan the entire table again to get the count to use in the processing.

    Aside from that, there's definitely a potential for locking issues with this statement from the original code:

    SELECT @cnt = COUNT(*)

    FROM [EventTracking].[DBO].[Event]

    WHERE EventCloseDate < GETDATE() - 6 -- remove rows older than 6 days

    since the entire table will have to be scanned. OK, a table scan may be unavoidable, since the only index is on id, and we need to check the date column as well, but at least we can use NOLOCK. Since we're only concerned with obsolete rows, that shouldn't be an issue.

    Since the id is the clus key, I suggest just getting the max id that needs deleted. Also, personally I would strip the time from the date, so that mutliple runs in the same day would all delete the same rows, rather than based on the time the command happened to run.

    Finally, I would use a variable for the days to keep, so that the code is self-documenting rather than relying on textual comments.

    So, putting it all together, something like this perhaps:

    DECLARE @days_of_history_to_keep int

    SET @days_of_history_to_keep = 6

    DECLARE @EventId bigint

    IF OBJECT_ID('EventTracking.dbo.Event_Delete_Control') IS NOT NULL

    DROP TABLE EventTracking.dbo.Event_Delete_Control

    SELECT @EventId = MAX(EventId)

    FROM [EventTracking].[DBO].[Event] WITH (NOLOCK)

    WHERE

    EventCloseDate < DATEADD(DAY, -@days_of_history_to_keep, DATEADD(DAY, DATEDIFF(DAY, 0, GETDATE()), 0))

    -- store the controlling values in case this process has to be restarted later

    SELECT @EventId AS EventId

    INTO EventTracking.dbo.Event_Delete_Control

    ...

    --the DELETE statement inside the DELETE loop (naturally other code will enclose this)

    ...

    DELETE TOP (nnnn)

    FROM [EventTracking].[DBO].[Event]

    WHERE

    EventId <= @EventId

    ...

    SQL DBA,SQL Server MVP(07, 08, 09) "Money can't buy you happiness." Maybe so, but it can make your unhappiness a LOT more comfortable!

  • Thanks Lana - somehow missed that in editing.:-)

  • This technique runs on our production systems for over 1 year - it reliably returns the oldest rows as long as the predicate is against the primary cluster index key. That's part of the trick here, give it a try as described in the article and let me know your results.

  • thanks for the comment. I also played with putting in a wait delay and that is another approach. What I do now is run these deletes as jobs that run every x minutes each job deleting a set number of rows.

  • Thanks for the comment. Sorry these jobs runs on a schedule, initially against huge tables I run them every 20 minutes once the table is cleaned up I change the schedule to once a day. Prior to this technique I could not run table cleanup during the business day due to lock contention but now I run these anytime of day with very minimal lock contention.

  • Agreed the count method consumes resources so initially against huge tables I hard code the count. Say the table has 50 million rows to remove, no point using count just SET @count = 1000000. However this is production so when we get to a few million rows to remove the table is small enough where the count penalty is small. Thanks for the comment.

Viewing 15 posts - 16 through 30 (of 95 total)

You must be logged in to reply to this topic. Login to reply