SQL rollback issues keep occurring due to lack of drive space

Question

SQL rollback issues keep occurring due to lack of drive space

mbrady

SSC-Addicted

Points: 412
More actions
January 3, 2014 at 3:37 pm

#281747

Hello,
Long time lurker first time poster.
I have a large database in a data warehouse I have been tasked with tuning. Performance is approaching critical with certain data load processes exceeding 4 hours to complete. One of the fact tables has over 300,000,000 records in it and growing every day. The business refuses to archive old data. Refuses to even consider an archiving scheme no matter how much sense it makes or by what factor this alone would increase performance. They also have SERIOUS hard drive space and server provisioning issues. They will spend thousands to hire consultants to tell them what's wrong but not hundreds to upgrade drive space nor provision new servers. As a result there are literally almost a hundred dbs hosted on a single server. Some have inter-operations with others, some are stand alone. This is my operating environment. Also I have only been working in this environment for about a month so my knowledge about the data is limited to what I need to know to accomplish my tasks.
One of the dbs in question is one that has dependencies on others and on which others depend. I have 4 top queries I have identified as priority to be tuned to start with. I have scoured all over the forums looking for good advise on tuning techniques and found Gail's write ups on Simple-Talk to be most helpful.
In order to tune for these queries I have started out just looking at ways indexing can help. I may then go on and consider partitioning. Currently I have copied over the existing db to a test area and dropped all the existing indices on the tables in question. I have decided to split up the main fact table (the one with over 300m records) into two tables linked by a common INT which is the PK on the first table. The second table contains rarely used and even duplicated columns (which is another story all together).
In my efforts to populate the 2 now empty fact tables from the source (which due to space considerations is on the same Dev server) I am running into rollback issues. What happens when I try to simply INSERT all 300m records into the new fact tables at once is that the database's log file chews up all the space on the log file drive (N:\ which is only 200 GB) and SQL initiates a Rollback and I get a lovely error (which I neglected to copy for this post). So after I waited roughly 7 hours for it to get the point of "timing out", for lack of a better term, and about 3 more hours for the Rollback I decided to break the Inserts up into ~30,000,000 record chunks.
BEGIN TRAN INSERT INTO FactTable1 {namedcolumns} OUTPUT INSERTED.{namedcolumns} INTO FactTable2 SELECT {namedcolumns} FROM BASourceFactTableOnSameServer WHERE Code IN ('x', 'y', 'z') COMMIT TRAN
Wash, rinse and repeat for as many times as it takes to fill the table using all the distinct codes. This was working fine until I forgot to check and shrink the log file prior to the last INSERT. It filled up the drive and initiated a Rollback of not just the records in that set, but all records that had previously been committed to the table. Why? I would have expected it to only Rollback the last 30m records in the last INSERT statement that caused the log to exceed the drive space.
In addition to checking the table record count several times in a row to verify it is dropping I also checked sys.dm_exec_requests to validate that it was in fact performing a Rollback.
11002014-01-03 13:47:25.360rollbackAWAITING COMMAND0x010035000A46B50B48B9794E000000000000000000000000000x000000000000000000000000000000000000000000000000531AD87F642-3960-482B-9EAD-46C357FA13E90IO_COMPLETION0IO_COMPLETION212123760050x47.4310251682232262609526725620x00FD9390468269040432835259400912147483647us_englishmdy7111011112-100010020x00000000000000000x0000000000000000
Any help would be appreciated. I have Quest Spotlight on SQL Server installed if that can be of any benefit. Also if there are metrics I need to be tracking now (other than query performance based ones that I'm doing and those covered by Gail's Simple Talk articles) please let me know so I may begin monitoring these for historical purposes.
https://www.simple-talk.com/sql/performance/finding-the-causes-of-poor-performance-in-sql-server,-part-1[/url]
https://www.simple-talk.com/sql/performance/finding-the-causes-of-poor-performance-in-sql-server,-part-2[/url]
thanks,
Michael

Viewing 15 posts - 1 through 15 (of 27 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 1

mbrady (1/3/2014)
Wash, rinse and repeat for as many times as it takes to fill the table using all the distinct codes. This was working fine until I forgot to check and shrink the log file prior to the last INSERT. It filled up the drive and initiated a Rollback of not just the records in that set, but all records that had previously been committed to the table. Why? I would have expected it to only Rollback the last 30m records in the last INSERT statement that caused the log to exceed the drive space.

Hi Michael,

The reason is that a ROLLBACK affects ALL currently open transactions. I realize the code you posted has a COMMIT in it but the only way a rollback could trash the whole table is if there's a parent transaction that's open over all the individual transactions. You don't actually need a transaction in your code for this.

Also, if you're working in a separate database and it's not being backed up in a "Point-in-Time" fashion, I'd strongly recommend that you set the database to "SIMPLE" recovery for this testing so that you don't actually have to worry so much about log file explosions.

Last but not least, 30 million row insert may be beyond what I call the "tipping point" of the server (especially since you mentioned the limited environment your working in). I'd recommend that you setup a WHILE loop to keep track of and copy just 5 million rows at a time. The "Tipping point" is a point that all servers have where it might take X seconds to insert N rows, 2X seconds to in 2N rows, and several minutes to insert just 3N rows (as an example... the points will vary).

Between the SIMPLE recovery mode, taking out the explicit transaction, and using a WHILE loop to step through much smaller blocks of data, you should be able to "set it and forget it" until it's done instead of having to baby sit the process. I'd also recommend that your loop include something like the following to give you an on-the-fly status of where the code is at in the process...

RAISERROR('Working on loop %u...',0,1,@LoopCounter);

So far as partitioning goes, don't do it for performance because it usually does nothing to help. Performance is in the code and proper indexing. Also, if you've never done partitioning before, let me tell you that there are some huge caveats to be had especially when it comes to things like PKs, FKs, UNIQUE indexes, and "packing" of static partitions. Further,there's a B-Tree for each partition instead of just 1 for the table and that means that paritioning will frequently take more disk space AND can also slow down queries that don't rely on the partitioning column.

I will say that with 300 million rows, that table is certainly a candidate for partitioning but not for performance of code reasons. It WILL, for the static/historical partitions and if you uses a different FileGroup for each partition, allow backup times to be greatly reduced (you won't backup the static stuff on a nightly basis) as well as allowing for "Piecemeal" restores to occur. Just remember... it's going to take more space than what it currently does unless your current stuff has massive numbers of page splits.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

mbrady SSC-Addicted Points: 412 More actions · Answer 2

Ahh okay. I must have missed one. The db is already in SIMPLE mode. I just got used to wrapping everything in a transaction.

I will work on the looper on Monday. Thanks for taking the time to read my verbose post and replying.

mbrady SSC-Addicted Points: 412 More actions · Answer 3

Just as an FYI I found that error in case anyone else comes searching for a similar issue.

Msg 9002, Level 17, State 4, Line 3

The transaction log for database '[databasename]' is full. To find out why space in the log cannot be reused, see the log_reuse_wait_desc column in sys.databases

Msg 0, Level 20, State 0, Line 0

A severe error occurred on the current command. The results, if any, should be discarded.

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 4

mbrady (1/3/2014)
Just as an FYI I found that error in case anyone else comes searching for a similar issue.
Msg 9002, Level 17, State 4, Line 3
The transaction log for database '[databasename]' is full. To find out why space in the log cannot be reused, see the log_reuse_wait_desc column in sys.databases
Msg 0, Level 20, State 0, Line 0
A severe error occurred on the current command. The results, if any, should be discarded.

To be sure, that would only cause a rollback of the current transaction. It wouldn't have caused the loss of the whole table if you were doing 30M rows at a time unless, as you say, you missed a COMMIT somewhere along the line.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

mbrady SSC-Addicted Points: 412 More actions · Answer 5

So here is what I set up:

USE [myDB];

DECLARE @LoopCounter int

DECLARE @LoopCounterLoop int

DECLARE @LoopCounterMAX int

SET @LoopCounter = 1

SET @LoopCounterLoop = @LoopCounter + 5000000

SET @LoopCounterMAX = (SELECT COUNT(1) FROM OgDB.Fact.ChargeDetail)

WHILE @LoopCounter <= @LoopCounterMAX

BEGIN

RAISERROR('Working on loop %u...',0,1,@LoopCounterLoop);

INSERT INTO [Fact].[ChargeDetail] ([FacilityCode], [PatientControlNumber], [Sequence], [PostingDateId], [ChargeCode]

, [ServiceDateId], [OrganizationId], [EncounterId], [PatientId], [PatientEncounterTypeId], [CDMId], [CPTId], [CPTModifier1Id]

, [CPTModifier2Id], [CPTModifier3Id], [CPTModifier4Id], [CPTModifier5Id], [ALTCPTId], [RevenueCodeId], [ProviderId], [HCPCSCode]

, [CPTModifier1], [CPTModifier2], [CPTModifier3], [CPTModifier4], [CPTModifier5], [CPTCode], [OrderingProviderId], [ChargeQuantity]

, [ChargeAmount], [RevenueCode], [TotalCharges], [AccountBalance], [RecordStatus], [LoadControlId], [AdmissionDateId], [DischargeDateId]

, [BillDateId], [PatSatOrigDateId], [SIPGId], [OriginalLoadControlID], [RevenueDepartment], [ChargeDescription], [EPSIFeederKey]

,[RxNDC], [CorporationCode], [IncompleteReason])

OUTPUT INSERTED.[ChgDtlID]

, INSERTED.[RevenueDepartment]

, INSERTED.[ChargeDescription]

, INSERTED.[EPSIFeederKey]

, INSERTED.[RxNDC]

, INSERTED.[CorporationCode]

, INSERTED.[IncompleteReason]

INTO [Fact].[ChargeDetail_notes]

SELECT a.[FacilityCode]

, a.[PatientControlNumber]

, a.[Sequence]

, a.[PostingDateId]

, a.[ChargeCode]

, a.[ServiceDateId]

, a.[OrganizationId]

, a.[EncounterId]

, a.[PatientId]

, a.[PatientEncounterTypeId]

, a.[CDMId]

, a.[CPTId]

, a.[CPTModifier1Id]

, a.[CPTModifier2Id]

, a.[CPTModifier3Id]

, a.[CPTModifier4Id]

, a.[CPTModifier5Id]

, a.[ALTCPTId]

, a.[RevenueCodeId]

, a.[ProviderId]

, a.[HCPCSCode]

, a.[CPTModifier1]

, a.[CPTModifier2]

, a.[CPTModifier3]

, a.[CPTModifier4]

, a.[CPTModifier5]

, a.[CPTCode]

, a.[OrderingProviderId]

, a.[ChargeQuantity]

, a.[ChargeAmount]

, a.[RevenueCode]

, a.[TotalCharges]

, a.[AccountBalance]

, a.[RecordStatus]

, a.[LoadControlId]

, a.[AdmissionDateId]

, a.[DischargeDateId]

, a.[BillDateId]

, a.[PatSatOrigDateId]

, a.[SIPGId]

, a.[OriginalLoadControlID]

, a.[RevenueDepartment]

, a.[ChargeDescription]

, a.[EPSIFeederKey]

, a.[RxNDC]

, a.[CorporationCode]

, a.[IncompleteReason]

FROM (

SELECT ROW_NUMBER() OVER (ORDER BY EncounterId ASC) as RowId

, [FacilityCode]

, [PatientControlNumber]

, [Sequence]

, [PostingDateId]

, [ChargeCode]

, [ServiceDateId]

, [OrganizationId]

, [EncounterId]

, [PatientId]

, [PatientEncounterTypeId]

, [CDMId]

, [CPTId]

, [CPTModifier1Id]

, [CPTModifier2Id]

, [CPTModifier3Id]

, [CPTModifier4Id]

, [CPTModifier5Id]

, [ALTCPTId]

, [RevenueCodeId]

, [ProviderId]

, [HCPCSCode]

, [CPTModifier1]

, [CPTModifier2]

, [CPTModifier3]

, [CPTModifier4]

, [CPTModifier5]

, [CPTCode]

, [OrderingProviderId]

, [ChargeQuantity]

, [ChargeAmount]

, [RevenueCode]

, [TotalCharges]

, [AccountBalance]

, [RecordStatus]

, [LoadControlId]

, [AdmissionDateId]

, [DischargeDateId]

, [BillDateId]

, [PatSatOrigDateId]

, [SIPGId]

, [OriginalLoadControlID]

, [RevenueDepartment]

, [ChargeDescription]

, [EPSIFeederKey]

, [RxNDC]

, [CorporationCode]

, [IncompleteReason]

FROM OgDB.[Fact].[ChargeDetail]) a

WHERE a.RowID BETWEEN @LoopCounter AND @LoopCounterLoop

SET @LoopCounter = @LoopCounterLoop + 1

SET @LoopCounterLoop = @LoopCounter + 5000000

END

But I got the following error all three times I tried to run it:

Working on loop 5000001...

Msg 1204, Level 19, State 4, Line 16

The instance of the SQL Server Database Engine cannot obtain a LOCK resource at this time. Rerun your statement when there are fewer active users. Ask the database administrator to check the lock and memory configuration for this instance, or to check for long-running transactions.

So then I added WITH (TABLOCK) on the INSERT and got the same error. To my knowledge and research I am the only one using this database and table and, at times, the only one using this entire server. Is there something wrong with my looping logic?

In researching the error I ran this after the statement terminates:

SELECT request_session_id as 'SID', COUNT (*) num_of_locks

FROM sys.dm_tran_locks

GROUP BY request_session_id

ORDER BY count (*) DESC

And the highest num_of_locks is 1.

Max Memory is set to the default of 2147483647 MB.

I'm using

ROW_NUMBER() OVER (ORDER BY EncounterId ASC) as RowId

Because whomever designed this table neglected to put a PK of any sorts on the source table.

Lowell SSC Guru Points: 323482 More actions · Answer 6

i think the key on the error is Level 19 that's a really high error number; I'd bet the lock is waiting on the database growing and getting more space than it had to accommodate more rows of data, and it's taking a little while.

my first search in BOL was not that helpful:

[http://technet.microsoft.com/en-us/library/aa937483(v=sql.80).aspx
Severity Level 19: SQL Server Error in Resource
These messages indicate that some nonconfigurable internal limit has been exceeded and the current batch process is terminated. Severity level 19 errors occur rarely; however, they must be corrected by the system administrator or your primary support provider. The administrator should be informed every time a severity level 19 message occurs.

Lowell

--help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

mbrady SSC-Addicted Points: 412 More actions · Answer 7

MSDN's answer? "Take the highest session id, and terminate it using the KILL command." 😉

More info

EXEC sp_configure:

locks 500021474836470 0

max server memory (MB)16214748364721474836472147483647

min server memory (MB)021474836472048 2048

SQLBlimp SSCertifiable Points: 6330 More actions · Answer 8

Hi MBrady...

You have both a SQL issue and an office politics issue. 😀

As to the SQL Issue, in the thread I see you trying to loop to take smaller chunks of data, which is an acceptable means to prevent log growth -- unless the entire loop is inside a transaction. My advice is to make the chunks much smaller. Instead of half a million rows, try 10,000 or even 5000.

Also, what is the autogrowth of the file group housing the table? If it is a ridiculously low number such as 1MB, you are asking for tons of delay.

You are already in SIMPLE recovery mode, so you don't have that issue.

Another concern: Is this table replicated? If so, then the log reader has to read through your updates before the log space is freed, militating for a very low row count in the loop, say 5000.

As to the office politics issue (i.e., hosting ~100 databases on one SQL Server, no HD expansion space), these are the real issues that must be worked. You will need to hone your diplomacy skills and explain to management why you need more servers and disk space. You will need a cost/benefits analysis. The tempdb usage alone is going to be problematic if there is substantial query traffic. More servers, each with huge RAM, will also produce an immediate performance benefit. Interoperability between databases is an issue that can be solved by application refactoring if split between servers, or by judicious choice of which databases are assigned to what server.

As to your 300 million row table, increased disk space could permit you to open up separate file groups and partition the large table, which will have the benefits to you of faster query response to the most recent partition as well as keeping older data in a more-or-less static form.

Thanks

John.

SQLBlimp SSCertifiable Points: 6330 More actions · Answer 9

Do you have an IDENTITY column on this table with no PK? If so, and NULL is not permitted, then create an unique clustered index on it. Your speed benefits will be enormous and immediately noticeable. For one, you won't be table-scanning for each iteration of the loop. The EncounterID appears to be such an identity. If it is unique, cluster on it! The clustered index create may be a dog, but the initial pain would be worth the result...

Thanks

John.

mbrady SSC-Addicted Points: 412 More actions · Answer 10

Thank you all for your help.

The PRIMARY filegroup is set to Autogrow by 10 GB unrestricted on a drive with 770 GB free space remaining. The source DB PRIMARY file is currently at ~50 GB. There are 3 other files named INDEX, DATA, and of course LOG. The DB is around 100 GB total. The table is not replicated.

There is no IDENTITY column on the source table; there is on the destination table that I am designing. I did try to add it to the source table, but ran into log growth issues again, this time on the source table. I will try again post reboot.

I had a scheduled reboot today and had the hope that would help with this lock issue. It did not.

Politically I am currently a contractor with 3 months tenure. I have very little capital to spend. The recommendations I would make, have already been made by others many doors above me. It is a young IT department and they have recently made the financial decision to spend less on contracting, however that doesn't appear to be translating into more infrastructure spend. They are still trying to capitalize existing spend.

I think I am going to try to bring the DBs over to my local machine, make the changes and SSIS them back. That's all I can think of at this point. Unless anyone has any other ideas?

mbrady SSC-Addicted Points: 412 More actions · Answer 11

EncounterId is not unique as there can be several charges against a single encounter.

SQLBlimp SSCertifiable Points: 6330 More actions · Answer 12

mbrady (1/7/2014)
EncounterId is not unique as there can be several charges against a single encounter.

Is there any identity that is unique? You could cluster on a non-unique ID but the non-clustered indexes on the table would be larger.

I sympathize with your situation.

Thanks

John.

Jeffrey Williams SSC Guru Points: 89773 More actions · Answer 13

You could also look at using SSIS instead - which would give you the options on the OLEDB Destination to set a commit size and batch size.

Jeffrey Williams
Problems are opportunities brilliantly disguised as insurmountable obstacles.

How to post questions to get better answers faster
Managing Transaction Logs

mbrady SSC-Addicted Points: 412 More actions · Answer 14

I was able to check QA and PROD just to ensure some trickery hadn't occurred in DEV that I wasn't aware of. There is nothing identified that uniquely id's each row. There is a clustered index on LoadControlId and EncounterId (along with several NCI's), however it isn't unique and they were all in place prior to the start of this project. It is a very difficult balance to find here b/c there is a very real ETL consideration in the mix with HEAVY querying. As an overall solution I am pressing for an approach that will disable the NCI's prior to ETL operations and rebuilding them afterwards, but with a table this large on equipment this stressed it's difficult to gauge the cost/benefit for this. It may take just as long to rebuild the 7 NCI's as it would to just leave them enabled let it do it's thing. I am trying to take bite size chunks but like Alice I continue to find around every corner the rabbit hole just goes deeper and deeper. Which explains why no one has taken up the challenge and seen it through to the end.

I have been able to get 2 batches in since the reboot. So I'm just going to monitor the log situation and pray for the best. After the table gets populated the real work starts. 🙂

I will investigate SSIS if this latest attempt fails. It wouldn't be too much trouble to build a simple package. I guess I had hopes this would only need to be a one time thing. HA!