Storing files in VLDBs, filegroups and maintenance

Question

Storing files in VLDBs, filegroups and maintenance

Pavle-192195

SSChasing Mays

Points: 652
More actions
March 20, 2008 at 2:41 pm

#128380

Hello,
We are thinking about storing approx 300Gb (or more with time we could go up to 1-2 Tb) of files in our DB to bypass all the issues with classic DB/file-system association (access rights management, desynched backups etc).
I have a few questions about the architecture:
- would you recommend putting the tables with varchar(max) fields containing files in a separate filegroup?
- how would you go about archiving this data, horizontal partitioning or using stored procedures to copy old data to another DB?
- are there any major drawbacks /issues i should be aware of with such an implementation?
- any use cases ?
thanks a lot!

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply

Steve Jones - SSC Editor SSC Guru Points: 728164 More actions · Answer 1

I'm not sure 300GB qualifies as a VLDB anymore. I think you need to be 1TB and above.

I wouldn't separate out varchar(max) content out to separate filegroups. No benefits in general I'm aware of. If the data is heavily queried, then it might help.

For archiving, you can use a separate filegroup or you can move to a new database. I'm not sure one is much better. The table partitioning with SWITCHING things out might make the process a little simpler.

Follow me on Twitter: http://www.twitter.com/way0utwest
Forum Etiquette: How to post data/code on a forum to get the best help
My Blog: www.voiceofthedba.com

Robert.concepcion SSC Enthusiast Points: 194 More actions · Answer 2

I say it depends.

If the data will fit in a row or less than 8K I might put it in the same File,

But if the data is large I would create two files one for the rows of data and one for the data that will be over 8K (varchar(max)).

Why, table scans, if I have to scan the more records in a page the better. I’m making an assumption that once the storage engine has a list of qualifying records, it will then get the data from the varchar(max) field that is stored in a different file. Should lower IO.

As for Archiving, if it is an OLTP system I find smaller is better. It depends on your system design. After a while the data from 4 years ago is useless and need to be store in a operational data storage or DSS. That means copy then delete.

For example, table POSTransaction and Table POSTransaction_Archive .

There is a rule that after 90 days sales are final. I would keep 4 months of transactions on line. I would use table partition and move every month the last month to POSTransaction_Archive. Still you do not want people running reports in your OLTP system so you still have to move the old data to a DSS. That is copy data then drop partition.

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 3

Personally I see the file system as the best place to store files and the database the best place to store other stuff. I can rattle off numerous benefits for that arrangement.

SQL 2008 will provide the FILESTREAM mechanism to allow better handling of this situation.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

Pavle-192195 SSChasing Mays Points: 652 More actions · Answer 4

ok thanks a lot.

I was thinking that as long as we don't perform any searches based on the varchar(max) column, there shouldn't be a problem because we only read the index/pk of the table. I was wondering though how the general spread-out of data can impact performance, as SQL Server spreads varchar(max) data across pages will it make the search/access slower and I suppose the answer yes for access but what about the indexes? In other words if the 300Gb of data forces SQL server to spread the table across let's say 2000 pages (random number) will the index also be spread across 2000 pages or not? If I remember well the index should contain pointers to data so the index itself would be fast to read, right?

So far the major drawback I see is backup/restore maintenance cost due to large DBs, I don't see any performance issues other than lengthy RW tasks because the files WILL be accessed frequently.

BobAtDBS SSCarpal Tunnel Points: 4586 More actions · Answer 5

I was seeing ms/IO (from fn_virtualfilestats) of over 80 in my 170 Gig database that is about 70Gig Blobs and 100 Gig other data. I decided to put the Blobs into a separate file group and file on the same data drive (a raid 5 array) so that the folks that were accessing the normal data might enjoy better access. I expected the reads and writes on the blob file to stay higher, the files are often 20k to 300k, averaging 180k.

I was pleasantly surprised to find that ms/IO for both files has gone down. The normal data is now under 20, and the blob file is at 47.

I definitely recommend putting blobs in their own data file now. I may experiment with a separate file for indexes next. I don't claim to be smart enough to know exactly why it's working, I suspect it's easier for SQL server to use multiple CPUs or workers or something.

Student of SQL and Golf, Master of Neither

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 6

As long as you aren't I/O or CPU bound, moving indexes to their own file group (preferrably on their own set of physical disks/controller) should improve performance - perhaps significantly if you have plenty of CPUs and RAM and a good disk I/O system. This is a very common best practice.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 7

There is some merit to storing the following in separate files on separate drives if possible...

Static Tables and monthly archives

Audit Logs

Common utilities and functions, etc

Application stored procedures, functions, etc.

Non-static application tables

Etc... depending on your needs and the size of data

The merits are speed, size, and separation of backups with the idea of making restoration as quick as possible. If there are a lot of non-static application tables, mutiple groups to get critical tables online as soon a possible also have merit. Partioning of tables on separate files/groups may also have merit depending on how the tables are partioned.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

Pavle-192195 SSChasing Mays Points: 652 More actions · Answer 8

Thank you all for your interesting comments I think I will follow the advice and put the BLOBs on a separate filegroup. Regarding the indexes bit I read somewhere (I think it was Kimberly's blog on SQLSkills) that it wasn't always a best practice, depending on the way the DB is used so I'll perform some tests before.

Anyway thanks again!

BobAtDBS SSCarpal Tunnel Points: 4586 More actions · Answer 9

The performance merits of storing different parts of the database on separate disks (actually disk arrays) is obvious - you get more spindles working for you and perhaps even more disk controllers. What surprised me was that my milliseconds per IO went down for all groups with separate files on the same disk array. I have read that the "one file per actual CPU" is no longer much help in 2005, but I'm starting to wonder. Any specific guidance on the performance aspects of multiple files on the same array?

I have:

logs on one raid 1 array

tempdb on 4 files on 1 single disk

normal data on 1 file on 1 raid 5 array (5 disks)

blob data on 1 file on the same raid 5 array

dual cpus with dual cores with hyperthreading turned off

180 Gig of data, about 75 Gig of it is blobs

Student of SQL and Golf, Master of Neither

Steve Jones - SSC Editor SSC Guru Points: 728164 More actions · Answer 10

I thought that the big thing with multiple files is that you get a thread per file, so the thread can be assigned and the CPU can move off, waiting for the interrupt to signal a return.

It's hard to say if you separating off tempdb helps. It takes testing on your workload. Same for the BLOB data.

Follow me on Twitter: http://www.twitter.com/way0utwest
Forum Etiquette: How to post data/code on a forum to get the best help
My Blog: www.voiceofthedba.com

Robert.concepcion SSC Enthusiast Points: 194 More actions · Answer 11

This may help

http://blogs.msdn.com/psssql/archive/2007/02/21/sql-server-urban-legends-discussed.aspx

SQL Server Urban Legends Discussed

I have been answering questions on two of the most misunderstood ‘SQL Server Urban Legends’ frequently again so I decided to make a post about them.

· SQL Server Uses One Thread Per Data File

· A Disk Queue Length greater than 2 indicates an I/O bottleneck

SQL Server Uses One Thread Per Data File

The Legend grew from the following SQL Server behavior - “When SQL Server is creating a database file (.mdf, .ndf, .ldf) it writes zero’s to all bytes in the file and stamps the appropriate allocation structures. SQL Server can use a thread per file when creating a database.” This is a true statement but leaves room for interpretation so allow me to clarify with the following examples.

The true behavior is that for each UNIQUE disk drive specified a thread is used to initialize the database files. The number of threads is capped by a reasonable pool limit around one half of the total worker thread limit.

create database …. name = c:\MyFile1.mdf, name=c:\MyFile1.ndf

The SQL Server would use a single worker to create this database initializing MyFile1.mdf and then MyFile2.ndf.

create database …. name = c:\MyFile1.mdf, name=D:\MyFile1.ndf

The SQL Server would use two workers to create this database one thread initializing MyFile1.mdf and the other MyFile2.ndf because different drives have been specified. SQL Server makes the assumption that different drives have the possibility of unique I/O paths.

This information has been taken out of context and propagated incorrectly as SQL Server has a thread per database file so adding more files to a database can increase I/O performance. Adding more files and properly aligning them with storage can increase the I/O performance for a variety of reasons but achieving new I/O threads per file is NOT one of them. SQL Server I/O is all about I/O response time. You can have a variety of I/O configurations as long as the response times are acceptable.

The SQL Server 2005 – Instant File Initialization – feature significantly reduces the impact of database creation because zero’s don’t have to be stamped in all bytes of a database file, only the log files. This reduces the gain from using multiple threads during database creation.

The Truth

SQL Server uses asynchronous I/O allowing any worker to issue an I/O requests regardless of the number and size of the database files or what scheduler is involved. In fact, some may remember SQL Server 6.5 and previous versions had the sp_configure option ‘max async I/O’ that was commonly set to 255 and was removed when SQL Server 7.0 released.

SQL Server controls the async I/O depth with a set of FCBIoRequest structures. These structures hold the I/O request information, including the OVERLAPPED structure. There is a single pool of these structures for the entire instance of SQL Server. For SQL Server 6.5 the maximum number of these structures was controlled with the configuration option.

Starting with SQL Server 7.0 the number of these structures is only limited by available memory. It is actually a bit more complicated than just can a new FCBIoRequest be allocated. SQL Server has basic throttles for lazy writer, checkpoint, read ahead and other activities. This means that the practical limit is ((# of workers) * (action practical limit)). For example, the practical limit for checkpoint is generically no more than 100 outstanding I/O requests a any point in time and less if the target response time is not being honored.

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 12

Dunno if it's still true in 2k5, but when there are a large number of large image files as blobs, most folks recommend just storing the filename in the db and having the files as separate files in the OS.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

Steve Jones - SSC Editor SSC Guru Points: 728164 More actions · Answer 13

Good reference for the one thread per file. I swear I'd seen that in BOL (7, 2000), but perhaps not. Or I misread it 🙁

I've hard on the MVP side that between 250kb-1MB is the sweet spot. Below that, store in the db, above that, the filesystem.

Follow me on Twitter: http://www.twitter.com/way0utwest
Forum Etiquette: How to post data/code on a forum to get the best help
My Blog: www.voiceofthedba.com

BobAtDBS SSCarpal Tunnel Points: 4586 More actions · Answer 14

Please, let's not start the religious war on whether blobs even belong in the database at all - that's not what this thread is about. For several reasons, in our app, we've already made that choice and it's working very well. That was also the case with the original poster.

I'm just looking to enhance my knowledge (after a lot of reading and only a few year's experience) with other folk's opinions and experience with multiple files on a single array.

Thanks,

Student of SQL and Golf, Master of Neither