What is Better -- PDF as Image in SQL or IFilter & Indexing Service

  • I have almost a million of PDF files that are made searchable through Internet. The problem I have here is that I have information stored in SQL about companies that related to those PDFs. The search result would be a combined result of a search of the PDF contents and a SQL query about their related companies. I see 2 possible solutions to this.

    1. Store the PDFs as images in SQL (not very sure about this one)

    2. Store PDFs physically on server and search them through IFilter & Indexing service.

    I would like to have your opinions on any possible solutions. Thanks

  • save them in the file system. Extract the textual content from them (use FiltDump -b to do this), and store the textual content in the database. Index the data in the database with SQL FTS.

    Hilary

  • "Filtdump"?

     

  • i've used index service to do this before and it worked fine with the pdf filter in place. with that many files though you will want to keep an eye on your catalogs as you may have corruption problems.

    the previous post about dumping the text content is a more robust and customizable solution though IMO.

  • Thanks for all your replies; they are all valuable to me.

    Currently the PDFs are on the server and we are using IFilter and Indexing service for the contents search. I have tried to use Linked Server to link up the indexing service and SQL. This allows me to have the combined result of querying both the indexing service and the SQL. The speed is very impressive too. With this, is that true it would be redundant to extract the PDF contents and store them in SQL? However, the corruption problem would be a concern.

    Will there be more other concerns or things I would need to watch out if I keep the current set up?

  • I've personally developed an inhouse asp.net application that imports pdf's into SQL server. It's a departmental purchasing system, so it only has a couple users and it's within the confines of our company, so it was a good test for saving the pdf's to SQL. One thing I noticed is the space consumption in SQL. The pdf's simply ate up a lot of space. We have about 500-600 pdf's with each one averaging about 12-15 pages. The SQL ballooned about 500MB. We found application performance to be good, but within the confines of just a few users, it's not a good test for measuring performance. With as many pdf's as you are mentioning, I would be hesitant to recommend importing them to SQL.

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply