Search & pattern matching within text/html values

  • Good day all;

    How can I support the application development team to provide a search functionality on a column that is nvarchar (Max) and can store text/html values? I have been thinking to employ either full text indexing services or use sub strings. I am not too confident about sub string as in one of my previous applications, we had a similar situation where a column containing html markup would take forever to execute a simple select (top 10/100), hence I am considering to full text indexing, or is there any other solution to it?

    And yes, this table will further have inner joins with at least 4 other tables, that won't have any text/html columns and would not have more than few thousand rows.

    Regards,

    Kazim Raza

  • Folks, no one after 46 views :)? I guess I did not convey the problem correctly, or posted it in the wrong section. Yet look forward to a solution.

    Thanks,

    Kazim

  • One reason for not having a reply might be that we don't have enough information.

    What type of "search criteria" are you looking for? Does the full-text search meet your requirements? If so, what exactly is your question? And if not, what exactly do you expect but not getting as a result?

    I recommend to enable the table in question for full-text indexing and see if the requirements can be met.



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • Thanks for the reply Lutz. Ill try to sketch out the scenario below;

    The application we are currently working upon is aimed at research automation where a research paper/study would be segmented into different categories. 9 out of 10 times, the data would come in a pair of title and text/html format, so I have kept the attributes (title & description nvarchar(max)) for these. On the application side (UI), we have to enable users to search through these two fields based upon some keyword(s) or term(s).

    So for instance, a researcher uploads a database research study on SQL Server and it has different sections like security, replication, mirroring etc. having title and a rich text/html formatted data. When the application supplies a keyword for searching on these columns, I've to perform a search on these columns for the supplied keyword/term. A simple Like (title = '%replication%' or description like '%replication%') might do the job here, however I feel full text indexing might just gear up the things faster for me.

    I hope I have made the scenario clear now :). For my exact requirements, I'll have to wait till the dev team starts their development and have the client's say on it. Right now its more in the planning & designing stage.

    Regards,

    Kazim Raza

  • Could you attach a text document containing a couple of lines of the "rich text/html formatted data", please? We also need to know what the maximum number of characters would be on that column if possible to determine.

    As I side bar and if possible, it would be much more efficient if the data were "normalized" by splitting the data out of the tags into a real table instead of using the highly denormalized and difficult to use format it's currently in.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Hi Jeff thanks for your reply

    I am attaching a sample text file here. I have prepared it on the same lines as the original document. As for the length of the data goes, it could be endless.. The client has a 600 page PDF document that will be stored in the database, with most of the formatting stored in the database.

    Regards

    Kazim Raa

  • No one, still?

  • With the huge amount of variance in the size of the fields you are facing a slow search process no matter how you slice and dice this. At first I was thinking that sounded like a site search for a site where all the pages are stored in the db but then you said something about a 600 page pdf. Are you trying to search varchar or varbinary columns? Keep in mind that we don't know your project and have only the pretty vague details you have provided.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Hi Kazim

    If your requirement is what you have told in your post :: Posted 8/6/2011 10:16:45 PM

    then yes you can use Full Text Search.

    ---------------------600 page PDF post------------------------

    When you say, the client has 'a' 600 page PDF, do you mean that only this PDF you will be storing in your database (but in different rows, by breaking it)

    OR is it just an example of one of the values that would go in the nvarchar column.

    The files you have to store is in PDF format or (text or html). Just wanted to know as in your first post you mentioned text/html files but later you said pdf.

    Thanks - Janki

  • I guess my second post implied storing a 600 pager PDF as a binary object, which is not the case. The system will compile and format the data in the end of the data entry process which will be available to the users in form of a PDF/Doc files.

    Consider this link which is a PDF file on SQL Server data backup compression (https://docs.google.com/viewer?a=v&q=cache:Lpf17ipK32IJ:www9.unisys.com/eprise/main/admin/corporate/doc/41371394.pdf+sql+server+2008+white+paper+filetype:pdf&hl=en&gl=pk&pid=bl&srcid=ADGEESi9Cui0ZDEJJ-naHwPllbjf3Qk9Ml6VepgXvDNsNNJOGJ9Bmqyk4EVVyHGcqtfOwtxxwpjKC4dm6VYk4rfG9__YxZG_s0s-glXsJEBNIAHbiOCB05D-32tKPjnEBNOTiItm4zfw&sig=AHIEtbQKx0MBOs1ferW0FvkS5In-9bn6bg)

    So all data (text/html/tables/images) in this PDF is in our database, the PDF is just the output of the data stored in the database.

    Regards

    Kazim Raza

  • Yeah, this won't be pretty, if I've understood this right.

    Your VARCHAR(MAX) is storing standard ASCII coding, but also html tags and the like. Here's a problem with that:

    <a>ABC</a> and ABC are two different 'words' to full text indexing, usually. Full text indexing CAN help you here but it won't be spectacular. Anything searched would have to be a partial word search, which means cat and category end up being picked up together, unless you want to ignore headers/tagged items.

    If your html formatting is broken off (which means the spacing gets strange on display) IE: <a> ABC </a> Full Text Indexing can help you out here, because ABC is a word to the index.

    If you're storing multiple documents of this level, you're probably going to need to doublestore the information (cringe), once with and once without the 'extraneous' tagging information. That will get Full Text Indexing where you need it to go.

    A brief look into the topics didn't indicate a way to change what you can use as word-breaks, but there may be a way I'm unfamiliar with someone can offer.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Thank you for the reply, double storing might be the last resort, an ultimate dead end.

    Anything searched would have to be a partial word search, which means cat and category end up being picked up together, unless you want to ignore headers/tagged items.

    I did not get this, can you please elaborate

    Regards,

    Kazim

  • kazim.raza (8/17/2011)


    Thank you for the reply, double storing might be the last resort, an ultimate dead end.

    Anything searched would have to be a partial word search, which means cat and category end up being picked up together, unless you want to ignore headers/tagged items.

    I did not get this, can you please elaborate

    Regards,

    Kazim

    Sure. Let's take a look at what you're doing. Let's say you've got <a>cat in the hat</a> and category in a document. You're going to have to look for *cat* to find cat because it's anchored in the center of the 'tagged' word, and you don't know if it's anchored in front, back, or not at all. Because of the wildcarding, you're going to pick up all the following words (as a small sample):

    cat

    catamaran

    catalog

    subcategory

    Ducat

    ....

    Your indexing is also mostly useless because this is the equivalent of doing a like %cat% against the main data. It may be faster because the components (words) are smaller and they've been collapsed in the index to a single entry each, but it's still nowhere near optimal. The reason you have to do this is because the full text index actually indexed "<a>cat" as a word, instead of two words of "<a>" and "cat"


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Thanks for the prompt elaboration, appreciate.

    We are on the same page here, I was just confused where you mentioned partial word search. Um.. I have a gut feeling that we might have to bank on double storing, eventually...

    I wonder how these search engines functioning? I mean they are also doing something similar, though on a way too large data set...

    Regards,

    Kazim Raza

  • Let's take a look at what you're doing. Let's say you've got <a>cat in the hat</a> and category in a document. You're going to have to look for *cat* to find cat because it's anchored in the center of the 'tagged' word, and you don't know if it's anchored in front, back, or not at all. Because of the wildcarding, you're going to pick up all the following words (as a small sample):

    cat

    catamaran

    catalog

    subcategory

    Ducat

    Your indexing is also mostly useless because this is the equivalent of doing a like %cat% against the main data. It may be faster because the components (words) are smaller and they've been collapsed in the index to a single entry each, but it's still nowhere near optimal. The reason you have to do this is because the full text index actually indexed "<a>cat" as a word, instead of two words of "<a>" and "cat"

    Hi Kraig F,

    I have the binary images of PDFs in a table stored in varbinary(max). I am using Full Text Search. When I search for 44 f(2)(E), it returns all the documents having-->

    44 OR 44 F OR 44 OR 44F(2)(A) OR 44 E....

    What do I need to do to make it return only the document having the exact phrase

    44 f(2)(E)?

    Sorry, for deviating from the original (Kazim's) questions.

Viewing 15 posts - 1 through 15 (of 18 total)

You must be logged in to reply to this topic. Login to reply