Question re. wildcards in CONTAINS clause - full-text search

  • I'm after some guidance on how the addition of a wildcard character to a word in a CONTAINS clause affects the results being returned by a client's application.

    Firstly, some technicals:

    SQL Server 2005 Standard Edition v9.00.2153.00 SP running on

    Windows Server 2003 Standard edition SP1

    Secondly, the search is using the noiseENU.txt noise word file which contains the word "No" which I assume has a major bearing on the examples below. Note that all single numeric and alphabet characters have been removed from the file.

    When the client application runs the following T-SQL:

    SELECT

    dmename

    FROM DOCUMENT

    WHERE CONTAINS(dmename,'"gordon" & "machine" & "no.3"')

    .. the result set returns 800 records with results such as:

    O-47832 SYSTEM - STEELWORK DETAILS, GORDON PS, NO. 3 MACHINE

    O-47831 SUPPORTS AND HANGERS, GORDON PS, NO.3 MACHINE EXCITATION BUSDUCTS

    O-47678 GORDON POWER STATION - NO.3 MACHINE PROCEDURE FOR DISMANTLING TURBINE

    However, when a wildcard character is added to "No.3", ie:

    SELECT

    dmename

    FROM DOCUMENT

    WHERE CONTAINS(dmename,'"gordon" & "machine" & "no.3*"')

    .. only 11 records are returned, such as:

    F-01039 Gordon Power Station, No. 3 Machine Generator, Drawing Index

    A3-03116 DRAFT TUBE AIR ADMISSION LEAKAGE, GORDON POWER STATION, No. 3 Machine

    A1-06405 GORDON POWER STATION, MACHINE NO. 3 DRAFT TUBE CROWN DOOR SEAL DETAILS

    In the latter query all the results return document names where there is a space between "No." and "3" but no results where the document name contains the substring "No.3". As I mentioned at the start, I have a strong hunch that the inclusion of the "No" word in the noise file is related to why this occurs but I'm struggling to get my head around what exactly is happening.

    Any illumination will be greatly appreciated.

    🙂

  • A few things:

    1. Punctuation characters, e.g. the period (.) are removed from indexing and/or search.

    1.1 So this results in a search of "no 3"

    2. Single digits, e.g., the numeral 3 is considered a "noise" word.

    2.1 So now the search term is just "no".

    Therefore, the resulting search term ends up being just "no".

    Comment and suggestion: The "noise" word list is going to cause all sorts of grief for your end users. You'll get lots of questions of "why didn't it find " ". You'll almost need to have the list of terms available at the UI. Try replacing the "noise" word list with an empty list. Microsoft has documentation on how to do this is. Note that you'll have to reindex the data.


    [font="Arial Narrow"](PHB) I think we should build an SQL database. (Dilbert) What color do you want that database? (PHB) I think mauve has the most RAM.[/font]

  • Thanks JohnG

    As I mentioned towards the top of my post all the single numeric and alphabet characters have been removed from the noise file. Would "3" still be treated as a noise word in this case?

  • Sorry, my apology - I didn't read your posting thoroughly.

    The term "no" is also considered a "noise word". It is in the noiseENU.txt file. Therefore, it is also removed from the query.

    I think the difference in behavior is due to the way SQL Server functions as it pertains to the wildcard character (along with punctuation and "noise" word handling). Key note: When the wildcard character (*) is applied to a phrase ALL terms in the phrase are wildcarded, not just the terms with the asterisk. From the BOL:

    "When is a phrase, each word contained in the phrase is considered to be a separate prefix. Therefore, a query specifying a prefix term of "local wine*" matches any rows with the text of "local winery", "locally wined and dined", and so on."

    Our application also uses the full-text feature. We've had to go through a lot of experimentation and testing in order to pin down the exact behavior. Followed up with a companion "searching best practices" document for our end users.


    [font="Arial Narrow"](PHB) I think we should build an SQL database. (Dilbert) What color do you want that database? (PHB) I think mauve has the most RAM.[/font]

  • Looks like we may have to do the same.

    Thanks for your advice

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply