Search part of string within part of string

  • I have a table (tblOrganisations) with organisations in it. One column is called OrganisationName.

    I have recently imported a load of organisations into a new table - tblOrganisationsImport. This has a column called NewOrganisationName.

    Many of the organisations in tblOrganisationsImport are already listed in tblOrganisations. I need to find out which ones.

    Typical Organisations might be:

    The Society for Back Pain Research

    The Back Pain Society

    The Association of Back Pain Specialists

    The Association of Back Pain Specialists (ABPS)

    I have started by created stripped version of the name fields - so they contain no spaces, punctuation, brackets etc.

    Now I need to run some sort of routine to find matches. I can't compare the whole strings because there are thousands of organisations that are the same but which have been entered in subtly different ways.

    So I want to find matches on something like (in pseudo code)

    ... where a substring 15 characters long in the stripped name field of one table is in the stripped name field of the other table (and, say, the postcodes match) (I can do a couple of joins to link to at an address so I can compare stripped postcodes)

    ... where a substring 10 characters long in the stripped name field of one table is in the stripped name field of the other table (and, say, the postcodes match) (I can do a couple of joins to link to at an address so I can compare stripped postcodes)

    I can't just compare postcodes because they are all medical organisations and some buildings contain dozens of different organisations / associations / societies / institutes etc.

    Any pointers much appreciated.

  • Don't all these groups have some sort of registered group number that would be mostly immutable? If not, then I suggest you make one because there's no way to figure out that one set of words might mean one group without it meaning another group. You either have to get good data or make the data better by creating a group number and adding to the group number table everytime you come across something that's not in the table.

    If you do it right, it'll also make your life a lot easier when one of the groups or annexes to a group changes their address.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Thanks for your answer.

    Yes, the existing organisation table - has a primary key. The other table is an import from an excel spreadsheet that just contains the names and addresses of a few thousand organisations - most of which will already be in the organisation table.

    The only data I can sensibly compare is the organisation name and/or the postcode.

    What I'm really after is something that says ...

    If any ten characters in the organisation name field in table 1 match any 10 characters in the organistion name field in table 2

    Is this a regular expression thing? Can you use regular expressions with SQL Server?

  • You can't really use regular expressions in SQL Server unless you do an SQLCLR.

    The real problem here is that even if you can get it to make human-like decisions, even a human could get it wrong here.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • I would suggest that you try the fuzzy lookup component in SSIS. It's not perfect, but having done a data import from a dirty source recently, I can say from experience that it is very good at this type of matching.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Personally, I'd abandon the idea of using pure T-SQL for this.

    If you've got SQL 2008 Enterprise Edition, consider using the fuzzy lookup task in SSIS. Play around with match confidence levels and consider what you'd want to use as noise words (besides the usual "and","the" etc. are there any common acronyms in the industry you'd want to ignore e.g. "Ltd", "Associates", "Partners" etc.).

    For example, you could do a straight lookup on the postcode field, then do a fuzzy lookup to one match confidence level if the postcode matches and another (higher) confidence level if it doesn't.

    It'll be important to actually have a person validate the results after you've finished as this isn't an exact science.

    Edit: Bah, Gail beat me to it 😀

  • I have SQL Server Management Studio 10.0.2531.0

    What is SSIS? Do I have it? How can I find it?

    Thanks for your replies.

  • SQL Server Integration Services. One of the services you can install when you install SQL. Developed using Business Intelligence Development Studio.

    Some time with google will probably be beneficial here.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Gosh... I just wouldn't do this, folks. Too many possible problems. For example, there's just no guarantee that the following orgs are the same org...

    The Society for Back Pain Research

    The Back Pain Society

    The wrong problem is being addressed and your trying to shut the door after the horse got out. The REAL problem is the cruddy input data and the spreadsheet that allowed this mess to happen in the first place. If it HAS to be a spreadsheet, then give the users a spreadsheet with a restricted list pulldown on it.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Jeff Moden (8/23/2010)


    The wront problem is being addressed and your trying to shut the door after the horse got out. The REAL problem is the cruddy input data and the spreadsheet that allowed this mess to happen in the first place. If it HAS to be a spreadsheet, then give the users a spreadsheet with a restricted list pulldown on it.

    I definitely agree with you Jeff - Garbage in Garbage out. However, having worked in health care on more than one occasion you are really lucky to get the data at all and to ask for it in some consistent format, well, that is not acceptable. Most times they are pulling this off some old VMS type mainframe or something where there is only one guy that runs it and knows anything about it and he just won't budge to do any data scrubbing for your output. So, you get what you get.

    I am wondering if there are other data points that you could request from the source, i.e. address, billing number, etc where you could match on that. Trying to match on names never works and will always require manual scrubbing so, if you can find some other identifier you will be far better off working on that end.

    Hope this helps.

    David

    @SQLTentmaker

    “He is no fool who gives what he cannot keep to gain that which he cannot lose” - Jim Elliot

  • Thanks again for your replies. There is no chance of getting better data. The organisation in question is about to start using a new system - the only list of organisations they have at the moment is kept in a spreadsheet. There are no account numbers or anything else that could be used to pin things down - just name and half a dozen address fields.

    I'll have to try and get the SSIS thing working that was suggested.

  • just name and half a dozen address fields

    And you can't group by these to get a distinct set of organizations? You will still have to scrub the name but that would be expected in any case but if you can match the addresses back to your previously existing data then you have the brunt of that work done I would expect.

    I know I have seen all sorts of this data and it is never fun. I'll be curious to hear how things turn out for you.

    David

    @SQLTentmaker

    “He is no fool who gives what he cannot keep to gain that which he cannot lose” - Jim Elliot

  • sku370870 (8/22/2010)


    I have a table (tblOrganisations) with organisations in it. One column is called OrganisationName.

    I have recently imported a load of organisations into a new table - tblOrganisationsImport. This has a column called NewOrganisationName.

    Many of the organisations in tblOrganisationsImport are already listed in tblOrganisations. I need to find out which ones.

    Typical Organisations might be:

    The Society for Back Pain Research

    The Back Pain Society

    The Association of Back Pain Specialists

    The Association of Back Pain Specialists (ABPS)

    ....................

    Any pointers much appreciated.

    SSIS is very good tools for matching if you do 'full' match, I mean:

    match "The Association of Back Pain Specialists" against the same string with some misspelling like this "The AssoTiation of Bak Pain Specialits" . But if you trying to compare "The Association of Back Pain Specialists" with "The Association Specialists of Pain Back " or just "The Back Pain Society" it will be no successfully. If you need to match some sub-string you need to do some process like 'search'. SSIS is not created for searching.

    You can consider Walnutil - based on Lucene API - integrated to SQL Server and Oracle DBs. Walnutil works with tables and fields. Using Walnutil you need to have Primary Key in table, so after creating indexes you have possibility to create relations between tables and so on. For simple search you can use some methods from walnutilsoft, for more complicated search cases you can use Lucene API. Lucene is very good tolls for searching with very power API for different type of searching.

    See Demo and some code written on Java and C# which you can use it for creating different type of search...:-)

    Igor.

    http://www.walnutilsoft.com/

  • Who makes the decision on which organizations are the same? Is there a set of rules anywhere, or are you just basing it on a similar name set? Basically, how would YOU know which organizations are the same?

    i.e. Association for Back Pain of Illinois vs. Association for Back Pain and Illness

    Both are similar and I imagine even fuzzy logic in SSIS would consider both of those the same. Without knowing the complete data set, the rules used to separate/consolidate them, or if there are any other fields we can compare on also (street, city, zip code, etc)..this is not something you can really automate.

    How many records are you talking about here, out of curiosity?

  • Derrick Smith (9/10/2010)


    Who makes the decision on which organizations are the same? Is there a set of rules anywhere, or are you just basing it on a similar name set? Basically, how would YOU know which organizations are the same?

    i.e. Association for Back Pain of Illinois vs. Association for Back Pain and Illness

    YOU/WE make decision which organization is the same. We can use similarity in the fuzzy lookup(rang from 0.0 to 1.0). Yes you are right, here is set of rules. So you can manipulate similarity and may be some thing else. Depend of your goal you can set similarity = 1 and <Association for Back Pain of Illinois vs. Association for Back Pain and Illness> will be not the same or set similarity = 0.6(or like this) you can count that is the same in your terminology.

    Derrick Smith (9/10/2010)


    How many records are you talking about here, out of curiosity?

    For WalnUtil I used Score(0.0 - 1.0) and Similarity(0.0 - 1.0) for making decision for searching. Up to 10 millions records was tested . On Web Demo on WalnUtilSoft.com you can see example for more then 2 millions

    Igor .

    http://www.walnutilsoft.com

Viewing 15 posts - 1 through 15 (of 18 total)

You must be logged in to reply to this topic. Login to reply