Logic to determine Gender based on FirstName

  • Does anybody know of any logic or algorithm that will determine the gender of a person based on the FirstName column. 

    Any tools out there that will let us determine that?

    Any lead will be appreciated.

     

    Thanks

     

     

  • Easy, it can't be done.

    Boy or girl? Kim, Robin, Stac(e)y, Hilary.. I'm sure there are lots of more gender-generic names out there...

    /Kenneth

  • ... and Leslie, and Jamie, and Evelyn, and even my own Lee.

  • aww come on guys; don't let the 5% of exceptions prevent you from doing the other 95% of the work.

    it'd be some work gathering the data, but  a simple google of "boys names" and "girls names"  could get you some raw data. first link i found was 11000 boys names, but i don't know how easy it'll be to extract them. It's a pain to spider a bunch of pages and rip out the data you want, but it's certainly doable.

    stick them in a table with a genderflag related to the source you found it.

    CREATE TABLE GOOGLENAMES(name varchar(20), assumedGender char(1))

    insert into GOOGLENAMES(name,assumedGender) values ('Bill','M')

    insert into GOOGLENAMES(name,assumedGender) values ('Jamie','M')

    insert into GOOGLENAMES(name,assumedGender) values ('Jamie','F')

    --eliminate non-deterministic names as an example:

    select * from GOOGLENAMES where name in (select name from googlenames  group by name having count(name)=1)

    --do an update.

    update sometable set gender = assumedGender

    from GOOGLENAMES

    where sometable.firstname = GOOGLENAMES.name

    and name in (select name from googlenames  group by name having count(name)=1

    After that, it is up to you to do something with the exceptions, whether it is to assume all exceptions are male or female, or to leave blank, or to be reviewed and edited for probabilities...Kim is probably 95%female for example, and you could make assumptionsed like that.

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Well, gee, sure you could do that, Lowell, if you wanted full employment to be your career goal.  🙂

    In my personal experience, however, you can disclaim up front all you like that your results are only going to be but x% reliable, and the users will nod their heads in agreement.  But then, once you sign onto to the project, you spend the next 800 years in purgatory making constant adjustments because somebody further down the data stream, but up the food chain, doesn't like the counts and doesn't like your assumptions.

    Trust me on this.  Occasionally, I do work for statisticians -- Ph.Ds, mind you -- who get bent out of shape when rolled-up counts deviate by about .5 per 10,000 parts.  And this is *after* we've *already* explained that the numbers will be approximate.

    The real problem is even worse that what you've already laid out.  Take age brackets, for example.  The name 'Lindsay' may 80% female and 20% male -- but if the subject is forty years or older, it probably flips.

    Some worms are best left in the can.

    I should add, however, I like your approach, Lowell!  🙂

  • I agree with Lee.  This will cause you a great deal of difficulty down the line. 

    I also agree that I liked Lowell's approach... 

    I wasn't born stupid - I had to study.

  • I don't know why I'd do that... almost all the data would have to be manually recheck for validity... why not go down that road instead??

  • Or what happens if a male named KIM gets an official document that lists him as a Female? Lawsuit maybe? It is possible.

    A lot depends on what this data will be used for.

    Remember the Johnny Cash song "A Boy Named Sue?"

    -SQLBill

  • > Remember the Johnny Cash song "A Boy Named Sue?"

    How do you do?

  • There are lots of products /services on the market that offer genderization tools. Companies offering NCOA list services will typically genderize your tables for a certain $$$ amount per thousand with a minimum order or ### records. Typically these companies don't like to do anything for less than $500, but they probably will charge around $3/1000 records with that $500 figure as a price floor.

    Moral of the story - go google "NCOA list services"

  • How do you like that?  I guess my perspective was too parochial to even consider looking for something like that.

    One good thing that might do is shift the liability for making mistakes.  Sounds expensive, though.

     

  • your right; i could see how making any assumption like this could come back and haunt you forever;

    I've been similarly nailed in assuming that a blank city could be looked up based on zip code; as different sections of cities incorporate themselves, you need a really up to date source to know that the zip code 33331, for example, is no longer Fort Lauderdale, but is actually Weston, a newly incorporated city.

    Thanks!

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • There are products by a company called DataFlux that can provide what you are looking for. DF Power Studio being the main contender. It is basically a data analysis tool, but works great for this kind of thing.

  • I have no experience with neither of those tools, but I wonder what kind of crystal ball they have built in if they do any better guessing than anyone else, if given just the name 'Kim' - is a boy or girl..?

    The thing is, at least in my mind, that we deal with data - data is facts - just a first name is not enough fact alone to determine gender.

    Having said that, of course there may be certain applications that don't pose to present facts, but just 'good guesses', and sure, it's anyones choice to do that. However, if that's not clearly stated (ie how do I make a best guess) I'll always presume that we want to deal with facts and not guesswork.

    /Kenneth

  • I use a tool called ActiveGender from The Software Company that would work for you.  I got the ActiveX version a while back ($250, no recurring fees) but they now have both COM and .NET versions.

    I'm glad that all these people saying it can't be done work with nice clean, deterministic data.  You haven't a clue about the problems with names & addresses.  Yes, you're not going to be 100% correct guessing the gender from the first name, but you have to do it anyway.

    We have to deal with phony names & addresses, cleverly disguised obscene names, entries from illiterates who can't spell their address, transposed numbers in zipcodes.  If the worst we do is refer to someone by the wrong gender, it's a good day.

Viewing 15 posts - 1 through 15 (of 25 total)

You must be logged in to reply to this topic. Login to reply