Bug or Feature?

  • I was working on getting our systems added to the Google news site to help increase our traffic, grow, all the usual things.

    So I submit the site and I get a message back a week or two later that Database Daily is an aggregator and they Google people only want to show original news to their readers.

    OK, I can deal with that, but I inquired about SQLServerCentral.com, which does have original content everyday. I think you'll be interested in the reply, as I was, though perhaps stunned is a better description.

    I first heard back that they couldn't find original content on the front page. I thought that was strange, so I surfed over to the front page and checked. Sure enough, on Friday, there were 2 original articles, of the 6 articles listed there. The source was clearly identified as SQLServerCentral.com. So I wrote back and included the URLs of the articles so that the Google people were clear that they were original content.

    I then got this:

    Thank you for your reply. After some investigation, we've found that our system cannot crawl some of your articles because of the format of their URLs. In order to have your articles crawled by Google News, their URLs must contain a number consisting of at least three digits.

    For example, our news crawler would not crawl articles with the following URLs:

    http://www.google.com/news/article23.html

    http://www.google.com/lemurs_in_the_mist.html

    It would crawl these pages:

    http://www.google.com/news/08112003/article.html

    http://www.google.com/news/lemurs_in_the_mist/23467.html

    How's that for some cool system design? And aren't the Google people the ones with that amazing aptitude test to weed out the guys that aren't as smart as they are?

    Maybe they should spend a few less minutes in the theoretical lab and a few more in the practical system design class.

    Steve Jones

  • Are you really surprised. Now that they are public, all the good "old" coders are probably all history and only the young inexperienced new IT grads are being hired who all seem to have a nasty habit of beating their code and design into submission instead of taking the time to think about what all the ramifications of their actions might be. I have a friend who puts it this way, "If you're about to build a house, who do you want to frame it up for you ? The kid that just graduated from tech school and knows all the ins-and-outs of the latest computerized job site saw, or a good old take-his-time carpenter in a 3 year old F150 that had done a hundred of these jobs, each one better than the last?" ... Oh, and the old guy doen't juggle his hacky-sack balls in his cubicle either ...

  • Actually I was surprised since it's a completely (insert inappropriatel anguage here) design. And they're supposed to be "smarter" than everyone else. I may be heading back to Yahoo now

    And on the framing, not sure. I get what your analogy is, but in construction more so than software, I tend to go with who I "trust" a bit from meeting them. I've met too many god old carpenters who'd do a great job, but get halfway done and leave.

  • I don't see what the digits get them based on the example they sent back. If they're trying to weed out non-news items, it doesn't do 'em any good. Egads. They probably built their crawler after analyzing how the major media sources publish links to articles. But what if FoxNews.com or CNN.com suddenly decided to change their format so it wasn't a three-digit number? What if they went to a unique-identifier format? Betcha Google would revamp the crawler in a hurry.

    K. Brian Kelley
    @kbriankelley

  • Well, I haven't seen any suggestions yet as to how they're supposed to date the contents on SQLServerCentral.com or any other web site EASILY without a numeric reference. I looked at just a couple articles on CNN.com, and saw they use a combination year/majorsubject/minorsubject/month/date format for their article URL's. A few at Yahoo use the combination majorsubject/minorsubject/source/date/articleID. These both seem like a reasonable way to differentiate items, and even group them for archiving/removal after a certain period of time.

    I don't know...how particularly would YOU go about grouping and dating your articles by URL only?

  • You all need to try http://www.metacrawler.com.

    It searches google, yahoo, ask jeeves, and other sites. I get a better return there on more arcane searches than I ever have with Google. How do you think I found this place the first time.

    Just my $0.02.



    ----------------
    Jim P.

    A little bit of this and a little byte of that can cause bloatware.

  • I would agree with what you're saying, but one of the examples they gave has NO indication of dating:

    http://www.google.com/news/lemurs_in_the_mist/23467.html

    If they'll crawl this, why won't they crawl other URLs without two digits?

    K. Brian Kelley
    @kbriankelley

  • Maybe the numbers guy did some statistics on a bunch of sites, then determined what features where shared by most of the news sites.  That could lead to a rule like this.  Or maybe some "smart" person in management came up with it.  How many of you have worked on project that have been crippled by dumb "business rules" and feature requests?  Probably more of you then not.

    Lets be fair to google dev, instead of jealously poking fun at them (this holds for M$, too).  I mean, they aren't perfect, but they really do know what the hell they are doing...

    http://www.adaptivepath.com/publications/essays/archives/000385.php

    Signature is NULL

  • Ok, so you are attacking the intelligence of the people at google?

    I know some people (republicans) who observe an action that they don't agree with, then immediately launch a personal attack, completely changing the nature of the argument.

    This article struck me as a cheap shot - I certainly did not get anything from reading it, other than to find out that you were a bit angry. It was not informative.

  • Not that Steve needs to be defended, but he's reacting with honest frustration at a limitation that doesn't seem to make any sense. We've all done that. Except in Steve's case it does affect the business that is SQLServerCentral.com. Either SSC doesn't get listed or SSC has to rework its link structure.

    Looking at the limitation, it seems needless. As one poster put, if its date related, I could buy that. In other words, if the path or URL gave some indication as to the date of the article and Google's crawler keyed on that, understandable. But that's not the case. Te second link example they give is just a numeric entry. Probably indicative of a sequential numbering system for articles. However, if Google isn't keying on date, is there a need to key on a sequential system? Probably not. It may sound great theoretically, but in practice it is limited and unnecessarily so.

    K. Brian Kelley
    @kbriankelley

  • Search Engines strive to be sensible.  I always thought that was part of the reason Google did well.   This is not sensible, in my view.

    Maybe the aspersions cast on their staffing have a point.

    As to the speculation going on about dating of web sites.  There are other ways of getting dates than looking at the URL!!  (If only the form of the URL is looked at then some over-enthusiastic SEO dufus is going to mess that up pretty quick, with some fancy tricks.)

  • Someone earlier made the point that they're probably just trying to keep out non-news items from their aggrigator. I think that's probably the case.

    Sure, it's not a perfect system; Steve demonstrated that well. But can you come up with a better way for a computer to decide with relatively good accuracy whether a CNN page is news or not?

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply