Data Quality on the Open Web

I used to hear that one of the strengths of Linux was the thousands of volunteers that would help you get a patch or a fix in record time when you reported an issue. That worked well, but not well enough for many companies that really wanted a company to stand behind patches. A few companies, like Red Hat, sold support agreements with the "free as in beer" OS that ended up costing companies almost as much as a regular license of another OS. While Linux is a great system, it hasn't taken over the world like many people thought.

Lately there's been a different flavor of open-ness on the Internet. It seems that so much of what we read and is pushed out to us as news or information is based on the crowd-sourcing of what's popular. Facebook shows a "most active" view, Twitter has trending topics and re-tweets, and many news sites like Reddit use a crowd voting mechanism to help determine what you see first on their front page.

However there's a downside to using these open systems. There's the potential for abuse when a group of people get together. Google started using the open model on it's map services to allow people to add businesses to maps. A very handy feature, but the addition of a "mark this as closed" button allowed people to abuse this privilege. Whether it was competitors, vandals, or some criminal element isn't known, but apparently the quality of data Google is providing on maps isn't necessarily accurate. With many people using maps on iPhones and Android devices, this could damage businesses that add themselves to the mapping service. I think Google is playing a little fast and loose with their crowd voting on data points, but with so many companies looking to capitalize on the social networking phenomenon, I'm not surprised it's being abused.

Whenever we build systems that take input from users, we have a maxim: garbage in, garbage out. Essentially we aren't responsible for bad data, but many companies won't feel that way. They will still feel that we ought to be better policing the data quality and not showing bad data in reports or downstream systems. As more and more companies look to incorporate data from customers into their systems, it becomes more important that data professionals incorporate automated scans and manual workflow checks before data moves from staging areas to prevent incorrect data from affecting our production systems.

Steve Jones

The Voice of the DBA Podcasts

Watch the Windows Media Podcast - 24.0MB WMV
Watch the iPod Video Podcast - 17.6MB MP4
Watch the MP3 Audio Podcast - 4.0MB MP3

Everyday Jones

The podcast feeds are available at sqlservercentral.mevio.com. Comments are definitely appreciated and wanted, and you can get feeds from there. Overall RSS Feed: or now on iTunes!

Today's podcast features music by Everyday Jones. No relation, but I stumbled on to them and really like the music. Support this great duo at www.everydayjones.com.

You can also follow Steve Jones on Twitter:

Contract or Perm

by Steve Jones

SQLServerCentral.com

Editorial

If you are accepting a DBA position, does it make sense to work as a contractor or permanent employee?

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

You rated this post out of 5. Change rating

2007-11-21

168 reads

Discuss

Mini-Me

by Steve Jones

SQLServerCentral.com

Editorial

Will the next version of Windows be a "Mini-Me" version of Vista? Who knows, and it's too early to tell, but apparently there's a mini-kernel version of Windows 7, the one after Vista, which fits into 25MB on disk. That's a touch lower than the 4GB that Vista takes up. Granted it's not a full […]

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

You rated this post out of 5. Change rating

2007-10-25

93 reads

Discuss

An Hour in Time

by Steve Jones

SQLServerCentral.com

Editorial

Daylight Savings time switches a little later this year. In fact it's November 4th this year, after having been in October for all of my life. In case you don't remember which way we move the clocks, here's a saying: Spring forward, fall back.

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

5 (1)

You rated this post out of 5. Change rating

2007-10-17

203 reads

Discuss

Software is Like Building a House

by Steve Jones

SQLServerCentral.com

Editorial

One of the really classic analogies in software is that it's like building a house. You have a foundation, multiple teams, lots of contractors that specialize in something, etc. And it's an analogy that's debated as to its relevance over and over. I won't go into the correctness of this analogy, but I wanted to comment on it.