SQLServerCentral Editorial

Data Quality on the Open Web

,

I used to hear that one of the strengths of Linux was the thousands of volunteers that would help you get a patch or a fix in record time when you reported an issue. That worked well, but not well enough for many companies that really wanted a company to stand behind patches. A few companies, like Red Hat, sold support agreements with the "free as in beer" OS that ended up costing companies almost as much as a regular license of another OS. While Linux is a great system, it hasn't taken over the world like many people thought.

Lately there's been a different flavor of open-ness on the Internet. It seems that so much of what we read and is pushed out to us as news or information is based on the crowd-sourcing of what's popular. Facebook shows a "most active" view, Twitter has trending topics and re-tweets, and many news sites like Reddit use a crowd voting mechanism to help determine what you see first on their front page.

However there's a downside to using these open systems. There's the potential for abuse when a group of people get together. Google started using the open model on it's map services to allow people to add businesses to maps. A very handy feature, but the addition of a "mark this as closed" button allowed people to abuse this privilege. Whether it was competitors, vandals, or some criminal element isn't known, but apparently the quality of data Google is providing on maps isn't necessarily accurate. With many people using maps on iPhones and Android devices, this could damage businesses that add themselves to the mapping service. I think Google is playing a little fast and loose with their crowd voting on data points, but with so many companies looking to capitalize on the social networking phenomenon, I'm not surprised it's being abused.

Whenever we build systems that take input from users, we have a maxim: garbage in, garbage out. Essentially we aren't responsible for bad data, but many companies won't feel that way. They will still feel that we ought to be better policing the data quality and not showing bad data in reports or downstream systems. As more and more companies look to incorporate data from customers into their systems, it becomes more important that data professionals incorporate automated scans and manual workflow checks before data moves from staging areas to prevent incorrect data from affecting our production systems.

Steve Jones


The Voice of the DBA Podcasts

Everyday Jones

The podcast feeds are available at sqlservercentral.mevio.com. Comments are definitely appreciated and wanted, and you can get feeds from there. Overall RSS Feed: or now on iTunes!

Today's podcast features music by Everyday Jones. No relation, but I stumbled on to them and really like the music. Support this great duo at www.everydayjones.com.

You can also follow Steve Jones on Twitter:

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating