Dirty Data: Avoid the Pitfalls
Dirty data is a prevalent problem for all organizations. Unfortunately, when data-entry personnel entered information over the last 20 years into transactional systems, they never thought that this information would be used to analyze customer behavior or internal processes. Dirty data can lead to poor analysis in your business intelligence, so here are a few tips for recognizing your dirty data and how to avoid it.
There are three major types of dirty data: errant codes, business-rule breakers and misspellings.
Errant codes include information that does not fit into a defined set of codes. For example, there is a finite set of zip codes in the United States. If a data record has a zip code that is outside this set, it is considered an errant code.
Business-rule breakers are those pieces of data that break existing business rules. For example, if an individual's birth date is listed after the day that individual received his or her license, that is a business-rule breaker. An individual simply cannot be born after he or she has received a driver's license!
Misspellings are simply typos. For example, the data entry analyst could have entered Mr. Viajay Sankaran instead of Mr. Vijay Sankaran. These pieces of information do not violate any business rules or any code ranges, but simply have been typed in incorrectly.
Now, that the basic types of dirty
Please log in or sign up below to read the rest of the article.