Chengwei LEI, Ph.D.    Associate Professor

Department of Computer and Electrical Engineering and Computer Science
California State University, Bakersfield

 

Data Cleaning



Data cleaning, also known as data cleansing or data scrubbing, is the process of fixing or removing incorrect or irrelevant data from a dataset. It's a critical step in data processing that improves the quality and reliability of data.

 

Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human or computer error, transmission error




Incomplete Data

 

lacking attribute values, lacking certain attributes of interest, or containing only aggregate data
(missing data)

  

  • Data is not always available:
    many tuples have no recorded value for several attributes, such as customer income in sales data

  • Missing data may be due to:
    equipment malfunction;
    inconsistent with other recorded data and thus deleted;
    data not entered due to misunderstanding;
    certain data may not be considered important at the time of entry;
    not register history or changes of the data;



 

Noisy Data

 

Data may contain noise, errors, or outliers.
Random error or variance in a measured variable.

   

  •  Incorrect values may be due to:
    faulty data collection instruments;
    data entry problems;
    data transmission problems;
    technology limitation;
    inconsistency in naming convention;




 

 



 

Duplicate Records

 

"Duplicate records" refer to rows or entries within a dataset that are exactly the same, meaning they contain identical information across all their columns, essentially representing the same data point multiple times, which can be caused by data entry errors, system glitches, or merging datasets with overlapping information; these duplicates can skew analysis results if not properly identified and removed.

 



 

Inconsistent Data

 

 "Inconsistent data" refers to a situation where information within a dataset is not standardized or uniform, meaning different parts of the data may contradict each other, have different formats, or be incomplete, leading to potential inaccuracies in analysis if not properly addressed; essentially, it's a lack of consistency across various data points within a dataset.