Glossary

Data Cleansing

The process of detecting and correcting (or removing) corrupt or inaccurate records from a dataset, ensuring high-quality data for training.

Definition

Data Cleansing, also known as data cleaning or data scrubbing, is a critical preprocessing step in the data preparation phase of machine learning and AI projects. It involves identifying and rectifying errors, inconsistencies, and anomalies in data to improve its quality and reliability. This process may include handling missing values, correcting typos or spelling errors, standardizing data formats, and removing duplicates or irrelevant entries.

Data cleansing is essential because the accuracy and effectiveness of machine learning models are directly influenced by the quality of the training data. Clean, consistent data enables more efficient training, better model performance, and more reliable predictions. Given the diversity and complexity of data sources, especially in large-scale projects, data cleansing can be both challenging and time-consuming, often requiring automated tools as well as manual inspection to ensure thoroughness.

Examples / Use Cases

In a real estate pricing prediction model, data cleansing might involve standardizing the format of addresses, correcting outliers in property size or price that may result from data entry errors, and filling in missing values for the number of bedrooms or bathrooms with appropriate estimates or averages.

For a dataset collected from multiple sources, cleansing would also ensure that all dates are in a consistent format and that categorical data like property type (e.g., house, apartment, townhouse) are standardized across the dataset. This process reduces noise and inconsistencies in the data, leading to more accurate and reliable predictions regarding property prices.