πŸ•·οΈ Crawler Inspector

URL Lookup

Direct Parameter Lookup

Raw Queries and Responses

1. Shard Calculation

Query:
Response:
Calculated Shard: 37 (from laksa184)

2. Crawled Status Check

Query:
Response:

3. Robots.txt Check

Query:
Response:

4. Spam/Ban Check

Query:
Response:

5. Seen Status Check

ℹ️ Skipped - page is already crawled

πŸ“„
INDEXABLE
βœ…
CRAWLED
4 days ago
πŸ€–
ROBOTS ALLOWED

Page Info Filters

FilterStatusConditionDetails
HTTP statusPASSdownload_http_code = 200HTTP 200
Age cutoffPASSdownload_stamp > now() - 6 MONTH0.1 months ago
History dropPASSisNull(history_drop_reason)No drop reason
Spam/banPASSfh_dont_index != 1 AND ml_spam_score = 0ml_spam_score=0
CanonicalPASSmeta_canonical IS NULL OR = '' OR = src_unparsedNot set

Page Details

PropertyValue
URLhttps://apmonitor.com/pds/index.php/Main/CleanseData
Last Crawled2026-04-05 00:30:48 (4 days ago)
First Indexed2021-02-12 00:58:33 (5 years ago)
HTTP Status Code200
Meta TitleData Cleansing
Meta DescriptionMeasurements from sensors or from human input can contain bad data that negatively affects machine learning. This tutorial demonstrates how to identify and remove bad data.
Meta Canonicalnull
Boilerpipe Text
Measurements from sensors or from human input can contain bad data that negatively affects machine learning. This tutorial demonstrates how to identify and remove bad data. Data cleansing is the process of removing bad data that may include outliers, missing entries, failed sensors, or other types of missing or corrupted information. Bad data can be detected with summary statistics and data visualization . An effective way to remove bad data is with filters that segregate based on conditions that remove outliers, replace bad values such as NaN (Not a Number), or remove the data row that contains an NaN value. Remove Bad Data with Numpy NaN values are removed with numpy by identifying rows ix that contain NaN. Next, the rows are removed with z=z[~iz] where ~ is a bitwise not operator. import numpy as np z = np. array ( [ [ Β  Β  Β  1 , Β  Β  Β  2 ] , Β  Β  Β  Β  Β  Β  Β  [ np. nan , Β  Β  Β  3 ] , Β  Β  Β  Β  Β  Β  Β  [ Β  Β  Β  4 , np. nan ] , Β  Β  Β  Β  Β  Β  Β  [ Β  Β  Β  5 , Β  Β  Β  6 ] ] ) iz = np. any ( np. isnan ( z ) , axis = 1 ) print ( ~ iz ) z = z [ ~ iz ] print ( z ) The result of the selection iz is True if NaN is found anywhere in the row of False if NaN is not found in that row. The ~ is the not operator to reverse True and False . print(~iz) > [ True False False True] Only the rows with True are kept in the final z result. print(z) > [[1. 2.] > [5. 6.]] Remove Bad Data with Pandas Pandas manipulates data tables called DataFrames. There are many functions to efficiently manipulate data and remove bad data. import numpy as np import pandas as pd z = pd. DataFrame ( { 'x' : [ 1 , np. nan , 4 , 5 ] , 'y' : [ 2 , 3 , np. nan , 6 ] } ) print ( z ) This produces the same values as shown with the Numpy example. x y 0 1.0 2.0 1 NaN 3.0 2 4.0 NaN 3 5.0 6.0 There are two common ways to deal with NaN values: drop the rows or fill in values. The first is to remove the rows with dropna . Use inplace=True to avoid the extra assignment of result=z.dropna() to modify z directly. result = z. dropna ( ) x y 0 1.0 2.0 3 5.0 6.0 Replace Bad Data with Pandas If data is very limited then it may be better to keep the row and fill-in values with methods interpolate for time-series data or fillna to replace NaN with a default value. result = z. fillna ( z. mean ( ) ) A common replacement is the mean value for each column with z.mean() . In this case, the mean (average) of column x is 3.333 and column y is 3.667. x y 0 1.000000 2.000000 1 3.333333 3.000000 2 4.000000 3.666667 3 5.000000 6.000000 Filters with Pandas Data visualization can help identify outliers, especially with box plots. Statistical information such as standard deviation can also help to identify outliers such as eliminating data points that are more than 5 standard deviations away from the mean. A conditional filter can be created to eliminate bad values. The conditional statement z['y']<5.5 creates a Logical array of True for z['y']<5.5 and False for z['y']>=5.5 . result = z [ z [ 'y' ] < 5.5 ] This filter eliminates the last row of the DataFrame and the NaN in that column. x y 0 1.0 2.0 1 NaN 3.0 Filters can be combined with the and bitwise operator & or the or bitwise operator | . result = z [ ( z [ 'y' ] < 5.5 ) & ( z [ 'y' ] >= 1.0 ) ] Another way to combine filters is to operate on the object with successive methods such as z['x'].notnull() to eliminate NaN values in the x column, the .fillna(0) to replace NaN with zero in the y column, and .reset_index(drop=True) to reset the DataFrame index. result = z [ z [ 'x' ] . notnull ( ) ] . fillna ( 0 ) . reset_index ( drop = True ) x y 0 1.0 2.0 1 4.0 0.0 2 5.0 6.0 Further Reading Brownlee, J., How to Remove Outliers for Machine Learning , April 2018. βœ… Knowledge Check 1. What is the primary purpose of data cleansing? A. To add bad data into machine learning models. Incorrect. Data cleansing aims to remove bad data, not add it. B. To identify and remove bad data. Correct. Data cleansing is about removing bad data such as outliers, missing entries, and corrupted information. C. To improve the efficiency of data storage. Incorrect. While data cleansing can reduce storage needs by removing unnecessary data, its primary purpose is to improve data quality for analysis. D. To make data visualization more attractive. Incorrect. While clean data can lead to clearer visualizations, the main goal of data cleansing is to improve data quality for analysis. 2. How can NaN values be removed from a Numpy array? A. By using the dropna() method. Incorrect. The dropna() method is associated with Pandas, not Numpy. B. By replacing NaN values with the mean of the array. Incorrect. While this method is used to replace NaN values, it doesn't remove them. C. By using the command: z=z[~iz] where iz is the selection of NaN rows. Correct. This method uses bitwise not operator (~) and a selection to remove rows containing NaN values from a Numpy array. D. By using the fillna() method with the parameter 0. Incorrect. The fillna() method is associated with Pandas, and it replaces NaN values rather than removing them.
Markdown
# [Machine Learning for Engineers](https://apmonitor.com/pds/index.php) - [Home](https://apmonitor.com/pds/index.php/Main/HomePage) - [Schedule](https://apmonitor.com/pds/index.php/Main/CourseSchedule) - [View](https://apmonitor.com/pds/index.php/Main/CleanseData) - [Edit](https://apmonitor.com/pds/index.php/Main/CleanseData?action=edit) - [History](https://apmonitor.com/pds/index.php/Main/CleanseData?action=diff) - [Print](https://apmonitor.com/pds/index.php/Main/CleanseData?action=print) # [Data Cleansing](https://apmonitor.com/pds/index.php/Main/CleanseData) Measurements from sensors or from human input can contain bad data that negatively affects machine learning. This tutorial demonstrates how to identify and remove bad data. ![](https://apmonitor.com/pds/uploads/Main/analyze.png) Data cleansing is the process of removing bad data that may include outliers, missing entries, failed sensors, or other types of missing or corrupted information. ![](https://apmonitor.com/pds/uploads/Main/python.png) [Data Cleansing Python Jupyter Notebook](https://github.com/APMonitor/pds/blob/main/Cleanse_Data.ipynb) ![](https://apmonitor.com/pds/uploads/Main/colab.png) [Jupyter Notebook in Google Colab](https://colab.research.google.com/github/APMonitor/pds/blob/main/Cleanse_Data.ipynb) ![](https://apmonitor.com/pds/uploads/Main/matlab.png) [Data Cleansing MATLAB Live Script](https://github.com/APMonitor/mds/blob/main/Cleanse.mlx) Bad data can be detected with [summary statistics](https://apmonitor.com/pds/index.php/Main/StatisticsMath) and [data visualization](https://apmonitor.com/pds/index.php/Main/VisualizeData). An effective way to remove bad data is with filters that segregate based on conditions that remove outliers, replace bad values such as *NaN* (Not a Number), or remove the data row that contains an *NaN* value. *** #### Remove Bad Data with Numpy ![](https://apmonitor.com/pds/uploads/Main/python_numpy.png) NaN values are removed with numpy by identifying rows *ix* that contain NaN. Next, the rows are removed with *z=z\[~iz\]* where ~ is a bitwise not operator. import numpy as np z \= np.array(\[\[ 1, 2\], \[ np.nan, 3\], \[ 4, np.nan\], \[ 5, 6\]\]) iz \= np.any(np.isnan(z), axis\=1) print(~iz) z \= z\[~iz\] print(z) [\[\$\[Get Code\]\]](https://apmonitor.com/pds/index.php/Main/CleanseData?action=sourceblock&num=1) The result of the selection *iz* is *True* if *NaN* is found anywhere in the row of *False* if *NaN* is not found in that row. The ~ is the not operator to reverse *True* and *False*. ``` print(~iz) > [ True False False True] ``` Only the rows with *True* are kept in the final *z* result. ``` print(z) > [[1. 2.] > [5. 6.]] ``` *** #### Remove Bad Data with Pandas ![](https://apmonitor.com/pds/uploads/Main/python_pandas.png) [Pandas](https://pandas.pydata.org/) manipulates data tables called DataFrames. There are many functions to efficiently manipulate data and remove bad data. import numpy as np import pandas as pd z \= pd.DataFrame({'x':\[1,np.nan,4,5\],'y':\[2,3,np.nan,6\]}) print(z) [\[\$\[Get Code\]\]](https://apmonitor.com/pds/index.php/Main/CleanseData?action=sourceblock&num=2) This produces the same values as shown with the Numpy example. ``` x y 0 1.0 2.0 1 NaN 3.0 2 4.0 NaN 3 5.0 6.0 ``` There are two common ways to deal with *NaN* values: drop the rows or fill in values. The first is to remove the rows with *dropna*. Use *inplace=True* to avoid the extra assignment of *result=z.dropna()* to modify z directly. result \= z.dropna() [\[\$\[Get Code\]\]](https://apmonitor.com/pds/index.php/Main/CleanseData?action=sourceblock&num=3) ``` x y 0 1.0 2.0 3 5.0 6.0 ``` #### Replace Bad Data with Pandas If data is very limited then it may be better to keep the row and fill-in values with methods *interpolate* for time-series data or *fillna* to replace *NaN* with a default value. result \= z.fillna(z.mean()) [\[\$\[Get Code\]\]](https://apmonitor.com/pds/index.php/Main/CleanseData?action=sourceblock&num=4) A common replacement is the mean value for each column with *z.mean()*. In this case, the mean (average) of column *x* is 3.333 and column *y* is 3.667. ``` x y 0 1.000000 2.000000 1 3.333333 3.000000 2 4.000000 3.666667 3 5.000000 6.000000 ``` #### Filters with Pandas [Data visualization](https://apmonitor.com/pds/index.php/Main/VisualizeData) can help identify outliers, especially with box plots. Statistical information such as standard deviation can also help to identify outliers such as eliminating data points that are more than 5 standard deviations away from the mean. A conditional filter can be created to eliminate bad values. The conditional statement *z\['y'\]\<5.5* creates a Logical array of *True* for *z\['y'\]\<5.5* and *False* for *z\['y'\]\>=5.5*. result \= z\[z\['y'\]\<5\.5\] [\[\$\[Get Code\]\]](https://apmonitor.com/pds/index.php/Main/CleanseData?action=sourceblock&num=5) This filter eliminates the last row of the DataFrame and the *NaN* in that column. ``` x y 0 1.0 2.0 1 NaN 3.0 ``` Filters can be combined with the *and* bitwise operator *&* or the *or* bitwise operator *\|*. result \= z\[(z\['y'\]\<5\.5) & (z\['y'\]\>=1\.0)\] [\[\$\[Get Code\]\]](https://apmonitor.com/pds/index.php/Main/CleanseData?action=sourceblock&num=6) Another way to combine filters is to operate on the object with successive methods such as *z\['x'\].notnull()* to eliminate *NaN* values in the *x* column, the *.fillna(0)* to replace *NaN* with zero in the *y* column, and *.reset\_index(drop=True)* to reset the DataFrame index. result \= z\[z\['x'\].notnull()\].fillna(0).reset\_index(drop\=True) [\[\$\[Get Code\]\]](https://apmonitor.com/pds/index.php/Main/CleanseData?action=sourceblock&num=7) ``` x y 0 1.0 2.0 1 4.0 0.0 2 5.0 6.0 ``` #### Further Reading - Brownlee, J., [How to Remove Outliers for Machine Learning](https://machinelearningmastery.com/how-to-use-statistics-to-identify-outliers-in-data/), April 2018. ![](https://apmonitor.com/pds/uploads/Main/data_cleansing.png) *** #### βœ… Knowledge Check **1\.** What is the primary purpose of data cleansing? **A.** To add bad data into machine learning models. Incorrect. Data cleansing aims to remove bad data, not add it. **B.** To identify and remove bad data. Correct. Data cleansing is about removing bad data such as outliers, missing entries, and corrupted information. **C.** To improve the efficiency of data storage. Incorrect. While data cleansing can reduce storage needs by removing unnecessary data, its primary purpose is to improve data quality for analysis. **D.** To make data visualization more attractive. Incorrect. While clean data can lead to clearer visualizations, the main goal of data cleansing is to improve data quality for analysis. **2\.** How can NaN values be removed from a Numpy array? **A.** By using the dropna() method. Incorrect. The dropna() method is associated with Pandas, not Numpy. **B.** By replacing NaN values with the mean of the array. Incorrect. While this method is used to replace NaN values, it doesn't remove them. **C.** By using the command: z=z\[~iz\] where iz is the selection of NaN rows. Correct. This method uses bitwise not operator (~) and a selection to remove rows containing NaN values from a Numpy array. **D.** By using the fillna() method with the parameter 0. Incorrect. The fillna() method is associated with Pandas, and it replaces NaN values rather than removing them. Course on ![](https://apmonitor.com/pds/uploads/Main/github.png) [GitHub](https://github.com/APMonitor/pds) - [Overview](https://apmonitor.com/pds/index.php/Main/HomePage) - [Syllabus](https://apmonitor.com/pds/index.php/Main/CourseSyllabus) - [Schedule](https://apmonitor.com/pds/index.php/Main/CourseSchedule) - [4-Day Course](https://apmonitor.com/pds/index.php/Main/ShortCourse) - [Course Project](https://apmonitor.com/pds/index.php/Main/CourseProject) - [Startup Incubator](https://apmonitor.com/pds/index.php/Main/StartupIncubator) - [Resources](https://apmonitor.com/pds/index.php/Main/LearningResources) - [Install Python](https://apmonitor.com/pds/index.php/Main/InstallPython) - [Install Packages](https://apmonitor.com/pds/index.php/Main/InstallPythonPackages) - [AI Ethics](https://apmonitor.com/pds/index.php/Main/AIEthics) Exams - [Midterm Exam](https://apmonitor.com/pds/index.php/Main/CourseExam1) Data Engineering - [Overview](https://apmonitor.com/pds/index.php/Main/DataPreparation) - [1️⃣ Gather Data](https://apmonitor.com/pds/index.php/Main/GatherData) - [2️⃣ Statistics](https://apmonitor.com/pds/index.php/Main/StatisticsMath) - [3️⃣ Visualize](https://apmonitor.com/pds/index.php/Main/VisualizeData) - [4️⃣ Cleanse](https://apmonitor.com/pds/index.php/Main/CleanseData) - [5️⃣ Features](https://apmonitor.com/pds/index.php/Main/FeatureEngineering) - [6️⃣ Balance](https://apmonitor.com/pds/index.php/Main/ImbalancedData) - [7️⃣ Scale](https://apmonitor.com/pds/index.php/Main/ScaleData) - [8️⃣ Split](https://apmonitor.com/pds/index.php/Main/SplitData) - [9️⃣ Deploy](https://apmonitor.com/pds/index.php/Main/DeployMachineLearning) - [πŸ”Ÿ Apps](https://apmonitor.com/pds/index.php/Main/AppDevelopment) Agentic Engineering - [1️⃣ Workplan](https://apmonitor.com/pds/index.php/Main/AgenticWorkplan) - [2️⃣ Coding](https://apmonitor.com/pds/index.php/Main/AgenticCoding) - [3️⃣ Coordination](https://apmonitor.com/pds/index.php/Main/AgenticCoordination) - [4️⃣ Visualization](https://apmonitor.com/pds/index.php/Main/AgenticVisualization) - [5️⃣ Communication](https://apmonitor.com/pds/index.php/Main/AgenticReports) Classification - [Overview](https://apmonitor.com/pds/index.php/Main/ClassificationOverview) - [Hyperparameters](https://apmonitor.com/pds/index.php/Main/HyperparameterOptimization) - [Cybersecurity](https://apmonitor.com/pds/index.php/Main/CyberSecurity) **Supervised Learning** - [AdaBoost](https://apmonitor.com/pds/index.php/Main/AdaBoost) - [Decision Tree](https://apmonitor.com/pds/index.php/Main/DecisionTree) - [k-Nearest Neighbors](https://apmonitor.com/pds/index.php/Main/KNearestNeighbors) - [Logistic Regression](https://apmonitor.com/pds/index.php/Main/LogisticRegression) - [NaΓ―ve Bayes](https://apmonitor.com/pds/index.php/Main/NaiveBayes) - [Neural Network Classifier](https://apmonitor.com/pds/index.php/Main/DeepLearningNeuralNetwork) - [Random Forest](https://apmonitor.com/pds/index.php/Main/RandomForest) - [Stochastic Gradient Descent](https://apmonitor.com/pds/index.php/Main/StochasticGradientDescent) - [Support Vector Classifier](https://apmonitor.com/pds/index.php/Main/SupportVectorClassifier) - [XGBoost Classifier](https://apmonitor.com/pds/index.php/Main/XGBoostClassifier) **Unsupervised Learning** - [Gaussian Mixture Model](https://apmonitor.com/pds/index.php/Main/GaussianMixtureModel) - [K-Means Clustering](https://apmonitor.com/pds/index.php/Main/KMeansClustering) - [Spectral Clustering](https://apmonitor.com/pds/index.php/Main/SpectralClustering) Regression - [Overview](https://apmonitor.com/pds/index.php/Main/RegressionOverview) - [Linear Regression](https://apmonitor.com/pds/index.php/Main/LinearRegression) - [k-Nearest Neighbors](https://apmonitor.com/pds/index.php/Main/KNearestNeighborsRegression) - [Support Vector Regressor](https://apmonitor.com/pds/index.php/Main/SupportVectorRegressor) - [Gaussian Processes](https://apmonitor.com/pds/index.php/Main/GaussianProcessRegression) - [Neural Network Regressor](https://apmonitor.com/pds/index.php/Main/MultilayerPerceptronNeuralNetwork) - [XGBoost Regressor](https://apmonitor.com/pds/index.php/Main/XGBoostRegressor) Time-Series - [ARX Model](https://apmonitor.com/pds/index.php/Main/ARXTimeSeries) - [LSTM Network](https://apmonitor.com/pds/index.php/Main/LongShortTermMemory) Computer Vision - [Introduction](https://apmonitor.com/pds/index.php/Main/ComputerVisionIntro) - [Cascade Classifier](https://apmonitor.com/pds/index.php/Main/CascadeClassifier) - [Deep Learning](https://apmonitor.com/pds/index.php/Main/VisionDeepLearning) Applications [3D Print](https://apmonitor.com/pds/index.php/Main/AdditiveManufacturing) πŸ“ˆπŸ“Š [Automation LSTM](https://apmonitor.com/pds/index.php/Main/LSTMAutomation) ⏱️ [Automotive](https://apmonitor.com/pds/index.php/Main/AutomotiveMonitoring) πŸ“ˆπŸ“Š [Battery Life](https://apmonitor.com/pds/index.php/Main/BatteryLife)β±οΈπŸ“ˆ [Bit Classification](https://apmonitor.com/pds/index.php/Main/BitClassification) πŸ‘οΈπŸ“Š [Concrete Strength](https://apmonitor.com/pds/index.php/Main/CementStrength) πŸ“ˆπŸ“Š [Chemical Properties](https://apmonitor.com/pds/index.php/Main/ThermophysicalProperties) πŸ“ˆ [Draw Classification](https://apmonitor.com/pds/index.php/Main/DrawClassification) πŸ“Š [Facial Recognition](https://apmonitor.com/pds/index.php/Main/FacialRecognition) πŸ‘οΈπŸ“Š [Glass Type](https://apmonitor.com/pds/index.php/Main/GlassCharacterization)β±οΈπŸ“ˆ [Hand Tracking](https://apmonitor.com/pds/index.php/Main/HandTracking) πŸ‘οΈ [OT Cybersecurity](https://apmonitor.com/pds/index.php/Main/ActuatorMonitor) β±οΈπŸ“Š [Batteries](https://apmonitor.com/pds/index.php/Main/LithiumIonBatteries) πŸ“Š [Polymers](https://apmonitor.com/pds/index.php/Main/PolymerMeltFlowRate) πŸ“ˆ [Road Detection](https://apmonitor.com/pds/index.php/Main/RoadDetection) πŸ‘οΈπŸ“Š [Safety](https://apmonitor.com/pds/index.php/Main/SafetySurveillance) πŸ‘οΈ [Soils](https://apmonitor.com/pds/index.php/Main/SoilClassification) πŸ‘οΈπŸ“Š [Sonar](https://apmonitor.com/pds/index.php/Main/SonarDetection) πŸ“Š [Steel Plate](https://apmonitor.com/pds/index.php/Main/SteelPlateFaults) πŸ“Š [Texture](https://apmonitor.com/pds/index.php/Main/TextureClassification) πŸ‘οΈπŸ“Š [Wind Power](https://apmonitor.com/pds/index.php/Main/WindPower) β±οΈπŸ“ˆ πŸ“ˆ=Regression πŸ“Š=Classification ⏱️=Time Series πŸ‘οΈ=Computer Vision 🎧=Audio [TCLab Project](https://apmonitor.github.io/data_science) - [Overview](https://apmonitor.com/pds/index.php/Main/TCLabIntro) - [TCLab Help](https://apmonitor.com/pds/notebooks/TCLab_Help.html) - [1\. Overview](https://apmonitor.com/pds/notebooks/01_overview.html) - [2\. Import Data](https://apmonitor.com/pds/notebooks/02_import_export.html) - [3\. Statistics](https://apmonitor.com/pds/notebooks/03_analyze.html) - [4\. Visualize](https://apmonitor.com/pds/notebooks/04_visualize.html) - [5\. Prepare Data](https://apmonitor.com/pds/notebooks/05_prepare_data.html) - [6\. Regression](https://apmonitor.com/pds/notebooks/06_regression.html) - [7\. Features](https://apmonitor.com/pds/notebooks/07_features.html) - [8\. Classification](https://apmonitor.com/pds/notebooks/08_classification.html) - [9\. Interpolation](https://apmonitor.com/pds/notebooks/09_interpolation.html) - [10\. Solve Equations](https://apmonitor.com/pds/notebooks/10_solve_equations.html) - [11\. Differential Equations](https://apmonitor.com/pds/notebooks/11_differential_equations.html) - [12\. Time Series](https://apmonitor.com/pds/notebooks/12_time_series.html) - [Final TCLab Project](https://apmonitor.com/pds/notebooks/XX_final.html) Related Courses - [πŸŽ“ Begin Python](https://apmonitor.com/che263/index.php/Main/CourseProjects) - [πŸŽ“ Begin Matlab](https://apmonitor.com/che263/index.php/Main/BeginMatlab) - [πŸŽ“ Begin Java](https://apmonitor.com/che263/index.php/Main/BeginJava) - [πŸŽ“ Engineering Computing](https://apmonitor.com/che263) - [πŸŽ“ Data Science](https://apmonitor.com/pds/index.php/Main/TCLabIntro) - [πŸŽ“ Data-Driven Engineering](https://apmonitor.com/dde) - [πŸŽ“ Machine Learning](https://apmonitor.com/pds) - [πŸŽ“ Control (MATLAB)](https://apmonitor.com/che436) - [πŸŽ“ Control (Python)](https://apmonitor.com/pdc) - [πŸŽ“ Optimization](https://apmonitor.com/me575) - [πŸŽ“ Dynamic Optimization](https://apmonitor.com/do) [Admin](https://apmonitor.com/pds/index.php/PmWiki/PmWiki) Page last modified on November 22, 2023, at 12:20 am πŸ’¬ Send Minimize
Readable Markdown
Measurements from sensors or from human input can contain bad data that negatively affects machine learning. This tutorial demonstrates how to identify and remove bad data. ![](https://apmonitor.com/pds/uploads/Main/analyze.png) Data cleansing is the process of removing bad data that may include outliers, missing entries, failed sensors, or other types of missing or corrupted information. Bad data can be detected with [summary statistics](https://apmonitor.com/pds/index.php/Main/StatisticsMath) and [data visualization](https://apmonitor.com/pds/index.php/Main/VisualizeData). An effective way to remove bad data is with filters that segregate based on conditions that remove outliers, replace bad values such as *NaN* (Not a Number), or remove the data row that contains an *NaN* value. *** #### Remove Bad Data with Numpy ![](https://apmonitor.com/pds/uploads/Main/python_numpy.png) NaN values are removed with numpy by identifying rows *ix* that contain NaN. Next, the rows are removed with *z=z\[~iz\]* where ~ is a bitwise not operator. import numpy as np z \= np.array(\[\[ 1, 2\], \[ np.nan, 3\], \[ 4, np.nan\], \[ 5, 6\]\]) iz \= np.any(np.isnan(z), axis\=1) print(~iz) z \= z\[~iz\] print(z) The result of the selection *iz* is *True* if *NaN* is found anywhere in the row of *False* if *NaN* is not found in that row. The ~ is the not operator to reverse *True* and *False*. ``` print(~iz) > [ True False False True] ``` Only the rows with *True* are kept in the final *z* result. ``` print(z) > [[1. 2.] > [5. 6.]] ``` *** #### Remove Bad Data with Pandas ![](https://apmonitor.com/pds/uploads/Main/python_pandas.png) [Pandas](https://pandas.pydata.org/) manipulates data tables called DataFrames. There are many functions to efficiently manipulate data and remove bad data. import numpy as np import pandas as pd z \= pd.DataFrame({'x':\[1,np.nan,4,5\],'y':\[2,3,np.nan,6\]}) print(z) This produces the same values as shown with the Numpy example. ``` x y 0 1.0 2.0 1 NaN 3.0 2 4.0 NaN 3 5.0 6.0 ``` There are two common ways to deal with *NaN* values: drop the rows or fill in values. The first is to remove the rows with *dropna*. Use *inplace=True* to avoid the extra assignment of *result=z.dropna()* to modify z directly. result \= z.dropna() ``` x y 0 1.0 2.0 3 5.0 6.0 ``` #### Replace Bad Data with Pandas If data is very limited then it may be better to keep the row and fill-in values with methods *interpolate* for time-series data or *fillna* to replace *NaN* with a default value. result \= z.fillna(z.mean()) A common replacement is the mean value for each column with *z.mean()*. In this case, the mean (average) of column *x* is 3.333 and column *y* is 3.667. ``` x y 0 1.000000 2.000000 1 3.333333 3.000000 2 4.000000 3.666667 3 5.000000 6.000000 ``` #### Filters with Pandas [Data visualization](https://apmonitor.com/pds/index.php/Main/VisualizeData) can help identify outliers, especially with box plots. Statistical information such as standard deviation can also help to identify outliers such as eliminating data points that are more than 5 standard deviations away from the mean. A conditional filter can be created to eliminate bad values. The conditional statement *z\['y'\]\<5.5* creates a Logical array of *True* for *z\['y'\]\<5.5* and *False* for *z\['y'\]\>=5.5*. result \= z\[z\['y'\]\<5\.5\] This filter eliminates the last row of the DataFrame and the *NaN* in that column. ``` x y 0 1.0 2.0 1 NaN 3.0 ``` Filters can be combined with the *and* bitwise operator *&* or the *or* bitwise operator *\|*. result \= z\[(z\['y'\]\<5\.5) & (z\['y'\]\>=1\.0)\] Another way to combine filters is to operate on the object with successive methods such as *z\['x'\].notnull()* to eliminate *NaN* values in the *x* column, the *.fillna(0)* to replace *NaN* with zero in the *y* column, and *.reset\_index(drop=True)* to reset the DataFrame index. result \= z\[z\['x'\].notnull()\].fillna(0).reset\_index(drop\=True) ``` x y 0 1.0 2.0 1 4.0 0.0 2 5.0 6.0 ``` #### Further Reading - Brownlee, J., [How to Remove Outliers for Machine Learning](https://machinelearningmastery.com/how-to-use-statistics-to-identify-outliers-in-data/), April 2018. ![](https://apmonitor.com/pds/uploads/Main/data_cleansing.png) *** #### βœ… Knowledge Check **1\.** What is the primary purpose of data cleansing? **A.** To add bad data into machine learning models. Incorrect. Data cleansing aims to remove bad data, not add it. **B.** To identify and remove bad data. Correct. Data cleansing is about removing bad data such as outliers, missing entries, and corrupted information. **C.** To improve the efficiency of data storage. Incorrect. While data cleansing can reduce storage needs by removing unnecessary data, its primary purpose is to improve data quality for analysis. **D.** To make data visualization more attractive. Incorrect. While clean data can lead to clearer visualizations, the main goal of data cleansing is to improve data quality for analysis. **2\.** How can NaN values be removed from a Numpy array? **A.** By using the dropna() method. Incorrect. The dropna() method is associated with Pandas, not Numpy. **B.** By replacing NaN values with the mean of the array. Incorrect. While this method is used to replace NaN values, it doesn't remove them. **C.** By using the command: z=z\[~iz\] where iz is the selection of NaN rows. Correct. This method uses bitwise not operator (~) and a selection to remove rows containing NaN values from a Numpy array. **D.** By using the fillna() method with the parameter 0. Incorrect. The fillna() method is associated with Pandas, and it replaces NaN values rather than removing them.
Shard37 (laksa)
Root Hash13934051142679362237
Unparsed URLcom,apmonitor!/pds/index.php/Main/CleanseData s443