βΉοΈ Skipped - page is already crawled
| Filter | Status | Condition | Details |
|---|---|---|---|
| HTTP status | PASS | download_http_code = 200 | HTTP 200 |
| Age cutoff | PASS | download_stamp > now() - 6 MONTH | 0.1 months ago |
| History drop | PASS | isNull(history_drop_reason) | No drop reason |
| Spam/ban | PASS | fh_dont_index != 1 AND ml_spam_score = 0 | ml_spam_score=0 |
| Canonical | PASS | meta_canonical IS NULL OR = '' OR = src_unparsed | Not set |
| Property | Value |
|---|---|
| URL | https://apmonitor.com/pds/index.php/Main/CleanseData |
| Last Crawled | 2026-04-05 00:30:48 (4 days ago) |
| First Indexed | 2021-02-12 00:58:33 (5 years ago) |
| HTTP Status Code | 200 |
| Meta Title | Data Cleansing |
| Meta Description | Measurements from sensors or from human input can contain bad data that negatively affects machine learning. This tutorial demonstrates how to identify and remove bad data. |
| Meta Canonical | null |
| Boilerpipe Text | Measurements from sensors or from human input can contain bad data that negatively affects machine learning. This tutorial demonstrates how to identify and remove bad data.
Data cleansing is the process of removing bad data that may include outliers, missing entries, failed sensors, or other types of missing or corrupted information.
Bad data can be detected with
summary statistics
and
data visualization
. An effective way to remove bad data is with filters that segregate based on conditions that remove outliers, replace bad values such as
NaN
(Not a Number), or remove the data row that contains an
NaN
value.
Remove Bad Data with Numpy
NaN values are removed with numpy by identifying rows
ix
that contain NaN. Next, the rows are removed with
z=z[~iz]
where ~ is a bitwise not operator.
import
numpy
as
np
z
=
np.
array
(
[
[
Β Β Β
1
,
Β Β Β
2
]
,
Β Β Β Β Β Β Β
[
np.
nan
,
Β Β Β
3
]
,
Β Β Β Β Β Β Β
[
Β Β Β
4
,
np.
nan
]
,
Β Β Β Β Β Β Β
[
Β Β Β
5
,
Β Β Β
6
]
]
)
iz
=
np.
any
(
np.
isnan
(
z
)
,
axis
=
1
)
print
(
~
iz
)
z
=
z
[
~
iz
]
print
(
z
)
The result of the selection
iz
is
True
if
NaN
is found anywhere in the row of
False
if
NaN
is not found in that row. The ~ is the not operator to reverse
True
and
False
.
print(~iz)
> [ True False False True]
Only the rows with
True
are kept in the final
z
result.
print(z)
> [[1. 2.]
> [5. 6.]]
Remove Bad Data with Pandas
Pandas
manipulates data tables called DataFrames. There are many functions to efficiently manipulate data and remove bad data.
import
numpy
as
np
import
pandas
as
pd
z
=
pd.
DataFrame
(
{
'x'
:
[
1
,
np.
nan
,
4
,
5
]
,
'y'
:
[
2
,
3
,
np.
nan
,
6
]
}
)
print
(
z
)
This produces the same values as shown with the Numpy example.
x y
0 1.0 2.0
1 NaN 3.0
2 4.0 NaN
3 5.0 6.0
There are two common ways to deal with
NaN
values: drop the rows or fill in values. The first is to remove the rows with
dropna
. Use
inplace=True
to avoid the extra assignment of
result=z.dropna()
to modify z directly.
result
=
z.
dropna
(
)
x y
0 1.0 2.0
3 5.0 6.0
Replace Bad Data with Pandas
If data is very limited then it may be better to keep the row and fill-in values with methods
interpolate
for time-series data or
fillna
to replace
NaN
with a default value.
result
=
z.
fillna
(
z.
mean
(
)
)
A common replacement is the mean value for each column with
z.mean()
. In this case, the mean (average) of column
x
is 3.333 and column
y
is 3.667.
x y
0 1.000000 2.000000
1 3.333333 3.000000
2 4.000000 3.666667
3 5.000000 6.000000
Filters with Pandas
Data visualization
can help identify outliers, especially with box plots. Statistical information such as standard deviation can also help to identify outliers such as eliminating data points that are more than 5 standard deviations away from the mean. A conditional filter can be created to eliminate bad values. The conditional statement
z['y']<5.5
creates a Logical array of
True
for
z['y']<5.5
and
False
for
z['y']>=5.5
.
result
=
z
[
z
[
'y'
]
<
5.5
]
This filter eliminates the last row of the DataFrame and the
NaN
in that column.
x y
0 1.0 2.0
1 NaN 3.0
Filters can be combined with the
and
bitwise operator
&
or the
or
bitwise operator
|
.
result
=
z
[
(
z
[
'y'
]
<
5.5
)
&
(
z
[
'y'
]
>=
1.0
)
]
Another way to combine filters is to operate on the object with successive methods such as
z['x'].notnull()
to eliminate
NaN
values in the
x
column, the
.fillna(0)
to replace
NaN
with zero in the
y
column, and
.reset_index(drop=True)
to reset the DataFrame index.
result
=
z
[
z
[
'x'
]
.
notnull
(
)
]
.
fillna
(
0
)
.
reset_index
(
drop
=
True
)
x y
0 1.0 2.0
1 4.0 0.0
2 5.0 6.0
Further Reading
Brownlee, J.,
How to Remove Outliers for Machine Learning
, April 2018.
β
Knowledge Check
1.
What is the primary purpose of data cleansing?
A.
To add bad data into machine learning models.
Incorrect.
Data cleansing aims to remove bad data, not add it.
B.
To identify and remove bad data.
Correct.
Data cleansing is about removing bad data such as outliers, missing entries, and corrupted information.
C.
To improve the efficiency of data storage.
Incorrect.
While data cleansing can reduce storage needs by removing unnecessary data, its primary purpose is to improve data quality for analysis.
D.
To make data visualization more attractive.
Incorrect.
While clean data can lead to clearer visualizations, the main goal of data cleansing is to improve data quality for analysis.
2.
How can NaN values be removed from a Numpy array?
A.
By using the dropna() method.
Incorrect.
The dropna() method is associated with Pandas, not Numpy.
B.
By replacing NaN values with the mean of the array.
Incorrect.
While this method is used to replace NaN values, it doesn't remove them.
C.
By using the command: z=z[~iz] where iz is the selection of NaN rows.
Correct.
This method uses bitwise not operator (~) and a selection to remove rows containing NaN values from a Numpy array.
D.
By using the fillna() method with the parameter 0.
Incorrect.
The fillna() method is associated with Pandas, and it replaces NaN values rather than removing them. |
| Markdown | # [Machine Learning for Engineers](https://apmonitor.com/pds/index.php)
- [Home](https://apmonitor.com/pds/index.php/Main/HomePage)
- [Schedule](https://apmonitor.com/pds/index.php/Main/CourseSchedule)
- [View](https://apmonitor.com/pds/index.php/Main/CleanseData)
- [Edit](https://apmonitor.com/pds/index.php/Main/CleanseData?action=edit)
- [History](https://apmonitor.com/pds/index.php/Main/CleanseData?action=diff)
- [Print](https://apmonitor.com/pds/index.php/Main/CleanseData?action=print)
# [Data Cleansing](https://apmonitor.com/pds/index.php/Main/CleanseData)
Measurements from sensors or from human input can contain bad data that negatively affects machine learning. This tutorial demonstrates how to identify and remove bad data.

Data cleansing is the process of removing bad data that may include outliers, missing entries, failed sensors, or other types of missing or corrupted information.
 [Data Cleansing Python Jupyter Notebook](https://github.com/APMonitor/pds/blob/main/Cleanse_Data.ipynb)
 [Jupyter Notebook in Google Colab](https://colab.research.google.com/github/APMonitor/pds/blob/main/Cleanse_Data.ipynb)
 [Data Cleansing MATLAB Live Script](https://github.com/APMonitor/mds/blob/main/Cleanse.mlx)
Bad data can be detected with [summary statistics](https://apmonitor.com/pds/index.php/Main/StatisticsMath) and [data visualization](https://apmonitor.com/pds/index.php/Main/VisualizeData). An effective way to remove bad data is with filters that segregate based on conditions that remove outliers, replace bad values such as *NaN* (Not a Number), or remove the data row that contains an *NaN* value.
***
#### Remove Bad Data with Numpy

NaN values are removed with numpy by identifying rows *ix* that contain NaN. Next, the rows are removed with *z=z\[~iz\]* where ~ is a bitwise not operator.
import numpy as np
z \= np.array(\[\[ 1, 2\],
\[ np.nan, 3\],
\[ 4, np.nan\],
\[ 5, 6\]\])
iz \= np.any(np.isnan(z), axis\=1)
print(~iz)
z \= z\[~iz\]
print(z)
[\[\$\[Get Code\]\]](https://apmonitor.com/pds/index.php/Main/CleanseData?action=sourceblock&num=1)
The result of the selection *iz* is *True* if *NaN* is found anywhere in the row of *False* if *NaN* is not found in that row. The ~ is the not operator to reverse *True* and *False*.
```
print(~iz)
> [ True False False True]
```
Only the rows with *True* are kept in the final *z* result.
```
print(z)
> [[1. 2.]
> [5. 6.]]
```
***
#### Remove Bad Data with Pandas

[Pandas](https://pandas.pydata.org/) manipulates data tables called DataFrames. There are many functions to efficiently manipulate data and remove bad data.
import numpy as np
import pandas as pd
z \= pd.DataFrame({'x':\[1,np.nan,4,5\],'y':\[2,3,np.nan,6\]})
print(z)
[\[\$\[Get Code\]\]](https://apmonitor.com/pds/index.php/Main/CleanseData?action=sourceblock&num=2)
This produces the same values as shown with the Numpy example.
```
x y
0 1.0 2.0
1 NaN 3.0
2 4.0 NaN
3 5.0 6.0
```
There are two common ways to deal with *NaN* values: drop the rows or fill in values. The first is to remove the rows with *dropna*. Use *inplace=True* to avoid the extra assignment of *result=z.dropna()* to modify z directly.
result \= z.dropna()
[\[\$\[Get Code\]\]](https://apmonitor.com/pds/index.php/Main/CleanseData?action=sourceblock&num=3)
```
x y
0 1.0 2.0
3 5.0 6.0
```
#### Replace Bad Data with Pandas
If data is very limited then it may be better to keep the row and fill-in values with methods *interpolate* for time-series data or *fillna* to replace *NaN* with a default value.
result \= z.fillna(z.mean())
[\[\$\[Get Code\]\]](https://apmonitor.com/pds/index.php/Main/CleanseData?action=sourceblock&num=4)
A common replacement is the mean value for each column with *z.mean()*. In this case, the mean (average) of column *x* is 3.333 and column *y* is 3.667.
```
x y
0 1.000000 2.000000
1 3.333333 3.000000
2 4.000000 3.666667
3 5.000000 6.000000
```
#### Filters with Pandas
[Data visualization](https://apmonitor.com/pds/index.php/Main/VisualizeData) can help identify outliers, especially with box plots. Statistical information such as standard deviation can also help to identify outliers such as eliminating data points that are more than 5 standard deviations away from the mean. A conditional filter can be created to eliminate bad values. The conditional statement *z\['y'\]\<5.5* creates a Logical array of *True* for *z\['y'\]\<5.5* and *False* for *z\['y'\]\>=5.5*.
result \= z\[z\['y'\]\<5\.5\]
[\[\$\[Get Code\]\]](https://apmonitor.com/pds/index.php/Main/CleanseData?action=sourceblock&num=5)
This filter eliminates the last row of the DataFrame and the *NaN* in that column.
```
x y
0 1.0 2.0
1 NaN 3.0
```
Filters can be combined with the *and* bitwise operator *&* or the *or* bitwise operator *\|*.
result \= z\[(z\['y'\]\<5\.5) & (z\['y'\]\>=1\.0)\]
[\[\$\[Get Code\]\]](https://apmonitor.com/pds/index.php/Main/CleanseData?action=sourceblock&num=6)
Another way to combine filters is to operate on the object with successive methods such as *z\['x'\].notnull()* to eliminate *NaN* values in the *x* column, the *.fillna(0)* to replace *NaN* with zero in the *y* column, and *.reset\_index(drop=True)* to reset the DataFrame index.
result \= z\[z\['x'\].notnull()\].fillna(0).reset\_index(drop\=True)
[\[\$\[Get Code\]\]](https://apmonitor.com/pds/index.php/Main/CleanseData?action=sourceblock&num=7)
```
x y
0 1.0 2.0
1 4.0 0.0
2 5.0 6.0
```
#### Further Reading
- Brownlee, J., [How to Remove Outliers for Machine Learning](https://machinelearningmastery.com/how-to-use-statistics-to-identify-outliers-in-data/), April 2018.

***
#### β
Knowledge Check
**1\.** What is the primary purpose of data cleansing?
**A.** To add bad data into machine learning models.
Incorrect. Data cleansing aims to remove bad data, not add it.
**B.** To identify and remove bad data.
Correct. Data cleansing is about removing bad data such as outliers, missing entries, and corrupted information.
**C.** To improve the efficiency of data storage.
Incorrect. While data cleansing can reduce storage needs by removing unnecessary data, its primary purpose is to improve data quality for analysis.
**D.** To make data visualization more attractive.
Incorrect. While clean data can lead to clearer visualizations, the main goal of data cleansing is to improve data quality for analysis.
**2\.** How can NaN values be removed from a Numpy array?
**A.** By using the dropna() method.
Incorrect. The dropna() method is associated with Pandas, not Numpy.
**B.** By replacing NaN values with the mean of the array.
Incorrect. While this method is used to replace NaN values, it doesn't remove them.
**C.** By using the command: z=z\[~iz\] where iz is the selection of NaN rows.
Correct. This method uses bitwise not operator (~) and a selection to remove rows containing NaN values from a Numpy array.
**D.** By using the fillna() method with the parameter 0.
Incorrect. The fillna() method is associated with Pandas, and it replaces NaN values rather than removing them.
Course on  [GitHub](https://github.com/APMonitor/pds)
- [Overview](https://apmonitor.com/pds/index.php/Main/HomePage)
- [Syllabus](https://apmonitor.com/pds/index.php/Main/CourseSyllabus)
- [Schedule](https://apmonitor.com/pds/index.php/Main/CourseSchedule)
- [4-Day Course](https://apmonitor.com/pds/index.php/Main/ShortCourse)
- [Course Project](https://apmonitor.com/pds/index.php/Main/CourseProject)
- [Startup Incubator](https://apmonitor.com/pds/index.php/Main/StartupIncubator)
- [Resources](https://apmonitor.com/pds/index.php/Main/LearningResources)
- [Install Python](https://apmonitor.com/pds/index.php/Main/InstallPython)
- [Install Packages](https://apmonitor.com/pds/index.php/Main/InstallPythonPackages)
- [AI Ethics](https://apmonitor.com/pds/index.php/Main/AIEthics)
Exams
- [Midterm Exam](https://apmonitor.com/pds/index.php/Main/CourseExam1)
Data Engineering
- [Overview](https://apmonitor.com/pds/index.php/Main/DataPreparation)
- [1οΈβ£ Gather Data](https://apmonitor.com/pds/index.php/Main/GatherData)
- [2οΈβ£ Statistics](https://apmonitor.com/pds/index.php/Main/StatisticsMath)
- [3οΈβ£ Visualize](https://apmonitor.com/pds/index.php/Main/VisualizeData)
- [4οΈβ£ Cleanse](https://apmonitor.com/pds/index.php/Main/CleanseData)
- [5οΈβ£ Features](https://apmonitor.com/pds/index.php/Main/FeatureEngineering)
- [6οΈβ£ Balance](https://apmonitor.com/pds/index.php/Main/ImbalancedData)
- [7οΈβ£ Scale](https://apmonitor.com/pds/index.php/Main/ScaleData)
- [8οΈβ£ Split](https://apmonitor.com/pds/index.php/Main/SplitData)
- [9οΈβ£ Deploy](https://apmonitor.com/pds/index.php/Main/DeployMachineLearning)
- [π Apps](https://apmonitor.com/pds/index.php/Main/AppDevelopment)
Agentic Engineering
- [1οΈβ£ Workplan](https://apmonitor.com/pds/index.php/Main/AgenticWorkplan)
- [2οΈβ£ Coding](https://apmonitor.com/pds/index.php/Main/AgenticCoding)
- [3οΈβ£ Coordination](https://apmonitor.com/pds/index.php/Main/AgenticCoordination)
- [4οΈβ£ Visualization](https://apmonitor.com/pds/index.php/Main/AgenticVisualization)
- [5οΈβ£ Communication](https://apmonitor.com/pds/index.php/Main/AgenticReports)
Classification
- [Overview](https://apmonitor.com/pds/index.php/Main/ClassificationOverview)
- [Hyperparameters](https://apmonitor.com/pds/index.php/Main/HyperparameterOptimization)
- [Cybersecurity](https://apmonitor.com/pds/index.php/Main/CyberSecurity)
**Supervised Learning**
- [AdaBoost](https://apmonitor.com/pds/index.php/Main/AdaBoost)
- [Decision Tree](https://apmonitor.com/pds/index.php/Main/DecisionTree)
- [k-Nearest Neighbors](https://apmonitor.com/pds/index.php/Main/KNearestNeighbors)
- [Logistic Regression](https://apmonitor.com/pds/index.php/Main/LogisticRegression)
- [NaΓ―ve Bayes](https://apmonitor.com/pds/index.php/Main/NaiveBayes)
- [Neural Network Classifier](https://apmonitor.com/pds/index.php/Main/DeepLearningNeuralNetwork)
- [Random Forest](https://apmonitor.com/pds/index.php/Main/RandomForest)
- [Stochastic Gradient Descent](https://apmonitor.com/pds/index.php/Main/StochasticGradientDescent)
- [Support Vector Classifier](https://apmonitor.com/pds/index.php/Main/SupportVectorClassifier)
- [XGBoost Classifier](https://apmonitor.com/pds/index.php/Main/XGBoostClassifier)
**Unsupervised Learning**
- [Gaussian Mixture Model](https://apmonitor.com/pds/index.php/Main/GaussianMixtureModel)
- [K-Means Clustering](https://apmonitor.com/pds/index.php/Main/KMeansClustering)
- [Spectral Clustering](https://apmonitor.com/pds/index.php/Main/SpectralClustering)
Regression
- [Overview](https://apmonitor.com/pds/index.php/Main/RegressionOverview)
- [Linear Regression](https://apmonitor.com/pds/index.php/Main/LinearRegression)
- [k-Nearest Neighbors](https://apmonitor.com/pds/index.php/Main/KNearestNeighborsRegression)
- [Support Vector Regressor](https://apmonitor.com/pds/index.php/Main/SupportVectorRegressor)
- [Gaussian Processes](https://apmonitor.com/pds/index.php/Main/GaussianProcessRegression)
- [Neural Network Regressor](https://apmonitor.com/pds/index.php/Main/MultilayerPerceptronNeuralNetwork)
- [XGBoost Regressor](https://apmonitor.com/pds/index.php/Main/XGBoostRegressor)
Time-Series
- [ARX Model](https://apmonitor.com/pds/index.php/Main/ARXTimeSeries)
- [LSTM Network](https://apmonitor.com/pds/index.php/Main/LongShortTermMemory)
Computer Vision
- [Introduction](https://apmonitor.com/pds/index.php/Main/ComputerVisionIntro)
- [Cascade Classifier](https://apmonitor.com/pds/index.php/Main/CascadeClassifier)
- [Deep Learning](https://apmonitor.com/pds/index.php/Main/VisionDeepLearning)
Applications
[3D Print](https://apmonitor.com/pds/index.php/Main/AdditiveManufacturing) ππ
[Automation LSTM](https://apmonitor.com/pds/index.php/Main/LSTMAutomation) β±οΈ
[Automotive](https://apmonitor.com/pds/index.php/Main/AutomotiveMonitoring) ππ
[Battery Life](https://apmonitor.com/pds/index.php/Main/BatteryLife)β±οΈπ
[Bit Classification](https://apmonitor.com/pds/index.php/Main/BitClassification) ποΈπ
[Concrete Strength](https://apmonitor.com/pds/index.php/Main/CementStrength) ππ
[Chemical Properties](https://apmonitor.com/pds/index.php/Main/ThermophysicalProperties) π
[Draw Classification](https://apmonitor.com/pds/index.php/Main/DrawClassification) π
[Facial Recognition](https://apmonitor.com/pds/index.php/Main/FacialRecognition) ποΈπ
[Glass Type](https://apmonitor.com/pds/index.php/Main/GlassCharacterization)β±οΈπ
[Hand Tracking](https://apmonitor.com/pds/index.php/Main/HandTracking) ποΈ
[OT Cybersecurity](https://apmonitor.com/pds/index.php/Main/ActuatorMonitor) β±οΈπ
[Batteries](https://apmonitor.com/pds/index.php/Main/LithiumIonBatteries) π
[Polymers](https://apmonitor.com/pds/index.php/Main/PolymerMeltFlowRate) π
[Road Detection](https://apmonitor.com/pds/index.php/Main/RoadDetection) ποΈπ
[Safety](https://apmonitor.com/pds/index.php/Main/SafetySurveillance) ποΈ
[Soils](https://apmonitor.com/pds/index.php/Main/SoilClassification) ποΈπ
[Sonar](https://apmonitor.com/pds/index.php/Main/SonarDetection) π
[Steel Plate](https://apmonitor.com/pds/index.php/Main/SteelPlateFaults) π
[Texture](https://apmonitor.com/pds/index.php/Main/TextureClassification) ποΈπ
[Wind Power](https://apmonitor.com/pds/index.php/Main/WindPower) β±οΈπ
π=Regression
π=Classification
β±οΈ=Time Series
ποΈ=Computer Vision
π§=Audio
[TCLab Project](https://apmonitor.github.io/data_science)
- [Overview](https://apmonitor.com/pds/index.php/Main/TCLabIntro)
- [TCLab Help](https://apmonitor.com/pds/notebooks/TCLab_Help.html)
- [1\. Overview](https://apmonitor.com/pds/notebooks/01_overview.html)
- [2\. Import Data](https://apmonitor.com/pds/notebooks/02_import_export.html)
- [3\. Statistics](https://apmonitor.com/pds/notebooks/03_analyze.html)
- [4\. Visualize](https://apmonitor.com/pds/notebooks/04_visualize.html)
- [5\. Prepare Data](https://apmonitor.com/pds/notebooks/05_prepare_data.html)
- [6\. Regression](https://apmonitor.com/pds/notebooks/06_regression.html)
- [7\. Features](https://apmonitor.com/pds/notebooks/07_features.html)
- [8\. Classification](https://apmonitor.com/pds/notebooks/08_classification.html)
- [9\. Interpolation](https://apmonitor.com/pds/notebooks/09_interpolation.html)
- [10\. Solve Equations](https://apmonitor.com/pds/notebooks/10_solve_equations.html)
- [11\. Differential Equations](https://apmonitor.com/pds/notebooks/11_differential_equations.html)
- [12\. Time Series](https://apmonitor.com/pds/notebooks/12_time_series.html)
- [Final TCLab Project](https://apmonitor.com/pds/notebooks/XX_final.html)
Related Courses
- [π Begin Python](https://apmonitor.com/che263/index.php/Main/CourseProjects)
- [π Begin Matlab](https://apmonitor.com/che263/index.php/Main/BeginMatlab)
- [π Begin Java](https://apmonitor.com/che263/index.php/Main/BeginJava)
- [π Engineering Computing](https://apmonitor.com/che263)
- [π Data Science](https://apmonitor.com/pds/index.php/Main/TCLabIntro)
- [π Data-Driven Engineering](https://apmonitor.com/dde)
- [π Machine Learning](https://apmonitor.com/pds)
- [π Control (MATLAB)](https://apmonitor.com/che436)
- [π Control (Python)](https://apmonitor.com/pdc)
- [π Optimization](https://apmonitor.com/me575)
- [π Dynamic Optimization](https://apmonitor.com/do)
[Admin](https://apmonitor.com/pds/index.php/PmWiki/PmWiki)
Page last modified on November 22, 2023, at 12:20 am
π¬
Send
Minimize |
| Readable Markdown | Measurements from sensors or from human input can contain bad data that negatively affects machine learning. This tutorial demonstrates how to identify and remove bad data.

Data cleansing is the process of removing bad data that may include outliers, missing entries, failed sensors, or other types of missing or corrupted information.
Bad data can be detected with [summary statistics](https://apmonitor.com/pds/index.php/Main/StatisticsMath) and [data visualization](https://apmonitor.com/pds/index.php/Main/VisualizeData). An effective way to remove bad data is with filters that segregate based on conditions that remove outliers, replace bad values such as *NaN* (Not a Number), or remove the data row that contains an *NaN* value.
***
#### Remove Bad Data with Numpy

NaN values are removed with numpy by identifying rows *ix* that contain NaN. Next, the rows are removed with *z=z\[~iz\]* where ~ is a bitwise not operator.
import numpy as np
z \= np.array(\[\[ 1, 2\],
\[ np.nan, 3\],
\[ 4, np.nan\],
\[ 5, 6\]\])
iz \= np.any(np.isnan(z), axis\=1)
print(~iz)
z \= z\[~iz\]
print(z)
The result of the selection *iz* is *True* if *NaN* is found anywhere in the row of *False* if *NaN* is not found in that row. The ~ is the not operator to reverse *True* and *False*.
```
print(~iz)
> [ True False False True]
```
Only the rows with *True* are kept in the final *z* result.
```
print(z)
> [[1. 2.]
> [5. 6.]]
```
***
#### Remove Bad Data with Pandas

[Pandas](https://pandas.pydata.org/) manipulates data tables called DataFrames. There are many functions to efficiently manipulate data and remove bad data.
import numpy as np
import pandas as pd
z \= pd.DataFrame({'x':\[1,np.nan,4,5\],'y':\[2,3,np.nan,6\]})
print(z)
This produces the same values as shown with the Numpy example.
```
x y
0 1.0 2.0
1 NaN 3.0
2 4.0 NaN
3 5.0 6.0
```
There are two common ways to deal with *NaN* values: drop the rows or fill in values. The first is to remove the rows with *dropna*. Use *inplace=True* to avoid the extra assignment of *result=z.dropna()* to modify z directly.
result \= z.dropna()
```
x y
0 1.0 2.0
3 5.0 6.0
```
#### Replace Bad Data with Pandas
If data is very limited then it may be better to keep the row and fill-in values with methods *interpolate* for time-series data or *fillna* to replace *NaN* with a default value.
result \= z.fillna(z.mean())
A common replacement is the mean value for each column with *z.mean()*. In this case, the mean (average) of column *x* is 3.333 and column *y* is 3.667.
```
x y
0 1.000000 2.000000
1 3.333333 3.000000
2 4.000000 3.666667
3 5.000000 6.000000
```
#### Filters with Pandas
[Data visualization](https://apmonitor.com/pds/index.php/Main/VisualizeData) can help identify outliers, especially with box plots. Statistical information such as standard deviation can also help to identify outliers such as eliminating data points that are more than 5 standard deviations away from the mean. A conditional filter can be created to eliminate bad values. The conditional statement *z\['y'\]\<5.5* creates a Logical array of *True* for *z\['y'\]\<5.5* and *False* for *z\['y'\]\>=5.5*.
result \= z\[z\['y'\]\<5\.5\]
This filter eliminates the last row of the DataFrame and the *NaN* in that column.
```
x y
0 1.0 2.0
1 NaN 3.0
```
Filters can be combined with the *and* bitwise operator *&* or the *or* bitwise operator *\|*.
result \= z\[(z\['y'\]\<5\.5) & (z\['y'\]\>=1\.0)\]
Another way to combine filters is to operate on the object with successive methods such as *z\['x'\].notnull()* to eliminate *NaN* values in the *x* column, the *.fillna(0)* to replace *NaN* with zero in the *y* column, and *.reset\_index(drop=True)* to reset the DataFrame index.
result \= z\[z\['x'\].notnull()\].fillna(0).reset\_index(drop\=True)
```
x y
0 1.0 2.0
1 4.0 0.0
2 5.0 6.0
```
#### Further Reading
- Brownlee, J., [How to Remove Outliers for Machine Learning](https://machinelearningmastery.com/how-to-use-statistics-to-identify-outliers-in-data/), April 2018.

***
#### β
Knowledge Check
**1\.** What is the primary purpose of data cleansing?
**A.** To add bad data into machine learning models.
Incorrect. Data cleansing aims to remove bad data, not add it.
**B.** To identify and remove bad data.
Correct. Data cleansing is about removing bad data such as outliers, missing entries, and corrupted information.
**C.** To improve the efficiency of data storage.
Incorrect. While data cleansing can reduce storage needs by removing unnecessary data, its primary purpose is to improve data quality for analysis.
**D.** To make data visualization more attractive.
Incorrect. While clean data can lead to clearer visualizations, the main goal of data cleansing is to improve data quality for analysis.
**2\.** How can NaN values be removed from a Numpy array?
**A.** By using the dropna() method.
Incorrect. The dropna() method is associated with Pandas, not Numpy.
**B.** By replacing NaN values with the mean of the array.
Incorrect. While this method is used to replace NaN values, it doesn't remove them.
**C.** By using the command: z=z\[~iz\] where iz is the selection of NaN rows.
Correct. This method uses bitwise not operator (~) and a selection to remove rows containing NaN values from a Numpy array.
**D.** By using the fillna() method with the parameter 0.
Incorrect. The fillna() method is associated with Pandas, and it replaces NaN values rather than removing them. |
| Shard | 37 (laksa) |
| Root Hash | 13934051142679362237 |
| Unparsed URL | com,apmonitor!/pds/index.php/Main/CleanseData s443 |