How to identify the outliers in your Data??

Gautam Kumar
3 min readNov 30, 2020

--

Outliers — according to Statistics, outliers are data points that don’t belong to a certain population, an abnormal observation that lies far away from other data values.

Detection techniques are as follows:
1. Using standard Deviation
2. Using Boxplots
3. Using Violin Plots
4. Using Scatter Plots

Standard Deviation : If any data point that is more than 3 times of standard deviation, then those points are very likely to be treated as outliers, In general, if data distribution is approximately normal, then about 68% of the data values lie within one standard deviation of the mean, and about 95% are within two standard deviations, and about 99.7% lie within three standard deviations.

Boxplots: Box plots are a graphical depiction of numerical data through their quantiles, lower and upper whiskers as the boundaries of the data distribution. Any data points that show above or below the whiskers can be considered outliers.

Violin Plots: Violin plots are similar to box plots, except that they also show the probability density of the data at different values, usually smoothed by a kernel density estimator. Typically a violin plot will include all the data that is in a box plot a marker for the median of the data, a box or marker indicating the interquartile range, and possibly all sample points if the number of samples is not too high.

Scatter Plots: A scatter plot is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data,data are displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis, points which are very far away from the general spread of data and have a very few neighbors are considered to be outliers.

Following are the approaches to handle the outliers:
1. Drop the outlier records.

2. Assign a new value: If an outlier seems to be due to a mistake in your data, you try imputing a value.

3. If percentage-wise the number of outliers is less, but when we see numbers, there are several, then, in that case, dropping them might cause a loss in insight. We should group them in that case and run our analysis separately on them.

  1. See that 👏 icon? Send my article some claps
  2. Connect with me via linkedin, github and on medium👈 and Buy me a coffee if you like this blog.
  3. source code you can find it here 👈

--

--

Gautam Kumar

Data Scientist | MLOps | Coder l Machine learning | NLP | AI BOT I NEO4J | Python | Digital transformation |Applied AI | RPA | Blogger | Innovation enthusiast