Feature Scaling in Machine Learning using Python

      In this tutorial we are going to learn how to do scaling the independent variable data using python. In data processing, it is also called as data normalization.

What is the use of feature scaling in Machine Learning? 

    The range of values of raw data varies widely, in some machine learning algorithms, objective functions will not work properly without normalization. For example, the majority of classifiers calculate the distance between two points by the Euclidean distance. If one of the features has a broad range of values, the distance will be governed by this particular feature. Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance.

Another reason why feature scaling is applied is that gradient descent converges much faster with feature scaling than without it.

We can do the feature scaling in many ways. As of now we are going to look the Min-Max Normalization and Mean Normalization.

Min-Max Normalization:

    This is simplest method and consists in rescaling the range of features to scale the range in [0, 1] or [−1, 1]. Selecting the target range depends on the nature of the data. The general formula is given below,

Mean  Normalization:

   This method  rescaling the range of features using the mean . The general formula is given below,


Now, we are going to look the real time feature scaling using these methods. Already you have the data set then skip the preparing sample data set. 

Preparing the data set:

    Here we are creating the sample data set with the four columns. Each column having the numeric random variable. In normalization we cant use the text data types. We can generate the random numbers using the numpy .  The sample line for generating the data frame given below,

1
df = pd.DataFrame(np.random.randn(1000, 4), columns = ['a', 'b', 'c', 'd'])

The sample data is given below,

Now, we have the data in our hand. The remaining is to apply the feature scaling in the above data set. we are going to do the feature scaling for the below scenarios, They are
  1. Feature scaling / Normalizing a single columns
  2. Feature scaling / Normalizing entire data-frame

Normalizing Single row:

In single row,  we can use the particular column to normalize in the data set. In our example we are using the four features they are a,b,c,d . In this example I am used the a,b,c,d as column names and dont use it in your data-set. 

Min-Max Normalization for single columns :

The sample code is attached below, 

1
df["a_min"]=((df["a"]-df["a"].min())/(df["a"].max()-df["a"].min()))

Mean - Normalization for single columns:

The sample code attached below,

1
df['a_mean_method']=(df[['a']]-df[['a']].mean())/df[['a']].std()

Normalizing Entire data-frame :

  This can be done pretty straight forward way. But because full on the text type data and none type data.  If the column is None then fill it with 0.

Min-Max Normalization :

The sample code attached below,

1
normalized_df=(df-df.min())/(df.max()-df.min())

We can do this using the sklearn library also. The sample code attached below,

Mean Normalization :

The sample code is given below,

1
normalized_df=(df-df.mean())/df.std()

The full sample source code attached below,

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
# coding: utf-8

# In[4]:


import pandas as pd
import numpy as np


# In[5]:


#creatign data set
df = pd.DataFrame(np.random.randn(1000, 4), columns = ['a', 'b', 'c', 'd'])


# In[7]:


df.head(5)


# ### Min-Max normalization for single column using the our own code

# In[8]:


df["a_min"]=((df["a"]-df["a"].min())/(df["a"].max()-df["a"].min()))


# In[9]:


df.head()


# ### Mean normalization for single column 

# In[11]:


df['a_mean_method']=(df[['a']]-df[['a']].mean())/df[['a']].std()


# In[12]:


df.head()


# ### Min-Max normalization for entire data frame

# In[13]:


normalized_df=(df-df.min())/(df.max()-df.min())


# ### Min-Max normalization for entire data frame using sklearn library
# 

# In[14]:



from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
scaled_array = min_max_scaler.fit_transform(df)
df_normalized = pd.DataFrame(scaled_array)


# In[15]:


df_normalized.head(4)

Hope you guys understood better.  Drop your commends and suggestion below .

                                    Thanks for reading. :) 

Comments

Post a Comment

Popular posts from this blog

Pyhton auto post to blogger using Google blogger API

Connect VPN via Python

Website crawl or scraping with selenium and python