Top 10 Pandas techniques used in Data Science
The ten cool techniques that will always come in hand!
With so much data📚 gathered in the corporate sector, we must now have some methods of painting🎨🖌 an image of that information to comprehend it. Data analysis📊 enables us to understand the data and identify trends, patterns📈. We can now achieve all the above-said things by the techniques used in Data Science, thanks to Python's Pandas library! 🌟
Pandas is data manipulation and analysis library written using Python. It has a plethora of functions and methods to help speed up⚡ the data analysis process. Pandas’ popularity stems from its usefulness, adaptability, and straightforward syntax🤩.
Over the years, Pandas has been the industry standard🏭 for data reading and processing. This blog includes the top ten Pandas techniques that will be more than useful in your data science projects.💯
How to start with Data structures & Algorithms
1. Conditional Selection of Rows
Exploratory Data Analysis is done to summarise the key🗝 characteristics and to better understand the data set📋. It also helps us rapidly evaluate the data and experiment with different factors to see how they affect the results. One of the important analyses is the conditional selection of rows or data filtering. There are two methods to do this.
📌 The examples shown below utilizes a data set that specifies different properties of cars🚘. Take a look at the data set:
import pandas as pd
path = "C:/Users/aakua/OneDrive/Desktop/Toyota.csv"
df = pd.read_csv(path,na_values=["??","###","????"], index_col=0)
df.head()
Output:
a. loc[ ]
The Pandas DataFrame.loc[]
property allows you to access a set of rows and columns in the supplied data set by label(s) or a boolean array.
📌 For this example, we will filter the rows with Age <=25
and FuelType as Petrol
.
data = df.loc[(df['Age'] <= 25) & (df['FuelType'] == 'Petrol')]
data[:8]
Output:
b. query( )
The query( ) method is used to query the columns of a DataFrame with a boolean expression.🌟The only significant difference between a query and a standard conditional mask is that a string can be regarded as a conditional statement in a query. In contrast, a conditional mask filters the data with booleans and returns the values according to actual conditions.
📌 Here, we have filtered the rows that have KM
values between 15000 and 30000.
data_q = df.query('15000.0 < KM < 30000.0')
data_q[:8]
2. Sorting
Pandas provide two methods for sorting the data frame: sort_values( )
and sort_index( )
. 🚩The sorting order can be controlled.
a. sort_values( ):
This function is used to sort one or more columns of the Pandas DataFrame. We have sorted the data frame in ascending order according to the Price
column.
df.sort_values(by = 'Price')[:8]
Output:
b. sort_index( ):
This is used for sorting Pandas DataFrame by the row index.
df.sort_index(axis=1, ascending = True)[:8]
Output:
3. GroupBy
A GroupBy operation involves some combination of splitting the object, applying a function, and combining the results. It may be used for grouping vast quantities and computing operations on them.
📌 A. Let us say we want to look at the average price of the cars according to different FuelType
, such as CNG, Petrol and Diesel. Take a moment and think about the problem🤔.
GroupBy function can quickly implement this! We will first split the data according to the FuelType
, and then apply the mean( )
function to the price
.
df.groupby(['FuelType'])[['Price']].mean()
Output:
📌 B. Now, we will group the data as before, but we will also compute the mean
of the Age
of cars.
df.groupby(['FuelType'])[['Price', 'Age']].mean()
Output:
📌 C. What if you want to group the data according to more than one column?🧐 Here’s the solution:
df.groupby(['FuelType','Automatic'])[['Price', 'Age']].mean()
Output:
See, that was easy!😉 Let us explore the next one now.
4. Mapping
Another essential data manipulating technique is data mapping. Initially, as a mathematical concept, mapping is the act of producing a new collection of values, generally individually, from an existing set of values. 🌟The biggest advantage of this function is that it can be applied to an entire data set.📚
The Pandas library offers us the map()
method to handle series data. For mapping one value in a set to another value depending on input correspondence, Pandas map()
is employed. This input may be a series or even a dictionary.
📌 Let us map the FuelType
variable with Vehicle Type. For instance, CNG
will be mapped with Hatchback, Petrol
with Sedan, and Diesel
with Van.
# dictionary to map FuelType with Vehicle
map_fueltype_to_vehicle = { 'CNG' : 'Hatchback',
'Petrol' : 'Sedan',
'Diesel' : 'Van'}
df['Vehicle'] = df['FuelType'].map(map_fueltype_to_vehicle)
df[45:55]
Output:
5. nsmallest( ) and nlargest( )
I think, after reading their titles, there is no doubt about the goals🎯 of these two approaches; nonetheless, here are the definitions:
a. nsmallest( ): The Pandas nsmallest( )
technique is used to obtain n
least👇 values in the data set or series.
b. nlargest( ): The Pandas nlargest( )
function obtains n
largest👆 values in the data set or series.
📌 Let us examine how the five observations with the least
and most
value would be found in the Price
column:
df.nsmallest(5, "Price")
Output:
df.nlargest(5, "Price")
Output:
6. idxmax( ) and idxmin( )
a. idxmax( ): Pandas DataFrame.idxmax( )
gives the index of the first occurrence of maximum value across the specified axis. All NA/null values are omitted❌ when determining the index of the greatest value across any index.
b. idxmin( ): Pandas DataFrame.idxmin( )
gives the index of the first occurrence of minimum value across the specified axis. All NA/null values are omitted ❌ when determining the index of the least value across any index.
📌 Let us say we want to access the indices of the maximum and minimum values in the Price
column based on the FuelType
. This can be done in the following manner:
df.groupby(['FuelType'])[['Price']].idxmin()
Output:
df.groupby(['FuelType'])[['Price']].idxmax()
Output:
7. Exporting Data
While performing different functions on the data set, as explained above, grouping data becomes helpful. However, what if we want to save the grouped data as a new file in our system? 🤔🧐
Pandas offer functions that can come in handy in such situations🌟, that can save a DataFrame as a file in various extensions like .xlsx
, .csv
, etc.
data_q = df.query('15000.0 < KM < 30000.0')
data_q.to_csv('Query_Data.csv')
Output: File Saved
📌 To save the data frame in Excel use: to_excel( )
📌 To save the data frame in JSON use: to_json( )
8. dropna( ) and fillna( )
The data set sometimes includes null values✖ that are shown in the DataFrame as NaN afterward.
a. dropna( ): The dropna( )
method in Pandas DataFrame is used to eliminate❌ rows and columns containing Null/NaN values.
df.shape
Output: (1436, 9)
data1 = df.dropna()
data1.shape
Output: (1097, 10)
b. fillna( ): fillna( )
, unlike the pandas dropna( )
method, handles and removes❌ Null values from a DataFrame by allowing the user to substitute NaN values with their own.
📌 Before eliminating null values, the data set looked like:
df.head(10)
Output:
📌 I have substituted the null values with 0
. After using the fillna( )
method:
df = df.fillna(0)
df.head(10)
Output:
9. Correlation
Pandas DataFrame.corr( )
returns the pairwise correlation of all columns in a DataFrame. Any NA
values are immediately filtered out. It is disregarded for any non-numeric data type columns in the DataFrame.
df.corr()
Output:
🚩Some key points to note about correlation values:
📍 The correlation values fluctuate between -1
and 1
.
📍 1
implies that the association is 1
to 1
(perfect correlation), and one value in the second column has risen every time a value has been increased in the first column.
📍 0.9
indicates a strong connection, and the other is similarly likely to rise if you increase one value.
📍 -0.9
would be exactly as good as 0.9
, but the other would definitely decrease if you increase one value.
📍 0.2
does NOT imply a good correlation, which means that if one value goes up does not imply the other goes up too.
📍 I believe, it is reasonable to claim that the two columns have a decent correlation if the values are between -0.6
and 0.6
.
10. Apply a function to the DataFrame
The apply( )
function is used to implement arithmetic or logical code over an entire data set or series type using a Python function. The function to be applied can be an inbuilt function or a user-defined function.
a. Applying an inbuilt function:
import numpy as np
df['Price'].apply(np.sqrt)[:8]
Output:
b. Applying a user-defined function:
def fun(num):
if num <10000:
return "Low"
elif num >= 10000 and num <25000:
return "Normal"
else:
return "High"
new = df['Price'].apply(fun)
new[105:115]
Output:
You will surely have the advantage of knowing these approaches from Pandas in the field of data science🤩. I hope you liked and found this blog useful!🌟
Python Pandas for beginners
Subscribe to our YouTube channel for more great videos and projects.