Table of Contents
Python Pandas is a strong and widely used python library. It is intensively used in many fields and data analysis is one of such field were Pandas python library plays crucial role for manipulating, cleaning, analyzing, updating and editing data sets. One can refer to the Pandas source code available on git repository
Install Pandas in PyCharm
Follow below steps in Pycharm IDE to install the module:-
- Go to File -> settings
- select project where you want to install Pandas library
- select project interpreter
- click on '+' symbol at extreme right side
- search for the library
- click on 'Install Package'
Python Pandas Tutorial
Also Read: Best Explanation of Python File I/O(Input/Output) with Examples
Import Pandas
Pandas library can be imported using keyword 'import'. we can import in 2 ways:-
import pandas
or
import pandas as pd(alias)
Few Pandas concepts
1. Series
Pandas Series is a special type of list, also called as 1D array which can holds any kind of data. It indexes(also called labels) the data by default starting from 0. We can access the data from Series using the indexes.
Example
import pandas
data = [1, 'A', '*']
print(pandas.Series(data))
Output
0 1
1 A
2 *
dtype: object
We can also create indexes according to our need.
Example
import pandas
data = [1, 'A', '*']
print(pandas.Series(data, index=['a', 'b', 'c']))
Output
a 1
b A
c *
dtype: object
Accessing elements at different indexes.
Example
import pandas
data = [1, 'A', '*']
indexed_data = (pandas.Series(data, index=['a', 'b', 'c']))
print(f"Element at 'a': {indexed_data['a']}")
print(f"Element at 'b': {indexed_data['b']}")
print(f"Element at 'c': {indexed_data['c']}")
Output
Element at 'a': 1
Element at 'b': A
Element at 'c': *
2. DataFrame
DataFrame in pandas is a 2-D array which can hold heterogeneous type of data. It gets created with labelled axes (i.e with rows and columns).
In the below example, we will create a data frame and see some of it's important functions which are quite helpful when dealing with tabular data.
Example
import pandas
dframe = pandas.DataFrame({'NickNames':['Green City', 'Golden City', 'Yoga City'], #Create Data Frame
'States':['Gandhi Nagar', 'Amritsar', 'Rishikesh'],
'Delicacies':['Pizza', 'Kulcha', 'Samosa'],
'Rating':['4.5', '4', '4.6']
})
print(f"Data Frame:\n {dframe}\n")
inx = dframe.index = ['i', 'ii', 'iii'] #Set index of Data frame. default indexing starts from 0
print(f"Index of Data Frame:\n {dframe}\n")
print(f"NickNames are:\n {dframe['NickNames']}\n") #print All values of variable NickNames
print(f"NickNames along with Data Frame:\n {dframe[['NickNames']]}\n") #print Data Frame
print(f"NickNames and States are:\n {dframe[['NickNames', 'States']]}\n")
print(f"Second and Third row:\n {dframe[1:3]}\n") #slice Data Frame(another way to print specific number of rows)
print(f"First Data set:\n {dframe.loc[['i']]}\n") #loc uses string indices to print certain row(here print 1st row)
print(f"Second Data set:\n {dframe.iloc[1]}\n") #iloc uses integer indices to print certain row(here print 2nd row)
Output
Data Frame:
NickNames States Delicacies Rating
0 Green City Gandhi Nagar Pizza 4.5
1 Golden City Amritsar Kulcha 4
2 Yoga City Rishikesh Samosa 4.6
Index of Data Frame:
NickNames States Delicacies Rating
i Green City Gandhi Nagar Pizza 4.5
ii Golden City Amritsar Kulcha 4
iii Yoga City Rishikesh Samosa 4.6
NickNames are:
i Green City
ii Golden City
iii Yoga City
Name: NickNames, dtype: object
NickNames along with Data Frame:
NickNames
i Green City
ii Golden City
iii Yoga City
NickNames and States are:
NickNames States
i Green City Gandhi Nagar
ii Golden City Amritsar
iii Yoga City Rishikesh
Second and Third row:
NickNames States Delicacies Rating
ii Golden City Amritsar Kulcha 4
iii Yoga City Rishikesh Samosa 4.6
First Data set:
NickNames States Delicacies Rating
i Green City Gandhi Nagar Pizza 4.5
Second Data set:
NickNames Golden City
States Amritsar
Delicacies Kulcha
Rating 4
Name: ii, dtype: object
3. Read CSV
When we deal with huge size of data (say employees of an MNC), it's quite impossible to create a data frame and perform different operations on them.
Hence, we store such data in some kind of file like csv, json etc. We can then simply export these files and perform any operation based on our needs.
For below example, I have downloaded a data set Future50.csv from kaggle.com. Let’s see the example below.
Example
import pandas as pd
restaurant = pd.read_csv('Future50.csv')
print(f"Data Set: {restaurant}") #print data set
print(f"Columns in Data set: {restaurant.columns}") #print all columns in data set
print(f"First two columns in Data set{restaurant.columns[0:2]}") #print first 2 columns in data set
print(f"Data types of Columns: {restaurant.dtypes}", "\n") #print data type of each column
print(f"Data Type of Sales: {restaurant['Sales'].dtypes}") #print data type of a specific column
print(f"Shape of Data Set{restaurant.shape}") #print shape of data set (50 rows, 9 columns)
print(f"Number of rows: {restaurant.shape[0]}") #print number of rows
print(f"Number of Columns: {restaurant.shape[1]}") #print number of columns
print("\n")
print(f"First 5 rows of Data Set{restaurant.head()}") #by default prints first 5 rows of data set
print(f"First 3 rows of Data set{restaurant.head(3)}") #prints first 3 rows of data set
print(f"Last five rows of Data set{restaurant.tail()}") #by default prints last 5 rows of data set
print(f"Last two rows of Data set{restaurant.tail(2)}") #prints last 2 rows of data set
print("\n")
print(f"Unique values: {restaurant['Sales'].unique()}") #prints all unique values in a specific column
print(f"Number of unique values: {restaurant['Sales'].nunique()}") #prints total number of unique values in a specific column
Output
Columns in Data set: Index(['Rank', 'Restaurant', 'Location', 'Sales', 'YOY_Sales', 'Units',
'YOY_Units', 'Unit_Volume', 'Franchising'],
dtype='object')
First two columns in Data setIndex(['Rank', 'Restaurant'], dtype='object')
Data types of Columns: Rank int64
Restaurant object
Location object
Sales int64
YOY_Sales object
Units int64
YOY_Units object
Unit_Volume int64
Franchising object
dtype: object
Data Type of Sales: int64
Shape of Data Set(50, 9)
Number of rows: 50
Number of Columns: 9
First 5 rows of Data Set Rank Restaurant ... Unit_Volume Franchising
0 1 Evergreens ... 1150 No
1 2 Clean Juice ... 560 Yes
2 3 Slapfish ... 1370 Yes
3 4 Clean Eatz ... 685 Yes
4 5 Pokeworks ... 1210 Yes
[5 rows x 9 columns]
First 3 rows of Data set Rank Restaurant ... Unit_Volume Franchising
0 1 Evergreens ... 1150 No
1 2 Clean Juice ... 560 Yes
2 3 Slapfish ... 1370 Yes
[3 rows x 9 columns]
Last five rows of Data set Rank Restaurant ... Unit_Volume Franchising
45 46 LA Crawfish ... 2050 Yes
46 47 &pizza ... 1350 No
47 48 Super Duper Burgers ... 2630 No
48 49 StoneFire Grill ... 2550 No
49 50 Gus's World Famous Fried Chicken ... 1600 Yes
[5 rows x 9 columns]
Last two rows of Data set Rank Restaurant ... Unit_Volume Franchising
48 49 StoneFire Grill ... 2550 No
49 50 Gus's World Famous Fried Chicken ... 1600 Yes
[2 rows x 9 columns]
Unique values: [24 44 21 25 49 39 20 29 30 41 48 37 22 32 23 47 28 27 42 40 38 45 31]
Number of unique values: 23
Process finished with exit code 0