Datasets : pandas dataframe (kaggle.com)

Creating DataFrame

Using Lists

import numpy as np
import pandas as pd

student_data = [
    [100, 80, 10],
    [90, 70, 7],
    [120, 100, 14],
    [80, 50, 2]
]

pd.DataFrame(student_data, columns=['iq', 'marks', 'package'])

Using Dictionaries

student_dict = {
    'name': ['nabin', 'naksh', 'shisir', 'manisha', 'evanka', 'mansavi'],
    'iq': [100, 90, 120, 80, 0, 0],
    'marks': [80, 70, 100, 50, 0, 0],
    'package': [10, 7, 14, 2, 0, 0]
}

students = pd.DataFrame(student_dict)
students.set_index('name', inplace=True)

Using read_csv

movies = pd.read_csv('movies.csv')
ipl = pd.read_csv('ipl-matches.csv')

DataFrame Attributes and Methods

Shape

movies.shape
ipl.shape

DataTypes

movies.dtypes
ipl.dtypes

Index

movies.index
ipl.index

Columns

movies.columns
ipl.columns
students.columns

Values

students.values
ipl.values

Head, Tail, Sample

movies.head(2)
ipl.tail(2)
ipl.sample(5)

Info

movies.info()
ipl.info()

Describe

movies.describe()
ipl.describe()

isnull, duplicated

movies.isnull().sum()
movies.duplicated().sum()
students.duplicated().sum()

Rename Columns

students.rename(columns={'marks': 'percent', 'package': 'lpa'}, inplace=True)

Math Methods

students.sum(axis=0)
students.mean(axis=1)
students.var()

Selecting Columns from a DataFrame

Single Columns

movies['title_x']
ipl['Venue']

Multiple Columns

movies[['year_of_release', 'actors', 'title_x']]
ipl[['Team1', 'Team2', 'WinningTeam']]

Selecting Rows from a DataFrame

  • iloc – searches using index positions
  • loc – searches using index labels

Single Row

movies.iloc[5]

Multiple Rows

movies.iloc[:5]

Fancy Indexing

movies.iloc[[0, 4, 5]]

loc

students.loc['nitish']
students.loc['nitish':'rishabh':2]
students.loc[['nitish', 'ankita', 'rupesh']]
students.iloc[[0, 3, 4]]

Selecting both Rows and Columns

iloc and loc

movies.iloc[0:3, 0:3]
movies.loc[0:2, 'title_x':'poster_path']

Filtering a DataFrame

# Example: find all the final winners
mask = ipl['MatchNumber'] == 'Final'
new_df = ipl[mask]
new_df[['Season', 'WinningTeam']]

# Other Filtering Examples

Adding new Columns

Completely New

movies['Country'] = 'India'

From Existing Ones

movies.dropna(inplace=True)
movies['lead actor'] = movies['actors'].str.split('|').apply(lambda x: x[0])

Important DataFrame Functions

astype

ipl.info()
ipl['ID'] = ipl['ID'].astype('int32')
ipl.info()

value_counts

# Example: find which player has won most Player of the Match in finals and qualifiers

Other Examples

# Toss decision plot
# Number of matches each team has played
# sort_values, ascending, na_position, inplace, multiple cols

30 Replies to “Pandas DataFrame”

Leave a Reply

Your email address will not be published. Required fields are marked *