Pandas¶

Pandas is a Python package for tabular data.

You may need to install pandas first, as well as seaborn

(pycourse) $ conda install pandas seaborn

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
pd.__version__

'1.1.3'

Overview¶

Pandas provides a DataFrame object, which is used to hold tables of data (the name DataFrame comes from a similar object in R). The primary difference compared to a NumPy ndarray is that you can easily handle different data types in each column.

Difference between a DataFrame and NumPy Array¶

Pandas DataFrames and NumPy arrays both have similarities to Python lists.

Numpy arrays are designed to contain data of one type (e.g. Int, Float, …)
DataFrames can contain different types of data (Int, Float, String, …)
- Usually each column has the same type

Both arrays and DataFrames are optimized for storage/performance beyond Python lists

Pandas is also powerful for working with missing data, working with time series data, for reading and writing your data, for reshaping, grouping, merging your data, …

Key Features¶

File I/O - integrations with multiple file formats
Working with missing data (.dropna(), pd.isnull())
Normal table operations: merging and joining, groupby functionality, reshaping via stack, and pivot_tables,
Time series-specific functionality:
- date range generation and frequency conversion, moving window statistics/regressions, date shifting and lagging, etc.
Built in Matplotlib integration

Other Strengths¶

Strong community, support, and documentation
Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data
Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects Intelligent label-based slicing, fancy indexing, and subsetting of large data sets

Python/Pandas vs. R¶

R is a language dedicated to statistics. Python is a general-purpose language with statistics modules.
R has more statistical analysis features than Python, and specialized syntaxes.

However, when it comes to building complex analysis pipelines that mix statistics with e.g. image analysis, text mining, or control of a physical experiment, the richness of Python is an invaluable asset.

Objects and Basic Creation¶

Name	Dimensions	Description
`pd.Series`	1	1D labeled homogeneously-typed array
`pd.DataFrame`	2	General 2D labeled, size-mutable tabular structure
`pd.Panel`	3	General 3D labeled, also size-mutable array

Get Started¶

We’ll load the Titanic data set, which is part of seaborn’s example data. It contains information about 891 passengers on the infamous Titanic

See the Pandas tutorial on reading/writing data for more information.

titanic = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv")
df = titanic
df

	survived	pclass	sex	age	sibsp	parch	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
0	0	3	male	22.0	1	0	7.2500	S	Third	man	True	NaN	Southampton	no	False
1	1	1	female	38.0	1	0	71.2833	C	First	woman	False	C	Cherbourg	yes	False
2	1	3	female	26.0	0	0	7.9250	S	Third	woman	False	NaN	Southampton	yes	True
3	1	1	female	35.0	1	0	53.1000	S	First	woman	False	C	Southampton	yes	False
4	0	3	male	35.0	0	0	8.0500	S	Third	man	True	NaN	Southampton	no	True
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
886	0	2	male	27.0	0	0	13.0000	S	Second	man	True	NaN	Southampton	no	True
887	1	1	female	19.0	0	0	30.0000	S	First	woman	False	B	Southampton	yes	True
888	0	3	female	NaN	1	2	23.4500	S	Third	woman	False	NaN	Southampton	no	False
889	1	1	male	26.0	0	0	30.0000	C	First	man	True	C	Cherbourg	yes	True
890	0	3	male	32.0	0	0	7.7500	Q	Third	man	True	NaN	Queenstown	no	True

891 rows × 15 columns

we see that there are a variety of columns. Some contain numeric values, some contain strings, some contain booleans, etc.

If you want to access a particular column, you can create a Series using the column title.

ages = df["age"] # __getitem__
ages

    22.0
    38.0
    26.0
    35.0
    35.0
       ... 
  27.0
  19.0
   NaN
  26.0
  32.0
Name: age, Length: 891, dtype: float64

column titles are also treated as object attributes:

ages = df.age
ages

    22.0
    38.0
    26.0
    35.0
    35.0
       ... 
  27.0
  19.0
   NaN
  26.0
  32.0
Name: age, Length: 891, dtype: float64

Pandas DataFrames and Series have a variety of built-in methods to examine data

print(ages.max(), ages.min(), ages.mean(), ages.median())

80.0 0.42 29.69911764705882 28.0

You can display a variety of statistics using describe(). Note that NaN values are simply ignored.

ages.describe()

count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: age, dtype: float64

depending on the type of data held in the series, describe has different functionality.

df['sex'].describe()

count      891
unique       2
top       male
freq       577
Name: sex, dtype: object

You can find some more discussion in the Pandas tutorial here

Selecting Subsets of DataFrames¶

You can select multiple columns

df2 = df[["age", "survived"]]
df2

	age	survived
0	22.0	0
1	38.0	1
2	26.0	1
3	35.0	1
4	35.0	0
...	...	...
886	27.0	0
887	19.0	1
888	NaN	0
889	26.0	1
890	32.0	0

891 rows × 2 columns

If you want to select certain rows, using some sort of criteria, you can create a series of booleans

mask = df2['age'] > 30
df2[mask]

	age	survived
1	38.0	1
3	35.0	1
4	35.0	0
6	54.0	0
11	58.0	1
...	...	...
873	47.0	0
879	56.0	1
881	33.0	0
885	39.0	0
890	32.0	0

305 rows × 2 columns

We see that we only have 305 rows now

If you just want to access specific rows, you can use iloc

df.iloc[:4]

	survived	pclass	sex	age	sibsp	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
0	0	3	male	22.0	1	7.2500	S	Third	man	True	NaN	Southampton	no	False
1	1	1	female	38.0	1	71.2833	C	First	woman	False	C	Cherbourg	yes	False
2	1	3	female	26.0	0	7.9250	S	Third	woman	False	NaN	Southampton	yes	True
3	1	1	female	35.0	1	53.1000	S	First	woman	False	C	Southampton	yes	False

Aside: Attributes¶

Recall that you can access columns of a DataFrame using attributes:

df.age

    22.0
    38.0
    26.0
    35.0
    35.0
       ... 
  27.0
  19.0
   NaN
  26.0
  32.0
Name: age, Length: 891, dtype: float64

You have typically defined attributes using syntax like

class MyDF():
    def __init__(self, **kwargs):
        self.is_df = True
        
df2 = MyDF(a=1, b=np.ones(3))
df2.is_df

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-12-ca27e129039d> in <module>
      3         self.is_df = True
      4 
----> 5 df2 = MyDF(a=1, b=np.ones(3))
      6 df2.is_df

NameError: name 'np' is not defined

df2.a

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-19-f30d3ef5cdfe> in <module>
----> 1 df2.a

AttributeError: 'MyDF' object has no attribute 'a'

However, there are other options for defining attibutes.

Object attributes are accessed using a Python dictionary. If you want to store attributes manually, you can put these attributes into the object dictionary using __dict__:

import numpy as np
class MyDF():
    def __init__(self, **kwargs):
        """
        Store all keyword arguments as attributes
        """
        for k, v in kwargs.items():
            self.__dict__[k] = v
            
        self.is_df = True
        
    
df2 = MyDF(a=1, b=np.ones(3))

print(df2.a)
print(df2.b)

1
[1. 1. 1.]

You can also define methods so they behave like attributes as well using the @property decorator

class MyDF():
    def __init__(self, **kwargs):
        """
        Store all keyword arguments as attributes
        """
        for k, v in kwargs.items():
            self.__dict__[k] = v
            
        self.is_df = True
        
           
    @property
    def nitems(self):
        return len(self.__dict__)
    
    
    @property
    def get_number(self):
        return np.random.rand()


    
df2 = MyDF(a=1, b=np.ones(3))
df2.nitems

df2.get_number

0.21378456767590537

Missing data¶

In data analysis, knowing how to properly fill in missing data is very important, sometimes we don’t want to just ignore them, especially when the observational numbers are small. There are various ways to do it such as filling with the mean, K-Nearest Neighbors (KNN) methods and so on.

Exercise¶

Here, as an exercise, suppose we are trying to replace missing values in age column as its mean from titanic dataset.

titanic = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv")

MissDataPra = titanic.copy()
MissDataPra

	survived	pclass	sex	age	sibsp	parch	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
0	0	3	male	22.0	1	0	7.2500	S	Third	man	True	NaN	Southampton	no	False
1	1	1	female	38.0	1	0	71.2833	C	First	woman	False	C	Cherbourg	yes	False
2	1	3	female	26.0	0	0	7.9250	S	Third	woman	False	NaN	Southampton	yes	True
3	1	1	female	35.0	1	0	53.1000	S	First	woman	False	C	Southampton	yes	False
4	0	3	male	35.0	0	0	8.0500	S	Third	man	True	NaN	Southampton	no	True
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
886	0	2	male	27.0	0	0	13.0000	S	Second	man	True	NaN	Southampton	no	True
887	1	1	female	19.0	0	0	30.0000	S	First	woman	False	B	Southampton	yes	True
888	0	3	female	NaN	1	2	23.4500	S	Third	woman	False	NaN	Southampton	no	False
889	1	1	male	26.0	0	0	30.0000	C	First	man	True	C	Cherbourg	yes	True
890	0	3	male	32.0	0	0	7.7500	Q	Third	man	True	NaN	Queenstown	no	True

891 rows × 15 columns

## Your code here

ages = MissDataPra["age"]
ages

    22.0
    38.0
    26.0
    35.0
    35.0
       ... 
  27.0
  19.0
   NaN
  26.0
  32.0
Name: age, Length: 891, dtype: float64

# Solution 
ageMean = ages.mean()
MissDataPra["age"] = ages.fillna(ageMean)
MissDataPra["age"]

    22.000000
    38.000000
    26.000000
    35.000000
    35.000000
         ...    
  27.000000
  19.000000
  29.699118
  26.000000
  32.000000
Name: age, Length: 891, dtype: float64

Series¶

a Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index.
Basic method to create a series: s = pd.Series(data, index = index)
Data can be many things:
- A Python Dictionary
- An ndarray (or reg. list)
- A scalar
The passed index is a list of axis labels (which varies on what data is)

Think “Series = Vector + labels”

first_series = pd.Series([2**i for i in range(7)])
print(type(first_series))
print(first_series)

<class 'pandas.core.series.Series'>
   1
   2
   4
   8
  16
  32
  64
dtype: int64

s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
print(s)
print('-'*50)
print(s.index)

a    0.451096
b    1.335011
c    0.433332
d    1.729655
e    1.092280
dtype: float64
--------------------------------------------------
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

s['a']

0.45109622266084853

If Data is a dictionary, if index is passed the values in data corresponding to the labels in the index will be pulled out, otherwise an index will be constructed from the sorted keys of the dict

d = {'a': [0., 0], 'b': {'1':1.}, 'c': 2.}
pd.Series(d)

a      [0.0, 0]
b    {'1': 1.0}
c             2
dtype: object

You can create a series from a scalar, but need to specify indices

pd.Series(5, index = ['a', 'b', 'c'])

a    5
b    5
c    5
dtype: int64

You can index and slice series like you would numpy arrays/python lists

s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
print(s)

a    0.264398
b    0.701476
c    1.106238
d    0.455066
e    1.202556
dtype: float64

print(s['b':'d'])

b    0.701476
c    1.106238
d    0.455066
dtype: float64

s[s > s.mean()]

You can iterate over a Series as well

for idx, val in s.iteritems():
    print(idx,val)

a 0.2643980344953476
b 0.7014760363611338
c 1.106238389833158
d 0.4550662058745087
e 1.2025560049383854

or sort by index or value

print(s.sort_index())
print(s.sort_values())

a    0.264398
b    0.701476
c    1.106238
d    0.455066
e    1.202556
dtype: float64
a    0.264398
d    0.455066
b    0.701476
c    1.106238
e    1.202556
dtype: float64

You can also count unique values:

s = pd.Series([0,0,0,1,1,1,2,2,2,2])
s.value_counts()

  4
  3
  3
dtype: int64

Exercise¶

Consider the series s of letters in a sentence.
What is count of each letter in the sentence, output a series which is sorted by the count
Create a list with only the top 5 common letters (not including space)

## Your code here

sentence = 'Consider the series s of letters in a sentence.'
s = pd.Series([c for c in sentence])
s

   C
   o
   n
   s
   i
   d
   e
   r
    
   t
  h
  e
   
  s
  e
  r
  i
  e
  s
   
  s
   
  o
  f
   
  l
  e
  t
  t
  e
  r
  s
   
  i
  n
   
  a
   
  s
  e
  n
  t
  e
  n
  c
  e
  .
dtype: object

s.value_counts()

e    9
     8
s    6
n    4
t    4
i    3
r    3
o    2
.    1
l    1
d    1
C    1
h    1
a    1
c    1
f    1
dtype: int64

sc = s[s != ' '].value_counts()
sc[:5]

e    9
s    6
n    4
t    4
i    3
dtype: int64

Data Frames¶

a DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object.
You can create a DataFrame from:
- Dict of 1D ndarrays, lists, dicts, or Series
- 2-D numpy array
- A list of dictionaries
- A Series
- Another Dataframe

df = pd.DataFrame(data, index = index, columns = columns)

index/ columns is a list of the row/ column labels. If you pass an index and/ or columns, you are guarenteeing the index and /or column of the df.
If you do not pass anything in, the input will be constructed by “common sense” rules

Documentation: pandas.DataFrame

DataFrame Creation¶

from dictionaries:

end_string = "\n -------------------- \n"

# Create a dictionary of series
d = {'one': pd.Series([1,2,3], index  = ['a', 'b', 'c']), 
     'two': pd.Series(list(range(4)), index = ['a','b', 'c', 'd'])}
df = pd.DataFrame(d)
print(df, end = end_string)

d= {'one': {'a': 1, 'b': 2, 'c':3}, 
     'two': pd.Series(list(range(4)), index = ['a','b', 'c', 'd'])}
# Columns are dictionary keys, indices and values obtained from series
df = pd.DataFrame(d)
# Notice how it fills the column one with NaN for d
print(df, end = end_string)

   one  two
a  1.0    0
b  2.0    1
c  3.0    2
d  NaN    3
 -------------------- 
   one  two
a  1.0    0
b  2.0    1
c  3.0    2
d  NaN    3
 -------------------- 

From dictionaries of ndarrays or lists:

d = {'one' : [1., 2., 3., 4], 'two' : [4., 3., 2., 1.]}
pd.DataFrame(d)

	one	two
0	1.0	4.0
1	2.0	3.0
2	3.0	2.0
3	4.0	1.0

from a list of dicts:

data = []
for i in range(100):
    data += [ {'Column' + str(j):np.random.randint(100) for j in range(5)} ]
    # dictionary comprehension!
    
data[:5]

[{'Column0': 87, 'Column1': 30, 'Column2': 20, 'Column3': 90, 'Column4': 49},
 {'Column0': 36, 'Column1': 17, 'Column2': 22, 'Column3': 9, 'Column4': 80},
 {'Column0': 9, 'Column1': 43, 'Column2': 93, 'Column3': 68, 'Column4': 31},
 {'Column0': 82, 'Column1': 28, 'Column2': 68, 'Column3': 28, 'Column4': 37},
 {'Column0': 6, 'Column1': 60, 'Column2': 1, 'Column3': 30, 'Column4': 74}]

# Creation from a list of dicts
df = pd.DataFrame(data)
df.head()

	Column0	Column1	Column2	Column3	Column4
0	87	30	20	90	49
1	36	17	22	9	80
2	9	43	93	68	31
3	82	28	68	28	37
4	6	60	1	30	74

Adding and Deleting Columns¶

# Adding and accessing columns 
d = {'one': pd.Series([1,2,3], index  = ['a', 'b', 'c']), 
     'two': pd.Series(range(4), index = ['a','b', 'c', 'd'])}
df = pd.DataFrame(d)
# multiply 
df['three'] =  df['one']*df['two']
# Create a boolean flag
df['flag'] = df['one'] > 2
df

	one	two	three	flag
a	1.0	0	0.0	False
b	2.0	1	2.0	False
c	3.0	2	6.0	True
d	NaN	3	NaN	False

# inserting column in specified location, with values
df.insert(1, 'bar', pd.Series(3, index=['a', 'b', 'd']))
df

	one	bar	two	three	flag
a	1.0	3.0	0	0.0	False
b	2.0	3.0	1	2.0	False
c	3.0	NaN	2	6.0	True
d	NaN	3.0	3	NaN	False

# Deleting Columns  
three = df.pop('three')
df

	one	bar	two	flag
a	1.0	3.0	0	False
b	2.0	3.0	1	False
c	3.0	NaN	2	True
d	NaN	3.0	3	False

Indexing and Selection¶

4 methods [], ix, iloc, loc

Operation	Syntax	Result
Select Column	df[col]	Series
Select Row by Label	df.loc[label]	Series
Select Row by Integer Location	df.iloc[idx]	Series
Slice rows	df[5:10]	DataFrame
Select rows by boolean	df[mask]	DataFrame

Note all the operations below are valid on series as well restricted to one dimension

Indexing using []

Series: selecting a label: s[label]
DataFrame: selection single or multiple columns:

df['col'] or df[['col1', 'col2']]

DataFrame: slicing the rows:

df['rowlabel1': 'rowlabel2']

or

df[boolean_mask]

# Lets create a data frame
pd.options.display.max_rows = 4
dates = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C','D'])
df

	A	B	C	D
2000-01-01	2.481613	-0.577964	0.222924	-1.098381
2000-01-02	0.342951	-0.696970	1.165167	0.839895
...	...	...	...	...
2000-01-07	-1.135999	0.948088	-0.060405	1.720806
2000-01-08	0.683102	-0.914922	0.613417	0.461496

8 rows × 4 columns

# column  'A
df['A']

2000-01-01    2.481613
2000-01-02    0.342951
                ...   
2000-01-07   -1.135999
2000-01-08    0.683102
Freq: D, Name: A, Length: 8, dtype: float64

# all rows, columns A, B
df.loc[:,"A":"B"]

	A	B
2000-01-01	2.481613	-0.577964
2000-01-02	0.342951	-0.696970
...	...	...
2000-01-07	-1.135999	0.948088
2000-01-08	0.683102	-0.914922

8 rows × 2 columns

# multiple columns
df[['A', 'B']]

	A	B
2000-01-01	2.481613	-0.577964
2000-01-02	0.342951	-0.696970
...	...	...
2000-01-07	-1.135999	0.948088
2000-01-08	0.683102	-0.914922

8 rows × 2 columns

# slice by rows
df['2000-01-01': '2000-01-04']

	A	B	C	D
2000-01-01	2.481613	-0.577964	0.222924	-1.098381
2000-01-02	0.342951	-0.696970	1.165167	0.839895
2000-01-03	-2.449466	-0.704328	-1.625973	0.233635
2000-01-04	-0.936724	-0.764115	0.049249	-0.893543

# boolean mask 
df[df['A'] > df['B']]

	A	B	C	D
2000-01-01	2.481613	-0.577964	0.222924	-1.098381
2000-01-02	0.342951	-0.696970	1.165167	0.839895
2000-01-08	0.683102	-0.914922	0.613417	0.461496

# Assign via []
df['A'] = df['B'].values
df

	A	B	C	D
2000-01-01	-0.577964	-0.577964	0.222924	-1.098381
2000-01-02	-0.696970	-0.696970	1.165167	0.839895
...	...	...	...	...
2000-01-07	0.948088	0.948088	-0.060405	1.720806
2000-01-08	-0.914922	-0.914922	0.613417	0.461496

8 rows × 4 columns

Selecting by label: .loc

is primarily label based, but may also be used with a boolean array.
.loc will raise KeyError when the items are not found
Allowed inputs:
1. A single label
2. A list of labels
3. A boolean array

## Selection by label .loc
df.loc['2000-01-01']

A   -0.577964
B   -0.577964
C    0.222924
D   -1.098381
Name: 2000-01-01 00:00:00, dtype: float64

df.loc[:, 'A':'C']

	A	B	C
2000-01-01	-0.577964	-0.577964	0.222924
2000-01-02	-0.696970	-0.696970	1.165167
...	...	...	...
2000-01-07	0.948088	0.948088	-0.060405
2000-01-08	-0.914922	-0.914922	0.613417

8 rows × 3 columns

# Get columns for which value is greater than 0 on certain day, get all rows
df.loc[:, df.loc['2000-01-01'] > 0]

	C
2000-01-01	0.222924
2000-01-02	1.165167
...	...
2000-01-07	-0.060405
2000-01-08	0.613417

8 rows × 1 columns

Selecting by position: iloc

The .iloc attribute is the primary access method. The following are valid input:
- An integer
- A list of integers
- A slice
- A boolean array

df1 = pd.DataFrame(np.random.randn(6,4), 
                   index=list(range(0,12,2)), columns=list(range(0,12,3)))

df1

	0	3	6	9
0	0.475385	-0.631032	-1.535313	0.662338
2	0.289818	-0.659518	0.877284	-1.113570
...	...	...	...	...
8	-0.635411	0.512394	-0.638799	1.760938
10	-0.690578	-0.656628	0.463184	-1.155488

6 rows × 4 columns

# rows 0-2
df1.iloc[:3]

	0	3	6	9
0	0.475385	-0.631032	-1.535313	0.662338
2	0.289818	-0.659518	0.877284	-1.113570
4	-0.130941	1.969738	1.103449	-1.368122

#  rows 1:5  and columns 2 : 4
df1.iloc[1:5, 2:4]

	6	9
2	0.877284	-1.113570
4	1.103449	-1.368122
6	1.145936	-0.435587
8	-0.638799	1.760938

# select via integer list
df1.iloc[[1,3,5], [1,3]]

	3	9
2	-0.659518	-1.113570
6	0.017810	-0.435587
10	-0.656628	-1.155488

# selecting via integer mask 
boolean_mask = df1.iloc[:, 1] > 0.0
df1.iloc[boolean_mask.values,1]

  1.969738
  0.017810
  0.512394
Name: 3, dtype: float64

Exercise¶

Given the dataframe df below, find the following

Last two rows of columns A and D
Last three rows such which statisfy that column A > Column B

dates = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C','D'])
df

## Your code here

# Solution: Last two rows of columns A and D
df.iloc[[6,7], [0,3]]

# Solution: Last three rows such which statisfy that column A > Column B
last3rows = df.iloc[5:8,]  # last three rows 
print(last3rows['A'] > last3rows['B']) # The boolean statement return true if column A > Column B. 

Plotting¶

DataFrames and Series have a variety of built-in plotting methods:

Let’s take a look at the Iris data set

sepal length in cm
sepal width in cm
petal length in cm
petal width in cm
class: – Iris Setosa – Iris Versicolour – Iris Virginica

iris = pd.read_csv(
    "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
    names=["sepal length", "sepal width", "petal length", "petal width", "class"]
)
iris

	sepal length	sepal width	petal length	petal width	class
0	5.1	3.5	1.4	0.2	Iris-setosa
1	4.9	3.0	1.4	0.2	Iris-setosa
...	...	...	...	...	...
148	6.2	3.4	5.4	2.3	Iris-virginica
149	5.9	3.0	5.1	1.8	Iris-virginica

150 rows × 5 columns

iris.plot()
plt.show()

# Titanic DataSet
df = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv")

df.age.plot()
plt.show()

plot just plots the value by index, and doesn’t make a lot of sense unless the index means something (like time). In this case, a histogram makes more sense:

df.age.hist()
plt.show()

we can also create scatter plots of two columns

df.plot.scatter(x='age', y='fare', alpha=0.5)
plt.show()

You can also create box plots

df.age.plot.box()
plt.show()

Summary Statistics¶

Basic summary statistics are built into Pandas. These are easy to compute on columns/series

print(df.age.mean())
print(df.age.median())

29.69911764705882
28.0

You can also compute statistics on multiple (or all columns)

df.mean()

survived      0.383838
pclass        2.308642
                ...   
adult_male    0.602694
alone         0.602694
Length: 8, dtype: float64

df[['age', 'fare']].mean()

age     29.699118
fare    32.204208
dtype: float64

You can also compute statistics grouping by category

df[['age', 'sex']].groupby('sex').mean()

	age
sex
female	27.915709
male	30.726645

df.groupby('sex').mean()

	survived	pclass	age	sibsp	parch	fare	adult_male	alone
sex
female	0.742038	2.159236	27.915709	0.694268	0.649682	44.479818	0.000000	0.401274
male	0.188908	2.389948	30.726645	0.429809	0.235702	25.523893	0.930676	0.712305

You can count how many records are in each category for categorical variables

df['pclass'].value_counts()

  491
  216
  184
Name: pclass, dtype: int64

df.groupby('sex')['pclass'].value_counts()

sex     pclass
female  3         144
        1          94
                 ... 
male    1         122
        2         108
Name: pclass, Length: 6, dtype: int64

Table Manipulation¶

You can sort tables by a column value:

df.sort_values(by='age')[:10]

	survived	pclass	sex	age	sibsp	parch	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
803	1	3	male	0.42	0	1	8.5167	C	Third	child	False	NaN	Cherbourg	yes	False
755	1	2	male	0.67	1	1	14.5000	S	Second	child	False	NaN	Southampton	yes	False
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
381	1	3	female	1.00	0	2	15.7417	C	Third	child	False	NaN	Cherbourg	yes	False
164	0	3	male	1.00	4	1	39.6875	S	Third	child	False	NaN	Southampton	no	False

10 rows × 15 columns

You can also sort by a primary key and secondary key

df.sort_values(by=['pclass', 'age'])[:600]

	survived	pclass	sex	age	sibsp	parch	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
305	1	1	male	0.92	1	2	151.5500	S	First	child	False	C	Southampton	yes	False
297	0	1	female	2.00	1	2	151.5500	S	First	child	False	C	Southampton	no	False
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
162	0	3	male	26.00	0	0	7.7750	S	Third	man	True	NaN	Southampton	no	True
207	1	3	male	26.00	0	0	18.7875	C	Third	man	True	NaN	Cherbourg	yes	True

600 rows × 15 columns

Groups¶

pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names:

Syntax:

groups = df.groupby(key)
groups = df.groupby(key, axis = 1)
groups = df.groupby([key1, key2], axis = 1)

The group by concept is that we want to apply the same function on subsets of the dataframe, based on some key we use to split the DataFrame into subsets

This idea is referred to as the “split-apply-combine” operation:

Split the data into groups based on some criteria
Apply a function to each group independently
Combine the results

df = pd.DataFrame({'key':['A','B','C','A','B','C','A','B','C'],
                   'data': [0, 5, 10, 5, 10, 15, 10, 15, 20]})
df

	key	data
0	A	0
1	B	5
...	...	...
7	B	15
8	C	20

9 rows × 2 columns

df.groupby('key')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f60022988e0>

df.groupby('key').sum()

	data
key
A	15
B	30
C	45

sums = df.groupby('key').agg(np.sum)
sums

	data
key
A	15
B	30
C	45

ax = iris.groupby('class') \
         .get_group('Iris-setosa') \
         .boxplot(column=["sepal length","sepal width"], return_type='axes')

Pivot Tables¶

Say you want the mean age for each sex grouped by class. We can create a pivot table to display the data:

titanic.pivot_table(values="age", index="sex",
                columns="pclass", aggfunc="mean")

pclass	1	2	3
sex
female	34.611765	28.722973	21.750000
male	41.281386	30.740707	26.507589

you can change the aggregation function to compute other statistics

titanic.pivot_table(values="age", index="sex",
                columns="pclass", aggfunc="median")

pclass	1	2	3
sex
female	35.0	28.0	21.5
male	40.0	30.0	25.0

Merging DataFrames¶

Pandas has full-featured, very high performance, in memory join operations that are very similar to SQL and R
The documentation is https://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging
Pandas provides a single function, merge, as the entry point for all standard database join operations between DataFrame objects:

pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=True)

# Example of merge
left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [4, 2]})
right = pd.DataFrame({'key': ['bar', 'zoo'], 'rval': [4, 5]})

print("left: ",left,"right: ",right, sep=end_string)

left: 
 -------------------- 
   key  lval
0  foo     4
1  bar     2
 -------------------- 
right: 
 -------------------- 
   key  rval
0  bar     4
1  zoo     5

merged = pd.merge(left, right, how="inner")
print(merged)

   key  lval  rval
0  bar     2     4

merged = pd.merge(left, right, how="outer")
print(merged)

   key  lval  rval
foo   4.0   NaN
bar   2.0   4.0
zoo   NaN   5.0

merged = pd.merge(left, right, how="left")
print(merged)

   key  lval  rval
0  foo     4   NaN
1  bar     2   4.0

merged = pd.merge(left, right, how="right")
print(merged)

   key  lval  rval
0  bar   2.0     4
1  zoo   NaN     5

Functions¶

Row or Column-wise Function Application: Applies function along input axis of DataFrame

df.apply(func, axis = 0)

Elementwise: apply the function to every element in the df

df.applymap(func)

Note, applymap is equivalent to the map function on lists.
Note, Series objects support .map instead of applymap

df1 = pd.DataFrame(np.random.randn(6,4), index=list(range(0,12,2)), columns=list('abcd'))
df1

	a	b	c	d
0	-0.086992	-0.084940	0.732539	0.055577
2	0.414442	-0.085030	0.108881	0.392840
...	...	...	...	...
8	-1.401044	-0.814559	-0.227711	-1.525791
10	-1.349603	-0.058741	1.319168	0.215847

6 rows × 4 columns

A = np.random.randn(6,4)
print(np.mean(A, axis=0))
print(A[0])

[-0.30215707  0.32244258 -0.19654789 -0.17821756]
[-0.88573991  1.51802081 -0.79924036 -0.34815962]

print(np.mean(A, axis=1))
print(A[:,0])

[-0.12877977 -0.25276604  0.02348098  0.63743703 -0.20947629 -0.60161581]
[-0.88573991  0.03663791  0.69401035 -1.54572936 -0.68917893  0.57705751]

# Apply to each column
df1.apply(np.mean, axis=0)

a   -0.554604
b   -0.243100
c    0.496747
d    0.087104
dtype: float64

# Apply to each row
df1.apply(np.mean, axis=1)

0     0.154046
2     0.207783
        ...   
8    -0.992276
10    0.031668
Length: 6, dtype: float64

# # Use lambda functions  to normalize columns
df1.apply(lambda x: (x - x.mean())/ x.std(), axis = 0)

	a	b	c	d
0	0.567312	0.279936	0.440002	-0.037146
2	1.175658	0.279778	-0.723779	0.360238
...	...	...	...	...
8	-1.026911	-1.011459	-1.351878	-1.900419
10	-0.964503	0.326306	1.534685	0.151694

6 rows × 4 columns

df1.applymap(np.exp)

	a	b	c	d
0	0.916684	0.918567	2.080356	1.057151
2	1.513526	0.918485	1.115030	1.481181
...	...	...	...	...
8	0.246340	0.442835	0.796355	0.217449
10	0.259343	0.942951	3.740308	1.240912

6 rows × 4 columns

Bibliography - this notebook used content from some of the following sources:

Scientific Computing with Python

Pandas

Contents

Pandas¶

Overview¶

Difference between a DataFrame and NumPy Array¶

Key Features¶

Other Strengths¶

Python/Pandas vs. R¶

Objects and Basic Creation¶

Get Started¶

Selecting Subsets of DataFrames¶

Aside: Attributes¶

Missing data¶

Exercise¶

Series¶

Exercise¶

Data Frames¶

DataFrame Creation¶

Adding and Deleting Columns¶

Indexing and Selection¶

Exercise¶

Plotting¶

Summary Statistics¶

Table Manipulation¶

Groups¶

Pivot Tables¶

Merging DataFrames¶

Functions¶