Curiosity

Musings on observations.

Pandas and Python: Top 10

I recently discovered the high-performance Pandas library written in Python while performing data munging in a machine learning project. Using simple examples, I want to highlight my favorite (and sometimes hard to find) features.

Apart from serving as a quick reference, I hope this post will help new users to quickly start extracting value from Pandas. For a good overview of Pandas and its advanced features, I highly recommended Wes McKinney’s Python for Data Analysis book and the documentation on the website.

Here is my top 10 list:

  1. Indexing
  2. Renaming
  3. Handling missing values
  4. map(), apply(), applymap()
  5. groupby()
  6. New Columns = f(Existing Columns)
  7. Basic stats
  8. Merge, join
  9. Plots
  10. Scikit-learn conversion

Example DataFrame

I will use a simple data frame for explanation.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: from pandas import DataFrame, Series

In [4]: df = DataFrame({'int_col' : [1,2,6,8,-1], 'float_col' : [0.1, 0.2,0.2,10.1,None], 'str_col' : ['a','b',None,'c','a']})

In [5]: df
Out[5]:
   float_col  int_col str_col
0        0.1        1       a
1        0.2        2       b
2        0.2        6    None
3       10.1        8       c
4        NaN       -1       a

Indexing

Selecting a subset of columns

It is one of the simplest features but was surprisingly difficult to find. The ix method works elegantly for this purpose. Suppose you wanted to index only using columns int_col and string_col, you would use the advanced indexing ix method as shown below.

1
2
3
4
5
6
7
8
In [6]: df.ix[:,['float_col','int_col']]
Out[6]:
   float_col  int_col
0        0.1        1
1        0.2        2
2        0.2        6
3       10.1        8
4        NaN       -1

EDIT Suggestion by Dan in the comments below. Another (probably more elegant) syntax for indexing multiple columns is given below.

1
2
3
4
5
6
7
8
In [9]: df[['float_col','int_col']]
Out[9]:
   float_col  int_col
0        0.1        1
1        0.2        2
2        0.2        6
3       10.1        8
4        NaN       -1

Conditional indexing

One can index using boolean indexing

1
2
3
4
5
6
In [7]: df[df['float_col'] > 0.15]
Out[7]:
   float_col  int_col str_col
1        0.2        2       b
2        0.2        6    None
3       10.1        8       c
1
2
3
4
In [8]: df[df['float_col'] == 0.1]
Out[8]:
   float_col  int_col str_col
0        0.1        1       a

EDIT Suggestion by Roby Levy in the comments below. One can select multiple boolean operators (| for or, & for and, and ~ for not) and group them by parenthisis.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
In [10]: df[(df['float_col'] > 0.1) & (df['int_col']>2)]
Out[10]:
   float_col  int_col str_col
2        0.2        6    None
3       10.1        8       c

In [11]: df[(df['float_col'] > 0.1) | (df['int_col']>2)]
Out[11]:
   float_col  int_col str_col
1        0.2        2       b
2        0.2        6    None
3       10.1        8       c

In [12]: df[~(df['float_col'] > 0.1)]
Out[12]:
   float_col  int_col str_col
0        0.1        1       a
4        NaN       -1       a

Renaming columns

Use the rename method to rename columns. It copies the data to another DataFrame.

1
2
3
4
5
6
7
8
9
10
In [9]: df2 = df.rename(columns={'int_col' : 'some_other_name'})

In [10]: df2
Out[10]:
   float_col  some_other_name str_col
0        0.1                1       a
1        0.2                2       b
2        0.2                6    None
3       10.1                8       c
4        NaN               -1       a

Set the inplace = True flag incase you want to modify the existing DataFrame.

1
2
3
4
5
6
7
8
In [11]: df2.rename(columns={'some_other_name' : 'int_col'}, inplace = True)
Out[11]:
   float_col  int_col str_col
0        0.1        1       a
1        0.2        2       b
2        0.2        6    None
3       10.1        8       c
4        NaN       -1       a

Handling missing values

Handling of missing values can be performed beautifully using pandas.

Drop missing values

The dropna can used to drop rows or columns with missing data (NaN). By default, it drops all rows with any missing entry.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
n [12]: df2
Out[12]:
   float_col  int_col str_col
0        0.1        1       a
1        0.2        2       b
2        0.2        6    None
3       10.1        8       c
4        NaN       -1       a

In [13]: df2.dropna()
Out[13]:
   float_col  int_col str_col
0        0.1        1       a
1        0.2        2       b
3       10.1        8       c

Fill missing values

The fillna method on the other hand can be used to fill missing data (NaN). The example below shows a simple replacement using the mean of the available values.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
In [14]: df3 = df.copy()

In [15]: mean = df3['float_col'].mean()

In [16]: df3
Out[16]:
   float_col  int_col str_col
0        0.1        1       a
1        0.2        2       b
2        0.2        6    None
3       10.1        8       c
4        NaN       -1       a

In [17]: df3['float_col'].fillna(mean)
Out[17]:
0     0.10
1     0.20
2     0.20
3    10.10
4     2.65
Name: float_col

Map, Apply

Forget writing for loops while using pandas. One can do beautiful vectorized computation by applying function over rows and columns using the map, apply and applymap methods.

map

The map operation operates over each element of a Series.

1
2
3
4
5
6
7
In [18]: df['str_col'].dropna().map(lambda x : 'map_' + x)
Out[18]:
0    map_a
1    map_b
3    map_c
4    map_a
Name: str_col

apply

The apply is a pretty flexible function which, as the name suggests, applies a function along any axis of the DataFrame. The examples show the application of the sum function over columns. (Thanks to Mindey in the comments below to use np.sum instead of np.sqrt in the example)

1
2
3
4
5
In [19]: df.ix[:,['int_col','float_col']].apply(np.sqrt)
Out[19]:
int_col      16.0
float_col    10.6
dtype: float64

applymap

The applymap operation can be used to apply the function to each element of the DataFrame.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
In [39]: def some_fn(x):
   ....:   if type(x) is str:
   ....:     return 'applymap_' + x
   ....:   elif x:
   ....:     return 100 * x
   ....:   else:
   ....:     return
   ....:

In [40]: df.applymap(some_fn)
Out[40]:
   float_col  int_col     str_col
0         10      100  applymap_a
1         20      200  applymap_b
2         20      600        None
3       1010      800  applymap_c
4        NaN     -100  applymap_a

Vectorized mathematical and string operations

HT: @janschulz

One can perform vectorized calculations using simple operators and numpy functions.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
In [4]: df = pd.DataFrame(data={"A":[1,2], "B":[1.2,1.3]})

In [5]: df["C"] = df["A"]+df["B"]

In [6]: df
Out[6]:
   A    B    C
0  1  1.2  2.2
1  2  1.3  3.3

In [7]: df["D"] = df["A"]*3

In [8]: df
Out[8]:
   A    B    C  D
0  1  1.2  2.2  3
1  2  1.3  3.3  6

In [9]: df["E"] = np.sqrt(df["A"])

In [10]: df
Out[10]:
   A    B    C  D         E
0  1  1.2  2.2  3  1.000000
1  2  1.3  3.3  6  1.414214

Also, vectorized string operations are easy to use.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
In [11]: df = pd.DataFrame(data={"A":[1,2], "B":[1.2,1.3], "Z":["a","b"]})

In [12]: df
Out[12]:
   A    B  Z
0  1  1.2  a
1  2  1.3  b

In [13]: df["F"] = df.Z.str.upper()

In [14]: df
Out[14]:
   A    B  Z  F
0  1  1.2  a  A
1  2  1.3  b  B

GroupBy

The groupby method let’s you perform SQL-like grouping operations. The example below shows a grouping operation performed with str_col columns entries as keys. It is used to calculate the mean of the float_col for each key. For more details, please refer to the split-apply-combine description on the pandas website.

1
2
3
4
5
6
7
8
In [41]: grouped = df['float_col'].groupby(df['str_col'])

In [42]: grouped.mean()
Out[42]:
str_col
a           0.1
b           0.2
c          10.1

New Columns = f(Existing Columns)

Generating new columns from existing columns in a data frame is an integral part of my workflow. This was one of the hardest parts for me to figure out. I hope these examples will save time and effort for other people.

I will try to illustrate it in a piecemeal manner — multiple columns as a function of a single column, single column as a function of multiple columns, and finally multiple columns as a function of multiple columns.

multiple columns as a function of a single column

I often have to generate multiple columns of a DataFrame as a function of a single columns. Related Stack Overflow question

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
In [43]: df4 = df.copy()

In [44]: def two_three_strings(x):
   ....:   return x*2, x*3
   ....:

In [45]: df4['twice'],df4['thrice'] = zip(*df4['int_col'].map(two_three_strings))

In [46]: df4
Out[46]:
   float_col  int_col str_col  twice  thrice
0        0.1        1       a      2       3
1        0.2        2       b      4       6
2        0.2        6    None     12      18
3       10.1        8       c     16      24
4        NaN       -1       a     -2      -3

single column as a function of multiple columns

It’s sometimes useful to generate multiple DataFrame columns from a single column. It comes in handy especially when methods return tuples. Related Stack Overflow question

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
In [47]: df5 = df.copy()

In [48]: def sum_two_cols(series):
   ....:   return series['int_col'] + series['float_col']
   ....:

In [49]: df5['sum_col'] = df5.apply(sum_two_cols,axis=1)

In [50]: df5
Out[50]:
   float_col  int_col str_col  sum_col
0        0.1        1       a      1.1
1        0.2        2       b      2.2
2        0.2        6    None      6.2
3       10.1        8       c     18.1
4        NaN       -1       a      NaN

multiple columns as a function of multiple columns

Finally, a way to generate a new DataFrame with multiple columns based on multiple columns in an existing DataFrame. Related Stack Overflow question

1
2
3
4
5
6
7
8
9
10
11
12
13
14
In [51]: import math

In [52]: def int_float_squares(series):
   ....:   return pd.Series({'int_sq' : series['int_col']**2, 'flt_sq' : series['float_col']**2})
   ....:

In [53]: df.apply(int_float_squares, axis = 1)
Out[53]:
   flt_sq  int_sq
0    0.01       1
1    0.04       4
2    0.04      36
3  102.01      64
4     NaN       1

Stats

Pandas provides nifty methods to understand your data. I am highlighting the describe, correlation, covariance, and correlation methods that I use to quickly make sense of my data.

describe

The describe method provides quick stats on all suitable columns.

1
2
3
4
5
6
7
8
9
10
11
In [54]: df.describe()
Out[54]:
       float_col   int_col
count    4.00000  5.000000
mean     2.65000  3.200000
std      4.96689  3.701351
min      0.10000 -1.000000
25%      0.17500  1.000000
50%      0.20000  2.000000
75%      2.67500  6.000000
max     10.10000  8.000000

covariance

The cov method provides the covariance between suitable columns.

1
2
3
4
5
In [55]: df.cov()
Out[55]:
           float_col    int_col
float_col  24.670000  12.483333
int_col    12.483333  13.700000

correlation

The corr method provides the correlation between suitable columns.

1
2
3
4
5
In [56]: df.corr()
Out[56]:
           float_col   int_col
float_col   1.000000  0.760678
int_col     0.760678  1.000000

Merge and Join

Pandas supports database-like joins which makes it easy to link data frames.

I will use the simple example to highlight the joins using the merge command.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
n [57]: df
Out[57]:
   float_col  int_col str_col
0        0.1        1       a
1        0.2        2       b
2        0.2        6    None
3       10.1        8       c
4        NaN       -1       a

In [58]: other = DataFrame({'str_col' : ['a','b'], 'some_val' : [1, 2]})

In [59]: other
Out[59]:
   some_val str_col
0         1       a
1         2       b

The inner, outer, left and right joins are show below. The data frames are joined using the str_col keys.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
In [60]: pd.merge(df,other,on='str_col',how='inner')
Out[60]:
   float_col  int_col str_col  some_val
0        0.1        1       a         1
1        NaN       -1       a         1
2        0.2        2       b         2

In [61]: pd.merge(df,other,on='str_col',how='outer')
Out[61]:
   float_col  int_col str_col  some_val
0        0.1        1       a         1
1        NaN       -1       a         1
2        0.2        2       b         2
3        0.2        6    None       NaN
4       10.1        8       c       NaN

In [62]: pd.merge(df,other,on='str_col',how='left')
Out[62]:
   float_col  int_col str_col  some_val
0        0.1        1       a         1
1        NaN       -1       a         1
2        0.2        2       b         2
3        0.2        6    None       NaN
4       10.1        8       c       NaN

In [63]: pd.merge(df,other,on='str_col',how='right')
Out[63]:
   float_col  int_col str_col  some_val
0        0.1        1       a         1
1        NaN       -1       a         1
2        0.2        2       b         2

Plot

I was thoroughly surprised by the plotting capabilities of the pandas library. There are several plotting methods available. I am highlighting a couple of simple plots that I use the most.

Let’s start with a simple data frame to plot.

1
2
3
In [3]: plot_df = DataFrame(np.random.randn(1000,2),columns=['x','y'])

In [4]: plot_df['y'] = plot_df['y'].map(lambda x : x + 1)

Plot

A simple plot command goes a long way.

1
2
In [5]: plot_df.plot()
Out[5]: <matplotlib.axes.AxesSubplot at 0x10e6cad10>

Alt text

Histograms

I really enjoy histograms to get a quick idea about the distribution of the data.

1
2
In [6]: plot_df.hist()
Out[6]: array([[Axes(0.125,0.1;0.336957x0.8), Axes(0.563043,0.1;0.336957x0.8)]], dtype=object)

Alt text

Scikit-learn conversion

This took me a non-trivial amount of time to figure out and I hope others can avoid this mistake. According to the pandas documentation, the ndarray object obtained via the values method has object dtype if values contain more than float and integer dtypes. Now even if you slice the str columns away, the resulting array will still consist of object dtype and might not play well with other libraries such as scikit-learn which are expecting a float dtype. Explicitly converting type works well in this scenario.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
n [18]: df
Out[18]:
   float_col  int_col str_col
0        0.1        1       a
1        0.2        2       b
2        0.2        6     NaN
3       10.1        8       c
4        NaN       -1       a

In [25]: df.values[:,:-1]
Out[25]:
array([[0.1, 1],
       [0.2, 2],
       [0.2, 6],
       [10.1, 8],
       [nan, -1]], dtype=object)

In [26]: df.values[:,:-1].astype(float32)
Out[26]:
array([[  0.1       ,   1.        ],
       [  0.2       ,   2.        ],
       [  0.2       ,   6.        ],
       [ 10.10000038,   8.        ],
       [         nan,  -1.        ]], dtype=float32)

EDIT HT: Wes Turner via comments. The sklearn-pandas library looks great for bridging pandas scikit-learn.

Summary

I hope these examples will help new users quickly extract a lot of value out of pandas and serve as a useful quick reference for the pandas pros.

Happy munging!

Comments