Python pandas でのグルーピング/集約/変換処理まとめ

これの pandas 版。

<a href="http://sinhrks.hatenablog.com/entry/2014/10/13/003717">R dplyr, tidyr でのグルーピング/集約/変換処理まとめ - StatsFragments</a>

準備

サンプルデータは iris で。

補足 (11/26追記) rpy2 を設定している方は rpy2から、そうでない方はこちらから .csv でダウンロードして読み込み (もしくは read_csv のファイルパスとして直接 URL 指定しても読める)。

import pandas as pd
import numpy as np
# 表示する行数を設定
pd.options.display.max_rows=5

# iris の読み込みはどちらかで

# rpy2 経由で R から iris をロード
# import pandas.rpy.common as com
# iris = com.load_data('iris')

# csv から読み込み
# http://aima.cs.berkeley.edu/data/iris.csv
names = ['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species']
iris = pd.read_csv('iris.csv', header=None, names=names)

iris 
#      Sepal.Length  Sepal.Width  Petal.Length  Petal.Width    Species
# 1             5.1          3.5           1.4          0.2     setosa
# 2             4.9          3.0           1.4          0.2     setosa
# ..            ...          ...           ...          ...        ...
# 149           6.2          3.4           5.4          2.3  virginica
# 150           5.9          3.0           5.1          1.8  virginica

グルーピング/集約

ある列の値ごとに集計

Species 列ごとに Sepal.Length 列の合計を算出する場合、

iris.groupby('Species')['Sepal.Length'].sum()
# Species
# setosa        250.3
# versicolor    296.8
# virginica     329.4
# Name: Sepal.Length, dtype: float64

全列の合計を取得する場合 DataFrame.groupby から直接集約関数を呼べばよい。集約できない列は勝手にフィルタされる。

iris.groupby('Species').sum()
#             Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
# Species                                                         
# setosa             250.3        171.4          73.1         12.3
# versicolor         296.8        138.5         213.0         66.3
# virginica          329.4        148.7         277.6        101.3

集約対象列の指定は DataFrame の列選択と同じ。すっきり。

iris.groupby('Species')[['Petal.Width', 'Petal.Length']].sum()
#             Petal.Width  Petal.Length
# Species                              
# setosa             12.3          73.1
# versicolor         66.3         213.0
# virginica         101.3         277.6

メソッド呼び出しではなく、別に用意された集約関数を渡したい場合は .apply。文字列で渡したいときは渡す際に eval。

iris.groupby('Species')[['Petal.Width', 'Petal.Length']].apply(np.sum)
#             Petal.Width  Petal.Length
# Species                              
# setosa             12.3          73.1
# versicolor         66.3         213.0
# virginica         101.3         277.6

iris.groupby('Species')[['Petal.Width', 'Petal.Length']].apply(eval('np.sum'))
#             Petal.Width  Petal.Length
# Species                              
# setosa             12.3          73.1
# versicolor         66.3         213.0
# virginica         101.3         277.6

また、集約関数を複数渡したい場合は .agg。列名 : 集約関数の辞書を渡すので、列ごとに集約関数を変えることもできる。

iris.groupby('Species').agg({'Petal.Length': [np.sum, np.mean], 'Petal.Width': [np.sum, np.mean]})
#            Petal.Length        Petal.Width       
#                     sum   mean         sum   mean
# Species                                          
# setosa             73.1  1.462        12.3  0.246
# versicolor        213.0  4.260        66.3  1.326
# virginica         277.6  5.552       101.3  2.026

行持ち / 列持ち変換

複数列持ちの値を行持ちに展開 (unpivot / melt)

複数列で持っている値を行持ちに展開する処理は、pd.melt。 DataFrame.melt ではないので注意。

melted = pd.melt(iris, id_vars=['Species'], var_name='variable', value_name='value')
melted
#        Species      variable  value
# 0       setosa  Sepal.Length    5.1
# 1       setosa  Sepal.Length    4.9
# ..         ...           ...    ...
# 598  virginica   Petal.Width    2.3
# 599  virginica   Petal.Width    1.8
# 
# [600 rows x 3 columns]

複数行持ちの値を列持ちに変換 (pivot)

DataFrame.pivot。集約処理付きの別関数 pd.pivot_table もある。

# pivotするデータの準備。Species (列にする値) と variable (行にする値) の組がユニークでないとダメ。
unpivot = melted.groupby(['Species', 'variable']).sum()
unpivot = unpivot.reset_index()
unpivot
#       Species      variable  value
# 0      setosa  Petal.Length   73.1
# 1      setosa   Petal.Width   12.3
# ..        ...           ...    ...
# 10  virginica  Sepal.Length  329.4
# 11  virginica   Sepal.Width  148.7
# 
# [12 rows x 3 columns]

unpivot.pivot(index='variable', columns='Species', values='value')
# Species       setosa  versicolor  virginica
# variable                                   
# Petal.Length    73.1       213.0      277.6
# Petal.Width     12.3        66.3      101.3
# Sepal.Length   250.3       296.8      329.4
# Sepal.Width    171.4       138.5      148.7

列の分割 / 結合

列の値を複数列に分割

pandas には tidyr::separate に直接対応する処理はない。.str.split では分割された文字列が一つの列にリストとして格納されてしまう。そのため、分割結果のリストを個々の列に格納しなおす必要がある。

2014/11/17修正: v0.15.1 以降では str.split の return_type='frame' オプションを利用して簡単にできるようになったので修正。既定 ( return_type='series' )では、split されたリストが object 型として 1列に保存されてしまうので注意。

2015/01/16追記: v0.16.1 以降では return_type オプションが deprecate され、 expand オプションに置き換えられた。expand=True を指定すれば同様の処理ができる。

melted2 = melted.copy()
melted2
#        Species      variable  value
# 0       setosa  Sepal.Length    5.1
# 1       setosa  Sepal.Length    4.9
# ..         ...           ...    ...
# 598  virginica   Petal.Width    2.3
# 599  virginica   Petal.Width    1.8
# 
# [600 rows x 3 columns]

melted2[['Parts', 'Scale']] = melted2['variable'].str.split('.', return_type
='frame')
melted2
#        Species      variable  value  Parts   Scale
# 0       setosa  Sepal.Length    5.1  Sepal  Length
# 1       setosa  Sepal.Length    4.9  Sepal  Length
# ..         ...           ...    ...    ...     ...
# 598  virginica   Petal.Width    2.3  Petal   Width
# 599  virginica   Petal.Width    1.8  Petal   Width
# 
# [600 rows x 5 columns]

# 不要な列を削除
melted2.drop('variable', axis=1)
#        Species  value  Parts   Scale
# 0       setosa    5.1  Sepal  Length
# 1       setosa    4.9  Sepal  Length
# ..         ...    ...    ...     ...
# 598  virginica    2.3  Petal   Width
# 599  virginica    1.8  Petal   Width
# 
# [600 rows x 4 columns]

.str.extract で正規表現を使ってもできる。

melted3 = melted.copy()
melted3[['Parts', 'Scale']] = melted3['variable'].str.extract('(.+)\.(.+)')
melted3 = melted3.drop('variable', axis=1)
melted3
#        Species  value  Parts   Scale
# 0       setosa    5.1  Sepal  Length
# 1       setosa    4.9  Sepal  Length
# ..         ...    ...    ...     ...
# 598  virginica    2.3  Petal   Width
# 599  virginica    1.8  Petal   Width
# 
# [600 rows x 4 columns]

複数列の値を一列に結合

普通に文字列結合すればよい。

melted3['variable'] = melted3['Parts'] + '.' + melted3['Scale']
melted3
#        Species  value  Parts   Scale      variable
# 0       setosa    5.1  Sepal  Length  Sepal.Length
# 1       setosa    4.9  Sepal  Length  Sepal.Length
# ..         ...    ...    ...     ...           ...
# 598  virginica    2.3  Petal   Width   Petal.Width
# 599  virginica    1.8  Petal   Width   Petal.Width
# 
# [600 rows x 5 columns]