【3.3.4】Pandas--DataFrame的行索引和列索引的过渡(stack/unstack/pivot)

  • stack()方法是针对索引或者标签的,即将列索引转成最内层的行索引
  • pivot()针对列的值,即指定某列的值作为行索引,指定某列的值作为列索引,然后再指定哪些列作为索引对应的值
  • unstack()是stack()的逆操作

一、原理说明

unstack(self, level=-1, fill_value=None)、pivot(self, index=None, columns=None, values=None,对比这两个方法的参数,这里要注意的是,对于pivot(),如果参数values指定了不止一列作为值的话,那么生成的DataFrame的列索引就会出现层次索引,最外层的索引为原来的列标签;unstack()没有指定值的参数,会把剩下的列都作为值,即把剩下的列标签都作为最外层的索引,每个索引对应一个子表。

pivot()方法其实比较容易理解,就是指定相应的列分别作为行、列索引以及值。下面我们通过几张原理图详细说明stack()和unstack(),最后再通过一个具体的例子来对比stack()、unstack()和pivot()这三种方法。

先看stack(),如图。stack()是将原来的列索引转成了最内层的行索引,这里是多层次索引,其中AB索引对应第三层,即最内层索引。

再看unstack(),如图。显然,unstack()是stack()的逆操作,这里把最内层的行索引还原成了列索引。但是unstack()中有一个参数可以指定旋转第几层索引,比如unstack(0)就是把第一层行索引转成列索引,但默认的是把最内层索引转层列索引。

二、具体例子

3.1 stack

DataFrame.stack(self, level=-1, dropna=True)[source]

说明:

  • level : int, string, or list of these, default -1 (last level) Level(s) of index to unstack, can pass level name (从外到里 0,1,3,… n 或者是 -n,… -2,-1 。 默认是-1,即最里层)
  • fill_value : replace NaN with this value if the stack produces missing values

例子

>>> df_single_level_cols = pd.DataFrame([[0, 1], [2, 3]],
...                                     index=['cat', 'dog'],
...                                     columns=['weight', 'height'])

Stacking a dataframe with a single level column axis returns a Series:

>>> df_single_level_cols
     weight height
cat       0      1
dog       2      3

>>> df_single_level_cols.stack()
cat  weight    0
     height    1
dog  weight    2
     height    3
dtype: int64

Multi level columns: simple case

>>> multicol1 = pd.MultiIndex.from_tuples([('weight', 'kg'),
...                                        ('weight', 'pounds')])
>>> df_multi_level_cols1 = pd.DataFrame([[1, 2], [2, 4]],
...                                     index=['cat', 'dog'],
...                                     columns=multicol1)

Stacking a dataframe with a multi-level column axis:

>>> df_multi_level_cols1
     weight
         kg    pounds
cat       1        2
dog       2        4
>>> df_multi_level_cols1.stack()
            weight
cat kg           1
    pounds       2
dog kg           2
    pounds       4

Missing values

>>> multicol2 = pd.MultiIndex.from_tuples([('weight', 'kg'),
...                                        ('height', 'm')])
>>> df_multi_level_cols2 = pd.DataFrame([[1.0, 2.0], [3.0, 4.0]],
...                                     index=['cat', 'dog'],
...                                     columns=multicol2)

It is common to have missing values when stacking a dataframe with multi-level columns, as the stacked dataframe typically has more values than the original dataframe. Missing values are filled with NaNs:

>>> df_multi_level_cols2
    weight height
        kg      m
cat    1.0    2.0
dog    3.0    4.0
>>> df_multi_level_cols2.stack()
        height  weight
cat kg     NaN     1.0
    m      2.0     NaN
dog kg     NaN     3.0
    m      4.0     NaN

Prescribing the level(s) to be stacked

The first parameter controls which level or levels are stacked:

>>> df_multi_level_cols2.stack(0)
             kg    m
cat height  NaN  2.0
    weight  1.0  NaN
dog height  NaN  4.0
    weight  3.0  NaN
>>> df_multi_level_cols2.stack([0, 1])
cat  height  m     2.0
     weight  kg    1.0
dog  height  m     4.0
     weight  kg    3.0
dtype: float64

Dropping missing values

>>> df_multi_level_cols3 = pd.DataFrame([[None, 1.0], [2.0, 3.0]],
...                                     index=['cat', 'dog'],
...                                     columns=multicol2)

Note that rows where all values are missing are dropped by default but this behaviour can be controlled via the dropna keyword parameter:

>>> df_multi_level_cols3
    weight height
        kg      m
cat    NaN    1.0
dog    2.0    3.0
>>> df_multi_level_cols3.stack(dropna=False)
        height  weight
cat kg     NaN     NaN
    m      1.0     NaN
dog kg     NaN     2.0
    m      3.0     NaN
>>> df_multi_level_cols3.stack(dropna=True)
        height  weight
cat m      1.0     NaN
dog kg     NaN     2.0
    m      3.0     NaN

3.2 unstack

DataFrame.unstack(self, level=-1, fill_value=None)[source]

说明:

  • level : int, string, or list of these, default -1 (last level) Level(s) of index to unstack, can pass level name (从外到里 0,1,3,… n 或者是 -n,… -2,-1 。 默认是-1,即最里层)
  • fill_value : replace NaN with this value if the unstack produces missing values

例子:

>>> index = pd.MultiIndex.from_tuples([('one', 'a'), ('one', 'b'),
...                                    ('two', 'a'), ('two', 'b')])
>>> s = pd.Series(np.arange(1.0, 5.0), index=index)
>>> s
one  a   1.0
     b   2.0
two  a   3.0
     b   4.0
dtype: float64

>>> s.unstack(level=-1)
     a   b
one  1.0  2.0
two  3.0  4.0

>>> s.unstack(level=0)
   one  two
a  1.0   3.0
b  2.0   4.0

>>> df = s.unstack(level=0)
>>> df.unstack()
one  a  1.0
     b  2.0
two  a  3.0
     b  4.0
dtype: float64

3.3 pivot

DataFrame.pivot(self, index=None, columns=None, values=None)[source]

例子:

>>> df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two',
...                            'two'],
...                    'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
...                    'baz': [1, 2, 3, 4, 5, 6],
...                    'zoo': ['x', 'y', 'z', 'q', 'w', 't']})
>>> df
    foo   bar  baz  zoo
0   one   A    1    x
1   one   B    2    y
2   one   C    3    z
3   two   A    4    q
4   two   B    5    w
5   two   C    6    t
>>> df.pivot(index='foo', columns='bar', values='baz')
bar  A   B   C
foo
one  1   2   3
two  4   5   6
>>> df.pivot(index='foo', columns='bar')['baz']
bar  A   B   C
foo
one  1   2   3
two  4   5   6
>>> df.pivot(index='foo', columns='bar', values=['baz', 'zoo'])
      baz       zoo
bar   A  B  C   A  B  C
foo
one   1  2  3   x  y  z
two   4  5  6   q  w  t

A ValueError is raised if there are any duplicates.

>>> df = pd.DataFrame({"foo": ['one', 'one', 'two', 'two'],
...                    "bar": ['A', 'A', 'B', 'C'],
...                    "baz": [1, 2, 3, 4]})
>>> df
   foo bar  baz
0  one   A    1
1  one   A    2
2  two   B    3
3  two   C    4

Notice that the first two rows are the same for our index and columns arguments.

>>> df.pivot(index='foo', columns='bar', values='baz')
Traceback (most recent call last):
   ...
ValueError: Index contains duplicate entries, cannot reshape

参考资料

药企,独角兽,苏州。团队长期招人,感兴趣的都可以发邮件聊聊:tiehan@sina.cn
个人公众号,比较懒,很少更新,可以在上面提问题,如果回复不及时,可发邮件给我: tiehan@sina.cn