Cookbook

原文:http://pandas.pydata.org/pandas-docs/stable/cookbook.html

译者:飞龙 UsyiyiCN

校对:(虚位以待)

这里有大量的例程和链接,就像食谱中 简单而可口 示例一样,希望能对您学习pandas库有用。我们鼓励用户为此文档添加内容。

如果您向此部分中添加链接或者例子,我们会把它当做 First Pull Request 处理。

我们贴在此处的示例大多都是经过简化、凝练的,对新用户友好。这些例子可能已经在Stack-Overflow进行了讨论或者已经出现在了GitHub链接中。那些链接包含有更多的信息,这里的例程并没有扩展讲解。

为了照顾初学者,在导入的模块中只有Pandas(pd)和Numpy(np)进行了缩写,其余的模块导入时都保留了原来的名称。

这些例子是为python 3.4编写的。较早的python版本或许需要修改一下。

Idioms

这里有一些整齐优雅的pandas惯用写法

使用if-then / if-then-else选定某一列中某些值,但是却修改这些值所在行的其他一列或若干列中的值:

In [1]: df = pd.DataFrame(
   ...:      {'AAA' : [4,5,6,7], 'BBB' : [10,20,30,40],'CCC' : [100,50,-30,-50]}); df
   ...: 
Out[1]: 
   AAA  BBB  CCC
0    4   10  100
1    5   20   50
2    6   30  -30
3    7   40  -50

if-then...

if-then 选中修改一列的值

In [2]: df.ix[df.AAA >= 5,'BBB'] = -1; df
Out[2]: 
   AAA  BBB  CCC
0    4   10  100
1    5   -1   50
2    6   -1  -30
3    7   -1  -50

if-then 选中修改2列的值

In [3]: df.ix[df.AAA >= 5,['BBB','CCC']] = 555; df
Out[3]: 
   AAA  BBB  CCC
0    4   10  100
1    5  555  555
2    6  555  555
3    7  555  555

此例与上一例逻辑相反,用来实现 else

In [4]: df.ix[df.AAA < 5,['BBB','CCC']] = 2000; df
Out[4]: 
   AAA   BBB   CCC
0    4  2000  2000
1    5   555   555
2    6   555   555
3    7   555   555

或者你也可以使用mask

In [5]: df_mask = pd.DataFrame({'AAA' : [True] * 4, 'BBB' : [False] * 4,'CCC' : [True,False] * 2})

In [6]: df.where(df_mask,-1000)
Out[6]: 
   AAA   BBB   CCC
0    4 -1000  2000
1    5 -1000 -1000
2    6 -1000   555
3    7 -1000 -1000

通过numpy's where(),实现if-then-else的逻辑

In [7]: df = pd.DataFrame(
   ...:      {'AAA' : [4,5,6,7], 'BBB' : [10,20,30,40],'CCC' : [100,50,-30,-50]}); df
   ...: 
Out[7]: 
   AAA  BBB  CCC
0    4   10  100
1    5   20   50
2    6   30  -30
3    7   40  -50

In [8]: df['logic'] = np.where(df['AAA'] > 5,'high','low'); df
Out[8]: 
   AAA  BBB  CCC logic
0    4   10  100   low
1    5   20   50   low
2    6   30  -30  high
3    7   40  -50  high

Splitting

使用布尔值标准选择数据

In [9]: df = pd.DataFrame(
   ...:      {'AAA' : [4,5,6,7], 'BBB' : [10,20,30,40],'CCC' : [100,50,-30,-50]}); df
   ...: 
Out[9]: 
   AAA  BBB  CCC
0    4   10  100
1    5   20   50
2    6   30  -30
3    7   40  -50

In [10]: dflow = df[df.AAA <= 5]

In [11]: dfhigh = df[df.AAA > 5]

In [12]: dflow; dfhigh
Out[12]: 
   AAA  BBB  CCC
2    6   30  -30
3    7   40  -50

Building Criteria

根据多列的值选择

In [13]: df = pd.DataFrame(
   ....:      {'AAA' : [4,5,6,7], 'BBB' : [10,20,30,40],'CCC' : [100,50,-30,-50]}); df
   ....: 
Out[13]: 
   AAA  BBB  CCC
0    4   10  100
1    5   20   50
2    6   30  -30
3    7   40  -50

...实现‘与’操作(没有更改原内容而是返回了一个新的Series)

In [14]: newseries = df.loc[(df['BBB'] < 25) & (df['CCC'] >= -40), 'AAA']; newseries
Out[14]: 
0    4
1    5
Name: AAA, dtype: int64

...实现“或”操作(没有更改原内容而是返回了一个新的Series)

In [15]: newseries = df.loc[(df['BBB'] > 25) | (df['CCC'] >= -40), 'AAA']; newseries;

...实现“或”操作(直接修改了DataFrame的内容)

In [16]: df.loc[(df['BBB'] > 25) | (df['CCC'] >= 75), 'AAA'] = 0.1; df
Out[16]: 
   AAA  BBB  CCC
0  0.1   10  100
1  5.0   20   50
2  0.1   30  -30
3  0.1   40  -50

使用argsort选择数据最接近某个值的行

In [17]: df = pd.DataFrame(
   ....:      {'AAA' : [4,5,6,7], 'BBB' : [10,20,30,40],'CCC' : [100,50,-30,-50]}); df
   ....: 
Out[17]: 
   AAA  BBB  CCC
0    4   10  100
1    5   20   50
2    6   30  -30
3    7   40  -50

In [18]: aValue = 43.0

In [19]: df.ix[(df.CCC-aValue).abs().argsort()]
Out[19]: 
   AAA  BBB  CCC
1    5   20   50
0    4   10  100
2    6   30  -30
3    7   40  -50

使用二元运算符动态地减少条件列表

In [20]: df = pd.DataFrame(
   ....:      {'AAA' : [4,5,6,7], 'BBB' : [10,20,30,40],'CCC' : [100,50,-30,-50]}); df
   ....: 
Out[20]: 
   AAA  BBB  CCC
0    4   10  100
1    5   20   50
2    6   30  -30
3    7   40  -50

In [21]: Crit1 = df.AAA <= 5.5

In [22]: Crit2 = df.BBB == 10.0

In [23]: Crit3 = df.CCC > -40.0

可以硬编码:

In [24]: AllCrit = Crit1 & Crit2 & Crit3

...或者可以使用动态构建的标准列表

In [25]: CritList = [Crit1,Crit2,Crit3]

In [26]: AllCrit = functools.reduce(lambda x,y: x & y, CritList)

In [27]: df[AllCrit]
Out[27]: 
   AAA  BBB  CCC
0    4   10  100

Selection

DataFrames

indexing文档。

使用行标签和值条件

In [28]: df = pd.DataFrame(
   ....:      {'AAA' : [4,5,6,7], 'BBB' : [10,20,30,40],'CCC' : [100,50,-30,-50]}); df
   ....: 
Out[28]: 
   AAA  BBB  CCC
0    4   10  100
1    5   20   50
2    6   30  -30
3    7   40  -50

In [29]: df[(df.AAA <= 6) & (df.index.isin([0,2,4]))]
Out[29]: 
   AAA  BBB  CCC
0    4   10  100
2    6   30  -30

使用loc进行面向标签切片和iloc位置切片

In [30]: data = {'AAA' : [4,5,6,7], 'BBB' : [10,20,30,40],'CCC' : [100,50,-30,-50]}

In [31]: df = pd.DataFrame(data=data,index=['foo','bar','boo','kar']); df
Out[31]: 
     AAA  BBB  CCC
foo    4   10  100
bar    5   20   50
boo    6   30  -30
kar    7   40  -50

有2个显式切片方法,第三个一般情况

  1. 位置导向(Python切片风格:不包括end)
  2. 面向标签(非Python切片样式:包括​​end)
  3. 一般(切片样式:取决于切片是否包含标签或位置)
In [32]: df.loc['bar':'kar'] #Label
Out[32]: 
     AAA  BBB  CCC
bar    5   20   50
boo    6   30  -30
kar    7   40  -50

#Generic
In [33]: df.ix[0:3] #Same as .iloc[0:3]
Out[33]: 
     AAA  BBB  CCC
foo    4   10  100
bar    5   20   50
boo    6   30  -30

In [34]: df.ix['bar':'kar'] #Same as .loc['bar':'kar']
Out[34]: 
     AAA  BBB  CCC
bar    5   20   50
boo    6   30  -30
kar    7   40  -50

当索引由具有非零开始或非单位增量的整数组成时,会出现模糊性。

In [35]: df2 = pd.DataFrame(data=data,index=[1,2,3,4]); #Note index starts at 1.

In [36]: df2.iloc[1:3] #Position-oriented
Out[36]: 
   AAA  BBB  CCC
2    5   20   50
3    6   30  -30

In [37]: df2.loc[1:3] #Label-oriented
Out[37]: 
   AAA  BBB  CCC
1    4   10  100
2    5   20   50
3    6   30  -30

In [38]: df2.ix[1:3] #General, will mimic loc (label-oriented)
Out[38]: 
   AAA  BBB  CCC
1    4   10  100
2    5   20   50
3    6   30  -30

In [39]: df2.ix[0:3] #General, will mimic iloc (position-oriented), as loc[0:3] would raise a KeyError
Out[39]: 
   AAA  BBB  CCC
1    4   10  100
2    5   20   50
3    6   30  -30

使用反向运算符(〜)来取掩码的补码

In [40]: df = pd.DataFrame(
   ....:      {'AAA' : [4,5,6,7], 'BBB' : [10,20,30,40], 'CCC' : [100,50,-30,-50]}); df
   ....: 
Out[40]: 
   AAA  BBB  CCC
0    4   10  100
1    5   20   50
2    6   30  -30
3    7   40  -50

In [41]: df[~((df.AAA <= 6) & (df.index.isin([0,2,4])))]
Out[41]: 
   AAA  BBB  CCC
1    5   20   50
3    7   40  -50

Panels

通过移调,添加新尺寸并转换回原始尺寸来扩展面板框架

In [42]: rng = pd.date_range('1/1/2013',periods=100,freq='D')

In [43]: data = np.random.randn(100, 4)

In [44]: cols = ['A','B','C','D']

In [45]: df1, df2, df3 = pd.DataFrame(data, rng, cols), pd.DataFrame(data, rng, cols), pd.DataFrame(data, rng, cols)

In [46]: pf = pd.Panel({'df1':df1,'df2':df2,'df3':df3});pf
Out[46]: 
<class 'pandas.core.panel.Panel'>
Dimensions: 3 (items) x 100 (major_axis) x 4 (minor_axis)
Items axis: df1 to df3
Major_axis axis: 2013-01-01 00:00:00 to 2013-04-10 00:00:00
Minor_axis axis: A to D

#Assignment using Transpose  (pandas < 0.15)
In [47]: pf = pf.transpose(2,0,1)

In [48]: pf['E'] = pd.DataFrame(data, rng, cols)

In [49]: pf = pf.transpose(1,2,0);pf
Out[49]: 
<class 'pandas.core.panel.Panel'>
Dimensions: 3 (items) x 100 (major_axis) x 5 (minor_axis)
Items axis: df1 to df3
Major_axis axis: 2013-01-01 00:00:00 to 2013-04-10 00:00:00
Minor_axis axis: A to E

#Direct assignment (pandas > 0.15)
In [50]: pf.loc[:,:,'F'] = pd.DataFrame(data, rng, cols);pf
Out[50]: 
<class 'pandas.core.panel.Panel'>
Dimensions: 3 (items) x 100 (major_axis) x 6 (minor_axis)
Items axis: df1 to df3
Major_axis axis: 2013-01-01 00:00:00 to 2013-04-10 00:00:00
Minor_axis axis: A to F

使用np.where屏蔽面板,然后使用新的屏蔽值重建面板

New Columns

使用applymap有效且动态地创建新列

In [51]: df = pd.DataFrame(
   ....:      {'AAA' : [1,2,1,3], 'BBB' : [1,1,2,2], 'CCC' : [2,1,3,1]}); df
   ....: 
Out[51]: 
   AAA  BBB  CCC
0    1    1    2
1    2    1    1
2    1    2    3
3    3    2    1

In [52]: source_cols = df.columns # or some subset would work too.

In [53]: new_cols = [str(x) + "_cat" for x in source_cols]

In [54]: categories = {1 : 'Alpha', 2 : 'Beta', 3 : 'Charlie' }

In [55]: df[new_cols] = df[source_cols].applymap(categories.get);df
Out[55]: 
   AAA  BBB  CCC  AAA_cat BBB_cat  CCC_cat
0    1    1    2    Alpha   Alpha     Beta
1    2    1    1     Beta   Alpha    Alpha
2    1    2    3    Alpha    Beta  Charlie
3    3    2    1  Charlie    Beta    Alpha

在groupby中使用min()时保留其他列

In [56]: df = pd.DataFrame(
   ....:      {'AAA' : [1,1,1,2,2,2,3,3], 'BBB' : [2,1,3,4,5,1,2,3]}); df
   ....: 
Out[56]: 
   AAA  BBB
0    1    2
1    1    1
2    1    3
3    2    4
4    2    5
5    2    1
6    3    2
7    3    3

方法1:idxmin()获取最小的索引的完整数据原格式

In [57]: df.loc[df.groupby("AAA")["BBB"].idxmin()]
Out[57]: 
   AAA  BBB
1    1    1
5    2    1
6    3    2

方法2:排序,然后取第一

In [58]: df.sort_values(by="BBB").groupby("AAA", as_index=False).first()
Out[58]: 
   AAA  BBB
0    1    1
1    2    1
2    3    2

注意相同的结果,除了索引。

MultiIndexing

multindexing

从带标签的DataFrame创建多索引

In [59]: df = pd.DataFrame({'row' : [0,1,2],
   ....:                    'One_X' : [1.1,1.1,1.1],
   ....:                    'One_Y' : [1.2,1.2,1.2],
   ....:                    'Two_X' : [1.11,1.11,1.11],
   ....:                    'Two_Y' : [1.22,1.22,1.22]}); df
   ....: 
Out[59]: 
   One_X  One_Y  Two_X  Two_Y  row
0    1.1    1.2   1.11   1.22    0
1    1.1    1.2   1.11   1.22    1
2    1.1    1.2   1.11   1.22    2

# As Labelled Index
In [60]: df = df.set_index('row');df
Out[60]: 
     One_X  One_Y  Two_X  Two_Y
row                            
0      1.1    1.2   1.11   1.22
1      1.1    1.2   1.11   1.22
2      1.1    1.2   1.11   1.22

# With Hierarchical Columns
In [61]: df.columns = pd.MultiIndex.from_tuples([tuple(c.split('_')) for c in df.columns]);df
Out[61]: 
     One        Two      
       X    Y     X     Y
row                      
0    1.1  1.2  1.11  1.22
1    1.1  1.2  1.11  1.22
2    1.1  1.2  1.11  1.22

# Now stack & Reset
In [62]: df = df.stack(0).reset_index(1);df
Out[62]: 
    level_1     X     Y
row                    
0       One  1.10  1.20
0       Two  1.11  1.22
1       One  1.10  1.20
1       Two  1.11  1.22
2       One  1.10  1.20
2       Two  1.11  1.22

# And fix the labels (Notice the label 'level_1' got added automatically)
In [63]: df.columns = ['Sample','All_X','All_Y'];df
Out[63]: 
    Sample  All_X  All_Y
row                     
0      One   1.10   1.20
0      Two   1.11   1.22
1      One   1.10   1.20
1      Two   1.11   1.22
2      One   1.10   1.20
2      Two   1.11   1.22

Arithmetic

使用需要广播的多索引进行算术

In [64]: cols = pd.MultiIndex.from_tuples([ (x,y) for x in ['A','B','C'] for y in ['O','I']])

In [65]: df = pd.DataFrame(np.random.randn(2,6),index=['n','m'],columns=cols); df
Out[65]: 
          A                   B                   C          
          O         I         O         I         O         I
n  1.920906 -0.388231 -2.314394  0.665508  0.402562  0.399555
m -1.765956  0.850423  0.388054  0.992312  0.744086 -0.739776

In [66]: df = df.div(df['C'],level=1); df
Out[66]: 
          A                   B              C     
          O         I         O         I    O    I
n  4.771702 -0.971660 -5.749162  1.665625  1.0  1.0
m -2.373321 -1.149568  0.521518 -1.341367  1.0  1.0

Slicing

使用xs切片多索引

In [67]: coords = [('AA','one'),('AA','six'),('BB','one'),('BB','two'),('BB','six')]

In [68]: index = pd.MultiIndex.from_tuples(coords)

In [69]: df = pd.DataFrame([11,22,33,44,55],index,['MyData']); df
Out[69]: 
        MyData
AA one      11
   six      22
BB one      33
   two      44
   six      55

要取第1级和第1轴的横截面索引:

In [70]: df.xs('BB',level=0,axis=0)  #Note : level and axis are optional, and default to zero
Out[70]: 
     MyData
one      33
two      44
six      55

...现在是第1轴的第2级。

In [71]: df.xs('six',level=1,axis=0)
Out[71]: 
    MyData
AA      22
BB      55

使用xs,方法#2来切割多索引

In [72]: index = list(itertools.product(['Ada','Quinn','Violet'],['Comp','Math','Sci']))

In [73]: headr = list(itertools.product(['Exams','Labs'],['I','II']))

In [74]: indx = pd.MultiIndex.from_tuples(index,names=['Student','Course'])

In [75]: cols = pd.MultiIndex.from_tuples(headr) #Notice these are un-named

In [76]: data = [[70+x+y+(x*y)%3 for x in range(4)] for y in range(9)]

In [77]: df = pd.DataFrame(data,indx,cols); df
Out[77]: 
               Exams     Labs    
                   I  II    I  II
Student Course                   
Ada     Comp      70  71   72  73
        Math      71  73   75  74
        Sci       72  75   75  75
Quinn   Comp      73  74   75  76
        Math      74  76   78  77
        Sci       75  78   78  78
Violet  Comp      76  77   78  79
        Math      77  79   81  80
        Sci       78  81   81  81

In [78]: All = slice(None)

In [79]: df.loc['Violet']
Out[79]: 
       Exams     Labs    
           I  II    I  II
Course                   
Comp      76  77   78  79
Math      77  79   81  80
Sci       78  81   81  81

In [80]: df.loc[(All,'Math'),All]
Out[80]: 
               Exams     Labs    
                   I  II    I  II
Student Course                   
Ada     Math      71  73   75  74
Quinn   Math      74  76   78  77
Violet  Math      77  79   81  80

In [81]: df.loc[(slice('Ada','Quinn'),'Math'),All]
Out[81]: 
               Exams     Labs    
                   I  II    I  II
Student Course                   
Ada     Math      71  73   75  74
Quinn   Math      74  76   78  77

In [82]: df.loc[(All,'Math'),('Exams')]
Out[82]: 
                 I  II
Student Course        
Ada     Math    71  73
Quinn   Math    74  76
Violet  Math    77  79

In [83]: df.loc[(All,'Math'),(All,'II')]
Out[83]: 
               Exams Labs
                  II   II
Student Course           
Ada     Math      73   74
Quinn   Math      76   77
Violet  Math      79   80

使用xs设置多索引的部分

Sorting

按特定列或使用多索引的列的有序列表排序

In [84]: df.sort_values(by=('Labs', 'II'), ascending=False)
Out[84]: 
               Exams     Labs    
                   I  II    I  II
Student Course                   
Violet  Sci       78  81   81  81
        Math      77  79   81  80
        Comp      76  77   78  79
Quinn   Sci       75  78   78  78
        Math      74  76   78  77
        Comp      73  74   75  76
Ada     Sci       72  75   75  75
        Math      71  73   75  74
        Comp      70  71   72  73

部分选择,需要排序;

Levels

将关卡预置到多索引

展平分层列

panelnd

panelnd文档。

构建5D面板

Missing Data

missing data文档。

填充反向时间序列

In [85]: df = pd.DataFrame(np.random.randn(6,1), index=pd.date_range('2013-08-01', periods=6, freq='B'), columns=list('A'))

In [86]: df.ix[3,'A'] = np.nan

In [87]: df
Out[87]: 
                   A
2013-08-01 -1.054874
2013-08-02 -0.179642
2013-08-05  0.639589
2013-08-06       NaN
2013-08-07  1.906684
2013-08-08  0.104050

In [88]: df.reindex(df.index[::-1]).ffill()
Out[88]: 
                   A
2013-08-08  0.104050
2013-08-07  1.906684
2013-08-06  1.906684
2013-08-05  0.639589
2013-08-02 -0.179642
2013-08-01 -1.054874

在NaN值的累计复位

Replace

使用替换为backrefs

Grouping

grouping文档。

应用的基本分组

与agg不同,apply的callable是通过一个子DataFrame传递给你访问的所有的列

In [89]: df = pd.DataFrame({'animal': 'cat dog cat fish dog cat cat'.split(),
   ....:                    'size': list('SSMMMLL'),
   ....:                    'weight': [8, 10, 11, 1, 20, 12, 12],
   ....:                    'adult' : [False] * 5 + [True] * 2}); df
   ....: 
Out[89]: 
   adult animal size  weight
0  False    cat    S       8
1  False    dog    S      10
2  False    cat    M      11
3  False   fish    M       1
4  False    dog    M      20
5   True    cat    L      12
6   True    cat    L      12

#List the size of the animals with the highest weight.
In [90]: df.groupby('animal').apply(lambda subf: subf['size'][subf['weight'].idxmax()])
Out[90]: 
animal
cat     L
dog     M
fish    M
dtype: object

使用get_group

In [91]: gb = df.groupby(['animal'])

In [92]: gb.get_group('cat')
Out[92]: 
   adult animal size  weight
0  False    cat    S       8
2  False    cat    M      11
5   True    cat    L      12
6   True    cat    L      12

套用至群组中的其他项目

In [93]: def GrowUp(x):
   ....:    avg_weight =  sum(x[x['size'] == 'S'].weight * 1.5)
   ....:    avg_weight += sum(x[x['size'] == 'M'].weight * 1.25)
   ....:    avg_weight += sum(x[x['size'] == 'L'].weight)
   ....:    avg_weight /= len(x)
   ....:    return pd.Series(['L',avg_weight,True], index=['size', 'weight', 'adult'])
   ....: 

In [94]: expected_df = gb.apply(GrowUp)

In [95]: expected_df
Out[95]: 
       size   weight adult
animal                    
cat       L  12.4375  True
dog       L  20.0000  True
fish      L   1.2500  True

扩展应用

In [96]: S = pd.Series([i / 100.0 for i in range(1,11)])

In [97]: def CumRet(x,y):
   ....:    return x * (1 + y)
   ....: 

In [98]: def Red(x):
   ....:    return functools.reduce(CumRet,x,1.0)
   ....: 

In [99]: S.expanding().apply(Red)
Out[99]: 
0    1.010000
1    1.030200
2    1.061106
3    1.103550
4    1.158728
5    1.228251
6    1.314229
7    1.419367
8    1.547110
9    1.701821
dtype: float64

用组的其余部分的平均值替换某些值

In [100]: df = pd.DataFrame({'A' : [1, 1, 2, 2], 'B' : [1, -1, 1, 2]})

In [101]: gb = df.groupby('A')

In [102]: def replace(g):
   .....:    mask = g < 0
   .....:    g.loc[mask] = g[~mask].mean()
   .....:    return g
   .....: 

In [103]: gb.transform(replace)
Out[103]: 
     B
0  1.0
1  1.0
2  1.0
3  2.0

按聚合数据对组排序

In [104]: df = pd.DataFrame({'code': ['foo', 'bar', 'baz'] * 2,
   .....:                    'data': [0.16, -0.21, 0.33, 0.45, -0.59, 0.62],
   .....:                    'flag': [False, True] * 3})
   .....: 

In [105]: code_groups = df.groupby('code')

In [106]: agg_n_sort_order = code_groups[['data']].transform(sum).sort_values(by='data')

In [107]: sorted_df = df.ix[agg_n_sort_order.index]

In [108]: sorted_df
Out[108]: 
  code  data   flag
1  bar -0.21   True
4  bar -0.59  False
0  foo  0.16  False
3  foo  0.45   True
2  baz  0.33  False
5  baz  0.62   True

创建多个聚合列

In [109]: rng = pd.date_range(start="2014-10-07",periods=10,freq='2min')

In [110]: ts = pd.Series(data = list(range(10)), index = rng)

In [111]: def MyCust(x):
   .....:    if len(x) > 2:
   .....:       return x[1] * 1.234
   .....:    return pd.NaT
   .....: 

In [112]: mhc = {'Mean' : np.mean, 'Max' : np.max, 'Custom' : MyCust}

In [113]: ts.resample("5min").apply(mhc)
Out[113]: 
                     Max Custom  Mean
2014-10-07 00:00:00    2  1.234   1.0
2014-10-07 00:05:00    4    NaT   3.5
2014-10-07 00:10:00    7  7.404   6.0
2014-10-07 00:15:00    9    NaT   8.5

In [114]: ts
Out[114]: 
2014-10-07 00:00:00    0
2014-10-07 00:02:00    1
2014-10-07 00:04:00    2
2014-10-07 00:06:00    3
2014-10-07 00:08:00    4
2014-10-07 00:10:00    5
2014-10-07 00:12:00    6
2014-10-07 00:14:00    7
2014-10-07 00:16:00    8
2014-10-07 00:18:00    9
Freq: 2T, dtype: int64

创建值计数列,并重新分配回DataFrame

In [115]: df = pd.DataFrame({'Color': 'Red Red Red Blue'.split(),
   .....:                    'Value': [100, 150, 50, 50]}); df
   .....: 
Out[115]: 
  Color  Value
0   Red    100
1   Red    150
2   Red     50
3  Blue     50

In [116]: df['Counts'] = df.groupby(['Color']).transform(len)

In [117]: df
Out[117]: 
  Color  Value  Counts
0   Red    100       3
1   Red    150       3
2   Red     50       3
3  Blue     50       1

根据索引对列中的值进行移位

In [118]: df = pd.DataFrame(
   .....:    {u'line_race': [10, 10, 8, 10, 10, 8],
   .....:     u'beyer': [99, 102, 103, 103, 88, 100]},
   .....:     index=[u'Last Gunfighter', u'Last Gunfighter', u'Last Gunfighter',
   .....:            u'Paynter', u'Paynter', u'Paynter']); df
   .....: 
Out[118]: 
                 beyer  line_race
Last Gunfighter     99         10
Last Gunfighter    102         10
Last Gunfighter    103          8
Paynter            103         10
Paynter             88         10
Paynter            100          8

In [119]: df['beyer_shifted'] = df.groupby(level=0)['beyer'].shift(1)

In [120]: df
Out[120]: 
                 beyer  line_race  beyer_shifted
Last Gunfighter     99         10            NaN
Last Gunfighter    102         10           99.0
Last Gunfighter    103          8          102.0
Paynter            103         10            NaN
Paynter             88         10          103.0
Paynter            100          8           88.0

从每个组中选择具有最大值的行

In [121]: df = pd.DataFrame({'host':['other','other','that','this','this'],
   .....:                    'service':['mail','web','mail','mail','web'],
   .....:                    'no':[1, 2, 1, 2, 1]}).set_index(['host', 'service'])
   .....: 

In [122]: mask = df.groupby(level=0).agg('idxmax')

In [123]: df_count = df.loc[mask['no']].reset_index()

In [124]: df_count
Out[124]: 
    host service  no
0  other     web   2
1   that    mail   1
2   this    mail   2

像Python的itertools.groupby分组

In [125]: df = pd.DataFrame([0, 1, 0, 1, 1, 1, 0, 1, 1], columns=['A'])

In [126]: df.A.groupby((df.A != df.A.shift()).cumsum()).groups
Out[126]: 
{1: Int64Index([0], dtype='int64'),
 2: Int64Index([1], dtype='int64'),
 3: Int64Index([2], dtype='int64'),
 4: Int64Index([3, 4, 5], dtype='int64'),
 5: Int64Index([6], dtype='int64'),
 6: Int64Index([7, 8], dtype='int64')}

In [127]: df.A.groupby((df.A != df.A.shift()).cumsum()).cumsum()
Out[127]: 
0    0
1    1
2    0
3    1
4    2
5    3
6    0
7    1
8    2
Name: A, dtype: int64

Expanding Data

对齐和最新

基于值而不是计数滚动计算窗口

按时间间隔滚动平均值

Splitting

拆分帧

创建数据框列表,使用基于行中包含的逻辑的划分进行拆分。

In [128]: df = pd.DataFrame(data={'Case' : ['A','A','A','B','A','A','B','A','A'],
   .....:                         'Data' : np.random.randn(9)})
   .....: 

In [129]: dfs = list(zip(*df.groupby((1*(df['Case']=='B')).cumsum().rolling(window=3,min_periods=1).median())))[-1]

In [130]: dfs[0]
Out[130]: 
  Case      Data
0    A  0.174068
1    A -0.439461
2    A -0.741343
3    B -0.079673

In [131]: dfs[1]
Out[131]: 
  Case      Data
4    A -0.922875
5    A  0.303638
6    B -0.917368

In [132]: dfs[2]
Out[132]: 
  Case      Data
7    A -1.624062
8    A -0.758514

Pivot

Pivot文档。

部分和和小计

In [133]: df = pd.DataFrame(data={'Province' : ['ON','QC','BC','AL','AL','MN','ON'],
   .....:                          'City' : ['Toronto','Montreal','Vancouver','Calgary','Edmonton','Winnipeg','Windsor'],
   .....:                          'Sales' : [13,6,16,8,4,3,1]})
   .....: 

In [134]: table = pd.pivot_table(df,values=['Sales'],index=['Province'],columns=['City'],aggfunc=np.sum,margins=True)

In [135]: table.stack('City')
Out[135]: 
                    Sales
Province City            
AL       All         12.0
         Calgary      8.0
         Edmonton     4.0
BC       All         16.0
         Vancouver   16.0
MN       All          3.0
         Winnipeg     3.0
...                   ...
All      Calgary      8.0
         Edmonton     4.0
         Montreal     6.0
         Toronto     13.0
         Vancouver   16.0
         Windsor      1.0
         Winnipeg     3.0

[20 rows x 1 columns]

频率表,例如R中的plyr

In [136]: grades = [48,99,75,80,42,80,72,68,36,78]

In [137]: df = pd.DataFrame( {'ID': ["x%d" % r for r in range(10)],
   .....:                     'Gender' : ['F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'M', 'M'],
   .....:                     'ExamYear': ['2007','2007','2007','2008','2008','2008','2008','2009','2009','2009'],
   .....:                     'Class': ['algebra', 'stats', 'bio', 'algebra', 'algebra', 'stats', 'stats', 'algebra', 'bio', 'bio'],
   .....:                     'Participated': ['yes','yes','yes','yes','no','yes','yes','yes','yes','yes'],
   .....:                     'Passed': ['yes' if x > 50 else 'no' for x in grades],
   .....:                     'Employed': [True,True,True,False,False,False,False,True,True,False],
   .....:                     'Grade': grades})
   .....: 

In [138]: df.groupby('ExamYear').agg({'Participated': lambda x: x.value_counts()['yes'],
   .....:                     'Passed': lambda x: sum(x == 'yes'),
   .....:                     'Employed' : lambda x : sum(x),
   .....:                     'Grade' : lambda x : sum(x) / len(x)})
   .....: 
Out[138]: 
          Grade  Employed  Participated  Passed
ExamYear                                       
2007         74         3             3       2
2008         68         0             3       3
2009         60         2             3       2

绘制pandas DataFrame与年度数据

要创建年份和月份交叉表:

In [139]: df = pd.DataFrame({'value': np.random.randn(36)},
   .....:                   index=pd.date_range('2011-01-01', freq='M', periods=36))
   .....: 

In [140]: pd.pivot_table(df, index=df.index.month, columns=df.index.year,
   .....:                values='value', aggfunc='sum')
   .....: 
Out[140]: 
        2011      2012      2013
1  -0.560859  0.120930  0.516870
2  -0.589005 -0.210518  0.343125
3  -1.070678 -0.931184  2.137827
4  -1.681101  0.240647  0.452429
5   0.403776 -0.027462  0.483103
6   0.609862  0.033113  0.061495
7   0.387936 -0.658418  0.240767
8   1.815066  0.324102  0.782413
9   0.705200 -1.403048  0.628462
10 -0.668049 -0.581967 -0.880627
11  0.242501 -1.233862  0.777575
12  0.313421 -3.520876 -0.779367

Apply

滚动应用于整理 - 将嵌入列表转换为多索引框架

In [141]: df = pd.DataFrame(data={'A' : [[2,4,8,16],[100,200],[10,20,30]], 'B' : [['a','b','c'],['jj','kk'],['ccc']]},index=['I','II','III'])

In [142]: def SeriesFromSubList(aList):
   .....:    return pd.Series(aList)
   .....: 

In [143]: df_orgz = pd.concat(dict([ (ind,row.apply(SeriesFromSubList)) for ind,row in df.iterrows() ]))

使用返回系列的DataFrame的滚动应用

滚动应用于多个列,其中函数在返回系列的标量之前计算系列

In [144]: df = pd.DataFrame(data=np.random.randn(2000,2)/10000,
   .....:                   index=pd.date_range('2001-01-01',periods=2000),
   .....:                   columns=['A','B']); df
   .....: 
Out[144]: 
                   A         B
2001-01-01  0.000032 -0.000004
2001-01-02 -0.000001  0.000207
2001-01-03  0.000120 -0.000220
2001-01-04 -0.000083 -0.000165
2001-01-05 -0.000047  0.000156
2001-01-06  0.000027  0.000104
2001-01-07  0.000041 -0.000101
...              ...       ...
2006-06-17 -0.000034  0.000034
2006-06-18  0.000002  0.000166
2006-06-19  0.000023 -0.000081
2006-06-20 -0.000061  0.000012
2006-06-21 -0.000111  0.000027
2006-06-22 -0.000061 -0.000009
2006-06-23  0.000074 -0.000138

[2000 rows x 2 columns]

In [145]: def gm(aDF,Const):
   .....:    v = ((((aDF.A+aDF.B)+1).cumprod())-1)*Const
   .....:    return (aDF.index[0],v.iloc[-1])
   .....: 

In [146]: S = pd.Series(dict([ gm(df.iloc[i:min(i+51,len(df)-1)],5) for i in range(len(df)-50) ])); S
Out[146]: 
2001-01-01   -0.001373
2001-01-02   -0.001705
2001-01-03   -0.002885
2001-01-04   -0.002987
2001-01-05   -0.002384
2001-01-06   -0.004700
2001-01-07   -0.005500
                ...   
2006-04-28   -0.002682
2006-04-29   -0.002436
2006-04-30   -0.002602
2006-05-01   -0.001785
2006-05-02   -0.001799
2006-05-03   -0.000605
2006-05-04   -0.000541
dtype: float64

使用返回标量的DataFrame进行滚动应用

滚动应用于多个列,其中函数返回标量(体积加权平均价格)

In [147]: rng = pd.date_range(start = '2014-01-01',periods = 100)

In [148]: df = pd.DataFrame({'Open' : np.random.randn(len(rng)),
   .....:                    'Close' : np.random.randn(len(rng)),
   .....:                    'Volume' : np.random.randint(100,2000,len(rng))}, index=rng); df
   .....: 
Out[148]: 
               Close      Open  Volume
2014-01-01 -0.653039  0.011174    1581
2014-01-02  1.314205  0.214258    1707
2014-01-03 -0.341915 -1.046922    1768
2014-01-04 -1.303586 -0.752902     836
2014-01-05  0.396288 -0.410793     694
2014-01-06 -0.548006  0.648401     796
2014-01-07  0.481380  0.737320     265
...              ...       ...     ...
2014-04-04 -2.548128  0.120378     564
2014-04-05  0.223346  0.231661    1908
2014-04-06  1.228841  0.952664    1090
2014-04-07  0.552784 -0.176090    1813
2014-04-08 -0.795389  1.781318    1103
2014-04-09 -0.018815 -0.753493    1456
2014-04-10  1.138197 -1.047997    1193

[100 rows x 3 columns]

In [149]: def vwap(bars): return ((bars.Close*bars.Volume).sum()/bars.Volume.sum())

In [150]: window = 5

In [151]: s = pd.concat([ (pd.Series(vwap(df.iloc[i:i+window]), index=[df.index[i+window]])) for i in range(len(df)-window) ]);

In [152]: s.round(2)
Out[152]: 
2014-01-06   -0.03
2014-01-07    0.07
2014-01-08   -0.40
2014-01-09   -0.81
2014-01-10   -0.63
2014-01-11   -0.86
2014-01-12   -0.36
              ... 
2014-04-04   -1.27
2014-04-05   -1.36
2014-04-06   -0.73
2014-04-07    0.04
2014-04-08    0.21
2014-04-09    0.07
2014-04-10    0.25
dtype: float64

Timeseries

时间

在时间之间使用索引器

构造排除周末且仅包含特定时间的日期时间范围

矢量化查找

汇总和绘制时间序列

将以列和天为单位的小时数的行转换为以时间序列形式的连续行序列。如何重新排列一个python pandas DataFrame?

在将时间序列重新编入指定频率时处理重复项

计算DatetimeIndex中每个条目的月份第一天

In [153]: dates = pd.date_range('2000-01-01', periods=5)

In [154]: dates.to_period(freq='M').to_timestamp()
Out[154]: 
DatetimeIndex(['2000-01-01', '2000-01-01', '2000-01-01', '2000-01-01',
               '2000-01-01'],
              dtype='datetime64[ns]', freq=None)

Resampling

Resample文件。

按时间分组的值的时间分组

TimeGrouping#2

使用TimeGrouper和另一个分组来创建子组,然后应用自定义函数

使用自定义周期重新采样

重新取样日内栏框而不添加新日

重新采样分钟数据

使用groupby重新取样

Merge

Concat文档。Join文档。

附加具有重叠索引的两个数据帧(模拟R rbind)

In [155]: rng = pd.date_range('2000-01-01', periods=6)

In [156]: df1 = pd.DataFrame(np.random.randn(6, 3), index=rng, columns=['A', 'B', 'C'])

In [157]: df2 = df1.copy()

在pandas中需要ignore_index

In [158]: df = df1.append(df2,ignore_index=True); df
Out[158]: 
           A         B         C
0  -0.480676 -1.305282 -0.212846
1   1.979901  0.363112 -0.275732
2  -1.433852  0.580237 -0.013672
3   1.776623 -0.803467  0.521517
4  -0.302508 -0.442948 -0.395768
5  -0.249024 -0.031510  2.413751
6  -0.480676 -1.305282 -0.212846
7   1.979901  0.363112 -0.275732
8  -1.433852  0.580237 -0.013672
9   1.776623 -0.803467  0.521517
10 -0.302508 -0.442948 -0.395768
11 -0.249024 -0.031510  2.413751

自加入DataFrame

In [159]: df = pd.DataFrame(data={'Area' : ['A'] * 5 + ['C'] * 2,
   .....:                         'Bins' : [110] * 2 + [160] * 3 + [40] * 2,
   .....:                         'Test_0' : [0, 1, 0, 1, 2, 0, 1],
   .....:                         'Data' : np.random.randn(7)});df
   .....: 
Out[159]: 
  Area  Bins      Data  Test_0
0    A   110 -0.378914       0
1    A   110 -1.032527       1
2    A   160 -1.402816       0
3    A   160  0.715333       1
4    A   160 -0.091438       2
5    C    40  1.608418       0
6    C    40  0.753207       1

In [160]: df['Test_1'] = df['Test_0'] - 1

In [161]: pd.merge(df, df, left_on=['Bins', 'Area','Test_0'], right_on=['Bins', 'Area','Test_1'],suffixes=('_L','_R'))
Out[161]: 
  Area  Bins    Data_L  Test_0_L  Test_1_L    Data_R  Test_0_R  Test_1_R
0    A   110 -0.378914         0        -1 -1.032527         1         0
1    A   160 -1.402816         0        -1  0.715333         1         0
2    A   160  0.715333         1         0 -0.091438         2         1
3    C    40  1.608418         0        -1  0.753207         1         0

如何设置索引并加入

KDB like asof join

根据值加入条件

根据范围内的值使用searchsorted合并。

Plotting

Plotting文档。

使Matplotlib看起来像R

设置x轴主标签和次标签

在ipython notebook中绘制多个图表

创建多线图

绘制热图

注释时间序列图

注释时间序列图#2

使用Pandas,Vincent和xlsxwriter在excel文件中生成嵌入图。

分层变量的四分位数的箱线图

In [162]: df = pd.DataFrame(
   .....:      {u'stratifying_var': np.random.uniform(0, 100, 20),
   .....:       u'price': np.random.normal(100, 5, 20)})
   .....: 

In [163]: df[u'quartiles'] = pd.qcut(
   .....:     df[u'stratifying_var'],
   .....:     4,
   .....:     labels=[u'0-25%', u'25-50%', u'50-75%', u'75-100%'])
   .....: 

In [164]: df.boxplot(column=u'price', by=u'quartiles')
Out[164]: <matplotlib.axes._subplots.AxesSubplot at 0x7ff27ea62b90>
http://pandas.pydata.org/pandas-docs/version/0.19.2/_images/quartile_boxplot.png

Data In/Out

SQL与HDF5的性能比较

CSV

CSV文件

read_csv in action

写入csv

如何读取多个文件,附加创建单个结构化数据

读取csv逐块读取

只读取csv逐块的某些行

读取结构数据的前几行

读取被压缩但不是由gzip/bz2read_csv理解的原生压缩格式)压缩的文件。此示例显示一个WinZipped文件,但是在上下文管理器中打开该文件并使用该句柄进行读取的一般应用程序。见这里

从文件中引用dtypes

处理数据损坏行

处理数据损坏行II

使用Unix时间戳读取CSV并转换为本地时区

写入多行索引CSV,但不写入重复项

使用格式解析多列中的日期组件更快

In [30]: i = pd.date_range('20000101',periods=10000)

In [31]: df = pd.DataFrame(dict(year = i.year, month = i.month, day = i.day))

In [32]: df.head()
Out[32]:
   day  month  year
0    1      1  2000
1    2      1  2000
2    3      1  2000
3    4      1  2000
4    5      1  2000

In [33]: %timeit pd.to_datetime(df.year*10000+df.month*100+df.day,format='%Y%m%d')
100 loops, best of 3: 7.08 ms per loop

# simulate combinging into a string, then parsing
In [34]: ds = df.apply(lambda x: "%04d%02d%02d" % (x['year'],x['month'],x['day']),axis=1)

In [35]: ds.head()
Out[35]:
0    20000101
1    20000102
2    20000103
3    20000104
4    20000105
dtype: object

In [36]: %timeit pd.to_datetime(ds)
1 loops, best of 3: 488 ms per loop

Skip row between header and data

In [165]: from io import StringIO

In [166]: import pandas as pd

In [167]: data = """;;;;
   .....:  ;;;;
   .....:  ;;;;
   .....:  ;;;;
   .....:  ;;;;
   .....:  ;;;;
   .....: ;;;;
   .....:  ;;;;
   .....:  ;;;;
   .....: ;;;;
   .....: date;Param1;Param2;Param4;Param5
   .....:     ;m²;°C;m²;m
   .....: ;;;;
   .....: 01.01.1990 00:00;1;1;2;3
   .....: 01.01.1990 01:00;5;3;4;5
   .....: 01.01.1990 02:00;9;5;6;7
   .....: 01.01.1990 03:00;13;7;8;9
   .....: 01.01.1990 04:00;17;9;10;11
   .....: 01.01.1990 05:00;21;11;12;13
   .....: """
   .....: 
Option 1: pass rows explicitly to skiprows
In [168]: pd.read_csv(StringIO(data.decode('UTF-8')), sep=';', skiprows=[11,12],
   .....:         index_col=0, parse_dates=True, header=10)
   .....: 
Out[168]: 
                     Param1  Param2  Param4  Param5
date                                               
1990-01-01 00:00:00       1       1       2       3
1990-01-01 01:00:00       5       3       4       5
1990-01-01 02:00:00       9       5       6       7
1990-01-01 03:00:00      13       7       8       9
1990-01-01 04:00:00      17       9      10      11
1990-01-01 05:00:00      21      11      12      13
Option 2: read column names and then data
In [169]: pd.read_csv(StringIO(data.decode('UTF-8')), sep=';',
   .....:         header=10, parse_dates=True, nrows=10).columns
   .....: 
Out[169]: Index([u'date', u'Param1', u'Param2', u'Param4', u'Param5'], dtype='object')

In [170]: columns = pd.read_csv(StringIO(data.decode('UTF-8')), sep=';',
   .....:                   header=10, parse_dates=True, nrows=10).columns
   .....: 

In [171]: pd.read_csv(StringIO(data.decode('UTF-8')), sep=';',
   .....:             header=12, parse_dates=True, names=columns)
   .....: 
Out[171]: 
               date  Param1  Param2  Param4  Param5
0  01.01.1990 00:00       1       1       2       3
1  01.01.1990 01:00       5       3       4       5
2  01.01.1990 02:00       9       5       6       7
3  01.01.1990 03:00      13       7       8       9
4  01.01.1990 04:00      17       9      10      11
5  01.01.1990 05:00      21      11      12      13

SQL

SQL文档

使用SQL从数据库读取

Excel

Excel文档

从文件状句柄读取

在XlsxWriter输出中修改格式设置

HTML

从无法处理默认请求标头的服务器读取HTML表格

HDFStore

HDFStores文档

具有时间戳索引的简单查询

使用链接的多表层次结构管理异构数据

合并数百万行的磁盘表

在从多个进程/线程写入商店时避免不一致

通过块去重复大型存储,本质上是递归归约运算。显示从csv文件接收数据并通过块创建存储的功能,以及日期解析。见这里

从csv文件创建存储块chunk-by-chunk

附加到商店,同时创建唯一索引

大数据工作流

读取文件序列,然后在附加时为商店提供全局唯一索引

Groupby在具有低组密度的HDFStore上

Groupby在具有高组密度的HDFStore上

HDFStore上的分层查询

使用HDFStore计数

对HDFStore异常进行故障排除

用字符串设置min_itemsize

使用ptrepack在商店上创建完全排序索引

将属性存储到组节点

In [172]: df = pd.DataFrame(np.random.randn(8,3))

In [173]: store = pd.HDFStore('test.h5')

In [174]: store.put('df',df)

# you can store an arbitrary python object via pickle
In [175]: store.get_storer('df').attrs.my_attribute = dict(A = 10)

In [176]: store.get_storer('df').attrs.my_attribute
Out[176]: {'A': 10}

Binary Files

pandas很容易接受numpy记录数组,如果你需要读入一个由C结构数组组成的二进制文件。例如,给定这个C程序在用gcc main.c 编译的main.c文件中-std = gnu99在64位机器上,

#include <stdio.h>
#include <stdint.h>

typedef struct _Data
{
    int32_t count;
    double avg;
    float scale;
} Data;

int main(int argc, const char *argv[])
{
    size_t n = 10;
    Data d[n];

    for (int i = 0; i < n; ++i)
    {
        d[i].count = i;
        d[i].avg = i + 1.0;
        d[i].scale = (float) i + 2.0f;
    }

    FILE *file = fopen("binary.dat", "wb");
    fwrite(&d, sizeof(Data), n, file);
    fclose(file);

    return 0;
}

以下Python代码将二进制文件'binary.dat'读入pandas DataFrame,其中结构的每个元素对应于框架中的一列:

names = 'count', 'avg', 'scale'

# note that the offsets are larger than the size of the type because of
# struct padding
offsets = 0, 8, 16
formats = 'i4', 'f8', 'f4'
dt = np.dtype({'names': names, 'offsets': offsets, 'formats': formats},
              align=True)
df = pd.DataFrame(np.fromfile('binary.dat', dt))

注意

结构元素的偏移量可以根据其上创建文件的机器的体系结构而不同。不推荐使用这样的原始二进制文件格式进行一般数据存储,因为它不是跨平台。我们建议使用HDF5或msgpack,这两个都由pandas的IO设备支持。

Computation

时间序列的数值积分(基于样本)

Timedeltas

Timedeltas文件。

使用timedeltas

In [177]: s  = pd.Series(pd.date_range('2012-1-1', periods=3, freq='D'))

In [178]: s - s.max()
Out[178]: 
0   -2 days
1   -1 days
2    0 days
dtype: timedelta64[ns]

In [179]: s.max() - s
Out[179]: 
0   2 days
1   1 days
2   0 days
dtype: timedelta64[ns]

In [180]: s - datetime.datetime(2011,1,1,3,5)
Out[180]: 
0   364 days 20:55:00
1   365 days 20:55:00
2   366 days 20:55:00
dtype: timedelta64[ns]

In [181]: s + datetime.timedelta(minutes=5)
Out[181]: 
0   2012-01-01 00:05:00
1   2012-01-02 00:05:00
2   2012-01-03 00:05:00
dtype: datetime64[ns]

In [182]: datetime.datetime(2011,1,1,3,5) - s
Out[182]: 
0   -365 days +03:05:00
1   -366 days +03:05:00
2   -367 days +03:05:00
dtype: timedelta64[ns]

In [183]: datetime.timedelta(minutes=5) + s
Out[183]: 
0   2012-01-01 00:05:00
1   2012-01-02 00:05:00
2   2012-01-03 00:05:00
dtype: datetime64[ns]

添加和减去三角形和日期

In [184]: deltas = pd.Series([ datetime.timedelta(days=i) for i in range(3) ])

In [185]: df = pd.DataFrame(dict(A = s, B = deltas)); df
Out[185]: 
           A      B
0 2012-01-01 0 days
1 2012-01-02 1 days
2 2012-01-03 2 days

In [186]: df['New Dates'] = df['A'] + df['B'];

In [187]: df['Delta'] = df['A'] - df['New Dates']; df
Out[187]: 
           A      B  New Dates   Delta
0 2012-01-01 0 days 2012-01-01  0 days
1 2012-01-02 1 days 2012-01-03 -1 days
2 2012-01-03 2 days 2012-01-05 -2 days

In [188]: df.dtypes
Out[188]: 
A             datetime64[ns]
B            timedelta64[ns]
New Dates     datetime64[ns]
Delta        timedelta64[ns]
dtype: object

另一个例子

值可以使用np.nan设置为NaT,类似于datetime

In [189]: y = s - s.shift(); y
Out[189]: 
0      NaT
1   1 days
2   1 days
dtype: timedelta64[ns]

In [190]: y[1] = np.nan; y
Out[190]: 
0      NaT
1      NaT
2   1 days
dtype: timedelta64[ns]

Aliasing Axis Names

要全局提供轴名称的别名,可以定义这两个函数:

In [191]: def set_axis_alias(cls, axis, alias):
   .....:    if axis not in cls._AXIS_NUMBERS:
   .....:       raise Exception("invalid axis [%s] for alias [%s]" % (axis, alias))
   .....:    cls._AXIS_ALIASES[alias] = axis
   .....: 
In [192]: def clear_axis_alias(cls, axis, alias):
   .....:    if axis not in cls._AXIS_NUMBERS:
   .....:       raise Exception("invalid axis [%s] for alias [%s]" % (axis, alias))
   .....:    cls._AXIS_ALIASES.pop(alias,None)
   .....: 
In [193]: set_axis_alias(pd.DataFrame,'columns', 'myaxis2')

In [194]: df2 = pd.DataFrame(np.random.randn(3,2),columns=['c1','c2'],index=['i1','i2','i3'])

In [195]: df2.sum(axis='myaxis2')
Out[195]: 
i1   -0.573143
i2   -0.161663
i3    0.264035
dtype: float64

In [196]: clear_axis_alias(pd.DataFrame,'columns', 'myaxis2')

Creating Example Data

要从某些给定值的每个组合(如R的expand.grid()函数)创建数据帧,我们可以创建一个dict,其中键是列名,值是数据值的列表:

In [197]: def expand_grid(data_dict):
   .....:    rows = itertools.product(*data_dict.values())
   .....:    return pd.DataFrame.from_records(rows, columns=data_dict.keys())
   .....: 

In [198]: df = expand_grid(
   .....:    {'height': [60, 70],
   .....:     'weight': [100, 140, 180],
   .....:     'sex': ['Male', 'Female']})
   .....: 

In [199]: df
Out[199]: 
       sex  weight  height
0     Male     100      60
1     Male     100      70
2     Male     140      60
3     Male     140      70
4     Male     180      60
5     Male     180      70
6   Female     100      60
7   Female     100      70
8   Female     140      60
9   Female     140      70
10  Female     180      60
11  Female     180      70