Sparse data structures

原文:http://pandas.pydata.org/pandas-docs/stable/sparse.html

译者:飞龙 UsyiyiCN

校对:(虚位以待)

注意

在0.19.0中已删除SparsePanel

我们实现了“稀疏”版本的Series和DataFrame。这些在典型的“大多为0”中不稀疏。相反,您可以将这些对象视为“压缩”,其中省略任何匹配特定值(NaN /缺失值,尽管可以选择任何值)的数据。特殊的SparseIndex对象跟踪数据已被“稀疏化”的位置。在一个例子中,这将更有意义。所有标准的熊猫数据结构都有一个to_sparse方法:

In [1]: ts = pd.Series(randn(10))

In [2]: ts[2:-2] = np.nan

In [3]: sts = ts.to_sparse()

In [4]: sts
Out[4]: 
0    0.469112
1   -0.282863
2         NaN
3         NaN
4         NaN
5         NaN
6         NaN
7         NaN
8   -0.861849
9   -2.104569
dtype: float64
BlockIndex
Block locations: array([0, 8], dtype=int32)
Block lengths: array([2, 2], dtype=int32)

to_sparse方法采用kind参数(对于稀疏索引,请参见下文)和fill_value所以如果我们有一个大多数为零的系列,我们可以将它转换为稀疏与fill_value=0

In [5]: ts.fillna(0).to_sparse(fill_value=0)
Out[5]: 
0    0.469112
1   -0.282863
2    0.000000
3    0.000000
4    0.000000
5    0.000000
6    0.000000
7    0.000000
8   -0.861849
9   -2.104569
dtype: float64
BlockIndex
Block locations: array([0, 8], dtype=int32)
Block lengths: array([2, 2], dtype=int32)

稀疏对象存在是为了内存效率的原因。假设你有一个大的,主要是NA DataFrame:

In [6]: df = pd.DataFrame(randn(10000, 4))

In [7]: df.ix[:9998] = np.nan

In [8]: sdf = df.to_sparse()

In [9]: sdf
Out[9]: 
             0         1         2         3
0          NaN       NaN       NaN       NaN
1          NaN       NaN       NaN       NaN
2          NaN       NaN       NaN       NaN
3          NaN       NaN       NaN       NaN
4          NaN       NaN       NaN       NaN
5          NaN       NaN       NaN       NaN
6          NaN       NaN       NaN       NaN
...        ...       ...       ...       ...
9993       NaN       NaN       NaN       NaN
9994       NaN       NaN       NaN       NaN
9995       NaN       NaN       NaN       NaN
9996       NaN       NaN       NaN       NaN
9997       NaN       NaN       NaN       NaN
9998       NaN       NaN       NaN       NaN
9999  0.280249 -1.648493  1.490865 -0.890819

[10000 rows x 4 columns]

In [10]: sdf.density
Out[10]: 0.0001

如你所见,密度(未被“压缩”的值的百分比)非常低。这个稀疏对象在磁盘(pickled)和Python解释器中占用更少的内存。在功能上,它们的行为应该与它们的稠密对应物几乎相同。

任何稀疏对象都可以通过调用to_dense转换回标准密集形式:

In [11]: sts.to_dense()
Out[11]: 
0    0.469112
1   -0.282863
2         NaN
3         NaN
4         NaN
5         NaN
6         NaN
7         NaN
8   -0.861849
9   -2.104569
dtype: float64

SparseArray

SparseArray是所有稀疏索引数据结构的基本层。它是一个1维的ndarray样对象,只存储不同于fill_value的值:

In [12]: arr = np.random.randn(10)

In [13]: arr[2:5] = np.nan; arr[7:8] = np.nan

In [14]: sparr = pd.SparseArray(arr)

In [15]: sparr
Out[15]: 
[-1.95566352972, -1.6588664276, nan, nan, nan, 1.15893288864, 0.145297113733, nan, 0.606027190513, 1.33421134013]
Fill: nan
IntIndex
Indices: array([0, 1, 5, 6, 8, 9], dtype=int32)

像索引对象(SparseSeries,SparseDataFrame)一样,通过调用to_dense可以将SparseArray转换回常规的ndarray:

In [16]: sparr.to_dense()
Out[16]: 
array([-1.9557, -1.6589,     nan,     nan,     nan,  1.1589,  0.1453,
           nan,  0.606 ,  1.3342])

SparseList

SparseList类已弃用,将在以后的版本中删除。有关SparseList的文档,请参见以前版本的文档

SparseIndex objects

实现了两种SparseIndexblockinteger我们建议使用block,因为它更节省内存。integer格式保留数据不等于填充值的所有位置的数组。block格式只跟踪数据块的位置和大小。

Sparse Dtypes

稀疏数据应具有与其密集表示相同的dtype。目前,支持float64int64bool dtypes。根据原始dtype,fill_value默认更改:

In [17]: s = pd.Series([1, np.nan, np.nan])

In [18]: s
Out[18]: 
0    1.0
1    NaN
2    NaN
dtype: float64

In [19]: s.to_sparse()
Out[19]: 
0    1.0
1    NaN
2    NaN
dtype: float64
BlockIndex
Block locations: array([0], dtype=int32)
Block lengths: array([1], dtype=int32)

In [20]: s = pd.Series([1, 0, 0])

In [21]: s
Out[21]: 
0    1
1    0
2    0
dtype: int64

In [22]: s.to_sparse()
Out[22]: 
0    1
1    0
2    0
dtype: int64
BlockIndex
Block locations: array([0], dtype=int32)
Block lengths: array([1], dtype=int32)

In [23]: s = pd.Series([True, False, True])

In [24]: s
Out[24]: 
0     True
1    False
2     True
dtype: bool

In [25]: s.to_sparse()
Out[25]: 
0     True
1    False
2     True
dtype: bool
BlockIndex
Block locations: array([0, 2], dtype=int32)
Block lengths: array([1, 1], dtype=int32)

您可以使用.astype()更改dtype,结果也是稀疏的。请注意,.astype()也会影响fill_value以保持其密集表示。

In [26]: s = pd.Series([1, 0, 0, 0, 0])

In [27]: s
Out[27]: 
0    1
1    0
2    0
3    0
4    0
dtype: int64

In [28]: ss = s.to_sparse()

In [29]: ss
Out[29]: 
0    1
1    0
2    0
3    0
4    0
dtype: int64
BlockIndex
Block locations: array([0], dtype=int32)
Block lengths: array([1], dtype=int32)

In [30]: ss.astype(np.float64)
Out[30]: 
0    1.0
1    0.0
2    0.0
3    0.0
4    0.0
dtype: float64
BlockIndex
Block locations: array([0], dtype=int32)
Block lengths: array([1], dtype=int32)

如果任何值不能强制到指定的dtype,它会引发。

In [1]: ss = pd.Series([1, np.nan, np.nan]).to_sparse()
0    1.0
1    NaN
2    NaN
dtype: float64
BlockIndex
Block locations: array([0], dtype=int32)
Block lengths: array([1], dtype=int32)

In [2]: ss.astype(np.int64)
ValueError: unable to coerce current fill_value nan to int64 dtype

Sparse Calculation

您可以将NumPy ufuncs应用于SparseArray,并获得SparseArray作为结果。

In [31]: arr = pd.SparseArray([1., np.nan, np.nan, -2., np.nan])

In [32]: np.abs(arr)
Out[32]: 
[1.0, nan, nan, 2.0, nan]
Fill: nan
IntIndex
Indices: array([0, 3], dtype=int32)

ufunc也适用于fill_value这是需要得到正确的密集结果。

In [33]: arr = pd.SparseArray([1., -1, -1, -2., -1], fill_value=-1)

In [34]: np.abs(arr)
Out[34]: 
[1.0, 1, 1, 2.0, 1]
Fill: 1
IntIndex
Indices: array([0, 3], dtype=int32)

In [35]: np.abs(arr).to_dense()
Out[35]: array([ 1.,  1.,  1.,  2.,  1.])

Interaction with scipy.sparse

实验api在稀疏熊猫和scipy.sparse结构之间进行转换。

A SparseSeries.to_coo() method is implemented for transforming a SparseSeries indexed by a MultiIndex to a scipy.sparse.coo_matrix.

该方法需要具有两个或更多个级别的MultiIndex

In [36]: s = pd.Series([3.0, np.nan, 1.0, 3.0, np.nan, np.nan])

In [37]: s.index = pd.MultiIndex.from_tuples([(1, 2, 'a', 0),
   ....:                                      (1, 2, 'a', 1),
   ....:                                      (1, 1, 'b', 0),
   ....:                                      (1, 1, 'b', 1),
   ....:                                      (2, 1, 'b', 0),
   ....:                                      (2, 1, 'b', 1)],
   ....:                                      names=['A', 'B', 'C', 'D'])
   ....: 

In [38]: s
Out[38]: 
A  B  C  D
1  2  a  0    3.0
         1    NaN
   1  b  0    1.0
         1    3.0
2  1  b  0    NaN
         1    NaN
dtype: float64

# SparseSeries
In [39]: ss = s.to_sparse()

In [40]: ss
Out[40]: 
A  B  C  D
1  2  a  0    3.0
         1    NaN
   1  b  0    1.0
         1    3.0
2  1  b  0    NaN
         1    NaN
dtype: float64
BlockIndex
Block locations: array([0, 2], dtype=int32)
Block lengths: array([1, 2], dtype=int32)

在下面的示例中,通过指定第一个和第二个MultiIndex级别定义行的标签,将SparseSeries变换为2-d数组的稀疏表示,和第四级定义列的标签。我们还指定列和行标签应按最终稀疏表示法排序。

In [41]: A, rows, columns = ss.to_coo(row_levels=['A', 'B'],
   ....:                              column_levels=['C', 'D'],
   ....:                              sort_labels=True)
   ....: 

In [42]: A
Out[42]: 
<3x4 sparse matrix of type '<type 'numpy.float64'>'
	with 3 stored elements in COOrdinate format>

In [43]: A.todense()
Out[43]: 
matrix([[ 0.,  0.,  1.,  3.],
        [ 3.,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  0.]])

In [44]: rows
Out[44]: [(1, 1), (1, 2), (2, 1)]

In [45]: columns
Out[45]: [('a', 0), ('a', 1), ('b', 0), ('b', 1)]

指定不同的行和列标签(而不是排序)会产生不同的稀疏矩阵:

In [46]: A, rows, columns = ss.to_coo(row_levels=['A', 'B', 'C'],
   ....:                              column_levels=['D'],
   ....:                              sort_labels=False)
   ....: 

In [47]: A
Out[47]: 
<3x2 sparse matrix of type '<type 'numpy.float64'>'
	with 3 stored elements in COOrdinate format>

In [48]: A.todense()
Out[48]: 
matrix([[ 3.,  0.],
        [ 1.,  3.],
        [ 0.,  0.]])

In [49]: rows
Out[49]: [(1, 2, 'a'), (1, 1, 'b'), (2, 1, 'b')]

In [50]: columns
Out[50]: [0, 1]

实现方便方法SparseSeries.from_coo()用于从scipy.sparse.coo_matrix创建SparseSeries

In [51]: from scipy import sparse

In [52]: A = sparse.coo_matrix(([3.0, 1.0, 2.0], ([1, 0, 0], [0, 2, 3])),
   ....:                       shape=(3, 4))
   ....: 

In [53]: A
Out[53]: 
<3x4 sparse matrix of type '<type 'numpy.float64'>'
	with 3 stored elements in COOrdinate format>

In [54]: A.todense()
Out[54]: 
matrix([[ 0.,  0.,  1.,  2.],
        [ 3.,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  0.]])

默认行为(dense_index=False)只返回一个只包含非空条目的SparseSeries

In [55]: ss = pd.SparseSeries.from_coo(A)

In [56]: ss
Out[56]: 
0  2    1.0
   3    2.0
1  0    3.0
dtype: float64
BlockIndex
Block locations: array([0], dtype=int32)
Block lengths: array([3], dtype=int32)

指定dense_index=True将产生一个索引,该索引是矩阵的行和列坐标的笛卡尔乘积。注意,如果稀疏矩阵足够大(和稀疏),这将消耗大量的存储器(相对于dense_index=False)。

In [57]: ss_dense = pd.SparseSeries.from_coo(A, dense_index=True)

In [58]: ss_dense
Out[58]: 
0  0    NaN
   1    NaN
   2    1.0
   3    2.0
1  0    3.0
   1    NaN
   2    NaN
   3    NaN
2  0    NaN
   1    NaN
   2    NaN
   3    NaN
dtype: float64
BlockIndex
Block locations: array([2], dtype=int32)
Block lengths: array([3], dtype=int32)