Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-43352: [Docs][Python] Add all tensor classes documentation #45160

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
12 changes: 12 additions & 0 deletions docs/source/python/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,3 +39,15 @@ API Reference
api/dataset
api/cuda
api/misc
api/tensors

*************
Tensors
*************

.. _toc.tensors:

.. toctree::
:maxdepth: 2

<python/api/tensors.rst>
11 changes: 3 additions & 8 deletions docs/source/python/api/tables.rst
Original file line number Diff line number Diff line change
Expand Up @@ -55,12 +55,7 @@ Dataframe Interchange Protocol

interchange.from_dataframe

.. _api.tensor:
See Also
--------

Tensors
-------

.. autosummary::
:toctree: ../generated/

Tensor
For information about tensors, refer to :doc:`tensors`
74 changes: 74 additions & 0 deletions docs/source/python/api/tensors.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
.. Licensed to the Apache Software Foundation (ASF) under one
.. or more contributor license agreements. See the NOTICE file
.. distributed with this work for additional information
.. regarding copyright ownership. The ASF licenses this file
.. to you under the Apache License, Version 2.0 (the
.. "License"); you may not use this file except in compliance
.. with the License. You may obtain a copy of the License at

.. http://www.apache.org/licenses/LICENSE-2.0

.. Unless required by applicable law or agreed to in writing,
.. software distributed under the License is distributed on an
.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
.. KIND, either express or implied. See the License for the
.. specific language governing permissions and limitations
.. under the License.

.. currentmodule:: pyarrow

.. _api.tensor:

Tensors
=======

PyArrow supports both dense and sparse tensors. Dense tensors store all data values explicitly, while sparse tensors represent only the non-zero elements and their locations, making them efficient for storage and computation.

Dense Tensors
-------------

.. autosummary::
:toctree: ../generated/

Tensor

Sparse Tensors
--------------

PyArrow supports the following sparse tensor formats:

.. autosummary::
:toctree: ../generated/

SparseCOOTensor
SparseCSRMatrix
SparseCSCMatrix
SparseCSFTensor

SparseCOOTensor
^^^^^^^^^^^^^^^

The ``SparseCOOTensor`` represents a sparse tensor in Coordinate (COO) format, where non-zero elements are stored as tuples of row and column indices.
ShaiviAgarwal2 marked this conversation as resolved.
Show resolved Hide resolved

For detailed examples, see :ref:`data/SparseCOOTensor`.

SparseCSRMatrix
^^^^^^^^^^^^^^^

The ``SparseCSRMatrix`` represents a sparse matrix in Compressed Sparse Row (CSR) format. This format is useful for matrix-vector multiplication.

For detailed examples, see :ref:`data/SparseCSRMatrix`

SparseCSCMatrix
^^^^^^^^^^^^^^^

The ``SparseCSCMatrix`` represents a sparse matrix in Compressed Sparse Column (CSC) format, where data is stored by columns.

For detailed examples, see :ref:`data/SparseCSCMatrix`.

SparseCSFTensor
^^^^^^^^^^^^^^^

The ``SparseCSFTensor`` represents a sparse tensor in Compressed Sparse Fiber (CSF) format, which is a generalization of the CSR format for higher dimensions.

For detailed examples, see :ref:`data/SparseCSFTensor`.
129 changes: 129 additions & 0 deletions docs/source/python/data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -561,6 +561,135 @@ schema without having to get any of the batches.::

It can also be sent between languages using the :ref:`C stream interface <c-stream-interface>`.

Sparse Tensor Classes
=====================

SparseCOOTensor
---------------

The ``SparseCOOTensor`` represents a sparse tensor in Coordinate (COO) format, where non-zero elements are stored as tuples of row and column indices.

Example Usage:
^^^^^^^^^^^^^^

.. code-block:: python

>>> import pyarrow as pa
>>> indices = [
... pa.array([0, 1]),
... pa.array([1, 0])
... ]
>>> data = pa.array([1, 2])
>>> shape = (2, 3)

>>> tensor = pa.SparseCOOTensor.from_numpy(indices, data, shape)
>>> print(tensor)
<pyarrow.SparseCOOTensor object at 0x7fbce1234567>


SparseCSRMatrix
---------------

``SparseCSRMatrix`` represents a sparse matrix in Compressed Sparse Row (CSR) format, where non-zero elements are stored in a compressed manner using arrays for data, indices, and indptr.

Example Usage:
^^^^^^^^^^^^^^

.. code-block:: python

>>> import pyarrow as pa
>>> data = pa.array([1, 2, 3])
>>> indptr = pa.array([0, 2, 3])
>>> indices = pa.array([0, 2, 1])
>>> shape = (2, 3)
>>> sparse_matrix = pa.SparseCSRMatrix.from_numpy(data, indptr, indices, shape)
>>> print(sparse_matrix)
<pyarrow.SparseCSRMatrix object at 0x7fabcde12345>


SparseCSCMatrix
---------------

``SparseCSCMatrix`` represents a sparse matrix in Compressed Sparse Column (CSC) format, where non-zero elements are stored in a compressed manner using arrays for data, indices, and indptr.

Example Usage:
^^^^^^^^^^^^^^

.. code-block:: python

>>> import pyarrow as pa
>>> data = pa.array([4, 5, 6])
>>> indptr = pa.array([0, 1, 3])
>>> indices = pa.array([0, 2, 1])
>>> shape = (3, 2)

>>> sparse_matrix = pa.SparseCSCMatrix.from_numpy(data, indptr, indices, shape)
>>> print(sparse_matrix)
<pyarrow.SparseCSCMatrix object at 0x7fabcde12345>


SparseCSFTensor
---------------

``SparseCSFTensor`` represents a sparse tensor in Compressed Sparse Fiber (CSF) format, optimized for multi-dimensional sparse data storage.

Example Usage:
^^^^^^^^^^^^^^

.. code-block:: python

>>> import pyarrow as pa
>>> data = pa.array([1, 2, 3])
>>> indices = [
... pa.array([0, 0, 1]),
... pa.array([0, 1, 2]),
... ]
>>> shape = (2, 3)

>>> sparse_tensor = pa.SparseCSFTensor.from_numpy(data, indices, shape)
>>> print(sparse_tensor)
<pyarrow.SparseCSFTensor object at 0x7fabcde54321>


Conversion of RecordBatch to Tensor
-----------------------------------

Each array of the ``RecordBatch`` has its own contiguous memory that is not necessarily
adjacent to other arrays. A different memory structure that is used in machine learning
libraries is a two-dimensional array (also called a 2-dim tensor or a matrix) which takes
only one contiguous block of memory.

For this reason, there is a function ``pyarrow.RecordBatch.to_tensor()`` available
to efficiently convert tabular columnar data into a tensor.

Data types supported in this conversion are unsigned, signed integer, and float
types. Currently, only column-major conversion is supported.

Example Usage:
^^^^^^^^^^^^^^

.. code-block:: python

>>> import pyarrow as pa
>>> arr1 = [1, 2, 3, 4, 5]
>>> arr2 = [10, 20, 30, 40, 50]
>>> batch = pa.RecordBatch.from_arrays(
... [
... pa.array(arr1, type=pa.uint16()),
... pa.array(arr2, type=pa.int16()),
... ], ["a", "b"]
... )
>>> batch.to_tensor()
<pyarrow.Tensor>
type: int32
shape: (9, 2)
strides: (4, 36)
>>> batch.to_tensor().to_numpy()
array([[ 1, 10],
[ 2, 20],
[ 3, 30],
[ 4, 40],
[ 5, 50]], dtype=int32)
Conversion of RecordBatch to Tensor
-----------------------------------

Expand Down
Empty file added filtered_rat.txt
Empty file.
7 changes: 4 additions & 3 deletions python/benchmarks/parquet.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
import numpy as np

import pyarrow as pa

try:
import pyarrow.parquet as pq
except ImportError:
Expand All @@ -34,7 +35,7 @@ def setup(self):
num_cols = 10

unique_values = np.array([rands(value_size) for
i in range(nuniques)], dtype='O')
_ in range(nuniques)], dtype='O')
values = unique_values[np.random.randint(0, nuniques, size=length)]
self.table = pa.table([pa.array(values) for i in range(num_cols)],
names=['f{}'.format(i) for i in range(num_cols)])
Expand All @@ -58,7 +59,7 @@ def time_convert_pandas_and_write_binary_table(self):


def generate_dict_strings(string_size, nunique, length, random_order=True):
uniques = np.array([rands(string_size) for i in range(nunique)], dtype='O')
uniques = np.array([rands(string_size) for _ in range(nunique)], dtype='O')
if random_order:
indices = np.random.randint(0, nunique, size=length).astype('i4')
else:
Expand All @@ -71,7 +72,7 @@ def generate_dict_table(num_cols, string_size, nunique, length,
data = generate_dict_strings(string_size, nunique, length,
random_order=random_order)
return pa.table([
data for i in range(num_cols)
data for _ in range(num_cols)
], names=['f{}'.format(i) for i in range(num_cols)])


Expand Down
15 changes: 14 additions & 1 deletion python/pyarrow/tensor.pxi
Original file line number Diff line number Diff line change
Expand Up @@ -610,7 +610,20 @@ shape: {0.shape}""".format(self)

cdef class SparseCSRMatrix(_Weakrefable):
"""
A sparse CSR matrix.
SparseCSRMatrix represents a sparse matrix in Compressed Sparse Row (CSR) format.

Example:
>>> import pyarrow as pa
>>> import numpy as np
>>> data = np.array([1, 2, 3])
>>> indptr = np.array([0, 2, 3])
>>> indices = np.array([0, 2, 1])
>>> shape = (2, 3)
>>> tensor = pa.SparseCSRMatrix.from_numpy(data, indptr, indices, shape)
>>> print(tensor)
<pyarrow.SparseCSRMatrix>
type: int64
shape: (2, 3)
"""

def __init__(self):
Expand Down
Empty file added rat.txt
Empty file.