rain.nodes.pandas package#

Submodules#

rain.nodes.pandas.model_io module#

Copyright (C) 2023 Università degli Studi di Camerino and Sigma S.p.A. Authors: Alessandro Antinori, Rosario Capparuccia, Riccardo Coltrinari, Flavio Corradini, Marco Piangerelli, Barbara Re, Marco Scarpetta

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.

class rain.nodes.pandas.model_io.PickleModelLoader(node_id: str, path: str)[source]#

Bases: InputNode

Node that loads a given object, for instance a trained model, stored in pickle format.

Output:

model (pickle) – The loaded object in pickle format.

Parameters:

node_id (str) – Id of the node.
path (str) – The path of the stored object/model.

execute()[source]#: Expose the main functionality: depending on the node, the computation is done using a specific Python library and its function/s.

model = None#

class rain.nodes.pandas.model_io.PickleModelWriter(node_id: str, path: str)[source]#

Bases: OutputNode

Node that stores a given object, for instance a trained model, in pickle format.

Input:

model (pickle) – The object/model to store.

Parameters:

node_id (str) – Id of the node.
path (str) – The path/filename where to store the object/model.

execute()[source]#: Expose the main functionality: depending on the node, the computation is done using a specific Python library and its function/s.

model = None#

rain.nodes.pandas.node_structure module#

You should have received a copy of the GNU Affero General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.

class rain.nodes.pandas.node_structure.PandasNode(node_id: str)[source]#

Bases: ComputationalNode

Node that perform some transformation using the Pandas library without input/output constraints.

Parameters:: node_id (str) – Unique identifier of the node in the DataFlow.

abstract execute()[source]#: Expose the main functionality: depending on the node, the computation is done using a specific Python library and its function/s.

class rain.nodes.pandas.node_structure.PandasTransformer(node_id: str)[source]#

Bases: ComputationalNode

Parent class for all the nodes that take a dataset as input, apply a transformation and expose the transformed dataset as output.

Parameters:: node_id (str) – Unique identifier of the node in the DataFlow.

dataset = None#

abstract execute()[source]#: Expose the main functionality: depending on the node, the computation is done using a specific Python library and its function/s.

rain.nodes.pandas.pandas_io module#

You should have received a copy of the GNU Affero General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.

class rain.nodes.pandas.pandas_io.PandasCSVLoader(node_id: str, path: str, delim: str = ',', index_col: Optional[Union[int, str]] = None)[source]#

Bases: PandasInputNode

Loads a pandas DataFrame from a CSV file.

Output:

dataset (pandas.DataFrame) – The loaded csv file as a pandas DataFrame.

Parameters:

path (str) – Of the CSV file.
delim (str, default ',') – Delimiter symbol of the CSV file.
index_col (str, default=None) – Column to use as the row labels of the DataFrame, given as string name

Notes

Visit https://pandas.pydata.org/pandas-docs/version/1.3/reference/api/pandas.read_csv.html for Pandas read_csv documentation.

dataset = None#

execute()[source]#: Expose the main functionality: depending on the node, the computation is done using a specific Python library and its function/s.

class rain.nodes.pandas.pandas_io.PandasCSVWriter(node_id: str, path: str, delim: str = ',', include_rows: bool = True, rows_column_label: Optional[str] = None, include_columns: bool = True, columns: Optional[list] = None)[source]#

Bases: PandasOutputNode

Writes a pandas DataFrame into a CSV file.

Input:

dataset (pandas.DataFrame) – The pandas DataFrame to write in a CSV file.

Parameters:

path (str) – Of the CSV file.
delim (str, default ',') – Delimiter symbol of the CSV file.
include_rows (bool, default True) – Whether to include rows indexes.
rows_column_label (str, default None) – If rows indexes must be included you can give a name to its column.
include_columns (bool, default True) – Whether to include column names.
columns (list[str], default None) – If column names must be included you can give names to them. The order is relevant.

Notes

Visit https://pandas.pydata.org/pandas-docs/version/1.3/reference/api/pandas.DataFrame.to_csv.html for Pandas to_csv documentation.

dataset = None#

execute()[source]#: Expose the main functionality: depending on the node, the computation is done using a specific Python library and its function/s.

class rain.nodes.pandas.pandas_io.PandasInputNode(node_id: str)[source]#

Bases: InputNode

Parent class for all the nodes that load a pandas DataFrame from some kind of source.

dataset = None#

abstract execute()[source]#: Expose the main functionality: depending on the node, the computation is done using a specific Python library and its function/s.

class rain.nodes.pandas.pandas_io.PandasOutputNode(node_id: str)[source]#

Bases: OutputNode

Parent class for all the nodes that return a pandas DataFrame toward some kind of destination.

dataset = None#

abstract execute()[source]#: Expose the main functionality: depending on the node, the computation is done using a specific Python library and its function/s.

rain.nodes.pandas.transform_nodes module#

You should have received a copy of the GNU Affero General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.

class rain.nodes.pandas.transform_nodes.PandasAddColumn(node_id: str, loc: int, col: str)[source]#

Bases: PandasTransformer

Node used to add a column to a Pandas Dataframe starting from a given Pandas Series.

Input:

dataset (pd.DataFrame) – A pandas DataFrame.
column (pd.Series) – A pandas Series to add to the dataset.

Output:

dataset (pd.DataFrame) – A pandas DataFrame.

Parameters:

node_id (str) – The unique id of the node.
loc (int) – Insertion index. Must verify 0 <= loc <= len(columns)
col (str) – Label of the inserted column.

column = None#

dataset = None#

execute()[source]#: Expose the main functionality: depending on the node, the computation is done using a specific Python library and its function/s.

class rain.nodes.pandas.transform_nodes.PandasColumnsFiltering(node_id: str, column_indexes: Optional[List[int]] = None, column_names: Optional[List[str]] = None, columns_like: Optional[str] = None, columns_regex: Optional[str] = None, columns_range: Optional[Tuple[int, int]] = None, columns_type=None)[source]#

Bases: PandasTransformer

PandasColumnsFiltering manages filtering of columns. This node gives access to several functionalities such as: - select columns by their indexes; - select columns by their names (labels); - select columns containing a substring in their names; - select columns that match a regex; - select columns in a range of indexes; - assign a type to a column. Every parameter but ‘columns_type’ is mutually exclusive, meaning that only one can be used.

Input:

dataset (pd.DataFrame) – A pandas DataFrame.

Output:

dataset (pd.DataFrame) – A pandas DataFrame.

Parameters:

node_id (str) – Id of the node.
column_indexes (List[int]) – Filters the dataset selecting the given indexes. Uses the pandas iloc function.
column_names (List[str]) – Filters the dataset selecting the given column labels. Uses the pandas filter function.
columns_like (str) – Keep columns for which the given string is a substring of the column label.
columns_regex (str) – Keep columns for which column labels match a given pattern.
columns_range (Tuple[int, int]) – Keep columns for which index falls withing the given range (from, to (excluded)).
columns_type (str or List[str]) – Type to assign to columns. It can be either a string, meaning that it will try to apply the chosen type to all the columns, or a list of strings, one for each column, meaning that it will try to assign a chosen type to each column in order.

dataset = None#

execute()[source]#: Expose the main functionality: depending on the node, the computation is done using a specific Python library and its function/s.

class rain.nodes.pandas.transform_nodes.PandasDropNan(node_id: str, axis='rows', how='any')[source]#

Bases: PandasTransformer

Drops rows or columns that either has at least a NaN value or that has all NaN values.

Input:

dataset (pd.DataFrame) – A pandas DataFrame.

Output:

dataset (pd.DataFrame) – A pandas DataFrame.

Parameters:

node_id (str) – Id of the node.
axis ({rows, columns}, default rows) – The axis from where to remove the nan values.
how ({any, all}, default any) – Whether to remove a row or a column which either contains any nan value or contains all nan values.

dataset = None#

execute()[source]#: Expose the main functionality: depending on the node, the computation is done using a specific Python library and its function/s.

class rain.nodes.pandas.transform_nodes.PandasFilterRows(node_id: str)[source]#

Bases: PandasTransformer

PandasFilterRows manages filtering of rows that have been previously selected.

Input:

dataset (pd.DataFrame) – A pandas DataFrame to filter.
selected_rows (pd.Series) – A pandas Series containing True on the rows to keep.

Output:

dataset (pd.DataFrame) – A pandas DataFrame containing only the selected rows.

Parameters:

node_id (str) – Id of the node.

dataset = None#

execute()[source]#: Expose the main functionality: depending on the node, the computation is done using a specific Python library and its function/s.

selected_rows = None#

class rain.nodes.pandas.transform_nodes.PandasGroupBy(node_id: str, key: Optional[str] = None, freq: Optional[str] = None, axis: int = 0, sort: bool = False, dropna: bool = True, aggregates: Optional[str] = None)[source]#

Bases: PandasTransformer

PandasGroupBy manages filtering of rows that have been previously selected.

Input:

dataset (pd.DataFrame) – A pandas DataFrame to group.

Output:

dataset (pd.DataFrame) – A pandas DataFrame resulting from the GroupBy.

Parameters:

node_id (str) – Id of the node.
key (str) – Groupby key, which selects the grouping column of the target.
freq (str) – This will groupby the specified frequency if the target selection (via key) is a datetime-like object. For full specification of available frequencies, please see `here /pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases>`_.
axis (int, default=0) – Number of the axis.
sort (bool, default=False) – Whether to sort the resulting labels.
dropna (bool, default=True) – If True, and if group keys contain NA values, NA values together with row/column will be dropped. If False, NA values will also be treated as the key in groups.
aggregates (str or List[str]) – The function used to aggregate the different columns during the GroupBy. It can be either a string, meaning that it will try to apply the chosen aggregation function to all the columns, or a list of strings, one for each column, meaning that it will try to assign a chosen type to each column in order.

dataset = None#

execute()[source]#: Expose the main functionality: depending on the node, the computation is done using a specific Python library and its function/s.

class rain.nodes.pandas.transform_nodes.PandasPivot(node_id: str, rows: str, columns: str, values: str, aggfunc: str = 'mean', fill_value: int = 0, dropna: bool = True, sort: bool = True)[source]#

Bases: PandasTransformer

Transforms a DataFrame into a Pivot table from the given rows, columns and values.

Input:

dataset (pd.DataFrame) – A pandas DataFrame.

Output:

dataset (pd.DataFrame) – A pandas DataFrame containing a Pivot table.

Parameters:

rows (str) – Name of the column whose values will be the rows of the pivot.
columns (str) – Name of the column whose values will be the columns of the pivot.
values (str) – Name of the column whose values will be the values of the pivot.
aggfunc (str, default 'mean') – Function to use for the aggregation.
fill_value (int, default 0) – Value to replace missing values with.
dropna (bool, default True) – Do not include columns whose entries are all NaN.
sort (bool, default True) – Specifies if the result should be sorted.

dataset = None#

execute()[source]#: Expose the main functionality: depending on the node, the computation is done using a specific Python library and its function/s.

class rain.nodes.pandas.transform_nodes.PandasRenameColumn(node_id: str, columns: list)[source]#

Bases: PandasTransformer

Sets column names for a pandas DataFrame.

Input:: dataset (pd.DataFrame) – A pandas DataFrame.
Output:: dataset (pd.DataFrame) – A pandas DataFrame.
Parameters:: columns (list[str]) – Column names to assign to the DataFrame. The order is relevant.

dataset = None#

execute()[source]#: Expose the main functionality: depending on the node, the computation is done using a specific Python library and its function/s.

class rain.nodes.pandas.transform_nodes.PandasReplaceColumn(node_id: str, first_value: Any, second_value: Any)[source]#

Bases: PandasNode

Node used to replace the boolean values of a Pandas Series with other values given by the user.

Input:

column (pd.Series) – A pandas Series containing all True or False values.

Output:

column (pd.Series) – A pandas Series containing the substituted values.

Parameters:

node_id (str) – The unique id of the node.
first_value (Any) – Value used when the condition is True.
second_value (Any) – Value used when the condition is False.

column = None#

execute()[source]#: Expose the main functionality: depending on the node, the computation is done using a specific Python library and its function/s.

class rain.nodes.pandas.transform_nodes.PandasSelectRows(node_id: str, select_nan: bool = False, conditions: Optional[List[str]] = None)[source]#

Bases: PandasNode

PandasSelectRows manages selection of rows, which can later be filtered or deleted.

Input:

dataset (pd.DataFrame) – A pandas DataFrame.

Output:

selection (pd.Series) – A pandas Series containing True on the selected rows and False on the other.
dataset (pd.DataFrame) – The filtered pandas DataFrame.

Parameters:

node_id (str) – Id of the node.
select_nan (bool, default False) – Whether to select rows with at least one NaN value.
conditions (List[str]) – List of conditions to select rows.

dataset = None#

execute()[source]#: Expose the main functionality: depending on the node, the computation is done using a specific Python library and its function/s.

selection = None#

class rain.nodes.pandas.transform_nodes.PandasSequence(node_id: str, stages: List[PandasTransformer])[source]#

Bases: PandasTransformer

PandasSequence wraps a list of nodes that must be executed in sequence into a single node. Intermediate values are passed along the chain using the ‘dataset’ variable, hence only PandasNodes can be used within a sequence.

Input:

dataset (pd.DataFrame) – A pandas DataFrame.

Output:

dataset (pd.DataFrame) – A pandas DataFrame.

Parameters:

node_id (str) – The unique id of the node.
stages (list of PandasTransformer) – ordered in an execution sequence. They must all be PandasNodes, hence have a ‘dataset’ variable used for input and output.

dataset = None#

execute()[source]#: Expose the main functionality: depending on the node, the computation is done using a specific Python library and its function/s.

class rain.nodes.pandas.transform_nodes.SplitFeaturesAndLabels(node_id: str, target: str)[source]#

Bases: PandasTransformer

Node used to split a Dataframe into Features and Labels.

Input:

dataset (pd.DataFrame) – A pandas DataFrame.

Output:

dataset (pd.DataFrame) – A pandas DataFrame representing the Features.
labels (pd.Series) – A pandas Series containing the labels.

Parameters:

node_id (str) – The unique id of the node.
target (str) – The name of the column containing the labels.

dataset = None#

execute()[source]#: Expose the main functionality: depending on the node, the computation is done using a specific Python library and its function/s.

labels = None#

rain.nodes.pandas.zscore module#

You should have received a copy of the GNU Affero General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.

class rain.nodes.pandas.zscore.ZScorePredictor(node_id: str, columns: List[str] = [], threshold: float = 1.3)[source]#

Bases: PandasNode

Node that returns the predictions performed with a ZScore model on the columns of a dataset.

Input:

dataset (pandas.DataFrame) – The pandas DataFrame.
model (pickle) – The ZScore model in pickle format.

Output:

predictions (pandas.DataFrame) – The DataFrame containing the predictions.

Parameters:

columns (List[str]) – Column names to apply ZScore to. Empty to use all columns.
threshold (float, default=1.3) – The threshold of the ZScore to distinguish anomalies.

dataset = None#

execute()[source]#: Expose the main functionality: depending on the node, the computation is done using a specific Python library and its function/s.

model = None#

predictions = None#

class rain.nodes.pandas.zscore.ZScoreTrainer(node_id: str, columns: List[str] = [])[source]#

Bases: PandasNode

Node that returns the model trained with the ZScore algorithm by analyzing the columns of the dataset.

Input:: dataset (pandas.DataFrame) – The pandas DataFrame.
Output:: model (pickle) – The ZScore model in pickle format.
Parameters:: columns (List[str]) – Column names to apply ZScore to. Empty to use all columns.

dataset = None#

execute()[source]#: Expose the main functionality: depending on the node, the computation is done using a specific Python library and its function/s.

model = None#

Module contents#

You should have received a copy of the GNU Affero General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.