rain.nodes.pandas package#
Submodules#
rain.nodes.pandas.model_io module#
Copyright (C) 2023 Università degli Studi di Camerino and Sigma S.p.A. Authors: Alessandro Antinori, Rosario Capparuccia, Riccardo Coltrinari, Flavio Corradini, Marco Piangerelli, Barbara Re, Marco Scarpetta
This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.
- class rain.nodes.pandas.model_io.PickleModelLoader(node_id: str, path: str)[source]#
Bases:
InputNodeNode that loads a given object, for instance a trained model, stored in pickle format.
- Output:
model (pickle) – The loaded object in pickle format.
- Parameters:
node_id (str) – Id of the node.
path (str) – The path of the stored object/model.
- execute()[source]#
Expose the main functionality: depending on the node, the computation is done using a specific Python library and its function/s.
- model = None#
- class rain.nodes.pandas.model_io.PickleModelWriter(node_id: str, path: str)[source]#
Bases:
OutputNodeNode that stores a given object, for instance a trained model, in pickle format.
- Input:
model (pickle) – The object/model to store.
- Parameters:
node_id (str) – Id of the node.
path (str) – The path/filename where to store the object/model.
- execute()[source]#
Expose the main functionality: depending on the node, the computation is done using a specific Python library and its function/s.
- model = None#
rain.nodes.pandas.node_structure module#
Copyright (C) 2023 Università degli Studi di Camerino and Sigma S.p.A. Authors: Alessandro Antinori, Rosario Capparuccia, Riccardo Coltrinari, Flavio Corradini, Marco Piangerelli, Barbara Re, Marco Scarpetta
This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.
- class rain.nodes.pandas.node_structure.PandasNode(node_id: str)[source]#
Bases:
ComputationalNodeNode that perform some transformation using the Pandas library without input/output constraints.
- Parameters:
node_id (str) – Unique identifier of the node in the DataFlow.
- class rain.nodes.pandas.node_structure.PandasTransformer(node_id: str)[source]#
Bases:
ComputationalNodeParent class for all the nodes that take a dataset as input, apply a transformation and expose the transformed dataset as output.
- Parameters:
node_id (str) – Unique identifier of the node in the DataFlow.
- dataset = None#
rain.nodes.pandas.pandas_io module#
Copyright (C) 2023 Università degli Studi di Camerino and Sigma S.p.A. Authors: Alessandro Antinori, Rosario Capparuccia, Riccardo Coltrinari, Flavio Corradini, Marco Piangerelli, Barbara Re, Marco Scarpetta
This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.
- class rain.nodes.pandas.pandas_io.PandasCSVLoader(node_id: str, path: str, delim: str = ',', index_col: Optional[Union[int, str]] = None)[source]#
Bases:
PandasInputNodeLoads a pandas DataFrame from a CSV file.
- Output:
dataset (pandas.DataFrame) – The loaded csv file as a pandas DataFrame.
- Parameters:
path (str) – Of the CSV file.
delim (str, default ',') – Delimiter symbol of the CSV file.
index_col (str, default=None) – Column to use as the row labels of the DataFrame, given as string name
Notes
Visit https://pandas.pydata.org/pandas-docs/version/1.3/reference/api/pandas.read_csv.html for Pandas read_csv documentation.
- dataset = None#
- class rain.nodes.pandas.pandas_io.PandasCSVWriter(node_id: str, path: str, delim: str = ',', include_rows: bool = True, rows_column_label: Optional[str] = None, include_columns: bool = True, columns: Optional[list] = None)[source]#
Bases:
PandasOutputNodeWrites a pandas DataFrame into a CSV file.
- Input:
dataset (pandas.DataFrame) – The pandas DataFrame to write in a CSV file.
- Parameters:
path (str) – Of the CSV file.
delim (str, default ',') – Delimiter symbol of the CSV file.
include_rows (bool, default True) – Whether to include rows indexes.
rows_column_label (str, default None) – If rows indexes must be included you can give a name to its column.
include_columns (bool, default True) – Whether to include column names.
columns (list[str], default None) – If column names must be included you can give names to them. The order is relevant.
Notes
Visit https://pandas.pydata.org/pandas-docs/version/1.3/reference/api/pandas.DataFrame.to_csv.html for Pandas to_csv documentation.
- dataset = None#
- class rain.nodes.pandas.pandas_io.PandasInputNode(node_id: str)[source]#
Bases:
InputNodeParent class for all the nodes that load a pandas DataFrame from some kind of source.
- dataset = None#
- class rain.nodes.pandas.pandas_io.PandasOutputNode(node_id: str)[source]#
Bases:
OutputNodeParent class for all the nodes that return a pandas DataFrame toward some kind of destination.
- dataset = None#
rain.nodes.pandas.transform_nodes module#
Copyright (C) 2023 Università degli Studi di Camerino and Sigma S.p.A. Authors: Alessandro Antinori, Rosario Capparuccia, Riccardo Coltrinari, Flavio Corradini, Marco Piangerelli, Barbara Re, Marco Scarpetta
This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.
- class rain.nodes.pandas.transform_nodes.PandasAddColumn(node_id: str, loc: int, col: str)[source]#
Bases:
PandasTransformerNode used to add a column to a Pandas Dataframe starting from a given Pandas Series.
- Input:
dataset (pd.DataFrame) – A pandas DataFrame.
column (pd.Series) – A pandas Series to add to the dataset.
- Output:
dataset (pd.DataFrame) – A pandas DataFrame.
- Parameters:
node_id (str) – The unique id of the node.
loc (int) – Insertion index. Must verify 0 <= loc <= len(columns)
col (str) – Label of the inserted column.
- column = None#
- dataset = None#
- class rain.nodes.pandas.transform_nodes.PandasColumnsFiltering(node_id: str, column_indexes: Optional[List[int]] = None, column_names: Optional[List[str]] = None, columns_like: Optional[str] = None, columns_regex: Optional[str] = None, columns_range: Optional[Tuple[int, int]] = None, columns_type=None)[source]#
Bases:
PandasTransformerPandasColumnsFiltering manages filtering of columns. This node gives access to several functionalities such as: - select columns by their indexes; - select columns by their names (labels); - select columns containing a substring in their names; - select columns that match a regex; - select columns in a range of indexes; - assign a type to a column. Every parameter but ‘columns_type’ is mutually exclusive, meaning that only one can be used.
- Input:
dataset (pd.DataFrame) – A pandas DataFrame.
- Output:
dataset (pd.DataFrame) – A pandas DataFrame.
- Parameters:
node_id (str) – Id of the node.
column_indexes (List[int]) – Filters the dataset selecting the given indexes. Uses the pandas iloc function.
column_names (List[str]) – Filters the dataset selecting the given column labels. Uses the pandas filter function.
columns_like (str) – Keep columns for which the given string is a substring of the column label.
columns_regex (str) – Keep columns for which column labels match a given pattern.
columns_range (Tuple[int, int]) – Keep columns for which index falls withing the given range (from, to (excluded)).
columns_type (str or List[str]) – Type to assign to columns. It can be either a string, meaning that it will try to apply the chosen type to all the columns, or a list of strings, one for each column, meaning that it will try to assign a chosen type to each column in order.
- dataset = None#
- class rain.nodes.pandas.transform_nodes.PandasDropNan(node_id: str, axis='rows', how='any')[source]#
Bases:
PandasTransformerDrops rows or columns that either has at least a NaN value or that has all NaN values.
- Input:
dataset (pd.DataFrame) – A pandas DataFrame.
- Output:
dataset (pd.DataFrame) – A pandas DataFrame.
- Parameters:
node_id (str) – Id of the node.
axis ({rows, columns}, default rows) – The axis from where to remove the nan values.
how ({any, all}, default any) – Whether to remove a row or a column which either contains any nan value or contains all nan values.
- dataset = None#
- class rain.nodes.pandas.transform_nodes.PandasFilterRows(node_id: str)[source]#
Bases:
PandasTransformerPandasFilterRows manages filtering of rows that have been previously selected.
- Input:
dataset (pd.DataFrame) – A pandas DataFrame to filter.
selected_rows (pd.Series) – A pandas Series containing True on the rows to keep.
- Output:
dataset (pd.DataFrame) – A pandas DataFrame containing only the selected rows.
- Parameters:
node_id (str) – Id of the node.
- dataset = None#
- execute()[source]#
Expose the main functionality: depending on the node, the computation is done using a specific Python library and its function/s.
- selected_rows = None#
- class rain.nodes.pandas.transform_nodes.PandasGroupBy(node_id: str, key: Optional[str] = None, freq: Optional[str] = None, axis: int = 0, sort: bool = False, dropna: bool = True, aggregates: Optional[str] = None)[source]#
Bases:
PandasTransformerPandasGroupBy manages filtering of rows that have been previously selected.
- Input:
dataset (pd.DataFrame) – A pandas DataFrame to group.
- Output:
dataset (pd.DataFrame) – A pandas DataFrame resulting from the GroupBy.
- Parameters:
node_id (str) – Id of the node.
key (str) – Groupby key, which selects the grouping column of the target.
freq (str) – This will groupby the specified frequency if the target selection (via key) is a datetime-like object. For full specification of available frequencies, please see `here /pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases>`_.
axis (int, default=0) – Number of the axis.
sort (bool, default=False) – Whether to sort the resulting labels.
dropna (bool, default=True) – If True, and if group keys contain NA values, NA values together with row/column will be dropped. If False, NA values will also be treated as the key in groups.
aggregates (str or List[str]) – The function used to aggregate the different columns during the GroupBy. It can be either a string, meaning that it will try to apply the chosen aggregation function to all the columns, or a list of strings, one for each column, meaning that it will try to assign a chosen type to each column in order.
- dataset = None#
- class rain.nodes.pandas.transform_nodes.PandasPivot(node_id: str, rows: str, columns: str, values: str, aggfunc: str = 'mean', fill_value: int = 0, dropna: bool = True, sort: bool = True)[source]#
Bases:
PandasTransformerTransforms a DataFrame into a Pivot table from the given rows, columns and values.
- Input:
dataset (pd.DataFrame) – A pandas DataFrame.
- Output:
dataset (pd.DataFrame) – A pandas DataFrame containing a Pivot table.
- Parameters:
rows (str) – Name of the column whose values will be the rows of the pivot.
columns (str) – Name of the column whose values will be the columns of the pivot.
values (str) – Name of the column whose values will be the values of the pivot.
aggfunc (str, default 'mean') – Function to use for the aggregation.
fill_value (int, default 0) – Value to replace missing values with.
dropna (bool, default True) – Do not include columns whose entries are all NaN.
sort (bool, default True) – Specifies if the result should be sorted.
- dataset = None#
- class rain.nodes.pandas.transform_nodes.PandasRenameColumn(node_id: str, columns: list)[source]#
Bases:
PandasTransformerSets column names for a pandas DataFrame.
- Input:
dataset (pd.DataFrame) – A pandas DataFrame.
- Output:
dataset (pd.DataFrame) – A pandas DataFrame.
- Parameters:
columns (list[str]) – Column names to assign to the DataFrame. The order is relevant.
- dataset = None#
- class rain.nodes.pandas.transform_nodes.PandasReplaceColumn(node_id: str, first_value: Any, second_value: Any)[source]#
Bases:
PandasNodeNode used to replace the boolean values of a Pandas Series with other values given by the user.
- Input:
column (pd.Series) – A pandas Series containing all True or False values.
- Output:
column (pd.Series) – A pandas Series containing the substituted values.
- Parameters:
node_id (str) – The unique id of the node.
first_value (Any) – Value used when the condition is True.
second_value (Any) – Value used when the condition is False.
- column = None#
- class rain.nodes.pandas.transform_nodes.PandasSelectRows(node_id: str, select_nan: bool = False, conditions: Optional[List[str]] = None)[source]#
Bases:
PandasNodePandasSelectRows manages selection of rows, which can later be filtered or deleted.
- Input:
dataset (pd.DataFrame) – A pandas DataFrame.
- Output:
selection (pd.Series) – A pandas Series containing True on the selected rows and False on the other.
dataset (pd.DataFrame) – The filtered pandas DataFrame.
- Parameters:
node_id (str) – Id of the node.
select_nan (bool, default False) – Whether to select rows with at least one NaN value.
conditions (List[str]) – List of conditions to select rows.
- dataset = None#
- execute()[source]#
Expose the main functionality: depending on the node, the computation is done using a specific Python library and its function/s.
- selection = None#
- class rain.nodes.pandas.transform_nodes.PandasSequence(node_id: str, stages: List[PandasTransformer])[source]#
Bases:
PandasTransformerPandasSequence wraps a list of nodes that must be executed in sequence into a single node. Intermediate values are passed along the chain using the ‘dataset’ variable, hence only PandasNodes can be used within a sequence.
- Input:
dataset (pd.DataFrame) – A pandas DataFrame.
- Output:
dataset (pd.DataFrame) – A pandas DataFrame.
- Parameters:
node_id (str) – The unique id of the node.
stages (list of PandasTransformer) – ordered in an execution sequence. They must all be PandasNodes, hence have a ‘dataset’ variable used for input and output.
- dataset = None#
- class rain.nodes.pandas.transform_nodes.SplitFeaturesAndLabels(node_id: str, target: str)[source]#
Bases:
PandasTransformerNode used to split a Dataframe into Features and Labels.
- Input:
dataset (pd.DataFrame) – A pandas DataFrame.
- Output:
dataset (pd.DataFrame) – A pandas DataFrame representing the Features.
labels (pd.Series) – A pandas Series containing the labels.
- Parameters:
node_id (str) – The unique id of the node.
target (str) – The name of the column containing the labels.
- dataset = None#
- execute()[source]#
Expose the main functionality: depending on the node, the computation is done using a specific Python library and its function/s.
- labels = None#
rain.nodes.pandas.zscore module#
Copyright (C) 2023 Università degli Studi di Camerino and Sigma S.p.A. Authors: Alessandro Antinori, Rosario Capparuccia, Riccardo Coltrinari, Flavio Corradini, Marco Piangerelli, Barbara Re, Marco Scarpetta
This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.
- class rain.nodes.pandas.zscore.ZScorePredictor(node_id: str, columns: List[str] = [], threshold: float = 1.3)[source]#
Bases:
PandasNodeNode that returns the predictions performed with a ZScore model on the columns of a dataset.
- Input:
dataset (pandas.DataFrame) – The pandas DataFrame.
model (pickle) – The ZScore model in pickle format.
- Output:
predictions (pandas.DataFrame) – The DataFrame containing the predictions.
- Parameters:
columns (List[str]) – Column names to apply ZScore to. Empty to use all columns.
threshold (float, default=1.3) – The threshold of the ZScore to distinguish anomalies.
- dataset = None#
- execute()[source]#
Expose the main functionality: depending on the node, the computation is done using a specific Python library and its function/s.
- model = None#
- predictions = None#
- class rain.nodes.pandas.zscore.ZScoreTrainer(node_id: str, columns: List[str] = [])[source]#
Bases:
PandasNodeNode that returns the model trained with the ZScore algorithm by analyzing the columns of the dataset.
- Input:
dataset (pandas.DataFrame) – The pandas DataFrame.
- Output:
model (pickle) – The ZScore model in pickle format.
- Parameters:
columns (List[str]) – Column names to apply ZScore to. Empty to use all columns.
- dataset = None#
- execute()[source]#
Expose the main functionality: depending on the node, the computation is done using a specific Python library and its function/s.
- model = None#
Module contents#
Copyright (C) 2023 Università degli Studi di Camerino and Sigma S.p.A. Authors: Alessandro Antinori, Rosario Capparuccia, Riccardo Coltrinari, Flavio Corradini, Marco Piangerelli, Barbara Re, Marco Scarpetta
This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.