rain.nodes.spark.pipeline package#
Submodules#
rain.nodes.spark.pipeline.spark_pipeline module#
Copyright (C) 2023 Università degli Studi di Camerino and Sigma S.p.A. Authors: Alessandro Antinori, Rosario Capparuccia, Riccardo Coltrinari, Flavio Corradini, Marco Piangerelli, Barbara Re, Marco Scarpetta
This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.
- class rain.nodes.spark.pipeline.spark_pipeline.SparkPipelineNode(node_id: str, stages: List[SparkNode])[source]#
Bases:
EstimatorRepresent a Spark Pipeline consisting of SparkNode (stages). It should contain some Spark Transformer and a final Spark Estimator that return the trained model.
- Input:
dataset (DataFrame) – A Spark DataFrame.
- Output:
model (PipelineModel) – A Spark PipelineModel.
- Parameters:
node_id (str) – Id of the node.
stages (List[SparkNode]) – List of SparkNode that can be executed in a Spark Pipeline.
Notes
Visit https://spark.apache.org/docs/latest/ml-pipeline.html#pipeline for Spark Pipeline documentation.
- dataset = None#
- execute()[source]#
Expose the main functionality: depending on the node, the computation is done using a specific Python library and its function/s.
- model = None#
rain.nodes.spark.pipeline.stages module#
Copyright (C) 2023 Università degli Studi di Camerino and Sigma S.p.A. Authors: Alessandro Antinori, Rosario Capparuccia, Riccardo Coltrinari, Flavio Corradini, Marco Piangerelli, Barbara Re, Marco Scarpetta
This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.
- class rain.nodes.spark.pipeline.stages.HashingTF(node_id: str, in_col: str, out_col: str)[source]#
Bases:
TransformerRepresent a Spark HashingTF that maps a sequence of terms to their term frequencies using the hashing trick.
- Input:
dataset (DataFrame) – A Spark DataFrame.
- Output:
dataset (DataFrame) – The modified Spark DataFrame.
- Parameters:
node_id (str) – Id of the node.
in_col (str) – The name of the input column.
out_col (str) – The name of the output column.
- dataset = None#
- class rain.nodes.spark.pipeline.stages.LogisticRegression(node_id: str, max_iter: int, reg_param: float)[source]#
Bases:
EstimatorRepresent a SparkNode that supports fitting traditional logistic regression model.
- Input:
dataset (DataFrame) – A Spark DataFrame.
- Output:
model (PipelineModel) – A Spark PipelineModel.
- Parameters:
max_iter (int) – Max number of iterations.
reg_param (float) – Regularization parameter.
- dataset = None#
- execute()[source]#
Expose the main functionality: depending on the node, the computation is done using a specific Python library and its function/s.
- model = None#
- class rain.nodes.spark.pipeline.stages.Tokenizer(node_id: str, in_col: str, out_col: str)[source]#
Bases:
TransformerRepresent a Spark Tokenizer used to split text in individual term.
- Input:
dataset (DataFrame) – A Spark DataFrame.
- Output:
dataset (DataFrame) – The modified Spark DataFrame.
- Parameters:
node_id (str) – Id of the node.
in_col (str) – The name of the input column.
out_col (str) – The name of the output column.
- dataset = None#
Module contents#
Copyright (C) 2023 Università degli Studi di Camerino and Sigma S.p.A. Authors: Alessandro Antinori, Rosario Capparuccia, Riccardo Coltrinari, Flavio Corradini, Marco Piangerelli, Barbara Re, Marco Scarpetta
This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.