rain.nodes.spark.pipeline package#

Submodules#

rain.nodes.spark.pipeline.spark_pipeline module#

Copyright (C) 2023 Università degli Studi di Camerino and Sigma S.p.A. Authors: Alessandro Antinori, Rosario Capparuccia, Riccardo Coltrinari, Flavio Corradini, Marco Piangerelli, Barbara Re, Marco Scarpetta

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.

class rain.nodes.spark.pipeline.spark_pipeline.SparkPipelineNode(node_id: str, stages: List[SparkNode])[source]#

Bases: Estimator

Represent a Spark Pipeline consisting of SparkNode (stages). It should contain some Spark Transformer and a final Spark Estimator that return the trained model.

Input:

dataset (DataFrame) – A Spark DataFrame.

Output:

model (PipelineModel) – A Spark PipelineModel.

Parameters:
  • node_id (str) – Id of the node.

  • stages (List[SparkNode]) – List of SparkNode that can be executed in a Spark Pipeline.

Notes

Visit https://spark.apache.org/docs/latest/ml-pipeline.html#pipeline for Spark Pipeline documentation.

dataset = None#
execute()[source]#

Expose the main functionality: depending on the node, the computation is done using a specific Python library and its function/s.

model = None#

rain.nodes.spark.pipeline.stages module#

Copyright (C) 2023 Università degli Studi di Camerino and Sigma S.p.A. Authors: Alessandro Antinori, Rosario Capparuccia, Riccardo Coltrinari, Flavio Corradini, Marco Piangerelli, Barbara Re, Marco Scarpetta

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.

class rain.nodes.spark.pipeline.stages.HashingTF(node_id: str, in_col: str, out_col: str)[source]#

Bases: Transformer

Represent a Spark HashingTF that maps a sequence of terms to their term frequencies using the hashing trick.

Input:

dataset (DataFrame) – A Spark DataFrame.

Output:

dataset (DataFrame) – The modified Spark DataFrame.

Parameters:
  • node_id (str) – Id of the node.

  • in_col (str) – The name of the input column.

  • out_col (str) – The name of the output column.

dataset = None#
execute()[source]#

Expose the main functionality: depending on the node, the computation is done using a specific Python library and its function/s.

class rain.nodes.spark.pipeline.stages.LogisticRegression(node_id: str, max_iter: int, reg_param: float)[source]#

Bases: Estimator

Represent a SparkNode that supports fitting traditional logistic regression model.

Input:

dataset (DataFrame) – A Spark DataFrame.

Output:

model (PipelineModel) – A Spark PipelineModel.

Parameters:
  • max_iter (int) – Max number of iterations.

  • reg_param (float) – Regularization parameter.

dataset = None#
execute()[source]#

Expose the main functionality: depending on the node, the computation is done using a specific Python library and its function/s.

model = None#
class rain.nodes.spark.pipeline.stages.Tokenizer(node_id: str, in_col: str, out_col: str)[source]#

Bases: Transformer

Represent a Spark Tokenizer used to split text in individual term.

Input:

dataset (DataFrame) – A Spark DataFrame.

Output:

dataset (DataFrame) – The modified Spark DataFrame.

Parameters:
  • node_id (str) – Id of the node.

  • in_col (str) – The name of the input column.

  • out_col (str) – The name of the output column.

dataset = None#
execute()[source]#

Expose the main functionality: depending on the node, the computation is done using a specific Python library and its function/s.

Module contents#

Copyright (C) 2023 Università degli Studi di Camerino and Sigma S.p.A. Authors: Alessandro Antinori, Rosario Capparuccia, Riccardo Coltrinari, Flavio Corradini, Marco Piangerelli, Barbara Re, Marco Scarpetta

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.