Machine Learning in Spark

QBoard » Big Data » Big Data - Spark » Machine Learning in Spark

User Dashboard

Back To Topics

Tags : big_data spark

Raji Reddy A

90 6

I am using Apache Spark to perform sentiment analysis.I am using Naive Bayes algorithm to classify the text. I don't know how to find out the probability of labels. I would be grateful if I know get some snippet in python to find the probability of labels.
June 11, 2019 4:59 PM IST

0
- Biswajeet Dasmajumdar Probabilities are calculated separately for each class. This means that we first calculate the probability that a new piece of data belongs to the first class, then calculate probabilities that it belongs to the second class, and so on for all the... more
  
  or cancel
  
  October 7, 2020
Samar Patil

346 3
Apache Spark is known as a fast, easy-to-use and general engine for big data processing that has built-in modules for streaming, SQL, Machine Learning (ML) and graph processing. This technology is an in-demand skill for data engineers, but also data scientists can benefit from learning Spark when doing Exploratory Data Analysis (EDA), feature extraction and, of course, ML.

In this tutorial, you’ll interface Spark with Python through PySpark, the Spark Python API that exposes the Spark programming model to Python. More concretely, you’ll focus on
- Installing PySpark locally on your personal computer and setting it up so that you can work with the interactive Spark shell to do some quick, interactive analyses on your data. You’ll see how to do this with pip, Homebrew and via the Spark download page.
- Learning how to work with the basics of Spark: you’ll see how you can create RDDs and perform basic operations on them.
- Getting started with PySpark in Jupyter Notebook and loading in a real-life data set.
- Exploring and preprocessing the data that you loaded in at the first step the help of DataFrames, which demands that you make use of Spark SQL, which allows you to query structured data inside Spark programs.
- Creating a Linear Regression model with Spark ML to feed the data to it, after which you’ll be able to make predictions. And, lastly,
- Evaluating the machine learning model that you made.
This post was edited by Samar Patil at September 11, 2021 1:32 PM IST
September 11, 2021 1:31 PM IST

0
Pranav B

106 5

Probability can be found for the test dataset once you trained the model and transformed for the test dataset e.g: if your trained Naive Bayes model is model then model.transform(test) contains a node of probability, for more details please check the below code, going to show you the probability node and others useful nodes also for iris dataset.

Partition dataset randomly into Training and Test sets. Set seed for reproducibility

(trainingData, testData) = irisdf.randomSplit([0.7, 0.3], seed = 100)

trainingData.cache()
testData.cache()

print trainingData.count()
print testData.count()

Output:

103
47

Next, we will use the VectorAssembler() to merge our feature columns into a single vector column, which we will be passing into our Naive Bayes model. Again, we will not transform the dataset just yet as we will be passing the VectorAssembler into our ML Pipeline.

from pyspark.ml.feature import VectorAssembler
vecAssembler = VectorAssembler(inputCols=["SepalLength", "SepalWidth", "PetalLength", "PetalWidth"], outputCol="features")

For iris dataset, it has three classes namely setosa, versicolor and virginica. So let's create a Multiclass Naive Bayes Classifier using pysaprk library ml.

from pyspark.ml.classification import NaiveBayes
from pyspark.ml import Pipeline

# Train a NaiveBayes model
nb = NaiveBayes(smoothing=1.0, modelType="multinomial")

# Chain labelIndexer, vecAssembler and NBmodel in a pipeline
pipeline = Pipeline(stages=[labelIndexer, vecAssembler, nb])

# Run stages in pipeline and train model
model = pipeline.fit(trainingData)

Analyse the created mode model, from which we can make predictions.

predictions = model.transform(testData)
# Display what results we can view
predictions.printSchema()

Output

root
|-- SepalLength: double (nullable = true)
|-- SepalWidth: double (nullable = true)
|-- PetalLength: double (nullable = true)
|-- PetalWidth: double (nullable = true)
|-- Species: string (nullable = true)
|-- label: double (nullable = true)
|-- features: vector (nullable = true)
|-- rawPrediction: vector (nullable = true)
|-- probability: vector (nullable = true)
|-- prediction: double (nullable = true)

You can also select a particular node to view for some dataset as:

# DISPLAY Selected nodes only
display(predictions.select("label", "prediction", "probability"))

Above will show you in tabular formate.

Reference:

spark
Models using pipeline
https://mike.seddon.ca/natural-language-processing-with-apache-spark-ml-and-amazon-reviews-part-1/
https://stackoverflow.com/questions/31028806/how-to-create-correct-data-frame-for-classification-in-spark-ml

This post was edited by Pranav B at June 14, 2019 12:32 PM IST

June 11, 2019 5:01 PM IST

0
Vaibhav Mali

259
Machine Learning Library (MLlib) Guide

MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. At a high level, it provides tools such as:
- ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering
- Featurization: feature extraction, transformation, dimensionality reduction, and selection
- Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
- Persistence: saving and load algorithms, models, and Pipelines
- Utilities: linear algebra, statistics, data handling, etc.
As of Spark 2.0, the RDD-based APIs in the spark.mllib package have entered maintenance mode. The primary Machine Learning API for Spark is now the DataFrame-based API in the spark.ml package.

What are the implications?

MLlib will still support the RDD-based API in spark.mllib with bug fixes.
MLlib will not add new features to the RDD-based API.
In the Spark 2.x releases, MLlib will add features to the DataFrames-based API to reach feature parity with the RDD-based API.
Why is MLlib switching to the DataFrame-based API?

DataFrames provide a more user-friendly API than RDDs. The many benefits of DataFrames include Spark Datasources, SQL/DataFrame queries, Tungsten and Catalyst optimizations, and uniform APIs across languages.
The DataFrame-based API for MLlib provides a uniform API across ML algorithms and across multiple languages.
DataFrames facilitate practical ML Pipelines, particularly feature transformations. See the Pipelines guide for details.
What is “Spark ML”?

“Spark ML” is not an official name but occasionally used to refer to the MLlib DataFrame-based API. This is majorly due to the org.apache.spark.ml Scala package name used by the DataFrame-based API, and the “Spark ML Pipelines” term we used initially to emphasize the pipeline concept.
August 28, 2021 1:43 PM IST

0

Member Sign In

Member Sign In

Create Account

Machine Learning in Spark

Machine Learning Library (MLlib) Guide

Connect With Us