QBoard » Big Data » Big Data - Spark » Machine Learning in Spark

Machine Learning in Spark

  •  

    I am using Apache Spark to perform sentiment analysis.I am using Naive Bayes algorithm to classify the text. I don't know how to find out the probability of labels. I would be grateful if I know get some snippet in python to find the probability of labels.

      June 11, 2019 4:59 PM IST
    0
    • Biswajeet  Dasmajumdar
      Biswajeet Dasmajumdar Probabilities are calculated separately for each class. This means that we first calculate the probability that a new piece of data belongs to the first class, then calculate probabilities that it belongs to the second class, and so on for all the...  more
      October 7, 2020
  • Apache Spark is known as a fast, easy-to-use and general engine for big data processing that has built-in modules for streaming, SQL, Machine Learning (ML) and graph processing. This technology is an in-demand skill for data engineers, but also data scientists can benefit from learning Spark when doing Exploratory Data Analysis (EDA), feature extraction and, of course, ML.

    In this tutorial, you’ll interface Spark with Python through PySpark, the Spark Python API that exposes the Spark programming model to Python. More concretely, you’ll focus on

    This post was edited by Samar Patil at September 11, 2021 1:32 PM IST
      September 11, 2021 1:31 PM IST
    0
  • Probability can be found for the test dataset once you trained the model and transformed for the test dataset e.g: if your trained Naive Bayes model is model then model.transform(test) contains a node of probability, for more details please check the below code, going to show you the probability node and others useful nodes also for iris dataset.

    Partition dataset randomly into Training and Test sets. Set seed for reproducibility

    (trainingData, testData) = irisdf.randomSplit([0.7, 0.3], seed = 100)

    trainingData.cache()
    testData.cache()

    print trainingData.count()
    print testData.count()

    Output:

    103
    47

    Next, we will use the VectorAssembler() to merge our feature columns into a single vector column, which we will be passing into our Naive Bayes model. Again, we will not transform the dataset just yet as we will be passing the VectorAssembler into our ML Pipeline.

    from pyspark.ml.feature import VectorAssembler
    vecAssembler = VectorAssembler(inputCols=["SepalLength", "SepalWidth", "PetalLength", "PetalWidth"], outputCol="features")

    For iris dataset, it has three classes namely setosa, versicolor and virginica. So let's create a Multiclass Naive Bayes Classifier using pysaprk library ml.

    from pyspark.ml.classification import NaiveBayes
    from pyspark.ml import Pipeline

    # Train a NaiveBayes model
    nb = NaiveBayes(smoothing=1.0, modelType="multinomial")

    # Chain labelIndexer, vecAssembler and NBmodel in a pipeline
    pipeline = Pipeline(stages=[labelIndexer, vecAssembler, nb])

    # Run stages in pipeline and train model
    model = pipeline.fit(trainingData)

    Analyse the created mode model, from which we can make predictions.

    predictions = model.transform(testData)
    # Display what results we can view
    predictions.printSchema()

    Output

    root
    |-- SepalLength: double (nullable = true)
    |-- SepalWidth: double (nullable = true)
    |-- PetalLength: double (nullable = true)
    |-- PetalWidth: double (nullable = true)
    |-- Species: string (nullable = true)
    |-- label: double (nullable = true)
    |-- features: vector (nullable = true)
    |-- rawPrediction: vector (nullable = true)
    |-- probability: vector (nullable = true)
    |-- prediction: double (nullable = true)

    You can also select a particular node to view for some dataset as:

    # DISPLAY Selected nodes only
    display(predictions.select("label", "prediction", "probability"))

    Above will show you in tabular formate.

    Reference:

    spark
    Models using pipeline
    https://mike.seddon.ca/natural-language-processing-with-apache-spark-ml-and-amazon-reviews-part-1/
    https://stackoverflow.com/questions/31028806/how-to-create-correct-data-frame-for-classification-in-spark-ml

     

    This post was edited by Pranav B at June 14, 2019 12:32 PM IST
      June 11, 2019 5:01 PM IST
    0
  • Machine Learning Library (MLlib) Guide

    MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. At a high level, it provides tools such as:

    • ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering
    • Featurization: feature extraction, transformation, dimensionality reduction, and selection
    • Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
    • Persistence: saving and load algorithms, models, and Pipelines
    • Utilities: linear algebra, statistics, data handling, etc.

    As of Spark 2.0, the RDD-based APIs in the spark.mllib package have entered maintenance mode. The primary Machine Learning API for Spark is now the DataFrame-based API in the spark.ml package.

    What are the implications?

    MLlib will still support the RDD-based API in spark.mllib with bug fixes.
    MLlib will not add new features to the RDD-based API.
    In the Spark 2.x releases, MLlib will add features to the DataFrames-based API to reach feature parity with the RDD-based API.
    Why is MLlib switching to the DataFrame-based API?

    DataFrames provide a more user-friendly API than RDDs. The many benefits of DataFrames include Spark Datasources, SQL/DataFrame queries, Tungsten and Catalyst optimizations, and uniform APIs across languages.
    The DataFrame-based API for MLlib provides a uniform API across ML algorithms and across multiple languages.
    DataFrames facilitate practical ML Pipelines, particularly feature transformations. See the Pipelines guide for details.
    What is “Spark ML”?

    “Spark ML” is not an official name but occasionally used to refer to the MLlib DataFrame-based API. This is majorly due to the org.apache.spark.ml Scala package name used by the DataFrame-based API, and the “Spark ML Pipelines” term we used initially to emphasize the pipeline concept.

      August 28, 2021 1:43 PM IST
    0