QBoard » Big Data » Big Data - Spark » Why Import pandas in PySpark?

Why Import pandas in PySpark?

  • Hi In the University in the data science area we learned that if we wanted to work with small data we should use pandas and if we work with Big Data we schould use spark, in the case of Python programmers PySpark.

    Recently I saw in a Hackaton in the cloud (azure Synapse, that work inside in Spark) importing pandas in the notebook ( I suppose the code is good cause was made from Microsoft people)

    import pandas 
    from azureml.core import Dataset
    training_pd = training_data.toPandas().to_csv('training_pd.csv', index=False)

     

    Why do they do that ?

     
      December 15, 2021 12:45 PM IST
    0
  • Bassically seems that the person who make that work feel more conformatable in Pandas. Of course Pandas doesn’t scale and If your data set grows, you need more RAM and probably a faster CPU (faster in terms of single core performance). While this may be limiting for some scenarios seems that in the example the csv would be not big enought to use spark. I can not see any other reason.

     
      December 22, 2021 1:30 PM IST
    0
  • Pandas dataframes does not support parallelization. On the other hand, with Pandas, you need no cluster, you have more libraries and easy-to-extend examples. And let's be real, its performance is better for every task that doesn't require scaling.

    So, if you start your data engineering life learning Pandas, you're stuck with two things:

    • Externalized knowledge: ready-made code, snippets, and projects;
    • Internalized knowledge: API that you know well and prefer much more, patterns, guarantees, and gut feeling how to write this code in general.

    To a man with a hammer, everything looks like a nail. And that's not always a bad thing. If you have strict deadlines, done better than perfect! Better use Pandas now, than learn proper scalable solutions for years.

    Imagine you want to use an Apache Zeppelin notebook in PySpark mode, with all these cool visualizations. But it's not quite meet your requirements, and you're thinking about how to quick-fix that. At the same time, you can instantly google a ready-made solution for Pandas. This is a way to go; you have no other option to meet your deadlines.

    Another guess is that if you write your code in Python, you can debug it easily in every good IDE like PyCharm, using the interactive debugger. And that generally isn't valid for online notebooks, especially in Spark mode. Do you know any good debugger for Spark? I know nothing (people from the Big Data Tools plugin for IDEA are trying to fix this for Scala, but not for Python, as far as I know). So you have to write code in the IDE and then copy-paste it into the notebook.

    And last but not least, it may be just a mistake. People do not always perfectly know what they're doing, especially in a large field as Big Data. You're fortunate to have this university course. Average Joe on the internets had no such option.

    I should stop here because only speculations lie ahead.

      December 24, 2021 1:20 PM IST
    0
  • The main difference between working with PySpark and Pandas is the syntax. To show this difference, I provide a simple example of reading in a parquet file and doing some transformations on the data. As you can see, the syntax is completely different between PySpark and Pandas, which means that your Pandas knowledge is not directly transferable to PySpark.

    # Pandas
    pandasDF = pd.read_parquet(path_to_data)
    pandasDF['SumOfTwoColumns'] = pandasDF['Column1'] + pandasDF['Column2']
    pandasDF.rename({'Column1': 'Col1', 'Column2': 'Col2'}, axis=1, inplace=True)
    
    # PySpark
    sparkDF = spark.read.parquet(path_to_data)
    sparkDF = sparkDF.withColumn('SumOfTwoColumns', col('Column1') + col('Column2'))
    sparkDF = sparkDF.withColumnRenamed('Column1', 'Col1').withColumnRenamed('Column2', 'Col2')

     

    These differences in usage, but also in syntax, mean that there is a learning curve when transferring from using pure Pandas code to pure PySpark code. This also means that your legacy Pandas code can not be used directly on Spark with PySpark. Luckily there are solutions that allow you to use your Pandas code and knowledge on Spark.

    Solutions to leverage the power of Spark with Pandas There are mainly two options to use Pandas code on Spark: Koalas and Pandas UDFs

    Although, its not recommended to use Pandas while working with pyspark, but sometimes, I have also seen people doing the same.

      December 28, 2021 12:18 PM IST
    0