Uploaded image for project: 'HPCC'
  1. HPCC
  2. HPCC-21936

spark upgrade from 2.3 to 2.4

    XMLWordPrintable

Details

    Description

      There is a request to update the spark from 2.3 to 2.4. From spark 2.4, the pyspark will support UDF aggregation through pandas. Below code( from https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html) can run through from spark 2.4, but 2.3. Such capability is pretty important to analytics as the default aggregation functions in spark are limited. In many cases, we want to customize the aggregation function by case, and it will be ideal to integrate the pandas rich data manipulation ability into the aggregation function. So, it will be a great deal for us to upgrade the spark to 2.4

      Thanks!

       

      from pyspark.sql.functions import pandas_udf, PandasUDFType

      from pyspark.sql import Window

       

      df = spark.createDataFrame(

          [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],

          ("id", "v"))

       

      @pandas_udf("double", PandasUDFType.GROUPED_AGG)

      def mean_udf(v):

          return v.mean()

       

      df.groupby("id").agg(mean_udf(df['v'])).show()

      # ------------+

      # | id|mean_udf(v)|

      # ------------+

      # |  1|        1.5|

      # |  2|        6.0|

      # ------------+

       

      w = Window \

          .partitionBy('id') \

          .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)

      df.withColumn('mean_v', mean_udf(df['v']).over(w)).show()

      # ---------

      # | id|   v|mean_v|

      # ---------

      # |  1| 1.0|   1.5|

      # |  1| 2.0|   1.5|

      # |  2| 3.0|   6.0|

      # |  2| 5.0|   6.0|

      # |  2|10.0|   6.0|

      # ---------

      Attachments

        Activity

          People

            Michael-Gardner Michael Gardner
            Wang Huaizhou(Joe)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: