Uploaded image for project: 'HPCC'
  1. HPCC
  2. HPCC-21936

spark upgrade from 2.3 to 2.4

    XMLWordPrintable

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 7.4.0
    • Component/s: Spark-HPCC
    • Labels:
      None

      Description

      There is a request to update the spark from 2.3 to 2.4. From spark 2.4, the pyspark will support UDF aggregation through pandas. Below code( from https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html) can run through from spark 2.4, but 2.3. Such capability is pretty important to analytics as the default aggregation functions in spark are limited. In many cases, we want to customize the aggregation function by case, and it will be ideal to integrate the pandas rich data manipulation ability into the aggregation function. So, it will be a great deal for us to upgrade the spark to 2.4

      Thanks!

       

      from pyspark.sql.functions import pandas_udf, PandasUDFType

      from pyspark.sql import Window

       

      df = spark.createDataFrame(

          [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],

          ("id", "v"))

       

      @pandas_udf("double", PandasUDFType.GROUPED_AGG)

      def mean_udf(v):

          return v.mean()

       

      df.groupby("id").agg(mean_udf(df['v'])).show()

      # ------------+

      # | id|mean_udf(v)|

      # ------------+

      # |  1|        1.5|

      # |  2|        6.0|

      # ------------+

       

      w = Window \

          .partitionBy('id') \

          .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)

      df.withColumn('mean_v', mean_udf(df['v']).over(w)).show()

      # ---------

      # | id|   v|mean_v|

      # ---------

      # |  1| 1.0|   1.5|

      # |  1| 2.0|   1.5|

      # |  2| 3.0|   6.0|

      # |  2| 5.0|   6.0|

      # |  2|10.0|   6.0|

      # ---------

        Attachments

          Activity

            People

            • Assignee:
              Michael-Gardner Michael Gardner
              Reporter:
              Wang Huaizhou(Joe)
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: