Uploaded image for project: 'JAPI'
  1. JAPI
  2. JAPI-382

Pushdown filters in Spark Connector

    XMLWordPrintable

Details

    • Question
    • Status: Resolved
    • Not specified
    • Resolution: Fixed
    • 7.12.32
    • 7.12.x
    • Spark
    • None

    Description

      I'm getting questions about "push down" filtering with the Spark HPCC Connector.

      I'm trying to get clarification from the engineer here at the client.

       

      https://spark.apache.org/docs/latest/sql-data-sources-parquet.html

      This page shows:

      spark.sql.parquet.filterPushdown true Enables Parquet filter push-down optimization when set to true.

      Is this relevant to the Spark HPCC Connector?

      Does the Connector support "filterPushdown"?   If so, I'd love to get some discussion/notes on that.

       

       

      The example from the engineer here - 

       

      Push Down model

       

      scala> val df = spark.read.option("host", "http://XX.XX.XX.XX:8010").option("cluster", "hthor").option("path", "myThor::data_index").format("hpcc").load()
      scala> df.createOrReplaceTempView("SomeInfo")
      scala> spark.sql("SELECT f1, f2, f3 from SomeInfo where f1 = 'ABC123'").show(false)
      scala> spark.sql("SELECT f1, f2, f3 from SomeInfo where f1 = 'DEF456'").show(false)
      scala> spark.sql("SELECT f1, f2, f3 from SomeInfo where f1 = 'IJK789'").show(false)
      

       

       

      File Filter model -

       

      scala> val df1 = spark.read.option("host", "http://XX.XX.XX.XX:8010").option("cluster", "hthor").option("path", "myThor::data_index").option("filter", "f1=['ABC123']").format("hpcc").load()
      scala> df1.createOrReplaceTempView("SomeInfo1")
      scala> spark.sql("SELECT f1, f2, f3 from SomeInfo1")
       
      scala> val df2 = spark.read.option("host", "http://XX.XX.XX.XX:8010").option("cluster", "hthor").option("path", "myThor::data_index").option("filter", "f1=['DEF456']").format("hpcc").load()
      scala> df2.createOrReplaceTempView("SomeInfo2")
      scala> spark.sql("SELECT f1, f2, f3 from SomeInfo2")
      scala> val df3 = spark.read.option("host", "http://XX.XX.XX.XX:8010").option("cluster", "hthor").option("path", "myThor::data_index").option("filter", "f1=['IJK789']").format("hpcc").load()
      scala> df3.createOrReplaceTempView("SomeInfo3")
      scala> spark.sql("SELECT f1, f2, f3 from SomeInfo3")
      

       

       

      In which of these examples would the filter (f1 = "...") be passed down to HPCC (so that the entire file is not passed up to Spark)?

       

       

      More on "push down":

      https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-Optimizer-PushDownPredicate.html

      https://databricks.com/session/the-pushdown-of-everything

      https://databricks.com/session/apache-spark-data-source-v2

      Attachments

        Activity

          People

            rpastrana Rodrigo Pastrana
            jwilt James Wiltshire
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: