Details
-
Question
-
Status: Resolved
-
Not specified
-
Resolution: Fixed
-
7.12.32
-
None
Description
I'm getting questions about "push down" filtering with the Spark HPCC Connector.
I'm trying to get clarification from the engineer here at the client.
https://spark.apache.org/docs/latest/sql-data-sources-parquet.html
This page shows:
spark.sql.parquet.filterPushdown | true | Enables Parquet filter push-down optimization when set to true. |
Is this relevant to the Spark HPCC Connector?
Does the Connector support "filterPushdown"? If so, I'd love to get some discussion/notes on that.
The example from the engineer here -
Push Down model
scala> val df = spark.read.option("host", "http://XX.XX.XX.XX:8010").option("cluster", "hthor").option("path", "myThor::data_index").format("hpcc").load() scala> df.createOrReplaceTempView("SomeInfo") scala> spark.sql("SELECT f1, f2, f3 from SomeInfo where f1 = 'ABC123'").show(false) scala> spark.sql("SELECT f1, f2, f3 from SomeInfo where f1 = 'DEF456'").show(false) scala> spark.sql("SELECT f1, f2, f3 from SomeInfo where f1 = 'IJK789'").show(false)
File Filter model -
scala> val df1 = spark.read.option("host", "http://XX.XX.XX.XX:8010").option("cluster", "hthor").option("path", "myThor::data_index").option("filter", "f1=['ABC123']").format("hpcc").load() scala> df1.createOrReplaceTempView("SomeInfo1") scala> spark.sql("SELECT f1, f2, f3 from SomeInfo1") scala> val df2 = spark.read.option("host", "http://XX.XX.XX.XX:8010").option("cluster", "hthor").option("path", "myThor::data_index").option("filter", "f1=['DEF456']").format("hpcc").load() scala> df2.createOrReplaceTempView("SomeInfo2") scala> spark.sql("SELECT f1, f2, f3 from SomeInfo2") scala> val df3 = spark.read.option("host", "http://XX.XX.XX.XX:8010").option("cluster", "hthor").option("path", "myThor::data_index").option("filter", "f1=['IJK789']").format("hpcc").load() scala> df3.createOrReplaceTempView("SomeInfo3") scala> spark.sql("SELECT f1, f2, f3 from SomeInfo3")
In which of these examples would the filter (f1 = "...") be passed down to HPCC (so that the entire file is not passed up to Spark)?
More on "push down":