I want to be able to read/write thor files natively from Spark. Eg — I can do a wordcount from any hdfs data structure like this
Whether it is: sc.textFile(“hpcc://localfilename”) (ie we get the PR accepted by the spark foundation) or RichardGaveMeThis.open(“localfilename”) I don’t really care.
The key part is I want to be able to have hpcc and spark use the same data on the same disks interoperably …
The extra credit version would import the data directly into a data frame (ie the field/meta data comes too) which would allow us to use they ML libraries directly ….