Details

      Description

      I want to be able to read/write thor files natively from Spark. Eg — I can do a wordcount from any hdfs data structure like this

      JavaRDD<String> textFile = sc.textFile("hdfs://...");
      
      JavaRDD<String> words = textFile.flatMap(new FlatMapFunction<String, String>() {
      
        public Iterator<String> call(String s) { return Arrays.asList(s.split(" ")).iterator(); }
      
      });
      
      JavaPairRDD<String, Integer> pairs = words.mapToPair(new PairFunction<String, String, Integer>() {
      
        public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); }
      
      });
      
      JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer, Integer>() {
      
        public Integer call(Integer a, Integer b) { return a + b; }
      
      });
      
      counts.saveAsTextFile("hdfs://...");
      

      Whether it is: sc.textFile(“hpcc://localfilename”) (ie we get the PR accepted by the spark foundation) or RichardGaveMeThis.open(“localfilename”) I don’t really care.

      The key part is I want to be able to have hpcc and spark use the same data on the same disks interoperably …

      The extra credit version would import the data directly into a data frame (ie the field/meta data comes too) which would allow us to use they ML libraries directly ….

      David Bayliss

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                johnholt John Holt
                Reporter:
                richardkchapman Richard Chapman
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: