Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Timed Out
-
5.0, 4.2.2
-
None
-
None
Description
Hello,
I am in the process of comparing Big Data systems and solutions on a cluster where each node has multiple disks that are directly accessible. As the cluster is also used for Hadoop/HDFS, the disks are mounted as JBOD; i.e., they are mounted as separate Linux volumes and are not utilizing any sort of abstraction between the OS and disk besides the filesystem itself (i.e., no RAID or LVM). For many systems I have encountered, this is one of their many acceptable hardware configurations, with this type of configuration being geared towards newer systems such as Hadoop/HDFS that take on the tasks of replication and failover in software. However, for HPCC this appears not to be a configuration where I can fully utilize the hardware, as it seems that with HPCC I must have one (and only one?!) location allocated for my data, homogeneously across the entire cluster. Using RAID is not a choice in my situation, as the cluster's hardware and OS are shared with other (Hadoop/HDFS) users and they are not mine to reconfigure. (I would expect similar situations to arise with the Big Data clusters of many enterprises today.) I am trying to understand if there is a simple way that this type of hardware configuration could be better accommodated. For example, something as simple as supporting an HPCC node process startup parameter that points at a configuration file might work. There could then be multiple processes, one per disk volume, coexisting on the same machine; this is how systems like MongoDB deal with multiple volumes, for example, when in non-RAIDed configurations.
Thank you,
Keren Ouaknine