Uploaded image for project: 'HPCC'
  1. HPCC
  2. HPCC-13090

Security enhancements for Column Level Security



    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Won't Fix
    • None
    • None
    • Core Libraries


      The following thread was copied from Flowdoc so further discussion can be archived in Jira.

      Feb 17, 2015 10:15
      Stuart, Flavio,Trisha and I are meeting this friday with a team at Georgia Tech to discuss their HPCC initiative. There are 2 main things they are looking for in HPCC

      1) Column level access privileges. Instead of replicating a dataset for every audience and only including columns of interest, there should be a mechanism for a single comprehensive dataset with various column level access granted per user/group. Example could be a patient database, where doctors can see medical information, billing clerks can only see insurance information, etc.

      2) Embeddable logic to be dynamically applied to column data, based on the user/group membership. Example would be when you should not display an exact DOB/age, instead you could apply an algorithm to display an age category.

      Regarding the first requirement, I was thinking that the helper DLL (code generator) could expose a set of methods that enumerates all columns/tables referenced by the workunit, both directly and indirectly. Before the engine executes the first graph of the workunit, it could call that method and check the access privileges (via LDAP) for all columns referenced and return an error if access is denied. No further runtime checking would be required and performance impact would be minimal. Alternately each activity that accesses data could check based on actual column usage, but even with caching the performance overhead is likely unacceptable.

      The second requirement seems more complicated, and I am hoping you guys can share some ideas. I presume the logic would call a macro or other embedded ECL, and perhaps we already have this level of support?
      Feb 17, 2015 10:16

      Feb 17, 2015 10:19
      Unfortunately Gavin is away this week, and he’s better placed to reply, but I’ll do my best.

      Feb 17, 2015 10:23
      1. I think Gavin has capabilities to output the set of fields used by a query, which might be a step towards this. Also clearly if you have capability 2 you can use it to implement the equivalent of capability 1 (by making the “special processing” be “FAIL” or “return blank” as approprriate.

      Feb 17, 2015 10:25
      2. If this is needed for roxie-type scenarios, it’s fairly easy to roll your own logic in ECL to achieve I would think. If you have to enforce for a thor-type scenario when users can run arbitrary ECL (including defining their own record layouts that would circumvent these controls) it’s harder to see how it would work.

      Feb 17, 2015 10:28
      Really the main requirement to have any chance of coding either of these securely if people can run arbitrary ECL is to prevent people from accesing a datafile AT ALL except via specified attributes. We can encrypt the files, but I don’t know we ensure that only selected attributes know how to decrypt.

      Feb 17, 2015 10:29
      Nice research problem for someone at Georgia Tech to code for us

      Feb 17, 2015 10:32
      1) For performance reasons I think Gavin exposing the set of fields being used by a query initially is more optimal than checking at run time (if that's what you mean by special processing).

      Feb 17, 2015 10:36
      The special processing I was referring to was the “embeddable logic" requested as item 2 in your original comment

      Feb 17, 2015 10:58
      1) and 2) are complicated by the fact that the user is likely denied (direct) access to the DOB field, but does allow some code to manipulate that field. Have to think that one through

      Feb 17, 2015 10:59
      I think if you solve 2) you forget about 1)

      Feb 17, 2015 11:00
      (since you can achieve the same thing). If you still want (1) for efficiency then any field that you need to access for special-processing has to be one that (as far as LDAP si concerned) you DO have access to.

      Feb 17, 2015 11:00
      Except 2) is determined at runtime, and I was hoping to check as much as possible at init time. Having each activity check access rights even if its looking at a cache introduces additional latency

      Feb 17, 2015 11:01
      Nothing comes for free. For Roxie cases you would check at query load time. For thor you don’t care about the latency

      Feb 17, 2015 11:01
      (and the special processing doesn;t need to introduce any great latency - you just need to make sure that it can;t be bypassed)

      @Russ, can we create a JIRA for this, or is there one already?

      Also, defining the #option reportFieldUsage to true will generate information about which fields are used from which files.

      I will open a ticket for discussion





            russwhitehead Russ Whitehead
            russwhitehead Russ Whitehead
            0 Vote for this issue
            2 Start watching this issue