How Apache Ranger and Chuck Norris help secure Hadoop
When Hadoop started, it was a set of loosely coupled parts primarily used in the back end of the big Internet companies like Yahoo. These parts were wrapped into distributions and marketed as Hadoop by the likes of MapR, Cloudera, and Hortonworks.
Such piecemeal architecture isn't unusual in the world of open source or even in the wide world of commercial software. It does, however, result in security challenges. Some will read this as "it's insecure," but that isn't necessarily the case -- though it can be. The problem is more how do you authenticate users to all parts of this system of parts -- and once you authenticate them how do you authorize them to do only what you mean to allow them to do
Each part of Hadoop has its own LDAP and Kerberos authentication, as well as its own means and rules of authorization (and in most cases totally separate implementations of the same). This means you get to configure Kerberos or LDAP to each individual part, then define those rules in each separate configuration. What Apache Ranger does is provide a plug-in to each of these parts of Hadoop and a common authentication repository, as well as allow you to define policies in a centralized location.
Ranger is clearly a Hortonworks-sponsored project (as opposed to a Cloudera or MapR or now Databricks). You can tell this in part by the way it's skinned (green) and in part because of what it supports. At present, Ranger supports the following:
Except for HDFS and HBase, which are supported as part of the core of Hadoop and Solr, these are some of the more "Hortonworksy" projects. In a modern deployment, you'll likely see other components, such as Spark or possibly Impala (from Cloudera). Nonetheless, Ranger is a great thing.
In Ranger, for each component you work with a Repository. These repositories are based on an underlying plug-in or agent that operates with that component.
Associated with each of these repositories is a set of policies, which are associated with the resource you are protecting (a table, folder, or column) and a group (such as administrators) and what they are allowed to do with that thing (read, write, and so on). You give each policy a name -- say, "Only the grp_nixon can read the apac_china table."
A GUI with a central view of who is allowed to do what brings much needed simplicity to the Hadoop ecosystem, but that's not all that Ranger offers. It also provides audit logging. Although this can't supplant all the application audit logging you could ever want, if you simply need to know who accessed what on HDFS or what policies were enforced where, it's probably exactly what you need.
In addition, Ranger can provide Key Management Services in order to work with HDFS's new TDE (transparent data encryption). So if you need end-to-end encryption and a clean way to manage the keys associated with it, Ranger is not a bad place to start.
I think the biggest hope for Ranger comes from its extensibility. You can create your own plug-ins for areas that are not covered.
If you were hoping this was the end of the story on Hadoop security, unfortunately, Cloudera has its own Apache project called Sentry (which MapR appears to also support) that covers much the same area. To be fair, Sentry was first, then Hortonworks acquired XA Secure. That said, the documentation for Sentry is virtually nonexistent, the coverage is more constrained, and the project website is in disrepair (although activity on GitHub recently picked up).
Hadoop security has come a long way. Ranger gives a fairly comprehensive, if still a little incomplete, way to manage the ecosystem. The holes that persist are mainly due to vendor competition throughout the big data world. These can be filled via the extensibility of the project, but it would be nice to see more collaboration and community in the Apache world.