Indata description
The srcdir (-s
or --srcdir
option) is expected to include three files:
table_tags.csv
column_tags.csv
ranger_policies.json
The csv files separated with semicolon(;). The first one include tags on table level, second is on column level. The json file includes the Ranger policies. Example of all files can be found in the example directory.
table_tags.csv
table_tags.csv has the fields:
- schema - schema of the table to process. The schema shall not include environment, e.g. it must be hadoop_out and not hadoop_out_prod
- table - table to process
- tags - tags to be set on the table. If several tags are provided they shall be separated with comma(,). No spaces allowed.
It is recommended to include all your tables, even if they do not have tags. This way it is easy to see if any important table is missed. Tags removed from tables will also be removed in Atlas. Not listed tables will not be touched by the tool.
column_tags.csv
column_tags.csv has the fields:
- schema - schema of the column to process. The schema shall not include environment, e.g. it must be hadoop_out and not hadoop_out_prod
- table - table of the column to process
- attribute - Name of the column
- tags - tags to be set on the table. If several tags are provided they shall be separated with comma(,). No spaces allowed.
It is recommended to include all your columns, even if they do not have tags. This way it is easy to see if any columns are missed. Tags removed from columns will also be removed in Atlas. Not listed columns will not be touched by the tool.
ranger_policies.json
Unfortunately, policies are more complex to handle than tags. Hence they are stored as json. It is the policy file that gives a meaning to our tags. Though, it can also handle policies not controlled by tags.
As you can see in the example file the top level is an array of json objects:
[
{
"command": "apply_tag_row_rule",
"filters": [...],
"policy": {...}
},
...
]
First field is the command
. There are two commands apply_rule
and
apply_tag_row_rule
. The apply_rule
command applies normal Ranger rules.
These have only a second field policy
on the top level. The
apply_tag_row_rule
command takes a number of tags and filters. Each table
with the tag will get a row based policy according to the filter. The
apply_tag_row_rule
have two fields beyond command
; filters
and policy
.
The ranger policy file do also include some variables expanded by policytool.
They are written like ${table}
. Variables provided by cobra-policytool are:
project_name
- project-name as provided on the command line.environment
- environment as provided on the command line.table
- Expand to current table name. Only valid for apply_tar_row_rule filters.schema
- Expand to current table name. Only valid for apply_tar_row_rule filters.end_data_column
- Only valid in apply_tag_row_rule. See its section for details.
You can also define your own variables in the configfile and refer them in your policy file.
The variables can be referred anywhere in the ranger_policies.json
by
${variable}
. You can find examples of usage in the
example file Policytool expand the variables
referred in the policy file before sent to Ranger.
Policy tool uses the policy name to find older versions of the policy in Ranger. If you change the name you may delete the old policies by hand in the Ranger user interface.
Command apply_rule
Command apply_rule
can be used in different ways depending on the policytype.
Currently are the following policy types supported:
- Access (0)
- Masking (1)
- Row filtering (2)
The number in parenthesis is what is used for the field policyType
to control
the type of the policy.
The following code snippet shows the common fields for all policy types. If you are familiar with the Ranger API you may recognise the fields. This is what is sent to Ranger, after the variables are expanded. This means that combination not described in this document can still be valid and accepted by Ranger. This way policytool is very generic but also raw since it not do any validation itself.
{
"command": "apply_rule",
"policy": {
"service": "${installation}_hive",
"name": "${project_name}_${environment}_vanilla",
"policyType": 0,
"description": "Access to data in schemas for vanilla etl",
"isAuditEnabled": true,
"resources":( ...)
"isEnabled": true
}
service
- The service name for the resource the policy will be applied to.name
- Name of the policy, must be unique. It should either start withload_etl_
or${project_name}_${environment}
policyType
- type of policy, see above.description
- Free text description.isAuditEnabled
- If audit logging is enabled or not.resources
- Definition of resources impacted by the rule. Note! How these are described differs depending on service used. See further down for tag resources and Hive resources.isEnabled
- If policy is enabled or not.
Resource definitions.
Each kind of resource has its own resource definition. For Hive resources we
define database, table and column. The following example match all tables and
columns in our two databases my_database_1
and my_database_2
.
"resources": {
"database": {
"values": ["my_database_1", "my_database_2"],
"isExcludes": false,
"isRecursive": false
},
"column": {
"values": ["*"],
"isExcludes": false,
"isRecursive": false
},
"table": {
"values": ["*"],
"isExcludes": false,
"isRecursive": false
}
}
Note that you can only have one policy per resource type and resource.
Tags are defined in a similar way:
"tag": {
"values": ["PII"],
"isExcludes": false,
"isRecursive": false
}
Even though we only mention hive and tag resources, one may also use the other resources supported by Ranger. See the Ranger documentation for details.
Access policy
Access to resources is controlled by access rules. It has the policy type number 0.
The following example gives access to my_database for the system user etl and
all members of the group ETL_USERS. Note that we in the accesses
field cannot
set isAllowed
to false. So access types where the users shall not have
access must be left out.
{
"command": "apply_rule",
"options": {
"expandHiveResourceToHdfs": true,
"hdfsService": "svs${installation}_hadoop"
},
"policy": {
"service": "${installation}_hive",
"name": "${project_name}_${environment}_vanilla",
"policyType": 0,
"description": "Access to data in schemas for vanilla etl",
"isAuditEnabled": true,
"resources": {
"database": {
"values": ["my_database_${environment}"],
"isExcludes": false,
"isRecursive": false
},
"column": {
"values": ["*"],
"isExcludes": false,
"isRecursive": false
},
"table": {
"values": ["*"],
"isExcludes": false,
"isRecursive": false
}
},
"policyItems": [{
"accesses": [{
"type": "select",
"isAllowed": true
}, {
"type": "update",
"isAllowed": true
}, {
"type": "create",
"isAllowed": true
}, {
"type": "drop",
"isAllowed": true
}, {
"type": "alter",
"isAllowed": true
}, {
"type": "read",
"isAllowed": true
}, {
"type": "write",
"isAllowed": true
}],
"users": ["etl${user_suffix}"],
"groups": ["ETL_USERS"],
"conditions": [],
"delegateAdmin": false
}],
"isEnabled": true
}
What access types exist varies between different resources. For hive the following access types exists:
- select
- update
- create
- drop
- alter
- read
- write
- index
- lock
- all
- replAdmin
See Ranger and respectively resources documentation for details.
The field delegateAdmin
, if it is true, gives the user with this policy the
right to delegate its rights to other users
You may have many access rules for one resource but they must all be in the same policy object and listed as different policy items.
The options
part in the example tells cobra-policytool to also create a corresponding
rule for hdfs. The option hdfsService
is used for point out the name of the hdfs service
in Ranger. Note that this can be used both when you point out tables explicitly as in the example
or when using tags.
Masking policy
Masking of fields is done with policy type number 1. The policy is has a
sub-object dataMaskPolicyItems
including information how to do the masking
and what it applies to. Not that masking can currently only be
done in Hive. The following example calls our own anonymize fun function an
all fields tagged PII for the user pii_etl.
{
"command": "apply_rule",
"policy": {
"service": "${installation}_tag",
"name": "load_etl_mask_pii_data",
"policyType": 1,
"description": "Mask all column data tagged PII",
"isAuditEnabled": true,
"resources": {
"tag": {
"values": ["PII"],
"isExcludes": false,
"isRecursive": false
}
},
"dataMaskPolicyItems": [{
"dataMaskInfo": {
"dataMaskType": "hive:CUSTOM",
"valueExpr": "udf.anonymize({col})"
},
"accesses": [
{
"type": "hive:select",
"isAllowed": true
}
],
"users": ["pii_etl${user_suffix}"],
"groups": [],
"conditions": [],
"delegateAdmin": false
}],
"isEnabled": true
}
}
Take an extra look on the value of the valueExpr field udf.anonymize({col})
.
Here you can write any SQL function you want in any combination. When a
function shall have the value of the field marked with the tag you write
{col}
, the same way you do in the Ranger user interface.
Row filtering policy
Row filtering policies has policy type number 2. It is very similar to a
masking policy but instead of dataMaskPolicyItems
we have a
rowFilterPolicyItems
. In the following example we have a filterExpr
that keep customer exist in our whitelist.
{
"command": "apply_rule",
"policy": {
"service":"${installation}_hive"
"policyType": "2",
"name": "customer_d whitelist filtering",
"isEnabled": true,
"isAuditEnabled": true,
"description": "Filter out customers in whitelist.",
"resources": {
"database": {
"values": ["hadoop_out_utv"],
"isRecursive": false,
"isExcludes": false
},
"table": {
"values": ["customer_d"],
"isRecursive": false,
"isExcludes": false
}
},
"rowFilterPolicyItems":[{
"groups": ["CRM_USERS"],
"users": [],
"accesses":[{
"type": "select",
"isAllowed": true
}],
"rowFilterInfo": {
"filterExpr": "exists (select 1 from whitelist where whitelist.customer_id=customer_d.customer_id)"
}
}]
}
}
Command apply_tag_row_rule
The apply_rule
command has one to one mapping to the functionality of Ranger.
The same is not true for command apply_tag_row_rule
. It introduces a more
general tag usage than Ranger provides. In Ranger you can have one row
filtering rule per table and the rule is only valid for one table. In a large
environment it is desirable to have one rule for all tables with a specific tag.
This works if naming is very systematic of columns this can be reached. For
instance, we have a table customer and one order. Both have a field customer_id
. We want
to only show users within a whitelist. In Ranger we must have one row filtering
per table, one for customer
and one for order
. Both tables have the tag
PII_table
. With apply_tag_row_rule
we only need one rule.
{
"command": "apply_tag_row_rule",
"filters": [{
"groups": [],
"users": ["pii_etl"],
"tagFilterExprs": [{
"tags": ["PII_table"],
"filterExpr": "exists (select 1 from whitelist where whitelist.customer_id=${table}.customer_id)"
}]
}],
"policy": {
"service": "${installation}_hive",
"name": "${project_name}_${environment}_${schema}_${table}",
"description": "Whitelist rowfiltering policy for Service: ${schema}.${table} in ${environment}.",
"policyType": 2,
"isEnabled": true,
"resources": {
"database": {
"isExcludes": false,
"values": ["${schema}_${environment}"],
"isRecursive": false
},
"table": {
"isExcludes": false,
"values": ["${table}"],
"isRecursive": false
}
},
"isAuditEnabled": true
}
}
In the example above shows the rule described. We have a policy section like
for the row filtering policy described for the apply_rule
command. We do also
have filters
section. From the example above the filter shall be self
explanatory. Though, note the reference to ${table}
in filterExpr
. You can
also see the same variable within the policy. The tag PII_table
will expand
to several tables. One policy per table will be created. The filter expression
is inserted into the policy and the ${table}
is expanded to the current table.
If multiple tags are provided in one tag filter expression, tables must have all tags applied to match.
Several matching filters to one table means the filter expression is added together with and.
Note that policytool uses the tag definitions from the csv files and not from Atlas.