6.5 MySQL Enterprise Data Masking and De-Identification

Note:

MySQL Enterprise Data Masking and De-Identification is an extension included in MySQL Enterprise Edition, a commercial product. To learn more about commercial products, https://www.mysql.com/products/.

As of MySQL 5.7.24, MySQL Enterprise Edition provides data masking and de-identification capabilities:

The way that applications use these capabilities depends on the purpose for which the data is used and who accesses it:

Example 1:

Medical research facilities can hold patient data that comprises a mix of personal and medical data. This may include genetic sequences (long strings), test results stored in JSON format, and other data types. Although the data may be used mostly by automated analysis software, access to genome data or test results of particular patients is still possible. In such cases, data masking should be used to render this information not personally identifiable.

Example 2:

A credit card processor company provides a set of services using sensitive data, such as:

A typical transaction may include many types of sensitive information, including:

Those types of information may then be joined within a bank or other card-issuing financial institution with client personal data, such as:

Various employee roles within both the card processing company and the financial institution require access to that data. Some of these roles may require access only to masked data. Other roles may require access to the original data on a case-to-case basis, which is recorded in audit logs.

Masking and de-identification are core to regulatory compliance, so MySQL Enterprise Data Masking and De-Identification can help application developers satisfy privacy requirements:

The following sections describe the elements of MySQL Enterprise Data Masking and De-Identification, discuss how to install and use it, and provide reference information for its elements.

 File: manual.info.tmp, Node: data-masking-elements, Next: data-masking-installation, Prev: data-masking, Up: data-masking

6.5.1 MySQL Enterprise Data Masking and De-Identification Elements

MySQL Enterprise Data Masking and De-Identification is based on a plugin library that implements these elements:

 File: manual.info.tmp, Node: data-masking-installation, Next: data-masking-usage, Prev: data-masking-elements, Up: data-masking

6.5.2 Installing or Uninstalling MySQL Enterprise Data Masking and De-Identification

This section describes how to install or uninstall MySQL Enterprise Data Masking and De-Identification, which is implemented as a plugin library file containing a plugin and several loadable functions. For general information about installing or uninstalling plugins and loadable functions, see note plugin-loading::, and note function-loading::.

To be usable by the server, the plugin library file must be located in the MySQL plugin directory (the directory named by the 'plugin_dir' system variable). If necessary, configure the plugin directory location by setting the value of 'plugin_dir' at server startup.

The plugin library file base name is 'data_masking'. The file name suffix differs per platform (for example, '.so' for Unix and Unix-like systems, '.dll' for Windows).

To install the MySQL Enterprise Data Masking and De-Identification plugin and functions, use the note 'INSTALL PLUGIN': install-plugin. and note 'CREATE FUNCTION': create-function. statements, adjusting the '.so' suffix for your platform as necessary:

 INSTALL PLUGIN data_masking SONAME 'data_masking.so';
 CREATE FUNCTION gen_blacklist RETURNS STRING
   SONAME 'data_masking.so';
 CREATE FUNCTION gen_dictionary RETURNS STRING
   SONAME 'data_masking.so';
 CREATE FUNCTION gen_dictionary_drop RETURNS STRING
   SONAME 'data_masking.so';
 CREATE FUNCTION gen_dictionary_load RETURNS STRING
   SONAME 'data_masking.so';
 CREATE FUNCTION gen_range RETURNS INTEGER
   SONAME 'data_masking.so';
 CREATE FUNCTION gen_rnd_email RETURNS STRING
   SONAME 'data_masking.so';
 CREATE FUNCTION gen_rnd_pan RETURNS STRING
   SONAME 'data_masking.so';
 CREATE FUNCTION gen_rnd_ssn RETURNS STRING
   SONAME 'data_masking.so';
 CREATE FUNCTION gen_rnd_us_phone RETURNS STRING
   SONAME 'data_masking.so';
 CREATE FUNCTION mask_inner RETURNS STRING
   SONAME 'data_masking.so';
 CREATE FUNCTION mask_outer RETURNS STRING
   SONAME 'data_masking.so';
 CREATE FUNCTION mask_pan RETURNS STRING
   SONAME 'data_masking.so';
 CREATE FUNCTION mask_pan_relaxed RETURNS STRING
   SONAME 'data_masking.so';
 CREATE FUNCTION mask_ssn RETURNS STRING
   SONAME 'data_masking.so';

If the plugin and functions are used on a replication source server, install them on all replica servers as well to avoid replication issues.

Once installed as just described, the plugin and functions remain installed until uninstalled. To remove them, use the note 'UNINSTALL PLUGIN': uninstall-plugin. and note 'DROP FUNCTION': drop-function. statements:

 UNINSTALL PLUGIN data_masking;
 DROP FUNCTION gen_blacklist;
 DROP FUNCTION gen_dictionary;
 DROP FUNCTION gen_dictionary_drop;
 DROP FUNCTION gen_dictionary_load;
 DROP FUNCTION gen_range;
 DROP FUNCTION gen_rnd_email;
 DROP FUNCTION gen_rnd_pan;
 DROP FUNCTION gen_rnd_ssn;
 DROP FUNCTION gen_rnd_us_phone;
 DROP FUNCTION mask_inner;
 DROP FUNCTION mask_outer;
 DROP FUNCTION mask_pan;
 DROP FUNCTION mask_pan_relaxed;
 DROP FUNCTION mask_ssn;

 File: manual.info.tmp, Node: data-masking-usage, Next: data-masking-function-reference, Prev: data-masking-installation, Up: data-masking

6.5.3 Using MySQL Enterprise Data Masking and De-Identification

Before using MySQL Enterprise Data Masking and De-Identification, install it according to the instructions provided at *note data-masking-installation::.

To use MySQL Enterprise Data Masking and De-Identification in applications, invoke the functions that are appropriate for the operations you wish to perform. For detailed function descriptions, see *note data-masking-functions::. This section demonstrates how to use the functions to carry out some representative tasks. It first presents an overview of the available functions, followed by some examples of how the functions might be used in real-world context:

Masking Data to Remove Identifying Characteristics

MySQL provides general-purpose masking functions that mask arbitrary strings, and special-purpose masking functions that mask specific types of values.

General-Purpose Masking Functions

'mask_inner()' and 'mask_outer()' are general-purpose functions that mask parts of arbitrary strings based on position within the string:

By default, 'mask_inner()' and 'mask_outer()' use ''X'' as the masking character, but permit an optional masking-character argument:

 mysql> SELECT mask_inner('This is a string', 5, 1, '*');
 +-------------------------------------------+
 | mask_inner('This is a string', 5, 1, '*') |
 +-------------------------------------------+
 | This **********g                          |
 +-------------------------------------------+
 mysql> SELECT mask_outer('This is a string', 5, 1, '#');
 +-------------------------------------------+
 | mask_outer('This is a string', 5, 1, '#') |
 +-------------------------------------------+
 | #####is a strin#                          |
 +-------------------------------------------+

Special-Purpose Masking Functions

Other masking functions expect a string argument representing a specific type of value and mask it to remove identifying characteristics.

Note:

The examples here supply function arguments using the random value generation functions that return the appropriate type of value. For more information about generation functions, see *note data-masking-usage-generation-functions::.

Payment card Primary Account Number masking

Masking functions provide strict and relaxed masking of Primary Account Numbers.

US Social Security number masking

'mask_ssn()' masks all but the last four digits of the number:

 mysql> SELECT mask_ssn(gen_rnd_ssn());
 +-------------------------+
 | mask_ssn(gen_rnd_ssn()) |
 +-------------------------+
 | XXX-XX-1723             |
 +-------------------------+

Generating Random Data with Specific Characteristics

Several functions generate random values. These values can be used for testing, simulation, and so forth.

'gen_range()' returns a random integer selected from a given range:

 mysql> SELECT gen_range(1, 10);
 +------------------+
 | gen_range(1, 10) |
 +------------------+
 |                6 |
 +------------------+

'gen_rnd_email()' returns a random email address in the 'example.com' domain:

 mysql> SELECT gen_rnd_email();
 +---------------------------+
 | gen_rnd_email()           |
 +---------------------------+
 | ayxnq.xmkpvvy@example.com |
 +---------------------------+

'gen_rnd_pan()' returns a random payment card Primary Account Number:

 mysql> SELECT gen_rnd_pan();

(The 'gen_rnd_pan()' function result is not shown because its return values should be used only for testing purposes, and not for publication. It cannot be guaranteed the number is not assigned to a legitimate payment account.)

'gen_rnd_ssn()' returns a random US Social Security number with the first and second parts each chosen from a range not used for legitimate numbers:

 mysql> SELECT gen_rnd_ssn();
 +---------------+
 | gen_rnd_ssn() |
 +---------------+
 | 912-45-1615   |
 +---------------+

'gen_rnd_us_phone()' returns a random US phone number in the 555 area code not used for legitimate numbers:

 mysql> SELECT gen_rnd_us_phone();
 +--------------------+
 | gen_rnd_us_phone() |
 +--------------------+
 | 1-555-747-5627     |
 +--------------------+

Generating Random Data Using Dictionaries

MySQL Enterprise Data Masking and De-Identification enables dictionaries to be used as sources of random values. To use a dictionary, it must first be loaded from a file and given a name. Each loaded dictionary becomes part of the dictionary registry. Items then can be selected from registered dictionaries and used as random values or as replacements for other values.

A valid dictionary file has these characteristics:

Suppose that a file named 'de_cities.txt' contains these city names in Germany:

 Berlin
 Munich
 Bremen

Also suppose that a file named 'us_cities.txt' contains these city names in the United States:

 Chicago
 Houston
 Phoenix
 El Paso
 Detroit

Assume that the 'secure_file_priv' system variable is set to '/usr/local/mysql/mysql-files'. In that case, copy the dictionary files to that directory so that the MySQL server can access them. Then use 'gen_dictionary_load()' to load the dictionaries into the dictionary registry and assign them names:

 mysql> SELECT gen_dictionary_load('/usr/local/mysql/mysql-files/de_cities.txt', 'DE_Cities');
 +--------------------------------------------------------------------------------+
 | gen_dictionary_load('/usr/local/mysql/mysql-files/de_cities.txt', 'DE_Cities') |
 +--------------------------------------------------------------------------------+
 | Dictionary load success                                                        |
 +--------------------------------------------------------------------------------+
 mysql> SELECT gen_dictionary_load('/usr/local/mysql/mysql-files/us_cities.txt', 'US_Cities');
 +--------------------------------------------------------------------------------+
 | gen_dictionary_load('/usr/local/mysql/mysql-files/us_cities.txt', 'US_Cities') |
 +--------------------------------------------------------------------------------+
 | Dictionary load success                                                        |
 +--------------------------------------------------------------------------------+

To select a random term from a dictionary, use 'gen_dictionary()':

 mysql> SELECT gen_dictionary('DE_Cities');
 +-----------------------------+
 | gen_dictionary('DE_Cities') |
 +-----------------------------+
 | Berlin                      |
 +-----------------------------+
 mysql> SELECT gen_dictionary('US_Cities');
 +-----------------------------+
 | gen_dictionary('US_Cities') |
 +-----------------------------+
 | Phoenix                     |
 +-----------------------------+

To select a random term from multiple dictionaries, randomly select one of the dictionaries, then select a term from it:

 mysql> SELECT gen_dictionary(ELT(gen_range(1,2), 'DE_Cities', 'US_Cities'));
 +---------------------------------------------------------------+
 | gen_dictionary(ELT(gen_range(1,2), 'DE_Cities', 'US_Cities')) |
 +---------------------------------------------------------------+
 | Detroit                                                       |
 +---------------------------------------------------------------+
 mysql> SELECT gen_dictionary(ELT(gen_range(1,2), 'DE_Cities', 'US_Cities'));
 +---------------------------------------------------------------+
 | gen_dictionary(ELT(gen_range(1,2), 'DE_Cities', 'US_Cities')) |
 +---------------------------------------------------------------+
 | Bremen                                                        |
 +---------------------------------------------------------------+

The 'gen_blacklist()' function enables a term from one dictionary to be replaced by a term from another dictionary, which effects masking by substitution. Its arguments are the term to replace, the dictionary in which the term appears, and the dictionary from which to choose a replacement. For example, to substitute a US city for a German city, or vice versa, use 'gen_blacklist()' like this:

 mysql> SELECT gen_blacklist('Munich', 'DE_Cities', 'US_Cities');
 +---------------------------------------------------+
 | gen_blacklist('Munich', 'DE_Cities', 'US_Cities') |
 +---------------------------------------------------+
 | Houston                                           |
 +---------------------------------------------------+
 mysql> SELECT gen_blacklist('El Paso', 'US_Cities', 'DE_Cities');
 +----------------------------------------------------+
 | gen_blacklist('El Paso', 'US_Cities', 'DE_Cities') |
 +----------------------------------------------------+
 | Bremen                                             |
 +----------------------------------------------------+

If the term to replace is not in the first dictionary, 'gen_blacklist()' returns it unchanged:

 mysql> SELECT gen_blacklist('Moscow', 'DE_Cities', 'US_Cities');
 +---------------------------------------------------+
 | gen_blacklist('Moscow', 'DE_Cities', 'US_Cities') |
 +---------------------------------------------------+
 | Moscow                                            |
 +---------------------------------------------------+

Using Masked Data for Customer Identification

At customer-service call centers, one common identity verification technique is to ask customers to provide their last four Social Security number (SSN) digits. For example, a customer might say her name is Joanna Bond and that her last four SSN digits are '0007'.

Suppose that a 'customer' table containing customer records has these columns:

For example, the table might be defined as follows:

 CREATE TABLE customer
 (
   id         BIGINT NOT NULL AUTO_INCREMENT PRIMARY KEY,
   first_name VARCHAR(40),
   last_name  VARCHAR(40),
   ssn        VARCHAR(11)
 );

The application used by customer-service representatives to check the customer SSN might execute a query like this:

 mysql> SELECT id, ssn
 mysql> FROM customer
 mysql> WHERE first_name = 'Joanna' AND last_name = 'Bond';
 +-----+-------------+
 | id  | ssn         |
 +-----+-------------+
 | 786 | 906-39-0007 |
 +-----+-------------+

However, that exposes the SSN to the customer-service representative, who has no need to see anything but the last four digits. Instead, the application can use this query to display only the masked SSN:

 mysql> SELECT id, mask_ssn(CONVERT(ssn USING binary)) AS masked_ssn
 mysql> FROM customer
 mysql> WHERE first_name = 'Joanna' AND last_name = 'Bond';
 +-----+-------------+
 | id  | masked_ssn  |
 +-----+-------------+
 | 786 | XXX-XX-0007 |
 +-----+-------------+

Now the representative sees only what is necessary, and customer privacy is preserved.

Why was the 'CONVERT()' function used for the argument to 'mask_ssn()'? Because 'mask_ssn()' requires an argument of length 11. Thus, even though 'ssn' is defined as 'VARCHAR(11)', if the 'ssn' column has a multibyte character set, it may appear to be longer than 11 bytes when passed to a loadable function, and an error occurs. Converting the value to a binary string ensures that the function sees an argument of length 11.

A similar technique may be needed for other data masking functions when string arguments do not have a single-byte character set.

Creating Views that Display Masked Data

If masked data from a table is used for multiple queries, it may be convenient to define a view that produces masked data. That way, applications can select from the view without performing masking in individual queries.

For example, a masking view on the 'customer' table from the previous section can be defined like this:

 CREATE VIEW masked_customer AS
 SELECT id, first_name, last_name,
 mask_ssn(CONVERT(ssn USING binary)) AS masked_ssn
 FROM customer;

Then the query to look up a customer becomes simpler but still returns masked data:

 mysql> SELECT id, masked_ssn
 mysql> FROM masked_customer
 mysql> WHERE first_name = 'Joanna' AND last_name = 'Bond';
 +-----+-------------+
 | id  | masked_ssn  |
 +-----+-------------+
 | 786 | XXX-XX-0007 |
 +-----+-------------+

 File: manual.info.tmp, Node: data-masking-function-reference, Next: data-masking-functions, Prev: data-masking-usage, Up: data-masking

6.5.4 MySQL Enterprise Data Masking and De-Identification Function Reference

MySQL Enterprise Data Masking and De-Identification Functions

Name Description

'gen_blacklist()' Perform dictionary term replacement

'gen_dictionary_drop()'Remove dictionary from registry

'gen_dictionary_load()'Load dictionary into registry

'gen_dictionary()' Return random term from dictionary

'gen_range()' Generate random number within range

'gen_rnd_email()' Generate random email address

'gen_rnd_pan()' Generate random payment card Primary Account Number

'gen_rnd_ssn()' Generate random US Social Security Number

'gen_rnd_us_phone()' Generate random US phone number

'mask_inner()' Mask interior part of string

'mask_outer()' Mask left and right parts of string

'mask_pan()' Mask payment card Primary Account Number part of string

'mask_pan_relaxed()' Mask payment card Primary Account Number part of string

'mask_ssn()' Mask US Social Security Number

 File: manual.info.tmp, Node: data-masking-functions, Prev: data-masking-function-reference, Up: data-masking

6.5.5 MySQL Enterprise Data Masking and De-Identification Function Descriptions

The MySQL Enterprise Data Masking and De-Identification plugin library includes several functions, which may be grouped into these categories:

These functions treat string arguments as binary strings (which means they do not distinguish lettercase), and string return values are binary strings. If a string return value should be in a different character set, convert it. The following example shows how to convert the result of 'gen_rnd_email()' to the 'utf8mb4' character set:

 SET @email = CONVERT(gen_rnd_email() USING utf8mb4);

It may also be necessary to convert string arguments, as illustrated in *note data-masking-usage-customer-identification::.

If a MySQL Enterprise Data Masking and De-Identification function is invoked from within the note 'mysql': mysql. client, binary string results display using hexadecimal notation, depending on the value of the '--binary-as-hex'. For more information about that option, see note mysql::.

Data Masking Functions

Each function in this section performs a masking operation on its string argument and returns the masked result.

Random Data Generation Functions

The functions in this section generate random values for different types of data. When possible, generated values have characteristics reserved for demonstration or test values, to avoid having them mistaken for legitimate data. For example, 'gen_rnd_us_phone()' returns a US phone number that uses the 555 area code, which is not assigned to phone numbers in actual use. Individual function descriptions describe any exceptions to this principle.

Random Data Dictionary-Based Functions

The functions in this section manipulate dictionaries of terms and perform generation and masking operations based on them. Some of these functions require the 'SUPER' privilege.

When a dictionary is loaded, it becomes part of the dictionary registry and is assigned a name to be used by other dictionary functions. Dictionaries are loaded from plain text files containing one term per line. Empty lines are ignored. To be valid, a dictionary file must contain at least one nonempty line.

 File: manual.info.tmp, Node: enterprise-encryption, Next: selinux, Prev: data-masking, Up: security