Let’s see some ID: Why data identification is a crucial first step for compliance

When two Twitter employees were recently criminally charged with allegedly accessing user data on behalf of the Saudi government, it spurred a larger conversation about data access controls (or the lack thereof) for employees. Why, many experts wondered, did these workers have such apparently unfettered access to sensitive information not at all needed to do their jobs – a clear violation of the well-known security standard of least privilege?

While debate around this incident will likely persist, it almost goes without saying that for many organizations least privilege is more of an ideal than a reality. In many cases, that’s because most companies have no idea how to cost-effectively identify and track their and their customers’ sensitive data across multiple systems.

The information explosion and data identification

There’s simply no getting around it: data is everywhere in 2019.

Not only is it everywhere, it’s growing at a rate that’s nearly impossible to comprehend. Virtually every organization – from retailers to sports leagues, to luxury fashion houses, to natural resources companies, to government and military entities, to NGOs – now generates what just a few years ago would have seemed a ludicrous amount of data.

A recent International Data Corp. (IDC) report says the world’s collective data will grow at an annual clip of 61 percent over the next few years, to reach 175 zettabytes by 2025 (for those not familiar with a zettabyte, it’s the equivalent of one sextillion bytes. For those not familiar with a sextillion – let’s just say it’s a heck of a lot). The easy scalability and flexibility of cloud data platforms and the advent of hundreds of millions of IoT and edge devices (forecast to reach 1.5 billion by 2022) has only driven those numbers even higher.

But as companies generate, store, and share more and more data, and more types of data from multiple and disparate sources, they’re also confronted with the threat of having to protect all that information from both non-authorized outsiders and internal bad actors.

You’ve got to know your data…intimately

The Ancient Greeks knew existence was futile without knowing thyself, and the same principle goes for your organization’s ever-growing mountain of data. Because to adequately protect your organization’s information, you first need to identify what needs protecting. And you can’t do that without knowing your data at an intimate level: what it is, what it contains, where it lives, and so on.

To make things even more challenging, your data can now reside on more systems than ever – cloud, mobile devices, local machines, or company networks. And while some of that data may need no protection at all in terms of compliance or privacy issues, several other data types absolutely must be protected:

Controlled unclassified information (CUI) PCI, PHI, PII data
Payment card information (PCI)
Personal health information (PHI)
Personally identifiable information (PII)

That’s not even counting company financial data, HR data, trade secrets, or even (for those who deal with government and military) classified information. Leaving this kind of data exposed risks the ire of regulators willing and able to mete out swift and harsh punishments, or hackers who would love to steal your data. Failure to protect any of the above likely means a serious (and costly) breach of regulations like the EU’s General Data Protection Regulations (GDPR), International Traffic in Arms Regulations (ITAR), or NATO STANAGs, depending on what data is exposed – with punishments ranging from fines to jail time.

The problem with big data (and how to solve it with software)

Having lots and lots of data is inherently a good thing. If you’ve got a well-designed data platform backed by strong data governance, integration, and management rules, it can also be a massive competitive advantage. But this can be a double-edged sword: thanks to the velocity and scale of data mentioned earlier, it’s now harder than ever to know it intimately using manual tools and processes. The more data your organization creates, the harder this gets.

Certain software tools, however, can be a huge help in identifying and classifying data both in flight and at rest across various systems like OneDrive, Google Drive, SharePoint and others. These tools can also identify and dispose of redundant or obsolete data, reducing storage costs and lowering your risk on multiple fronts.

Machine learning vs. deep learning

Both machine learning and deep learning are forms of artificial intelligence, but are quite different in terms of complexity:

Machine learning develops algorithms that can be modified without any human intervention, after being fed reams and reams of structured data from which to learn.

Deep learning is a subset of machine learning with similar algorithms, but more of them layered on top of one another, with each having a slightly different perspective on the data it is fed. These layers of algorithms are called artificial neural networks since they’re partly inspired by the workings of biological neural networks.

Titus’ Accelerator for Privacy uses the latest deep learning technologies to accurately predict the existence of sensitive data in files and emails at the point of creation.

Using such software tools allows organizations to not only identify their data and where it resides but to then proactively tag each email or file with metadata to ensure it stays identified and protected across various systems or when handled by partners. Effective identification and classification tools using deep learning typically allow organizations to:

Find and identify sensitive data in emails, documents, and systems based on various categories you can create and train your system to recognize, such as financial information, proprietary information, or personal information.
Apply the right levels of protection through metadata embedded in emails or files, set up automated rules to protect it, and even remind employees to take extra care when handling this kind of data.
Combine this rich metadata with encryption technology, digital rights management (DRM) software, enterprise rights management (ERM) software, cloud access security brokers (CASB), and next-gen firewalls for a cohesive, integrated approach to ensure strong data protection across global operations. These third-party solutions can be configured to automatically read and understand classification metadata and apply the appropriate controls.

Data identification is done right

There are several ways companies can automate the data identification process, which approaches to take depends on the information in question.

Some solutions are designed to detect sensitive data in motion, like when your employee accidentally attaches this quarter’s internal financial statements to an external email and presses ‘send’. It helps people and systems understand how to deal with certain types of sensitive data through visual markings and metadata – and locks things down with automated controls for the times when they forget.

Other solutions focus on identifying data at rest. This is important because organizations and employees typically save far more data and information than they need: according to The Association for Intelligent Information Management, an average of one-third of the data on every unmanaged server – and up to 70 percent in many cases – is redundant, obsolete or trivial (ROT). This data often sits on unmanaged servers indefinitely and presents an often under-appreciated privacy risk to organizations and employees.

A third category of data protection tools, meanwhile, expedite all the above by using pre-packaged deep learning functionality to identify sensitive data (and apply appropriate protections) at the point of creation. Deep learning-based tools provide faster time-to-value by speeding up sensitive data detection, require minimal end-user training, and work with data both in flight and at rest – locating and identifying data as soon as it’s created, wherever it exists during its lifecycle.

As your organization becomes more data-driven to keep pace with partners and competitors, your systems will be forced to generate and ingest more and more structured and unstructured data. Greater demands – by customers, employees, and regulators – will also be placed on your data. That’s why having a software-based data identification and classification solution in place is quickly becoming a must-have to make sense of it all – while also helping to (hopefully) cut down on international incidents.