Expanding our data discovery leadership with machine learning classification tools

Identification capabilities bring speed and scale to applying governance policies

Sam Curcuruto
Sr. Product Marketing Manager, Data Discovery, OneTrust
May 4, 2023

Abstract curved office building facade

Sensitive data lives everywhere in the organization, including databases, systems, documents, and apps. However, not all data stores are the same, creating classification challenges for some automated solutions. OneTrust Data Discovery uses advanced machine learning (ML) and artificial intelligence (AI) to identify documents that cannot be classified using traditional pattern matching approaches. By determining a document based on its content and context, organizations can then automatically apply the right governance policies to ensure data is used responsibly.

Eliminate manual effort and classify data using content and context

OneTrust Data Discovery goes beyond traditional pattern matching to intelligently scan and identify a document, such as a resume, passport, financial statement, or medical record. Machine learning helps saves time by classifying data at scale to minimize manual intervention and increase accuracy.

Automatically apply retention, deletion, and data protection policies

Once data is classified, security teams can ensure data is protected and handled based on its classification according to regulatory requirements. Using our improved classification and document identification, we can apply policies at the data level, such as ‘files containing PII’ and document level, like ‘resumés’ or ‘financial reports.’

Using these improved classifications enables the application and enforcement of policies like retention, deletion, or quarantine. We can also apply access policies to different data or document types, like ensuring that sensitive files or data are not shared with open access.

Applications of ML models

OneTrust Data Discovery employs a number of intelligent technologies and new techniques to help our customers better discover, control, and activate their data at scale.

We use AI, natural language processing (NLP), and ML technology to automate document classification and categorize documents based on content, because industries like legal, healthcare, and finance have large volumes of documents to process. The algorithms learn from labeled data sets to recognize patterns and characteristics in text to classify documents accurately and efficiently.

A classic area where a lot of solutions struggle is with named entities. Think about the word “Savannah,” where it could be a person’s name or the city in the U.S. state of Georgia. To help classify data appropriately, we have tuned Spacy's Named Entity Recognition (NER) model, which is a machine learning algorithm to identify and extract named entities (people, organizations, locations) from unstructured text data. It can identify named entities in different languages, making it valuable for global customers.

We have also developed new ways to utilize OCR (Optical Character Recognition) machine learning models to extract characters from images, including printed or handwritten text, to convert to machine-readable. Thanks to the speed of our scanning technology, classification of PDFs and JPGs can be completed at scale.

Privacy by design is built-in to our AI and ML strategy

OneTrust has been utilizing machine learning and AI for more than a year and it has been trained and used by privacy professionals. Our strategy has always been to use these and new technologies to better uncover, classify, protect, and encourage the responsible use of data across all enterprises.

We have built and deployed our technology with privacy by design in a way that each customer’s model is their own, tailored and trained by their own unique data and environment. Those models are never shared with anyone else.

Let us show you how it works — request a demo today. 

You may also like


Data Discovery

Live demo: OneTrust Data Discovery

See how OneTrust Data Discovery can help your organization achieve complete data visibility to empower your security program and reduce risk.

June 22, 2023

Learn more


Data Discovery

OneTrust Data Discovery Day: A deep dive into automating data discovery and classification

Join us for a two-hour deep dive into data discovery and how OneTrust helps privacy, IT, and security teams understaind their data and achieve risk reduction goals.

June 13, 2023

Learn more


Data Discovery

Monitoring least privilege access risks

Understand common scenarios for applying data access governance within your business and key considerations for evaluating open access risk.

May 18, 2023

Learn more