PII and other sensitive data are often hidden in enterprise unstructured data silos, in places it shouldn’t be, and storage teams need to know.
Protecting personally identifiable information (PII) and other sensitive data sources has been a long-time concern of enterprise IT organizations in the digital age, spurring regulations several years and even decades old, such as GDPR and HIPAA. With digitization accelerating since the pandemic, this problem is getting worse. What has been described as the biggest breach of PII on record was reported last August: nearly three billion records containing the PII of an unknown number of “U.S., Canadian, and British citizens” – including Social Security numbers and criminal records – were stolen in a hack of the computer systems of National Public Data.
AI is an additional, newer factor. Attempts to input PII into GenAI platforms represent over half (55%) of data loss prevention (DLP) events, followed by confidential documents (40%), according to 2024 research by Menlo Security. Not only are these incidents damaging for customer relationships, regulatory compliance and marketplace reputation, they are getting more expensive all the time. The global average cost of a data breach reached $4.88 million in 2024, according to IBM.
It has primarily been the responsibility of cybersecurity teams to monitor and protect sensitive data, using policies, education, and a mix of tools to detect and prevent attacks. IT infrastructure and storage teams have been involved vis-à-vis backups and recovery, adhering to regulations on data storage and implementing data access control mechanisms.
These days, security is increasingly built into data storage technologies, making data protection more front and center for storage managers. Meanwhile, storage administrators are becoming data managers more so than storage managers, as unstructured data lives across many silos from the data center to the cloud to the edge. Data storage teams must pay closer attention to data governance and work closer with departmental and line of business teams, since they are managing data access and performance as well as AI data workflows and cloud data migrations on behalf of many diverse stakeholders.
As part of these efforts, data storage teams should be able to detect PII, IP and other sensitive data types and mitigate the risks of this data being stored or shared against industry regulations and internal policies. Increasingly they will also be tasked with ensuring that only the right unstructured data sets they manage are ingested by AI services and data pipelines.
The problem is, they typically lack unified, granular visibility into unstructured data across disparate hybrid silos—including whether PII is in places where it shouldn’t be.
The challenge of finding, controlling and managing sensitive information in unstructured data assets
Unstructured data in the enterprise is large, diverse and everywhere; it’s generated by users, machines, mobile devices, apps, social media sites, chatbots, email, sensors and more. File data is the most accessible, used by employees across the organization and shared and copied readily. It’s easy for PII and other sensitive data types to end up in the wrong place—usually by mistake.
Finding PII data, for example, often requires hunting and pecking through file shares and directories manually. Even if you have AI tools that can crack open files and detect PII, you still need to feed the data to the AI—and sending/copying all or most of your data is prohibitively slow and expensive to move and process.
Additionally, IT infrastructure teams that are responsible for data management need to ensure sensitive data is moved out of places where it shouldn’t be, but they lack the tools to find sensitive data across their storage and cloud environments and move the data once it is identified. Some organizations may have sensitive data detection tools for their security teams, but these lack the ability to move the data and these tools are not available to the storage IT teams.
Cybersecurity tools that include PII scanners will not be able to scale to meet the needs of filtering, tagging and mobilizing only the right data across petabytes of scattered unstructured data assets.
The benefits of better sensitive data discovery and management across unstructured data
Unstructured data is the unmined gold of the enterprise; it’s not well understood nor analyzed but highly abundant. It’s becoming vital for IT teams to free this data, make it easily accessible and mineable and integrate it into different workflows for IT and the business including BI, AI, compliance management, cost optimization, data placement and more. The risk of sensitive data leakage is high for many of these use cases. Storage and infrastructure administrators need to ensure that sensitive data is stored properly to protect it and that data workflows can exclude sensitive data as needed.
Here a few considerations for sensitive data detection and mitigation:
- Whether using a standalone tool or capabilities within a broader unstructured data management platform, it’s ideal if the solution can work across storage and backup tools, data centers and clouds. This way, you have one view and one way to search and manage sensitive data versus trying to reconcile across different tools, which can create gaps and complexities.
- Can you act on the findings? Once sensitive data is discovered and tagged, storage managers need a foolproof easy way to automatically confine it or delete it, move it to compliant locations, and/or set workflows to exclude sensitive data from business processes such as AI ingestion where it can be leaked. The ability to audit and report on these processes is another bonus feature to look for as you develop plans.
With ransomware not slowing down, regulatory requirements for privacy and security continuing to expand, and the need for secure AI data workflows on the near horizon, it’s time to take a closer look at your sensitive data strategy and if you’ve got the right practices and tools to keep it safe.
By Paul Chen