“Data is more valuable than Oil”, nevertheless are we leveraging it to the extreme capacity?
The answer is simple, it is "No” and it simply becomes dark data!
Gartner, coined this term ‘Dark data’ and defines it as “The information assets organizations collect, process, and store during regular business activities, but generally fail to use for their analytics, business relationships and direct monetizing”
Dark data can be generated by organization’s systems, devices, and interactions and typically most of the time it is the CRM, ERP, SCADA, HTTP, IoT and even WIFI systems which collects the data.
It can be stored physically or on the storage peripherals or in cloud. While most of the data is unstructured, some of the examples of Dark data includes that of below, but not limited to the list,
Application logs
Customer records
Geolocation
Survey data
Financial statements
Customer Address
Contact details
CCTV footage
Emails
Chat messages
Medical records
Zip files
Archived web content
Code snippets
Biggest challenges with regards to dark data is with regards to:
Security dangers (hacks)
Compliance issues
Data authenticity and
High Storage cost
Brand Reputation
Opportunity Cost
Risk associated with the dark data can be easily mitigated by adhering to audit and retention policies defined by the organization. However, some best practices can have high impact to manage the risk associated with the dark data.
The below model typically shows how the data is collected, stored, retained and deleted, more from an analyze, categorize and classify approach.
Model Explained:
Starting from Data classification (Public, Internal, Restricted)
While we classify, it is vital to bucketize based on few critical factors, viz.,
Critical data?
Permanent document?
Proprietary Intellectual Property?
Document/data serves the current needs of the operations?
Legal and regulatory requirement? (For instance, w.r.t HIPAA, 6 years minimum retention. In contrary, GDPR allow data storage for an extended period, however, solely should be used for the purpose of public interest, statistical analysis and for historical research only)
Hot Data or Cold data? (hot data is accessed frequently and used for quick decision whereas cold data is old data and are not frequently used)
Based on the classification, then deciding whether to store or delete.
If we wanted to store what is the retention period and how it will be useful.
When we follow this approach, along with Regular data Audit and internal Data Life Cycle Management (DLCM), we can make the maximum utilization of the data from the data pool.
Ways to leverage Dark data:
Text Mining / Word mining
Data mining methods
Voice to Text analytics
Data analytics
Prescriptive analytics
Behavior analysis, which can be used to train AI models for prediction
Big data analytics and visualization (SAP HANA)
Data Forecasting
Trend Analysis
Investigate past complaints
Google’s approach to data management:
“Some data you can delete whenever you like, some data is deleted automatically, and some data we retain for longer periods of time when necessary. When you delete data, we follow a deletion policy to make sure that your data is safely and completely removed from our servers or retained only in anonymized form.”
Apple’s approach to data storage:
Apple uses personal data to power our services, to process your transactions, to communicate with you, for security and fraud prevention, and to comply with law. We may also use personal data for other purposes with your consent.
Final say:
Data violations have earned a lot of notice in recent years as businesses become more dependent on digital data, cloud computing, and remote working. As a result, compliance and regulations have emerged as a requirement for ensuring information security.
Using data analytic application suites can manage unified unstructured data effectively and can provide intelligent identification of data sets in the organization which can be in line with the industry legal and regulatory requirements.