Mastering data discovery: A practical manual

Roman Vinokurov
Oct 4, 2024
12 min read

Data discovery is a problem and hard to do.

In the past decade, many organizations, big and small, have gotten good at collecting lots of data. According to British AI researcher Mustafa Suleyman, “Eighteen million gigabytes of data are added to the global sum every single minute of every day.” This is now easy, thanks to increasingly cheaper technology like cloud storage and big data tools.

When it comes to making sense of this data, many organizations are, however, lost.

The progress in data excellence has been really slow in the business world, as evidenced by NewVantage Partners (now Wavestone) annual surveys. For years, many companies weren’t improving at using data, and some were even getting worse.

This was until 2024. The latest poll showed a big leap, with the number of companies calling themselves data-driven doubling from 24% to 48% in just one year.

What we see is a newfound enthusiasm for data-driven strategies, which can be partly attributed to the rise of generative AI.

If your organization shares this data excitement, this article is for you. We’ll cover the basics: finding and using your data effectively. We’ll explore data discovery, a key first step for any data need. Expect insights on modern tools and lots of practical tips.

If you’re looking to take your data discovery journey further with expert guidance, consider exploring the data services offered by JustSoftLab, a leading data consultant.

What is data discovery

Data discovery is like searching through an old, dusty library, long forgotten, where you need to find and examine every book and document, including special archives, personal letters, and ancient contracts hidden in unseen corners.

In the context of your organization, the goal is to identify and understand the data assets it holds. This involves finding where the data resides, its format, and its relevance. By using tools for data cataloging and managing metadata (data describing your data), you create a map of what data you have to make it accessible.

Ideally, data discovery empowers leaders and professionals across various roles to easily visualize, engage with, and leverage crucial data.

Why your organization might need data discovery

There’s a dual perspective.

Data discovery is the bedrock of data governance strategies, led by dedicated data teams. They involve getting people, processes, and technologies in sync so the organization can make the most of its data—using it smartly, ethically, and within the law. In this process, data discovery helps teams:

Determine data sensitivity levels to apply appropriate security protocols

Set access controls based on the data’s attributes and user roles

Identify data that may reside in unapproved cloud and on-premises sources, often due to the use of IT resources without official oversight (shadow IT)

Improve a data incident response and recovery

Identify redundant, obsolete, or trivial data to declutter storage

Minimize data collection to what is strictly necessary

When you zoom out to the business side of things, there are mainly two motivations driving data discovery:

Compliance: It’s simple—rules and regulations like GDPR, CCPA, and HIPAA are out there, and they’re not playing around. They want businesses to know exactly what kind of data they’re holding, especially if it’s sensitive. Fines for non-compliance reach millions of dollars.

Analytics: Whether your organization wants to empower business users with self-service BI or dive into advanced analytics, be it for making decisions or building personalized products, data discovery is also the launchpad. You can’t make your data work for you if you don’t even know what you have or where it is.

So, while the data team might be spearheading the effort, data discovery isn’t just a technical task. It’s a crucial step toward protecting and pushing your business forward.

How data is discovered

Before embarking on a data discovery initiative, it’s helpful to understand the scope. Here are five key steps involved:

1. Exploring and accessing data sources

The first step in data discovery—identifying where data is stored or generated—presents its own set of challenges. Data often sits in isolated silos: file, and object storage, software-defined storage, and the cloud. It’s generated by ERP and CRM systems, CMMs, cloud applications, mobile devices, data lakes, and beyond. We face dark data, duplicates, and unstructured data from social media, emails, and IoT sensors. Furthermore, accessing this data requires configuring connections, obtaining permissions, or using APIs.

2. Organizing your data

After pinpointing data sources, the next challenge is to organize this data. This involves cataloging and classifying the data within a data catalog that needs to be integrated with existing systems. This central repository doesn’t store the data itself but indexes metadata about each data asset, including its storage location, format, primary content, and classifications based on type, sensitivity, and relevance to business goals.

3. Cleaning, enriching, and mapping data

This step is about fixing any errors in the data, enriching it by adding layers of context, mapping relationships between data points, and understanding lineage, including where the data comes from, how it’s processed, and how it’s used. For instance, a retailer analyzing customer purchases might need to correct transaction record inaccuracies, add demographic information to purchases for deeper insights, and trace customer interactions from first contact to sale.

4. Keeping data safe

Safeguarding data involves encryption for both data at rest and in transit, access controls based on roles and the principle of least privilege, and masking and anonymization of data used in less secure or public environments (e.g., for analytics or development). Regular audits, data retention policies, and employee training sessions ensure ongoing security and compliance.

5. Monitoring and continuous refinement

The journey of discovery is never static, and data observability is a key concept here. You need to monitor data health in your systems. This requires tracking data sources for new additions, changes, or deprecated information, updating your data catalog, refining classifications and metadata as business or regulatory needs shift, and establishing feedback mechanisms from data users to improve data utility and access.

It’s important to understand that data discovery is an ongoing process, not a finite task. As your organization continuously generates, collects, and updates data, it will need to repeat these five steps over and over again.

And your practical use case is king

Before embarking on a data discovery initiative, it’s also helpful to understand how your business case might influence it.

This understanding guides the project, whether it’s enhancing data analytics, meeting regulatory needs, or integrating after a merger. A clear purpose also guarantees that resources are optimally allocated to value-driving activities.

Here are key takeaways:

Identifying the objective is key to selecting relevant data sources. For example, if you aim to build a sales forecast model, your focus would be on historical sales data, customer demographics, transaction records, and marketing campaign outcomes. Conversely, aiming for GDPR compliance involves focusing on data that contains personal information, such as customer and employee records or transaction logs.

The initiative’s goal also dictates the approach to data classification and metadata management. For analytics, the priority lies in the data’s quality and relevance—classifying data by temporal significance, geographic details, and product categories to ensure accurate predictions. In contrast, a compliance-focused discovery must manage metadata with details on consent, processing purposes, and retention periods.

The goal might significantly influence the selection of tools and technology, with certain solutions more apt for specific challenges

Your data discovery objective will also influence subsequent strategies for using the data. In the sales forecasting scenario, you need to ensure that you create datasets that are easy to integrate into predictive modeling tools. For GDPR compliance, evaluating security measures, establishing access controls, and maintaining thorough documentation and audit trails are imperative.

Finally, with a clearly defined goal, it’s a lot easier to secure buy-in from stakeholders.

How you do data discovery technically

There are two approaches to discovering your data: manual and automated.

Manual data discovery

To cut a long story short, the traditional method of manual data discovery is now rare. The sheer scale of data managed by organizations today makes manually searching for and cataloging data assets impractical, except for a few scenarios:

Highly sensitive or confidential data: Manual review might be preferred for legal documents related to ongoing litigation, sensitive corporate agreements, or intellectual property, and for ambiguous cases where human judgment is required about what constitutes, for example, personal health information.

Complex or unstructured data: Situations involving intricate specifications or designs, particularly in aerospace, manufacturing, and construction, often require human expertise to interpret. Automated tools may fall short.

Data in inaccessible or legacy systems: Automated discovery tools might not always have access to or be compatible with legacy systems, proprietary formats, or data stored in isolated networks.

Initial data mapping: Before deploying automated tools, many organizations conduct a preliminary manual discovery to create an initial inventory of data assets.

The next section devoted to automated data discovery will be longer. Because it’s probably the reason you’re reading this article in the first place (just to note, the previously mentioned insights are also highly beneficial).

Automated data discovery

There are plenty of data discovery tools on the market, and we know you are confused. Many of the data discovery requests that clients bring to us ultimately center around the choice of suitable tools. We’ll try to guide you through this decision-making process.

There are tools for performing specific tasks in the data discovery process. For example, Apache NiFi, Fivetran, and Stitch Data help integrate data. Apache Atlas manages and governs metadata. Tamr cleans, sorts, and enriches data, as well as facilitates master data management. For creating visuals, there’s Qlik Sense and Looker. IBM Guardium provides data protection, discovers sensitive data, classifies it, and monitors it in real-time. For data security, you have Imperva, Thales, and Varonis.

There are plenty of integrated data discovery solutions, too, whose functionality spans from data ingestion and cataloging to analysis, visualization, and security. Our top ten includes:

Talend
- Enables robust data integration across an array of sources and systems
- Provides tools for managing data quality and governance
- Its data catalog automatically scans, analyzes, categorizes, connects, and enhances metadata, ensuring that about 80% of metadata associated with the data is autonomously documented and regularly updated using ML
- Talend Data Fabric offers a low-code environment, making it accessible for users with varying technical skills to work with data, from integration to insight generation
Informatica
- Its data catalog uses an ML-based data discovery engine to gather data assets across data silos
- Provides tools for profiling data
- Supports tracking of data dependencies, crucial for managing data lineage, impact analysis, and ensuring data integrity
Alation
- Its data catalog relies on an AI/ML-driven behavioral analysis engine for enhanced data finding, governance, and stewardship
- Can connect to a variety of sources, including relational databases, file systems, and BI tools
- Automates data governance processes based on predefined rules
- Uses popularity-driven relevancy to bring frequently used information to the forefront, aiding in data discovery
- Its Open Data Quality Initiative allows smooth data sharing between sources
Atlan
- Offers Google-like search functionality with advanced filtering options for accurately retrieving data assets despite typos or keyword inaccuracies
- Its “Archie Bots” use generative AI to add natural language descriptions to data, simplifying discovery and understanding
- Features data profiling, lifecycle tracking, visual query building, and quality impact analysis
- Offers a no-code interface for creating custom metadata, allowing easy sharing and collaboration
Collibra
- Its data dictionary offers comprehensive documentation of technical metadata, detailing data structure, relationships, origins, formats, and usage, representing a searchable repository for users
- Offers data profiling and automatic data classification
- Enables users to document roles, responsibilities, and data processes, facilitating clear data governance pathways
Select Star
- Automates data discovery by analyzing and documenting data programmatically
- Connects directly to data warehouses and BI tools to collect metadata, query history, and activity logs, allowing users to set up an automated data catalog in just 15 minutes
- Automatically detects and displays column-level data lineage, aiding users in understanding the impact of column changes and ensuring data trustworthiness
Microsoft Azure Purview
- Provides a comprehensive and up-to-date visualization of data across cloud, on-premises, and SaaS environments, facilitating easy navigation of the data landscape
- Automates the identification and categorization of data
- Offers a glossary of search terms to streamline data discovery
- Offers data lineage tracking, classification, and integration with various Azure services
AWS Glue Data Catalog
- Offers scripting capabilities to crawl repositories automatically, capturing schema and data type information
- Incorporates a persistent metadata store, allowing data management teams to store, annotate, and share metadata to support ETL integration jobs for creating data warehouses or lakes on AWS
- Supports functionality similar to Apache Hive’s megastore repository and can integrate as an external megastore for Hive data
- Works with various AWS services like AWS Lake Formation, Amazon Athena, Amazon Redshift, and Amazon EMR, supporting data processes across the AWS ecosystem
Databricks Unity Catalog
- Utilizes AI to provide summaries, insights, and enhanced search functionalities across data assets
- Enables users to discover data through keyword searches and intuitive UI navigation within the catalog
- Offers tools for listing and exploring metadata programmatically, catering to more technical data discovery need
- Incorporates Catalog Explorer and navigators within notebooks and SQL query editors for seamless exploration of database objects without leaving the code editor environment
- Through the Insights tab and AI-generated comments, users can gain valuable understanding of how data is utilized within the workspace, including query frequencies and user interactions
Seconda
- Enables easy discovery of data, including end-to-end column lineage, column-level statistics, usage, and documentation in a unified platform
- Centralizes tools of the modern data stack with no-code integrations, allowing for quick consolidation of data knowledge
- Manages data requests within the same platform, eliminating the need to use external tools like Jira, Slack, or Google Forms
- Allows for the creation of knowledge documents that include executable queries and charts
- Provides a Google-like search experience for exploring and understanding data across all sources
- Offers commenting and tagging functionalities, enhancing team collaboration on data assets

How should you choose? This will, above all, depend on your source systems and use case.

Just remember three key points here:

Tools like Alation and Collibra can be expensive, and SaaS product pricing in this sector is often not straightforward. Many providers don’t list their prices online, making it challenging to understand costs without direct inquiry

While open-source tools offer a cost-effective alternative, they may be a bit naive compared with their paid counterparts. Features such as data quality, profiling, and governance need thorough evaluation to ensure they meet your requirements

The ideal data discovery tool for your organization might not require all the bells and whistles, such as big data processing capabilities or the recognition of every data type. Focus on the features that are most relevant to your specific needs.

At the same time, whatever your use case or source systems, there are critical features that you should consider when selecting a data discovery tool. These are:

Comprehensive data scanning: Essential for modern enterprises, this feature is about ensuring complete data visibility across all systems, including on-premises, cloud, and third-party services. Also, your data discovery tool must autonomously scan the entirety of your distributed data landscape without requiring manual inputs like login credentials or specific directions. The ability to perform continuous scans to adapt to rapid changes in cloud environments might also be helpful.
Customizable Classification: Organizations vary greatly in their data structure, usage, and governance needs. By being able to tailor classifiers, you can achieve greater precision in identifying, categorizing, and managing your data. This is especially important with the growing complexity of data privacy laws.
Comprehensive metadata management: Simply scanning metadata isn’t enough for full data discovery due to potential errors in labeling and the complexity of unstructured data. Your tool should also examine the actual data content. It should use techniques like pattern recognition, NLP, or ML to find important or sensitive information, regardless of its labeled metadata.
Contextual Understanding: Understanding the full context of data, including related content, file names, specific data fields, and even the location or access patterns, allows for more nuanced management of data assets. Because the context in which data resides can significantly impact the level of risk associated with that data set. For instance, the presence of personally identifiable information (PII) alongside financial data in the same file could elevate the risk level, necessitating stricter protection measures.
AI Training: When selecting an AI-powered data discovery tool, opt for solutions that train their technology on the most up-to-date regulatory requirements, frameworks, and data definitions, while allowing for customization to your specific context and supporting continuous learning from your data and feedback. Without the right data, your AI tool will be useless.

How JustSoftLab can help

If you still feel confused or uncertain about your capabilities, JustSoftLab can guide your organization through the entire data discovery journey with a structured approach tailored to your unique needs and objectives. Here’s how we can assist:

Identify Your Data Goals: We help you define clear objectives for data discovery, such as improving data quality, enhancing compliance, or building a data analytics platform.

Understand Your Data: Get a full grasp of the type, volume, sources, and complexity of your data to select the right tool.

Tool Selection Guidance: Our experts evaluate available tools based on how well they integrate with your systems, their scalability to accommodate data growth, and specific features like automated classification, metadata management, data lineage, and analytics that match your needs.

Ease of Use and Support: We focus on selecting tools with intuitive interfaces suitable for all skill levels and ensure they come with comprehensive training resources and customer support to facilitate a smooth learning curve.

Security and Compliance: Our approach includes choosing tools with robust security features and compliance capabilities to protect sensitive information and meet regulatory standards.

Cost Efficiency: We conduct a thorough cost-benefit analysis, considering all expenses and potential returns. We also recommend taking advantage of trials to assess tool effectiveness in your environment.

PoC Development: Before full-scale implementation, we can create a PoC to demonstrate the viability of the chosen solution in your specific environment. This can help in securing buy-in from stakeholders and ensuring the solution meets your needs.

Custom Integration: Beyond tool selection, we develop and implement custom data integrations for sources that aren’t natively supported.

Training and Workshops: While ensuring tools come with good support and resources is crucial, we also provide tailored training sessions and workshops for your team. This can range from basic tool usage to advanced data analysis techniques.

Data Governance Strategy: We help formulate and implement a robust data governance strategy. This includes setting up data access policies, compliance checks, and ensuring data quality standards are met across the organization.

Data Analytics and Insights Generation: Beyond data discovery, JustSoftLab can assist in analyzing the discovered data to generate actionable insights. This can involve advanced analytics, data visualization, reporting, and even AI tools for predictive modeling to help inform business decisions.

By offering these expanded services, we make sure that our clients not only select the right data discovery tools but also maximize their investment.