Augmenting data models with data profiles: an extension to the BA toolkit

When an organization narrates their challenges and problems about a specific business domain, business analysts think about how to represent those problems.  Data models, in some circumstances, can help.

When data models are created, or even before a model is contemplated, an additional tool that a business analyst can use is data profiles.  Data profiles reveal key, factual information about data.


This article won’t go deep into the technical details of data profiling, but I will show you, using a small example, how data profiles can add value to the requirements gathering process when the topic turns to data and what to model.

In my experience, I have heard stakeholders in requirements meetings talk about needing to capture specific pieces of information.  For example, a stakeholder may identify a set of data  from a legacy system as being needed in a new, modernized system.  I write the requirement down as a stated, yet to be validated requirement.

Afterward, when I am reviewing those requirements, I ask myself the following question: “how does the user know what specific data is, in fact, required in the new system?”

My tool to answer this question is the data profile.

Data profiles yield information like the minimum and maximum lengths of data values, uniqueness of data and whether columns that store data – such as those found in a relational database – store values or are mostly empty.  There are other attributes that can be culled from a data profiling tool.  For now, I will focus on how to interpret the statistics that will validate, or be used to challenge, a stated user requirement.

Consider the small table below that is designed to store information about users, such as first name, last name, city, state and zip code

First Name Last Name City State

Zip Code

Jane Doe Raleigh North Carolina
John Doe Ohio
Stewart Little Cleveland Ohio

Let’s assume that our stakeholder earlier stated that the zip code data field is required for the new system.

The data above can be reviewed in terms of obtaining profile metrics about each column without using a tool. For example, the minimum length of the First Name field is 4, given the length of Jane or John’s name.  The maximum length of the First Name field is 7, as Stewart represents the longest text value for that column  This can be done going across the columns for each data value.  The purpose of minimum and maximum lengths are to contribute to how wide a database designer will make this field in the new database after is physical model is implemented.

What about uniqueness?  Well, I can see from the State column that Ohio is shown twice.  This means that this value could represent a value contained in a lookup list where state values are shown in a standardized form, adding value to the formatting of the data that gets passed to the database system’s physical model.

What about the user’s request to bring over data, like the zip code field?  This is where data profiles can inform the discussion.  First though, the point of using statistics about data is not to argue in favour of including or excluding fields as a BA, but to have a discussion about the facts and hopefully influence decisions about migrating specific data from an old to a new system.

In this case, the Zip Code filed is never used.  That is a fact. I would inform the stakeholder that, historically, this data has never been captured.  Instead of saying to the stakeholder that the value won’t go into a logical model, the objective nature of my work would ask questions like “In what capacity does your business need a zip code?” or “What would the data be used for in the future, given that this piece of data has never been populated before?”  Sometimes, the user would never know that such facts exist about their data.  They could be surprised by the fact that a piece of data that they want has never been used and may yield to excluding it from any future application.  Data that is not needed can be dropped, saved for historical purposes, and reduce the amount of system resources required to support it, like disk space.

The power of using data profiles is strong. To reiterate, data profiles are designed to inform the requirements eliciting process and validate (or in the example, invalidate) stated business needs.

In my logical model, I would add the First Name, Last Name, City and State, and not the Zip Code, assuming the stakeholder has agreed, with fresh information upon which to make that decision, to exclude it from use.  There would be additional opportunities to talk about the data that will be modelled, such as formatting, but that is beyond the purpose of this small example.

A business analyst can use data models to illustrate not only an understanding about a business concept that needs data but also to help model related data entities, such as users, addresses, and locations. Data profiles augment the data analysis by providing facts about the nature of existing data and its historical use, or in this case, non-use.

In the end, data models, coupled with data profiles are designed to add value to the information and data structure that will empower a business to capture what they need, ultimately assess business performance using good data and gauge how well they are moving along their strategic objective path.

To continue the discussion further, please use the comment form below.





Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s