What’s your dataset worth? [Part 1/2]

This is part one of a two-part series on dataset valuation. If you'd like to be notified when part two is published, you can subscribe to my blog or follow me on Twitter.


Aswath Damodaran is a Professor of Finance at the Stern School of Business at NYU. He lectures on corporate finance and equity valuation, and maintains a blog called Musings on Markets, where he's shared and continues to share a vast wealth of knowledge on the methods he uses to value well-known companies such as Tesla, Apple and Uber. I don't come from a finance background, and a previous misconception I had before watching his lectures was that the valuations of companies followed a very strict formula: a series of equations that numbers would pass through to output a final worth of the company. I was surprised to learn that this wasn't necessarily the case. Professor Damodaran regards equity valuation as both an art and a science; stories and narratives are also powerful value drivers. Of course, the fundamentals are always in consideration: existing cash flows, risk/reward profiles of future growth strategies, and the length of time needed to execute on aforementioned strategies.

My very brief experience dipping my toes into corporate finance recently inspired new thinking around how we could apply equity valuation practices to data. I've held the opinion that larger companies tend to undervalue their datasets (while smaller firms overvalue it), but never had a framework for how I could communicate this variance to my peers. Hopefully with what is exposited here and in my next post, our future selves can justify hunches more tactfully and with some level of rigour.


Product-focused organizations collect a tremendous amount of information every day. Storage infrastructure managed by these companies maintain a mosaic of statistics, numbers and stories that are diverse in their complexity, availability, and arrival frequency. Sometimes datum can arrive in event-form, (m/b)illions of times a day, through product and behavioural analytics, or arrive slowly in drops with little structure, for example through recorded interview responses in a recruiting screen. Other times, the information may be highly structured and used as inputs for a machine learning model, such as the clothing preferences of a customer. All of this data is usually stored in the cloud, and it's easy to get lost in the lakes and rivers of information as a data scientist, engineer or product manager without a good navigation mental model. Knowing how to estimate the value of the company’s data assets can help data-centric teams prioritize which projects to show more love to and which to leave alone, especially if they do not spark joy.

Does your data spark joy?

Let's return to the corporate finance inspiration. How does one begin to understand what the cash flows and future risks of a dataset are? A financial balance sheet splits up company assets and liabilities into two columns: assets are qualities of the company that contribute to company value, such as existing investments. Liabilities such as debt, take away. In a similar spirit, I'd like to separate dataset valuation into its assets, for example its capacity to answer questions, and liabilities, such as privacy burdens.

I've come up with a total of six criteria, next to their headings a + or - (or both) in brackets to delineate whether they are assets, liabilities, or hybrids, respectively. I'll touch on the first three criteria in this post, and expand on the remaining three in part two.

1. Capacity to answer questions (+)

The purpose of data is to help answer questions. The complexity of questions can be low, such as the first name of a visiting customer; medium, such as the average length of a user's name for optimizing a user interface; or high, such as the number of daily active users for growth measurement. The latter question is more complex because it requires the organization to clearly understand what it means to be an active user for its unique product, and develop an understanding of churn. The former is low complexity because it is usually a simple lookup in a database.

Two variables contribute to this criteria:

  • the business or product impact of the answer to the question, and
  • the amount of work required to retrieve the answer.

Here’s a simple equation to assess the dataset’s question-answering capacity:

V1 = ImpactOfAverageAnswer / WorkRequired

The more high-impact questions a dataset is able to answer with less work, the more valuable it is. Decrease the numerator if there are other datasets that answer the same questions and give different answers. And increase the denominator if getting the answer to the question requires help from more than one person.

While measuring the WorkRequired is straightforward, the jury is still out on how to go about measuring the ImpactOfAverageAnswer - let's reserve this for another post.

Once teams know what the ImpactOfAverageAnswer is (and if it’s high enough), they should strive towards making the WorkRequired as low as possible to increase the value of the dataset in question.

2. Frequency of access (+)

A simple corollary to the first. This criteria acts as a multiplier to V1. The more frequently an answer is requested from the dataset, multiply V1 by the number of times it is requested. As you can see, decreasing the WorkRequired early on can pay dividends.

V2 = FrequencyOfAccess

3. Capacity to contribute to predictions (+/-)

Predictions are outputs of machine learning models that have compressed the datasets they've learned from, with a new representation. So a model's ability to compress dataset A better dataset B is an indicator that dataset A is "easier" to learn from. We can define compression performance with the number of bits required to recreate the original dataset from scratch, holding the data reconstruction loss constant. If the dataset were to be completely random, then there is no intrinsic structure and learning would be impossible. This is all assuming that the dataset is appropriate to be used for prediction. All bets are off if this isn't the case.

This criteria asks more questions than it answers, such as how one should go about assigning value to a prediction and how one should quantify the risk of a bad guess. It's highly context specific. Put very briefly, the value of a prediction is the uncaptured value had the prediction not been made. What makes this difficult to measure is the fact that it relies on disentangling the cause of the increase in value from all other possible causes. The same problem exists in reinforcement learning under a different name, called the credit assignment problem. The credit assignment problem describes the difficulty of correctly determining which parts of the entire prediction system and in what ratios should be credited with a successful outcome. But I digress.

Measuring the risk of a bad prediction is slightly easier as it can be solved by a monitoring system that oversees the quality of predictions from a particular model over time. For example, a fraud detection model would use such infrastructure to measure the lost revenue from mispredictions and use labels that previously did not exist during training to improve future predictions. So although a dataset may provide bad predictions initially, it should get better over time with more data.

The equation below is just a rough sketch of the ideas behind this valuation criteria.

V3 = (AveragePredictionValue * NumberOfPredictions) - BiasRisk

If the dataset never gets used in predicting anything, then V3 = 0.

Before we continue, I want to briefly touch on why the predictive power of a dataset can become a liability. It's extremely difficult for all datasets being used for prediction to contain diverse observations about all people groups. And thus models that have learned from these biased datasets and make automated decisions based on the data run the risk of amplifying inequality that already exists between racial, ethnic, socio-economic, religious and gender boundaries. Using fraud detection as an example, the usage of a user's IP address as a feature for classification may unfairly categorize people of colour who live in poorer locations as being more likely to commit fraud as geographic location information is encoded in IP addresses.

There is already a lot of great work being done today to mitigate negative outcomes of unfair predictions, but I don't think companies have fully absorbed the responsibility of ensuring that the gates that they may control are held open equally for people of different backgrounds, whether it's a funding opportunity for an online business or a mortgage approval process.

So to recap, although the capacity of a dataset to contribute to predictions that increase the value of the business also can increase the value of the business, they run the risk of further perpetuating societal biases. Though it may not happen anytime soon, smart governments may decide to intervene with regulation, which creates a liability for companies that make bias-sensitive predictions.


In this post, I introduced the idea of applying basic concepts from corporate finance to the valuation of an organization's datasets. In the equity world, businesses are valued based on their current ability to generate cash flows, the risk/reward profiles of future growth strategies, and the amount of time it is predicted to take for the firm to mature. In this post I argued that a dataset's "cash flows" can be valued based on its current ability to answer impactful questions (multiplied by the number of times the data is accessed), in addition to the consistent business value of the dataset's context-specific predictions, with all of the risks of bad guesses and representational biases factored in. In the next post, I explore three more criteria that risk the future value growth of the dataset: entropy, privacy burdens, and cost of collection. Stay tuned!


If you enjoyed this post and would like to be updated when I post a new one, you can subscribe to my blog or follow me on Twitter.

Show Comments

Get the latest posts delivered right to your inbox.