Technological advances over the past two decades have led to an explosion in the amount of data that is produced and collected all over the world. ‘Big data’—meaning data that is too large or too complex to be analyzed using traditional tools—is now commonplace, due to the widespread usage of electronic devices that automatically generate thousands of data points a day, as well as advances in digital technology that have made it easier to perform manual data collection.
Alongside the surge in the quantity of data, there has been an increase in the number of tools and techniques that enable people to process, analyze, and model huge and complex datasets. Organizations across the globe are rapidly leveraging these tools to extract insights from their data, creating value in many areas of business and society.
For example, in the health sector, patient data is being increasingly widely used to develop Artificial Intelligence (AI) models that can predict a range of patient outcomes. One recent published example is a model that predicts whether a patient with acute kidney injury will deteriorate into a critical condition. Like many AI models, this model was ‘trained’ on a large dataset comprising hundreds of thousands of data points.
However, the dataset is not publicly available, and the researchers that developed the model did not belong to the same organization that collected the data. This means that the researchers and the data owners would have had to set up a specific collaboration to facilitate the study. This situation—where technical specialists with the skills to perform advanced data analytics do not easily have access to the real-world datasets that they need in order to be able to produce outputs that benefit public health—is extremely common.
Making our data available to researchers and data specialists
At D-tree, we collect large volumes of data through our digital health programs. We know that these datasets contain valuable information that can inform public health decisions and advance areas of public health research. We also acknowledge that we do not have all the necessary resources within our organization to enable us to extract the maximum value from our data. We therefore form collaborations with research groups, such as those at Harvard Medical School and N/Lab at the University of Nottingham, that have expertise complementary to ours. We then work together to produce analytic insights and AI models that can improve health outcomes.
However, we also know that there are many organizations around the world with expertise in domains such as maternal health, child health, and health service utilization, or expertise in advanced data science techniques, that could make use of our datasets but with whom we do not have partnerships. Since our datasets are very rich and detailed, there are many research questions that could be answered by combining the knowledge of these experts with our data, such as, “How do errors in estimated gestational ages affect the likelihood of health facility deliveries?” There will be many questions, like this one, that can be answered provided that the relevant practitioners have access to our data.
But although our datasets are valuable by themselves, their value can be increased if they are combined with complementary datasets that enable a broader range of questions to be answered. A researcher with expertise in geospatial analysis who already has access to a dataset containing information about the building density and road conditions in each shehia of Zanzibar, for example, would be able to combine their dataset with ours and examine the relationship between levels of urban infrastructure and the health-seeking behaviors and health outcomes of clients in Zanzibar. This could then inform ways in which we can improve the efficiency and effectiveness of our programs by helping us to decide how best to allocate our resources based on the risk factors coming from each client’s environment.
Another example might be researchers who are performing cross-country comparison studies in order to identify high-level global trends. They could combine our datasets with relevant datasets from other countries and extract big-picture observations that may not directly impact our programs, but that could significantly advance areas of health-related research and indirectly feed back into our work.
In all cases, our data would be used to inform the improvement of our programs, either directly or indirectly, and would therefore benefit the clients from whom the data originated. Therefore, if we were to make our data openly available to everyone, rather than just to our immediate circle of partners, we would increase the benefits that our clients receive, as well as enabling many researchers to advance their respective fields of public health research.
What type of data would be made open?
When we talk about making data openly available, it is important to clarify the type of data to which we are referring. The data that we collect at D-tree comprises the personal details and detailed health history of our clients. This level of detail is crucial for some types of analysis, such as a detailed study of how demographic and socioeconomic characteristics affect an individual’s response to an intervention. However, it would not be ethically appropriate (and, in many jurisdictions, neither would it be legal) to make this type of data publicly available, due to the sensitive personal information contained in the data.
One option is instead to aggregate the data, and make that aggregated dataset publicly available. ‘Aggregating’ means that clients are grouped together (e.g. by geographical district) and then the data from the clients within each group are combined together to produce a single data point that summarizes the characteristics of that entire group. For example, the ages of each of the individual clients in each district could be combined to calculate the average age for each district. Then, whilst the age of each individual client would appear in the original dataset, only the average age for each district would appear in the aggregated dataset. This has the advantage that no personal information from any individual is revealed, but the disadvantage is that the aggregated dataset contains less information than the original dataset. This restricts the types of analyses and models that the data can be used for.
In between the extremes of individual-level data at one end and a fully aggregated dataset at the other end, there is a spectrum of options. The aim is to choose the right point on this spectrum so that we guarantee the privacy of all our clients, whilst retaining sufficient detail in the dataset to enable analysts and researchers to extract insights that can be practically used in public health settings.
Benefits and risks
The ‘open data’ approach has the following potential benefits:
Enabling more people to access a dataset is likely to increase the value that is extracted from that dataset.
Open datasets make it easier for parties to collaborate on a project, by reducing the barriers associated with data sharing. More collaboration generally leads to faster progress.
Practitioners can combine open datasets together, or combine open datasets with private datasets that they have access to. Combining multiple datasets can reveal trends and insights that are not visible in a single dataset, as discussed earlier. This enables deeper analysis and better-quality models to be produced.
The more organizations that make their data openly available, the more likely it is for others to follow. The more datasets that become available, the greater the value.
The approach also comes with a number of risks:
Sensitive information about individuals may inadvertently be disclosed, especially if multiple datasets are combined. This risk can be mitigated and reduced to a ‘safe’ level by following best practices for data de-identification and aggregation, as discussed above.
Data can be unintentionally misinterpreted, which can result in poor decisions being made.
Data can be used for unethical purposes.
The second and third risks are hard to mitigate if the data is made completely open to everyone (e.g. via an open platform such as Humanitarian Data Exchange), as there is no practical way of keeping track of everyone that has accessed the data and what they are using it for. An alternative to the completely open approach is instead to define a set of due diligence criteria that data users must meet; for example, requiring that they are affiliated with a reputable organization or that they have previously performed relevant research. The data are then provided only to the parties that meet these criteria, which reduces the risk that the data will be used inappropriately. However, this approach requires some work to review each application and to provision data access to approved applicants. It also raises the question of who becomes accountable in the event that data is misused or that an unapproved party gains access to the data.
Why is this important to us?
Our mission at D-tree is to strengthen health systems and make high-quality health services available to everyone. We do this very directly, through the implementation of community health programs. But we also know that the information contained in the data that we collect, as part of these programs, can be used to inform the improvement of our programs, and is also valuable to many practitioners in our field. We are therefore thinking about the best approach for making our data available to as many public health practitioners as possible, in order to ensure that the value of our data translates into better health outcomes for as many people as possible.