Data Integrity - Part One: Understanding the Process Behind Data Collection
That’s right, 19x more profitable! Yet when 2,190 global senior executives were surveyed on Data Integrity, only 35% said they have a high level of trust in how their organization uses data and analytics.
Maintaining accurate data leads to confident business decisions based on sound evidence, but it goes further than that. It plays a critical role in the trust that end-users have in a company’s products and services. With our new Data Integrity Series, WinnowPro is searching for answers to uncover how businesses prioritize data integrity within their organizations and how customers are impacted by organizations who fail to implement today’s standards of data integrity.
This is part one in our Data Integrity Series, in which we sat down with two of our leading engineers at WinnowPro to discuss the technical side of maintaining Data Integrity in business.
Interviewer: Do you see Data Integrity as a growing issue in the technology realm?
Erdal Guner: Absolutely. We are experiencing the growing pains of collating first-, second-, and third-party data as most companies do not own all the data about their businesses first hand. There is always a need to enrich this data, hence, going to second- and third-party data sources. Privacy concerns, old data, inaccurate and super-processed data being sold in data exchanges impact the data integrity negatively. This increases the data integrity issues as intersecting these three different data sources does not 100% work, creating more data integrity issues.
Your data integrity decreases exponentially with the integration of your second- and third-party data sources. With simple math, A being your own data, B being second-party data (someone else’s first-party data) and C being your third-party data (complete aggregates bought from third-party data exchanges) considering all variable values range from 0.0 to 1.0, 1.0 being the most reliable:
Complete Data integrity is A * B * C (in simple math, it is not an average).
For example 1.0 * 0.8 * 0.6 = 0.48 even if you consider your own data being the most accurate.
Interviewer: Why might companies collect their own data instead of sourcing it from third-parties?
EG: Four reasons...accuracy, trust, owning the data, and elimination of privacy concerns.
First, accuracy… Because focus areas are directly connected to the data collection processes — like collecting clickstream and content from our own virtual establishments (web, chat, customer communications made directly with phone, email, our own social media assets) — we would be the first to know the raw data. Raw data is accurate, before it gets interpreted and skewed.
Second would be trust. The data is accurate, transformed the way we deem proper via processes we establish without third party involvement. Then, owning the data allows us to use the data to its legal extent without tying ourselves to vendors. Finally, eliminating the privacy concerns as we control the way we inform our users, how we collect the data, legally in their own territories without getting affected by the problems of third party vendor data mishandling.
Interviewer: So, what are the different methods for sourcing primary or, as Erdal said, first-party data?
Cagatay Capar: Methods vary by where the data is sourced from. If the data source is, say, a social platform — Facebook, Twitter — data can be sourced using the methods these platforms have devised for this specific purpose, such as their APIs.
If the source is a business software tool — for example, a software keeping track of customer information — a connection to an FTP server would be one of the ways of sourcing this data.
Another, less structured method could be direct scraping of web pages, for example to see what type of text business websites use on their homepage. These are just three examples of a wide range of methods that could be used for sourcing data.
Interviewer: Thinking about the end user again, is there any way someone can determine the integrity of a company’s data, or know if it’s been changed?
CC: Although the answer heavily depends on the data sharing practices of the company in question, there may still be ways to determine if something about the data is not right.
Here’s an example for Twitter API… Twitter provides statistics about users, tweets, etcetera over their API. Suppose you retrieve how many times a certain tweet has been viewed using Twitter API. If you notice that as time progresses and you gather more data, the number of views sometimes decreases, you can already tell there is a data quality issue. In this example, the problem is most likely due to some technical glitch or an imperfect gathering of data on Twitter’s side, not necessarily a deliberate changing of data.
It’s not hard to imagine, though, scenarios where users will be helpless in saying whether data has been deliberately changed. This could be due to the way the company shares data and/or the limited privileges the user in question has in terms of access to data.
Interviewer: From a technical standpoint, can you describe the process of creating a new data-collecting tool and the considerations of how to maintain ‘Data Integrity’ at every point? What questions do you have to ask yourself as the creator?
CC: This really depends on the source of the data. If the source is a structured data source where you simply make calls to an API or move data from an FTP server to your own database, I would call the process of creating a new data-collecting tool rather straightforward. If the source is unstructured data, such as scraping the web, then the process is an iterative one. Through the iterations, checks need to be done to see if we’re collecting what we intend to be collecting. In my personal experience, there could be a lot of noise that you need to get rid of until you reach the desired outcome.
When maintaining data integrity, an automated, periodic data quality check is crucial. Sanity checks on collected data such as number of data points collected not varying too much outside a reasonable range (a sudden drop may be indicative that data collection failed at some point), data fields being in the “reasonable” range (such as review score being within 1-5, date of birth being a positive integer, daily ad spend in US dollars not being in six digits). These are all examples of questions that creators have to consider. Raising a flag in cases of unexpected situations and alerting those responsible is then possible when a data quality check is implemented.
Interviewer: Does data scrubbing fit into that process?
CC: Data scrubbing can be implemented and improved regularly by observing the outcomes coming from the above-mentioned data-quality-check process. For example, if you observe that sometimes the data for date of birth will be in two digits instead of four, you can implement a data scrubbing process that modifies this particular field before feeding into your process.
Interviewer: Does that change at all when you go from collecting data online to directly asking users for data?
CC: When you’re asking users for data, asking multiple users for the same data and cross checking can be of great help. A classic example is when you ask users to label cars in an image to collect images to feed into an AI tool. By providing the same image to multiple users and cross checking, you can greatly improve the data quality.
"The pillars of any software dealing with data collection and processing are understanding of the data collected...creating maintainable code that can scale and be fault tolerant...storing data securely in its raw form...and transferring the data by allowing multiple data transference paradigms in several mediums." Erdal Guner
Interviewer: How do you determine what data is collected, what’s kept, and what’s discarded?
EG: We always use a two-pronged approach: business and technical. Both of which are always business driven.
Once the business provides the data requirements, the first step is to understand the data from all angles. Legalities of capturing, transforming, processing of the data have to be properly met. This is the business side. All the formatting, supporting data, Lat/Long capture transiency of data and what method to be used is determined by the data provider (human, non-human) and locality of the data source... This is the technical side.
Collected data is scrubbed according to the business and technical requirements. During scrubbing, we do not keep data that is deemed unnecessary for the business and technical side.
Since future probable product development may require data points that may seem to be unrelated, raw data is kept regardless, in packaged format to save space. Additional data points, such as process execution environment and timings, will always be tracked.
Interviewer: In addition to collecting the data in a way that maintains its authenticity, is there more that goes into storing and protecting the data? Such as keeping it safe from any cross-site scripting or SQL injections.
EG: Software that collects any data, either from end-users or via systems using data for non-human systems, must have “penetration” testing as part of its testing and deployment strategy. Penetration testing is tailor made for the system to be exploited. From network to storage, from input methods to output methods (such as, calling or emailing to see if you can exploit customer data via automated or human driven systems), all systems involved in the solution are tested. Lots of strategies are employed for these tests that include not only cross-site scripting or SQL injection, but many other exploitations: Targeted, Internal, External, Blind, Double Blind, Black box, White box, and more.
For more information on WinnowPro’s Data Integrity Series and to learn how WinnowPro is achieving its mission for the highest data integrity standards, sign up to receive email alerts when new announcements are released.