Two more V’s in Big Data: Veracity and Value

In a previous post, we looked at the three V’s in Big Data, namely:

  • Volume: which considers the number of data points in a data set
  • Variety: or the number of features and parameters at each data point; more variety will often require complex data structures
  • Velocity: or a continuously updated data feed that is required to keep analytics updated in a ever changing world.

The whole ecosystem of Big Data tools rarely shines without those three ingredients. Without the three V’s, you are probably better off not using Big Data solutions at all and instead simply running a more traditional back-end.

Though the three V’s are the most widely accepted core of attributes, there are several extensions that can be considered. The five V’s on Big Data extend the three already covered with two more characteristics: veracity and value.

Veracity

In general, data veracity is defined as the accuracy or truthfulness of a data set. In many cases, the veracity of the data sets can be traced back to the source provenance. In this manner, many talk about trustworthy data sources, types or processes. However, when multiple data sources are combined, e.g. to increase variety, the interaction across data sets and the resultant non-homogeneous landscape of data quality can be difficult to track.

As the Big Data Value SRIA points out in the latest report, veracity is still an open challenge of the research areas in data analytics.

Content validation: Implementation of veracity (source reliability/information credibility) models for validating content and exploiting content recommendations from unknown users;

It is important not to mix up veracity and interpretability. Even with accurate data, misinterpretations in analytics can lead to the wrong conclusions. However, this is in principle not a property of the data set, but of the analytic methods and problem statement.

Value

Data value is a little more subtle of a concept. It is often quantified as the potential social or economic value that the data might create. However, the whole concept is weakly defined since without proper intention or application, high valuable data might sit at your warehouse without any value. This is often the case when the actors producing the data are not necessarily capable of putting it into value.

However, recent efforts in Cloud Computing are closing this gap between available data and possible applications of said data. Amazon Web Services, Google Cloud and Microsoft Azure are creating more and more services that democratize data analytics. Unfortunately, in aviation, a gap still remains between data engineering and aviation stakeholders. Fortunately, some platforms are lowering the entry barrier and making data accessible again.

Conclusion

The problem of the two additional V’s in Big Data is how to quantify them. Veracity can be interpreted in several ways, though none of them are probably objective enough; meanwhile, value is not a value intrinsic to data sets. Moreover, both veracity and value can only be determined a posteriori, or when your system or MVP has already been built. This can explain some of the community’s hesitance in adopting the two additional V’s.

In any case, these two additional conditions are still worth keeping in mind as they may help you decide when to evaluate the suitability of your next big data project.

About Author

Samuel Cristobal

One minute Samuel can be talking about Forcing theory and how to prove that the Axiom of Choice is independent from Set Theory and the next he could be talking about how to integrate Serverless architectures for Machine learning applications in a Containerized environment. Read more about Samuel Cristobal

Related Posts