You have probably heard of the rapidly-turning-into-cliché expression of data being the new oil. As overused as this expression might be, it turns out to be an excellent choice to form an analogy to explain the chain of transformations data goes through in a typical enterprise and how value, as defined by the any usage benefit accruing from the new form of data, added successively in each stage. That is what I refer as data value chain. The term itself is inspired from oil industry term “oil value chain”, which refers to the chain of economic transformations performed by an integrated Oil company (a set of Oil industry companies operating in different segments), from oil being extracted from a reservoir through upstream operations resulting in crude oil, and then that crude oil being stored, transported and whole-sale delivered in midstream operations and finally transported crude oil being refined and turned into a set of oil products like gasoline, jet oil, fuel oil, etc., and finally those refined products marketed and distributed to end users in downstream operations. This staged transformation through successive upstream, midstream and downstream operations what makes the data as new oil analogy so useful and redeems the fatigue of using a cliché term.
Dividing data transformation operations that add value to the data into three conceptually separate segments and hence creating the data value chain concept allows to manage unique constraints and requirements of each segment in such as a way to maximize the benefit that can be extracted through the operations in that particular segment. This is akin to oil companies choosing to specialize in a specific segment of the oil value chain (be it upstream production operations or downstream distribution operations) whereby they build operational excellence and efficiencies in managing operations in their segment and consequently gain competitive advantage. Similar benefits could be realized if data is handled in accordance with the nature of the operations to be performed on it and the segment those operations belong to in data value chain. While the nature of data operations will determine which segment they will fall under, the segment of the operations will influence the systems and methodologies to be used to perform the actual transformations to create the added value. This enables segment-specific specialization and using fit-for-purpose transformation technologies that maximize the value generated through the operations in each segment.
Having argued rationale behind segmentation and adopting segment-specific transformation methods and technologies let’s look at each of these segments in more detail.
Upstream is all about origination of data. This is where data is born and begins its journey in a long series of transformations in data value chain. We are concerned here with data coming into being from a point of view of a specific enterprise. This data might in fact be coming from an external vendor as an end product of that vendor’s own data value chain operations (in its downstream segment). This is similar to dependency relations in supply-chain networks where one company’s (supplier) end product becomes another company’s (customer) raw material.
As far as a specific enterprise concerned, there are three main forms of how data might originate:
Midstream is primarily concerned with classification and storage of data that comes from upstream.
Let’s focus on the storage part first. The amount of data coming (i.e. size), the rate/frequency its flowing through (i.e. speed), acceptable time delay with which the stored data later to be consumed by downstream operations (i.e. latency), the importance of the data for downstream operations (i.e. value), etc. will all have influence on how data will be stored and which storage technology will be used. Infrastructure costs would be cross-cutting theme influencing all of the mentioned factors and will lead to various options along cost/benefit spectrum.
Related to the storage is the important concept of the form in which data is going to be stored:
Second concern of midstream is data classification. This is usually known as ontology in the data analytics space and basically refers to a multiple-inheritance based hierarchical classification scheme where data entities could have multiple paths of ancestry. The full details of ontology are beyond the scope of this article (I will cover it in another article) but you can think of it as taxonomy (like in biology) where entities/concepts could have multiple inheritance relationships not necessarily in the same hierarchical path. For example, a Bat could be classified as both a bird (because it flies like birds) and a mammal (because it feeds its offspring with its milk) even though Mammal and Bird classifications are not part of the same hierarchy. As seen in this simple example, whilst ontology is very flexible with assigning attributes (e.g. flying, milk feeding, etc.) to the entities, this flexibility could lead to some logical contradictions (e.g. Bat is not a bird even though it flies), especially in complex, deep classification schemes.
The main benefit of data classification is to provide a level of abstraction over the data entities captured in upstream so that details not related to the fundamental nature and conceptual identity of the data does not propagate across downstream systems where such variations could cause unnecessary complexity in handling them in a consistent, coherent and cost-effective way. To see this, consider the case where a given company calls its suppliers in three different ways (perhaps due to historical reasons e.g. mergers, acquisitions, etc.): Partner, Provider and Supplier. Not recognizing this and correcting it by mapping all three concepts to a single one, say Supplier, through a ontology-based transformation will result in any downstream analytical system that deals with suppliers to be built in such a way that it knows that a supplier can be called any of the aforementioned terms and hence it has to be designed and built to deal with this kind of variation. The problem is that such variation handling functionality has to be repeated across all similar downstream systems, resulting in ballooning of the costs at best and consistency and coherence issues at worst.
Read Part 2 of this article.
***
Cetin Karakus will be speaking at the Chief Analytics Officer, Spring happening on May 2-4, 2017 in Scottsdale, Arizona. For more information, visit https://coriniumintelligence.com/chiefanalyticsofficerspring
Disclaimer: All the content provided here is for informational purposes only and belongs solely and completely to Cetin Karakus, not BP and BP is not responsible for any damage caused by any use of the content provided in this article.
By Cetin Karakus
Cetin Karakus is the Global Head of Analytics Core Strategies & Quantitative Development, Group Technology Advisor, BP IST IT&S
Cetin Karakus has almost two decades of experience in designing and building large scale software systems. Over the last decade, he has worked on design and development of complex derivatives pricing and risk management systems in leading global investment banks and commodity trading houses. Prior to that, he has worked on various large scale systems ranging from VOIP stacks to ERP systems.
In his current role, he had the opportunity to build an investment bank grade quantitative derivatives pricing and risk infrastructure from scratch. Most recently, he is working on designing a proprietary state-of-the-art BigData analytics platform for global energy markets while leading a global team of talented software engineers, analysts and data scientists.
Cetin has a degree in Electrical & Electronics Engineering and enjoys thinking and reading on various fields of humanities in his free time.
Save
Save
Save
Save
Save
Save