You have probably heard of the rapidly-turning-into-cliché expression of data being the new oil. As overused as this expression might be, it turns out to be an excellent choice to form an analogy to explain the chain of transformations data goes through in a typical enterprise and how value, as defined by the any usage benefit accruing from the new form of data, added successively in each stage. That is what I refer as data value chain. The term itself is inspired from oil industry term “oil value chain”, which refers to the chain of economic transformations performed by an integrated Oil company (a set of Oil industry companies operating in different segments), from oil being extracted from a reservoir through upstream operations resulting in crude oil, and then that crude oil being stored, transported and whole-sale delivered in midstream operations and finally transported crude oil being refined and turned into a set of oil products like gasoline, jet oil, fuel oil, etc., and finally those refined products marketed and distributed to end users in downstream operations. This staged transformation through successive upstream, midstream and downstream operations what makes the data as new oil analogy so useful and redeems the fatigue of using a cliché term.
Segmentation of Data Value Chain
Dividing data transformation operations that add value to the data into three conceptually separate segments and hence creating the data value chain concept allows to manage unique constraints and requirements of each segment in such as a way to maximize the benefit that can be extracted through the operations in that particular segment. This is akin to oil companies choosing to specialize in a specific segment of the oil value chain (be it upstream production operations or downstream distribution operations) whereby they build operational excellence and efficiencies in managing operations in their segment and consequently gain competitive advantage. Similar benefits could be realized if data is handled in accordance with the nature of the operations to be performed on it and the segment those operations belong to in data value chain. While the nature of data operations will determine which segment they will fall under, the segment of the operations will influence the systems and methodologies to be used to perform the actual transformations to create the added value. This enables segment-specific specialization and using fit-for-purpose transformation technologies that maximize the value generated through the operations in each segment.
Having argued rationale behind segmentation and adopting segment-specific transformation methods and technologies let’s look at each of these segments in more detail.
Upstream Data Operations
Upstream is all about origination of data. This is where data is born and begins its journey in a long series of transformations in data value chain. We are concerned here with data coming into being from a point of view of a specific enterprise. This data might in fact be coming from an external vendor as an end product of that vendor’s own data value chain operations (in its downstream segment). This is similar to dependency relations in supply-chain networks where one company’s (supplier) end product becomes another company’s (customer) raw material.
As far as a specific enterprise concerned, there are three main forms of how data might originate:
Sourced: Data is sourced from another organization where it has been created. This organization could be external or internal leading to terms external sourcing and internal sourcing respectively. Sourced data itself would have been either sourced or captured or generated in its port of origination and port of origination of that port of origination and so on.
Captured: Data is captured as a part of usual business operations. This forms the bulk of data in an enterprise organization and covers wide range of data including transaction data (e.g. sales data, purchasing data, trading data, treasury transactions data, etc.), reference/master data (e.g. customer data, product data, credit terms data, etc.), operational data (e.g. shipments data, order fulfilment data, etc.), sensor data, etc.
Generated: This is the form of data manufactured by an enterprise through its business and/or technology operations using sourced and/or captured data as feedstocks. An example would be generating a custom, hybrid credit score metric from customer data captured through business operations and credit score data sourced from credit agencies through a company specific data analytical process. Increasingly more data being generated (as opposed being sourced or captured) with the adoption of machine learning and other automated analytics technologies.
Midstream Data Operations
Midstream is primarily concerned with classification and storage of data that comes from upstream. Let’s focus on the storage part first. The amount of data coming (i.e. size), the rate/frequency its flowing through (i.e. speed), acceptable time delay with which the stored data later to be consumed by downstream operations (i.e. latency), the importance of the data for downstream operations (i.e. value), etc. will all have influence on how data will be stored and which storage technology will be used. Infrastructure costs would be cross-cutting theme influencing all of the mentioned factors and will lead to various options along cost/benefit spectrum.
Related to the storage is the important concept of the form in which data is going to be stored:
Raw: Data is stored as it is (i.e. as in upstream form). The primary benefit of this approach is that it allows results of downstream analytical operations to be easily traced back to the raw data that came from upstream operations, which could be quite useful in detailed analysis and troubleshooting.
Derived: Data is first transformed through a set of transformations and results of those transformations, rather than raw data, are stored. This approach stores the data in a form that is most likely to be used by downstream operations. It, however, discards the original raw data that could reduce the data tracing capabilities (i.e. linking analytical results to original input data) of downstream data analytic operations. Discarding raw data reduces storage requirements and lowers the associated infrastructure costs on the flip side.
Both: With this approach, you store both raw and derived data and hence avoid losing any capabilities due to storing one but not the other. The downside is the increased data storage requirements and associated increase in costs. Cost challenges could be mitigated by storing raw data, which is likely to be consumed less frequently than derived data, in a storage medium that has low cost but high latency profile (e.g. Amazon S3 storage)
Mixed: This is an approach where either of raw or derived (but not both) data is stored. The decision to store whether raw or derived data is made on a case by case basis for each dataset based on relevant downstream data use cases. In cases where storing data in raw form outweighs the benefits of storing in derived form it is stored as raw, otherwise it is stored as derived. This is the most complicated approach among the four listed and better to be avoided in favour of ‘both’ approach so long as the cost constraints of the latter do not prevent it as a viable option.
Second concern of midstream is data classification. This is usually known as ontology in the data analytics space and basically refers to a multiple-inheritance based hierarchical classification scheme where data entities could have multiple paths of ancestry. The full details of ontology are beyond the scope of this article (I will cover it in another article) but you can think of it as taxonomy (like in biology) where entities/concepts could have multiple inheritance relationships not necessarily in the same hierarchical path. For example, a Bat could be classified as both a bird (because it flies like birds) and a mammal (because it feeds its offspring with its milk) even though Mammal and Bird classifications are not part of the same hierarchy. As seen in this simple example, whilst ontology is very flexible with assigning attributes (e.g. flying, milk feeding, etc.) to the entities, this flexibility could lead to some logical contradictions (e.g. Bat is not a bird even though it flies), especially in complex, deep classification schemes.
The main benefit of data classification is to provide a level of abstraction over the data entities captured in upstream so that details not related to the fundamental nature and conceptual identity of the data does not propagate across downstream systems where such variations could cause unnecessary complexity in handling them in a consistent, coherent and cost-effective way. To see this, consider the case where a given company calls its suppliers in three different ways (perhaps due to historical reasons e.g. mergers, acquisitions, etc.): Partner, Provider and Supplier. Not recognizing this and correcting it by mapping all three concepts to a single one, say Supplier, through a ontology-based transformation will result in any downstream analytical system that deals with suppliers to be built in such a way that it knows that a supplier can be called any of the aforementioned terms and hence it has to be designed and built to deal with this kind of variation. The problem is that such variation handling functionality has to be repeated across all similar downstream systems, resulting in ballooning of the costs at best and consistency and coherence issues at worst.
Disclaimer: All the content provided here is for informational purposes only and belongs solely and completely to Cetin Karakus, not BP and BP is not responsible for any damage caused by any use of the content provided in this article.
By Cetin Karakus
Cetin Karakus is the Global Head of Analytics Core Strategies & Quantitative Development, Group Technology Advisor, BP IST IT&S
Cetin Karakus has almost two decades of experience in designing and building large scale software systems. Over the last decade, he has worked on design and development of complex derivatives pricing and risk management systems in leading global investment banks and commodity trading houses. Prior to that, he has worked on various large scale systems ranging from VOIP stacks to ERP systems.
In his current role, he had the opportunity to build an investment bank grade quantitative derivatives pricing and risk infrastructure from scratch. Most recently, he is working on designing a proprietary state-of-the-art BigData analytics platform for global energy markets while leading a global team of talented software engineers, analysts and data scientists.
Cetin has a degree in Electrical & Electronics Engineering and enjoys thinking and reading on various fields of humanities in his free time.