Big Data Pipelines – Efficient Data Processing with a System

A data pipeline, when viewed in detail, is a whole series of steps in data processing. In general, a data pipeline is a method of transferring data from a source to a target system. But what is a Big Data pipeline? From a pure conceptual point of view, there is no fundamental difference to the classic data pipeline, but the amount of data is crucial and leads us into the depths of definitional difficulty? When is it "just" data, when is it Big Data? How is such data collected, how is the data used, and who benefits from it?

These and other questions are clarified in the article below, which the Mobility Rockstars editorial team developed, as usual, in close cooperation with the relevant departments.


Marketing Professional


Ca. 11 min

Sharing is caring!

Definition question: What are Big Data pipelines?

To be able to clarify the question of the Big Data pipeline, we have to start with the definition of Big Data. In general, “big data” is only a broadly defined collective term and usually stands for data volumes that are too large to be analyzed using traditional data processing methods. We use the weakening “as a rule” here deliberately, because it is not only the sheer volume of data that generates “Big Data” – very complex data structures or weakly structured data can also become a case for the big data brother. So on the one hand, Big Data refers to the data itself. On the other hand, the collective term also covers an entire field of work in which the collection and analysis of data takes place, as well as the technologies used for this purpose, of which there are numerous.

Big Data Pipelines: What data is collected, and why?

In the automotive sector in particular, there is a huge amount of data that needs to be collected and processed in order to be able to evaluate it and also derive real insights. One method of data collection is the “Mobile Data Recorder” from Cognizant Mobility, also referred to below by the internally used acronym “MDR”. This is a device that is of course not installed in customer vehicles, but in numerous test vehicles, and which can be used to collect a wide range of data. An example of this would be the creation of a pothole map: test vehicles drive all over Germany on practically all recorded roads, MDR records every pothole, and in addition the test driver can send a signal to the recorder via steering wheel combination and ensure a higher resolution of the data – more on this in a moment.

The data from a large number of test vehicles must be collected and evaluated in a centralized manner. Due to the sheer number of vehicles, the many different types of data, and of course the sheer volume, sensibly structured Big Data pipelines are essential.

As mentioned, we are not talking about just one vehicle or just one road: some drivers complete well over 100,000 kilometers per month in test drives. And it’s not just potholes that are covered. What happens with which signal sequence in the vehicle, which error messages arise in various systems at which point, at what time, how often? All these signals of the most important vehicle buses are now stored in a basic data set by default, and in addition, customer-specific evaluation requirements can be addressed in the signal resolution and selection.

“The customer”, in the automotive industry, is usually the big players in the industry, whose engine departments, chassis departments et cetera need various data to drive and optimize their own development.

So the “Big” in “Big Data” is already emerging – assuming dozens, hundreds and more test vehicles, we’re talking millions of miles driven per month, and veritable myriads of different data. This data must not only be received, but now also sent: The next step in the Big Data pipeline.

How is Big Data sent, received and analyzed?

The massive data collection is stored by the Mobile Data Recorder, which packages this data and also contains a SIM card by means of which the recorder can send the data over the air to the cloud via the LTE network. In the cloud – in the case of this specific project, an MS Azure cloud – is a computing cluster, such as an Azure-type EventHub, which receives the data and stores it in a database, from where the data can be segmented, validated and analyzed. The database itself is connected to appropriate evaluation tools (such as Azure Data Explorer) so that the data can be filtered and made available as a data set.

A popular example in this context is that of a supermarket: one places an order for groceries (customer), which is made by the buyer (MDR). Purchases (data) are placed on the assembly line (EventHub) and settled by a cashier (Computing Cluster). From the shopping cart (database), the original customer can now select which goods are of particular interest to him and take a closer look at them. Sounds pragmatic – but it is not without reason that the methodology of data transfer is evidenced by the industrial term of a pipeline.

After collecting and transmitting, the most important step in the Big Data pipeline is certainly that of decoding the enormous amounts of data without errors. These are unpacked, decoded and, as far as possible, validated and provided with their own meta data (for example, if a serial number does not match the error message, a documenting note can be added to the data record here). The decoding is performed by an algorithm, a core part of Cognizant Mobility’s work in this project, which developed and wrote large parts of the algorithm independently. The data that has now been collected and transferred, unpacked, decoded and segmented, can now be presented to the customer.

Large data sets – A question of resolution

So far, so good – the customer has his data set, but how accurate is it? Finally, a tremendous amount of data was collected. However, if only a snapshot is to be illuminated, for example an error message that needs to be fixed in further development, it is necessary to talk about resolutions. This term is sometimes more familiar from video technology, where the number of pixels of a video on a given surface is described. In data processing, however, resolution means something completely different, namely a temporal instead of an image resolution. So let’s literally look at this more closely: a time series is basically one-dimensional, and we can look at certain sections of it. The question is at what distance the observation takes place. A period of two seconds could be viewed every second, which would result in two data points (also called sample points). However, the same period could also be observed every millisecond, and already we get two thousand of them instead of only two data points, and thus a considerably higher resolution of the time stream. So if a fault is now generated as part of a signal sequence in the vehicle, the customer can look at a higher resolution of exactly the time period in which the fault occurred and has a significantly higher basis of data available for evaluation to that effect.

The volumes of data collected, validated and analyzed via Big Data pipelines run into petabytes and therefore require special Big Data procedures.

Big Data Pipelines in Cognizant Mobility – Why Experience is Critical

We love simple comparisons: Anyone with a green thumb can grow an apple tree from a seed. But better does the gardener, and if we are talking about a whole plantation, it makes sense to hire gardeners with large-scale project experience. What makes sense for – pardon the awkward pun – the Big Apple makes even more sense for Big Data. For years, Cognizant Mobility has demonstrated a high level of expertise in the processing of data, as well as in its precise evaluation. After all, what good is data if it doesn’t yield insights from which actions can be derived? In addition, it must always be taken into account that Big Data is one thing above all: Big. Apart from the fact that there are few players on the scale required in the automotive sector, Mobility is adept at project business and manages to operate in a structured and efficient manner, especially from an IT perspective.

And not only the sheer size, but also the potential of this data is overwhelming. Of all the possible uses mentioned, we have barely named the tip of the iceberg. Which seat heating level is preferred? How far is the steering wheel moved? When will someone switch? Within the context of this multitude of data, the door is open for predictive analytics and machine learning – fields of activity that have a high degree of overlap in the automotive industry.

Especially for OEMs, which often do not have their own engineering department for these fields, a provider of structured surveys and evaluations of data that is close to the technology is an important partner in the value chain.

If you are even more interested in Big Data, or also want to equip your vehicles with the MDR, if you have large amounts of data that you can’t do anything with, or need Big Data management and validation, feel free to use our contact form and write to us.