How trains are becoming data driven

Trains and public transport are, for many of us, a vital part of our daily lives. Large cities are particularly dependent on an efficient public transport system, and if disruption occurs, it usually affects many passengers while spreading across the transport network. But our requirements as passengers are growing and maturing. Safety is paramount, but we also care about timeliness, comfort, Internet access, and other amenities. With strong competition for regional and long-distance trains, providing an attractive service has become critical for many rail operators today.

The railway industry is an old industry. For the last 150 years, this industry was built around mechanical systems maintained throughout a lifetime of 30 years, mostly through reactive or preventive maintenance. But this is not enough anymore to deliver the type of service we all want and expect to experience.

Deriving insight from the data of trains

Over the last few years, the rail industry has been transforming itself, embracing IT, digitalization, big data, and the related changes in business models. This change is driven both by the railway operating companies demanding higher vehicle and infrastructure availability, and, increasingly, wanting to transition their operational risk to suppliers. In parallel, the thought leaders among maintenance providers have embraced the technology opportunities to radically improve their offerings and help their customers deliver better value.

At the core of all these changes is the ability to derive insights and value from the data of trains, rail infrastructure, and operations. In essence, this means automatically gathering and transmitting data from rail vehicles and rail infrastructure, providing the rail operator with an up-to-date view of the fleet, and using data to improve maintenance processes and predict upcoming failures. When data is used to its full extent, the availability of rail assets can be substantially improved, while the costs of maintenance are reduced significantly. This can allow rail operators to create new offerings for customers.

A good example is the high-speed train from Madrid to Barcelona, Spain. The rail operating company, Renfe, is successfully competing with airline flight services on this route. The train service brings passengers from city center to city center in 2.5 hours, compared to a pure flight time of 1 hour 20 minutes. Part of what makes this service so competitive is the reliability of the trains. Renfe actually promises passengers a full refund for a delay of more than 15 minutes. This performance is ensured by a highly professional service organization between Siemens and Renfe (in a joint venture called Nertus), which uses sophisticated data analytics to detect upcoming failures and prevent any disruptions to the scheduled service (full disclosure: I’m the director of mobility data services at Siemens).

Requirements of industrial data

In my role on the mobility data services team, I focus on creating the elements of a viable data-enabled business. Over the last 12 months, we have built a dedicated team for data-driven services, a functioning remote diagnostic platform, and a set of analytical models. The team consists of 10 data scientists, supported by platform architects, software developers, and implementation managers.

The architecture of the remote diagnostic platform is derived from the popular Lambda architecture, but adapted to the requirements of industrial data that needs to be stored for long periods of time. This platform connects to vehicles and infrastructure in the field, displays the current status to customers, and stores the lifetime data in a data lake built on a Teradata system and a large Hadoop installation. On top of the data lake, there are a variety of analytics workbenches to help our data scientists identify patterns and derive predictive models for operational deployment. The data volumes might not be as large as the click streams from popular websites, for example, but they still require a sophisticated platform for analysis. A typical fleet of regional trains would generate a few terabytes of data and around 100 billion data points per year. And now, imagine that such data needs to be stored and accessed for 10-15 years and you see the challenge.

The target of a large set of the analytical models is the prediction of upcoming component failures. However, such a prediction is not an easy task to perform if you only rely on classical data mining approaches. First of all, rail assets are usually very reliable and do not fail very often. Preventive maintenance strategies make sure that failures are avoided, and safety for the passengers is secured. This all leads to a skewed distribution in the data set. Furthermore, the data usually contains significantly more possible features then failures, making it a challenge to avoid overfitting. Also, vehicle fleets are often rather small, with anything from 20 to 100 vehicles per fleet. But to create usable prediction models, a high prediction accuracy is required and, especially, a very low ratio of false failure predictions.

We have worked extensively on these topics in my team and were able to create a first set of robust predictive models that are now being introduced to the market. Understanding the data provided by rail vehicles lies fully at the core of these prediction models. Such data can be in the form of error messages, log files, sensor data snapshots, or true time series sensor data. The semantics of this data needs to be well understood, and the interpretation of the data often depends on the situation of the vehicle at that time. Elements that need to be considered include: whether multiple vehicles were coupled together, if the vehicle was loaded, the topography of the track, and which direction the vehicle was heading. All of these aspects may have visible influence on the interpretation of the data and help to separate the signal from the noise.

Many of the models we have developed also take into account physical processes in the assets we are examining. When it comes to predicting the failure of an air conditioning system, for example, some type of failure mode analysis is required, together with an analysis of how the physical system should behave (i.e. in terms of air pressure, air flow, temperatures, etc.).

How machine learning fits in

All of the approaches mentioned above rely heavily on deep interactions with rail engineering departments. Engineering models are often assessed in order to better separate normal from noteworthy behavior of the system under observation. These insights are used to define the structure of the machine learning system that is trying to predict an upcoming failure. This can be done by supplementing the defined features that describe the system or by providing additional boundary conditions for a support vector machine, for example. Very often, all of this cannot be mapped into a single model, but results in an ensemble of models working together to predict a failure and to avoid false failure predictions.

Our mobility data services team is still a young one, but we have been able to create and successfully apply some models already. We see strong interest from our customers, who are pushing to get ever more insights from the data coming from their vehicles or infrastructure. The value this data can provide is, of course, not only limited to failure prediction — many of them are now trying to identify how they can use this data to improve their own processes, their operations, or how they can change business models for improving their relations with passengers or freight customers. These developments are only the beginning, and there is so much more to come — using data to improve mobility services is at the forefront of mobility technology.

Public domain image on article and category pages via Wikimedia Commons.

For more, check out the Complete Video Compilation from Strata + Hadoop World London 2015, which includes Gerhard Kress’ session “The Internet of trains.”

How trains are becoming data driven

Railways are at the intersection of Internet and industry.

Deriving insight from the data of trains

Requirements of industrial data

How machine learning fits in

Get the O’Reilly Data Newsletter