Data Analytics


Data Analytics

One of our biggest problems with Analytics is knowing how to start training a system that could for example,  flag up heightened railway risk. Structured data sources are relatively easy to
analyse, however, unstructured data source (e.g. Word files, Excel sheets, images) are much more difficult. Multiple data sources are even more so, hence the need for a machine learning approach to the analysis. It became apparent that to try to undertake predictions we had first to understand past performance and thoroughly represent accident causation. We started by labelling accidents reports available on the RAIB website [6] using established NLP tools. RAIB have subsequently kindly shared their corporate memory tool with us to enable a further analysis. 
 
 Pizza Diagram (basis for PhD we are sponsoring with MMU.)
data
The Figure above “Proposed analytics and training approach” illustrates the proposed approach we are exploring railway data analytics. First of all, it is necessary for railway technology and safety experts to tag objects of interest in the accident and incident reports using the railway accident causation taxonomy we are developing. NLP and statistical analysis of the complete accident records are then studied and clusters and correlation between accident causation factors are established. See the object that looks like a pizza in Figure 5; this is an initial cluster diagram from analysis of RAIB online accident reports.
 
This causation clustering and correlation gives a statistical relationship of the important causes in situations that led to accidents. The theory is that if we can identify when these complex conditions are in place, there is a heightened chance that a serious accident could happen. 
 
Having trained our system to look for heightened risk we then stream the analytics engine with the railway data, real time and historical, structured and unstructured, let us call it operational data. Because of our previous work [1][2] we have established that data is available to flag up heightened risk and it can be linked to accident causation. The data is used as a proxy for the causation analysis as these are linked from the data to the accident causes.
 
The analysed operational data is compared with the data derived from the accident and incident records. If there is a similarity, a flag is raised. The system is interrogated to determine whether we have met a false positive or false negative or
if we have indeed averted a potential accident. In simple terms, if the two pizzas look similar, then there is likely to be a heightened risk. 
 
The system will learn from accidents and incidents on an ongoing basis and become more accurate as it gets feedback and increasing data. Clearly to begin with this type of system would have to be run in parallel with existing safety management system until sufficient confidence is built up.