Wil van del Aalst, the foremost expert in workflow and process mining, spoke this morning on the overlap between Data Science and Business Process, and showed how process mining is the super glue between them. What follows is the notes I made at the event.
Data science is a rapidly growing field. As evidence he mentioned the Philips currently has 80 openings for data scientists, and plan to hire 50 more every year in the next few years. That is probably a lot more than computer scientists. Four main questions for data science:
- what happened?
- why did it happen?
- what will happen in the future?
- what is the best that could happen?
These are the fundamental questions of data science, and it is incredibly important. A good data scientist is not just computer science, not just statistics, not just databases, but a combination of nine or ten different subjects.
People talk about Big Data, and usually on to Map reduce, and Hadoop, etc. But this is not the key: He calls that “Big Blah Blah”. Process is the important subject. The reason for mining data is to improve the organization or the service it provides. For example, improve the functioning of the hospital by examining data. Or improving the use of X-Ray machines. (Yes, that is him, at the right, hard at work solving the problems of x-ray machines.)
Process mining breaks out into four fields: process model analysis, then there is the data mining world which focuses on the data without consideration of the process. The third area is performance questions about how well the process is runing, and the last area is compliance: how many times is the process being done correctly or incorrectly.
He showed an example of a mined process. It seems the ProM will output SVG animations that can be played back later showing the flow of tokens through the process. He talked about the slider in ProM that increases or decreases the complexity of the displayed diagram, by selecting or unselecting differing amounts of the unusual traces. They also show particular instances of a process using red dashed lines placed on top of the normal process in blue solid lines. He reminded everyone that the diagrams we not modeled, but mined directly from the data without human input.
Data mining is quite a bit more appealing to business people than pure process modeling because it has real performance measures in it. IT people are also interested because the analytic output related to real world situations. Process mining works at design time, but it also works at run time. You can mine the processes from event streams as they are being created.
There will be more and more data in the future to mine. Internet of things: you shaving device will be connected to the internet. Even the baby bite-ring will be connected so that parents will know when the baby is getting teeth.
He showed an ER diagram of key process mining concepts. Mentioned specifically the XES event format.
Can you mine SAP? Yes, but a typical SAP installation has tens of thousands of tables. You need to understand the data model. You need to scope and select the data for mining. This is a challenge. You need to flatten the event data. A nice log table, with case id (instance id), event id, timestamp, activity name, and other attributes. Produces a flat model without complicated relationships. Very seldom people look at more complicated models with many-to-many relationships, and this remains one of the key challenges.
Gave an example of booking tickets for a concert venue. It is easy to extract the events that occurred. The hard part is to understand what questions you want to ask about the events. First choice is to decide what the process instance Id is from all the things going on. If the process is the lifecycle of a ticket, that would be one. If it is the lifecycle of the seat you get a different process model. Or the lifecycle of a booking yet another process is generated. If we focus on lifecycle of a ticket, then process mining is complicated by the fact that multiple tickets may share the same booking, and the same set of payments. What if a band cancels a concert? That would effect many tickets and many bookings.
Another classical example is Amazon where you might look at orderlines, orders, and/or delivery. I can order 2 books today, 3 more tomorrow, and they may come in 4 different shipments spread over the next few weeks. Try to draw a process model of this using BPMN? Very difficult. You need to think clearly about this before you start drawing pictures.
Data quality problems. There may be missing data, incorrect data, imprecise data, and additional irrelevant data. He gave examples of these for process instances (cases) events, and many other attributes. so in summary: three main challenges: finding the data, flattening the data, and data quality problems.
He gave 12 guidelines for logging (G4L) so that systems are designed to capture high quality information in the first place, so that big data might be able to make use of these later.
Process mining and conformance checking is trying to say something about the real process, but all you can see is “examples” of existing processes. There is a difference between examples, and the real process. We can not know what the real process is when you have not seen all possible examples. If you look at hospital data, there may be one patient who was 80 years old, drunk, and had a problem. This example may or may not say something about how other people are handled.
- True Positives: traces possible in the model, and also possible in the real process
- True Negatives: not possible in the model, and not found in real life
- False Positives: traces that are possible inthe model, but can not (or did not) happen reality
- False Negatives: traces not possible in the model, but happen in real life.
Showed a Venn diagram of this. Try to apply precision metrics to process mining, but you can’t do much. Your process log only contains a fraction of what is really possible. From this sample, you can look at what matches the model or not, and that gives you some measure of the log file, but not necessarily reality. An event log will never say “this can not happen.” You only see positive examples. If you look at a sample of university students, MOST students will follow a unique path. If you look at hospital patients, most will follow a unique path. Hard then to talk about the fraction that fits a particular process. Consider a silicon wafer test machine: you have one trace with 50,000 events. No two traces will match exactly with this number of events.
You never are interested in making a model that fits 100% of the event log. If you had a model that contained all possible traces, it would not be very useful. He used an analogy of four forces on an airplane: lift, drag, gravity, thrust. Lift = fitness (ability to explain the observed behavior), gravity = simplicity (Occam’s Razor), thrust = generalization (avoid over-fitting), and drag is precision (avoid under-fitting). Different situations require differing amounts of each.
Everything so far has been about one process. How then do you go to multiple processes? If we look at the study behavior of Dutch students and international students. We might find the the behavior of Dutch students is usually different from international students. Comparative process mining allow you to mine parts of the process, and show the two processes side by side. Interested in differences in performance, and differences in conformance. Notion of a process cube: dimensions of time, department, location, amount, gender, level, priority, etc. Can do database, extract with a particular filter, and generate the process, but this is tedious. Solution is put everything in a process cube, and then able to apply process mining on slices of the cube. For example a car rental agency looking at three different offices, three different time periods, and three different types of customers. Gave a real example of building permits in different Dutch municipalities.
He records all his lectures, and the students can watch the lectures off-line. There is a lot of interesting data because they know what parts of the lectures that students watch multiple times. Students can control the speed of playback, and look at parts the students typically play faster. They are correlating this with grades at the end of the course. They can compare different students from different origins and see how they compare. Standard OLAP techniques do not generally work here because we are dealing with events. Showed a model of students who passed, versus students who failed. For students who passed, the most likely first event is “watch lecture 1”. For the students that failed, the most likely first event is “take an exam”. (only after failing they go back and watch the lectures).
In conclusion: many of these things are mature enough to use in an industrial situation. But there are many challenges mentioned. There is a MOOC on Coursera on Process Mining this fall. There are 3000 registered students, and it will start in October.
In many year at SAP I have not seen a lot of reflections on past decisions. Is this really going to be used? SAP is not designed well to capture events. If you go to a hospital, things are much easier to mine, even if the systems are built ad-hoc. Also, there is a lack of maturity on process mining. You really need to be trained, and you need to see it work.
Philosophically, does the nature of process really matter? It is crucial that you isolate your notion of a process instance. One you have identified the process you have in mind, the process mining will work well. But it is a broad spectrum of process types. There are spaghetti processes, and lasagna processes. A lasagna process is fairly structured, and process mining of the overall process is not interesting, because people already know it. Instead you want to look at bottlenecks. For spaghetti processes every trace is unique, and the value comes from an aggregate overview of the process and the exceptions.
Is the case management metaphor more valuable than a process management metaphor? This is an illustration that the classical workflow metaphor is too narrow. The problem is that there are in reality many-to-many relationships, but when we go to the model we have to simplify. It is quite important for this community to bridge this gap. This is probably the main reason that process modeling formats have not become standard. It is too simple. For example, using the course data, there is a model of the student, and a completely different model of the course, coming from the exact same data.
About real-time event detection, how do you construct a sliding window of evens to mine? how does mining relate to complex event processing? Event correlation: how to translate lower level things into higher level things. Generating a model is extremely fast, so this can be done nearly real time. Map-reduce could be used to distribute some of the processing. On the other hand, conformance checking is extremely expensive. Complexity of that problem remains an issue. We are developing online variants of the process mining, which no longer require storing of the entire event log.
What about end users? Model driven engineering … it is possible to incorporate end users into engineering. How far are we away from involving end users into process mining? There will probably be different types of end users. First type will be data scientists to do the analysis of the data and get the competitive advantage. Once educated, data scientists will have no problem leveraging process mining. There are other kinds of users, that can be involved in varying degrees. For example, use the map of Germany as a metaphor. Some people are very interested in a map, but most people casually look and don’t worry about it. But, if you project data on the map, then a lot more people are interested. The same with process maps: put information that is relevant to people on it, and people will become more interested and more involved.