Understanding the Cause of Faults in the Lean Factory

Understanding the causes of faults and defects, and then improving the system or process so it won’t happen again, is central to lean manufacturing. This article looks at some of the methods used to identify the root cause of any issues so that you can prevent downtime and move toward zero defect manufacturing. I’ll cover methods such as the five whys, failure modes and effects analysis (FMEA), and fault tree analysis (FTA).

The emphasis on understanding the root cause of failures began early in the Toyota Production System. Toyota instructed workers to stop the line whenever they identified a defect or problem. They would then ask the five whys to dive deeper into why the defect had occurred. The idea was that within the five whys you should get to the root cause. This was driven by a desire to eliminate waste from the production system. If a defective part is allowed to continue down the production line, the opportunity to understand why the defect occurred is lost. The mistake will probably be repeated, resulting in more waste. To avoid this situation, an Andon light is used to notify workers that a problem has occurred. In the original Toyota Production System, workers had a pull cord they could use to stop the line and activate the light. Automated machines may automatically trigger an Andon light if a fault is detected. This is a critical first step in understanding the cause of defects. It allows them to be dealt with immediately.

Once the root cause has been identified, the process or system should be improved so that the problem cannot occur again. This type of error proofing is known as poka-yoke within lean. The identification and elimination of root causes for defects and failures should be part of a process of continuous improvement carried out by the whole team and known as kaizen in lean.

The Five Whys

The five whys is the original method used to identify the root cause of failures within the Toyota Production System. This is a simple method where you ask the question, “Why did this fault occur?” Once you have identified the direct cause, you then ask “Why?” again. This time, the focus is on the reason for the cause just identified. This process can be repeated to dive deeper into the underlying, or root, cause. It was named the five whys because it was observed that this is how many iterations were required to get to the root cause for most problems. However, it isn’t intended to be prescriptive. The process should stop whenever the root cause is identified, and the root cause should be actionable.

The process can be understood with a simple example. Imagine that you find a defect in a machined part:

  1. Why did the defect occur? The cutting tool broke.
  2. Why did it break? There wasn’t any coolant.
  3. Why wasn’t there any coolant? The operator had not checked the coolant level.
  4. Why hadn’t they checked it? There is no written process saying they should check it.
  5. Why isn’t there are written process for this? The shop floor processes have not been fully documented.

As you can see, deciding when you have reached the root cause is subjective. The process could easily have stopped after the fourth why, identifying the lack of a written process as the root cause. The value of this process is that it should lead you to a corrective action that will make your processes much more robust, eliminating future defects and failures.The aim of asking the five whys is to identify how the process can be made more robust. Therefore, it is important to direct the questioning toward controllable causes and avoid concluding that the root cause is something that is out of control. This can sometimes be achieved by rephrasing the question as, “Why did the process fail?”

It should be noted that issues often have multiple root causes. These can be discovered by repeating the process but asking the questions in different ways. Ishikawa or fishbone diagrams are often used to identify multiple root causes and may be used together with a five whys approach.

Ishikawa or Fishbone Diagrams

An Ishikawa diagram, also known as a fishbone or cause-and effect-diagram, looks a bit like a fish with the problem or the effect at the head, a spine running horizontally and the main categories of causes radiating out on both sides like fish bones. A number of standard headings are often used to provide an initial stimulus for the generation of ideas. Traditionally, the five M’s have been used: machine, method, material, manpower and measurement. These are often adapted to fit different organizations.

An Ishikawa diagram, also known as a fishbone or cause-and effect-diagram.

An Ishikawa diagram is simply a hierarchical diagram. It contains the same information structure as a tree or mind map. There is no reason you couldn’t instead use one of these diagrams to record the causes of an effect. In fact, other types of hierarchical diagram can be more convenient for representing the type of cascade of causes that a five whys analysis will uncover. It is, however, traditional to use the Ishikawa diagram to identify the causes of a problem.

Root Cause Analysis

The five whys and Ishikawa diagrams are both techniques that can be used for root cause analysis (RCA). Variation breakdown, or a thought map, is another similar method of identifying root causes. In some ways, it is preferable as it explicitly directs you to identify multiple causes at the top level, like an Ishikawa diagram, and directs you to drill all the way down to the root cause for each of these, like the five whys.

A simple thought map diagram can be a more useful way of identifying root causes than an Ishikawa diagram.

Process mapping may also be carried out as part of a root cause analysis to gain a deeper understanding of the process and all of its inputs. A cause-and effect-matrix and FMEA may also be used.

Failure Modes and Effects Analysis

Failure modes and effects analysis (FMEA) is an important method of understanding the potential causes of problems. It is often used proactively when designing a process. It evaluates the subjective likelihood and severity of different events using a table, much like a risk analysis. A number of different names are sometimes used such as a failure mode effects and criticality analysis (FMECA) or a process failure mode and effects analysis (PFMEA).

An FMEA should start with a system definition, which may involve the creation of a system block diagram. However, most of the work is often done by filling out a table. Many manufacturing companies have their own formats for FMEA tables, often created as an Excel spreadsheet. The first column should list the system components, or process steps for a PFMEA. For each of these, multiple failure modes are listed. Each failure mode may in turn have multiple effects. Finally, each effect may have multiple causes. There is, therefore, a hierarchy consisting of, from top to bottom:

  • Process step or system component
  • Multiple potential failure modes for each step or component
  • Multiple potential effects for each failure mode
  • Multiple potential causes of each failure mode.

Each cause has its own row with the process steps or system components, failure modes, effects and causes spanning a number of cause rows. Related to each cause of failure, there are additional columns used to input estimates for likelihood and severity; methods of mitigation, such as controlling the process through prevention or detection; other bespoke requirements; and columns that calculate combined values.

An FMEA is often completed using a spreadsheet with a standard company format. Note the hierarchy of rows because of the process step having multiple failure modes. These potentially have multiple effects, and each effect has multiple causes.

The structured nature of an FMEA can really help draw out ideas and identify methods of preventing problems from occurring in the future. This method has become widely used within the manufacturing industry. It can, however, be overly time consuming, forcing the user to spend time considering insignificant possibilities. It also does not allow

Failure Modes, effects and Criticality Analysis is an excellent hazard analysis and risk assessment tool, but it suffers from other limitations. This alternative does not consider combined failures or typically include software and human interaction considerations. It also usually provides an optimistic estimate of reliability. Therefore, FMECA should be used in conjunction with other analytical tools when developing reliability estimates

Fault Tree Analysis

Fault tree analysis (FTA) is a rigorous way of quantitatively accessing the causes of faults. It is typically used in safety and reliability engineering, especially within aerospace, nuclear power and chemicals processing. Whereas FMEA is a qualitative assessment, FTA uses probabilities of individual events, combined with Boolean logic, to give an overall probability of system failure. This is a top-down process in which you start with the possible system failure and work down through the causes that could lead to it. As you move down through the lower levels, they are connected back to the system failure at the top though a network of Boolean logic. This provides a quantitative understanding of how a system could fail, leading to the identification of optimal methods of reducing this risk.

For each possible system failure condition, the severity is first determined to establish the extent of analysis required. The most severe failure conditions should be evaluated using a full FTA. For each of these, the system failure condition is written at the top of the chart, and a fault tree is drawn below it. The fault tree shows different types of eventsthat might contribute to the failure condition. The Boolean logic shows how these would combine or cascade to result in the failure.

The following types of event are used in fault tree analysis:

  • Basic events are the lowest level of events, which cannot be developed any further. They may be considered root causes, asking “Why?” won’t generate any useful underlying reasons why this event happened.
  • Undeveloped events are events that have not been developed any further but may have the potential to be.
  • Intermediate events are events that come in between the failure condition and the root cause.
  • Transfer events are used to continue a tree on another diagram when the tree is too large to view as a single diagram.
Figure 4: Event symbols used in fault tree analysis.

Events are connected using two main types of Boolean logic gates: AND gates and OR gates. An AND gate is used when the output event occurs when all the input events occur. An OR gate is used when the output event will occur if any one of the input events occurs. The simple example used for the previous types of analysis will clearly show this principle.

Figure 5: A simple fault tree analysis.

More complex systems may also include exclusive OR, priority AND and inhibit gates. An exclusive OR causes the output event if exactly one input event occurs. A priority AND causes the output if both inputs occur in a specific sequence. An inhibit gate results in the output event if the input occurs according to some specified conditioning event.

Using FTA it is possible to model complex chains of events leading to failure. When probabilities are assigned to the basic events and the undeveloped events, it is possible to calculate the probability of the system failure condition

Conclusions

Different methods can be used to understand the cause of faults and defects. One approach is to do this work reactively after a problem is detected. The five whys and RCA are both normally used in this way. Proactive, or preventive, analysis may also be carried out to identify the possible causes of faults before they happen. FMEA is often used in this way. Ideally, FMEA should be carried out for new processes to proactively eliminate potential causes of failure. Reactive analysis should also be carried out if any problems are encountered. For safety critical processes, more rigorous methods, such as FTA, may be required.