Preventing Equipment Failure

“Overcoming human error and variability”

Why an equipment fails

Equipment fails when one of the components fails. For example, a pump fails when a bearing or impeller fails. To understand how we can prevent failure, we must understand how the failure of these components occurs.

There are four fundamental causes of component failure:

Design error
Overstress (Overload, fatigue)
Wear
Human error

Design Error

Once a piece of equipment has been misdesigned, rectifying it is not easy. An example is a chute or bin incorrectly designed, which will have accelerated wear or block up constantly. These equipment or systems will be maintenance intensive and will have a high operating cost for life. If it justifies the cost, replacing them or rectifying the design is the only solution.

Overstress

Overstressed failure occurs when the stress applied to the equipment or component exceeds its strength, e.g. a shaft fractures when the torque applied exceeds its strength or a bolt fails due to cyclic loading (stress) exceeding its endurance limit (strength).

The difference between the strength and stress is known as a safety factor. However, neither the strength nor the load are discrete values but are distributed statistically. See Figure 1. When the distributions overlap, failure occurs. Failure would be improbable if we have overengineered equipment as these curves will be far apart. But the reality is that the trend is to narrow this gap to reduce manufacturing costs.

Figure 1 Load and Strength Distributions a) no overlap b) overlapping distributions (failure) [1, p. 5]

We cannot control the strength variability since this happened during the design phase. Still, we can control the variability of the load to ensure it does not exceed the strength of the equipment or component.

Variability

If we ask 100 people to tighten a bolt, we will probably have 100 different results. And if we plot the distribution graph, it will look something like the green curve in Figure 2. But if we give them the same calibrated torque wrench with a standard procedure on how to do it, the distribution graph will look more like the blue curve in Figure 2. And if we would like to narrow the curve, we could use hydraulic tensioning or strain gauges: the more accurate and standardised the method, the less variability in the outcome.

Another good example is oil sampling. To have effective equipment health monitoring, the sample needs to be consistent, i.e. taken by a trained technician every time from the exact location using the same method. That is why it is recommended to have a dedicated lube technician. Having different technicians with different levels of training and using different methods will lead to a larger variation in the oil analysis results, which does not help you take the right action.

Figure 2 – Two identical curves with different variability (Wikipedia)

Wear

Wear is a well-known mechanism. Wear is the degradation of the equipment or component with time that results in a reduction of strength. Examples of wear are abrasive wear, corrosion, fatigue, etc.

In maintenance, we have three options for managing wear,

Replace the component before it fails (Preventive maintenance)
Reengineer component with a better wear-resistant material to increase MTBF
Understand the wear mechanism and the factors that increase the wear rate and re-engineer the component to reduce wear. For example, changing the geometry of a chute to reduce the impact velocity or adding a rock box. (Re-design)

Another type of wear is normal aging. Equipment has a finite life, and we can expect more component failures as it ages. To prevent this, it is essential to do a lifecycle analysis at the start of the project and budget its replacement.

Human Error

”Failures are caused primarily by people (designers, suppliers, assemblers, users, maintainers). Therefore, the achievement of reliability is essentially a management task to ensure that the right people, skills, teams and other resources are applied to prevent the creation of failures”. [1, p. 17]

If somebody never made a mistake, he is probably not a human. Making errors is an intrinsic part of being a human being. A maintenance technician, being a human, will commit several mistakes at work, and most of them will go unnoticed. Still, unless there are processes and systems in place to identify and mitigate those errors, one of them will cause catastrophic equipment failure.

Some of the factors that contribute to human error are,

Time Pressure
Poor quality documentation
Poor Housekeeping
Ineffective communication
Fatigue
Inadequate tools or equipment
Inexperience

According to Boeing, in the early days, 80% of accidents were caused by the machine and 20% were caused by human error. Today 80% of airplane accidents are due to human error and only 20% due to equipment failure [2]. Furthermore, Boing found that most errors are associated with reassembly and installation. They found that the top seven causes of in-flight engine shutdown (IFESD) were: [3]

Incomplete Installation 30%
Damage on installation 5%
Improper installation 11%
Equipment not installed or missing 5%
Improper fault isolation, inspection, test 6%
Equipment not activated or deactivated 4%

Eliminating human error is impossible, but we can create processes and systems to prevent or at least identify them before they cause a catastrophic failure. Below are two tools that can help reduce errors.

Improve Procedures

Most organisations have procedures, but they are not always used. The main reason is the quality of the documents: Wrong or missing information; unclear, vague or wordy task descriptions; etc.

A survey was conducted on 400 operators and managers in the petrochemical industry [4], where only 10% said they use procedures for routine maintenance. The main reasons for not using procedures were,

If followed to the letter, the job wouldn’t get done
People are not aware that a procedure exists
People prefer to rely on their skills and experience
People assume they know what is in the procedure

And when asked about the strategies for improving, the answers were,

Involving users in the design of procedures (highest rated)
Writing procedures in plain English
Updating procedures when plant and working practices change
Ensuring that procedures always reflect current working practices

Improving all processes is a massive task, but you can start with critical tasks, especially those with higher complexity, and you must involve the users.

Quality Independent Inspections

The best process you can implement to prevent human error is Independent Inspection. This is the most underutilised process by maintenance departments in mining.

Most errors happen during installation at the end of the task or after handover. Some of these errors or omissions are:

Fasteners not tightened
Missing gasket
Missing O-ring
Missing caps or lids
Rags, tools or parts not removed from the equipment before boxing up

Another reason to do a final inspection is to pick components incorrectly installed, such as:

Bearings
Lip seals
O-rings
Unidirectional Valves
Instrumentation such as flowmeters, etc.

To have an effective independent inspection, you need to follow the criteria below,

It has to happen at the right time. Sometimes, these checks are left to the end when it is impossible to inspect, such as the inspection of an O-ring, once moved to the next step in the reassembly, is not visible anymore.
It has to be independent. It cannot be checked by the same person who did the job, and if a tool is required, it can’t be the same tool the technician used for the task. If a bolt torque is being checked, the torque wrench used by the first technician could be out of calibration.
It has to be carried out by a competent person. It does not necessarily have to be done by an external inspector; a team member can do it. An engineer can do this check if a test needs to be verified. If a 2” bolt torque needs to be inspected, then a supervisor or a fitter can do this check.

Final Thoughts

The job of the maintenance execution team is to carry out maintenance and to bring the equipment online from failure as quickly as possible, and they are usually really good at this. When equipment fails, it is generally because there are not enough systems and processes to prevent or identify that failure early enough before it becomes catastrophic. It is the Management’s responsibility to create the conditions and to have those processes in place to drive plant reliability in the same way they drive safety culture. Effective failure prevention must be a top-down approach.

References

[1]	P. D. O’Connor and A. Kleyner, Practical Reliability Engineering, Wiley, 2012.
[2]	W. Rankin, MEDA Investigation Process, Boeing, 2007.
[3]	J. Reason and A. Hobbs, Managing Maintenance Error, CRC Press, 2003.
[4]	D. Embrey, “Creating a Procedure culture to minimise risks using CARMAN,” in The 12th Symposium on Human Factors in Aviation Maintenance, Gatwick, 1998.

Preventing Equipment Failure