NHS surgeon data may hide poor performance

"Relying upon the death rates of individual surgeons…may lead to 'false complacency'," The Daily Telegraph warns. It reports on an article in The Lancet which argues that recently published NHS data on surgical outcomes is too limited in scope to be useful.

The data, published in June 2013 on the NHS Choices website, currently consists of mortality rates for seven types of surgery.

The Lancet article highlights the fact that most surgeons do not perform enough of the individual procedures each year for patient death rates to be a reliable indication of poor performance. A far greater number of procedures per year would be needed to give enough “statistical power” to show which surgeons were truly performing worse than average.

With only a small number of procedures performed, the number of patient deaths per surgeon in any given year may be the result of chance. As a result, some surgeons may be wrongly identified as underperforming.

The Lancet article also highlights the fact that focusing solely on mortality rates is not particularly helpful for patients. For example, orthopaedic surgeries such as hip replacements have a very low risk of death, but complications from hip surgery are relatively common, such as loosening of the replacement joint, which may require further surgery to correct. These types of post-surgical outcomes should also have been included in the NHS data, they argue.

The authors of the Lancet article offer several other suggestions for how to give a more reliable indication of surgeon performance.

How could reporting of surgeons’ performance be improved?

The authors of the Lancet paper suggest ways to increase the number of procedures analysed to give a better indication of performance.

They suggest:

pooling data per surgeon over a longer time frame than a year
pooling surgical procedures within specialties (such as all adult cardiac surgery), rather than looking at single procedures
pooling data by hospital rather than by individual surgeon
measuring outcomes that are more common than death, such as rates of surgical complications or emergency readmission rates

Overall, this article is useful for both members of the public and professionals in highlighting the possible limitations of analysing patient death rates alone following surgical procedures. This, the authors argue, is a very crude indication of what constitutes a ‘good’ or a ‘bad’ surgeon.

Where did the story come from?

This was a report authored by researchers from peer-reviewed medical journal, The Lancet. The report received no specific funding. This article was reported fairly by both The Daily Telegraph and BBC News.

What kind of research was this?

The researchers report that, from June 2013 onwards, the patient death rates from certain surgical procedures are being reported for individual surgeons as part of the English NHS Commissioning Board’s new policy. Several US states already report similar data, and UK heart (cardiac) surgery mortality data has already been reported for a number of years. The intended aim of this is to allow patients to be better informed when choosing their surgeon.

However, as the authors of this article highlight, when the overall number of certain procedures performed is low, death rates are not necessarily a good indicator of the surgeon’s overall performance. They say that there is a danger “that low numbers mask poor performance and lead to false complacency”.

The aim of this article was to examine this issue by looking at patient death rates for individual surgeons for adult heart surgery, and also for three specific procedures in three other specialties:

oesophagectomy or gastrectomy for oesophagogastric cancer (removal of all, or part of, the oesophagus or stomach for cancer of the oesophagus or stomach)
bowel cancer resection (removal of part of the bowel to treat bowel cancer)
hip fracture surgery

The researchers wanted to answer the following questions:

What number of procedures does a surgeon need to do to give a reliable indication of whether their performance is poor?
How many surgeons in each specialty perform this number of procedures over periods of one, three or five years?
What is the probability that a surgeon identified as having a high mortality rate truly has poor performance?

The researchers then gave suggestions on how surgeon performance could be addressed meaningfully. They used figures on numbers of surgeries and deaths from national sources such as Hospital Episode Statistics and the National Institute for Cardiovascular Outcomes Research. As such, these are likely to represent the best national figures available.

The researchers’ calculations involved some assumptions about what would constitute poor performance. For example, they defined a surgeon whose surgical mortality rates were double the national average as performing poorly. If they had defined this differently it would affect the results of the calculations.

How many procedures are needed to give a good indication of performance?

The median (average) number of heart procedures each heart surgeon performs per year is 128. For the other specific procedures examined, the median number of procedures performed per surgeon per year is far less:

11 oesophagectomies or gastrectomies
nine bowel resections for cancer
31 hip fracture surgeries

Next, the researchers related this to how many procedures per surgeon would be needed to give the best statistical power to identify accurately the poorly performing surgeons.

That is, the probability that a surgeon with truly poor performance would be detected as having significantly poorer performance than average.

The higher the statistical power, the higher the probability of identifying the poorly performing surgeons. A power value of 80% would mean that out of 10 poorly performing surgeons, eight would be identified, while 60% power would mean that out of 10 poorly performing surgeons, six would be identified, and so on.

Of all the patients who undergo heart surgery across the UK, national mortality data shows that 2.7% die following the procedure. While the average number of heart surgeries per surgeon seems high at 128 per year, in fact:

192 surgeries per surgeon per year would need to be performed to have 60% power to detect poorly performing surgeons
256 procedures would be needed to have 70% power, and
352 surgeries would be needed to have 80% power to detect the poorly performing surgeons – almost three times as many procedures per year as heart surgeons currently perform on average.

For the other surgeries the figures are as follows:

Oesophagectomies or gastrectomies: 6.1% of people die following this procedure. Rather than the current average 11 per year per surgeon, 79 procedures would be needed for 60% power, 109 for 70% power and 148 for 80% power.
Bowel resections for cancer: 5.1% of people die following this procedure. Rather than the current average of nine per year per surgeon, 95 procedures would be needed for 60% power, 132 for 70% power and 179 for 80% power.
Hip fracture surgery: 8.4% of people die following this procedure. Rather than the current average of 31 per year per surgeon, 56 procedures would be needed for 60% power, 75 for 70% power and 102 for 80% power.

Overall, the findings show that, given the small number of procedures performed per surgeon per year, using annual deaths as a measure of performance would miss many underperforming surgeons. If each surgeon were able to perform the large number of procedures required to give adequate statistical power, then death rates would be better at identifying the surgeons who are performing worse than average.

What proportion of surgeons do the required number of procedures?

Based on the numbers of surgeries performed over three years, 75% of UK heart surgeons perform sufficient procedures to give 60% power to use death rates to identify the poorly performing surgeons. Just over half (56%) perform enough procedures to give the more reliable 80% power.

For hip surgery the numbers are similar, but for other procedures, the proportion of surgeons achieving high enough numbers of surgeries is much lower. Over a three-year period:

for hip fracture surgeries: a similar 73% of surgeons perform enough of these procedures to give 60% power to use death rates to indicate the poorly performing surgeons, 62% perform enough for 70% power and just under half (42%) perform enough for 80% power
for bowel resections for cancer: 17% of surgeons perform enough of these procedures to give 60% power to use death rates to indicate the poorly performing surgeons, 4% perform enough to give 70% power and no surgeons perform enough surgeries to give 80% power
for oesophagectomies or gastrectomies: only 9% of surgeons perform enough of these procedures to give 60% power to use death rates to indicate the poorly performing surgeons, and no surgeons perform enough surgeries to give 70% or 80% power

However, the researchers demonstrate that extending the time over which a surgeon’s figures are examined (to measure more procedures) gives better power.

The figures detailed above relate to data collected over three years. Increasing the observation period to five years would increase the proportion of surgeons who perform sufficient procedures to give the same levels of power. However, increasing the observation period would mean it would take longer to identify underperforming surgeons.

Conversely, if the time frame were decreased to one year rather than three, very few surgeons would have performed enough procedures to give adequate power – only 16% of heart surgeons have performed enough procedures in a year to achieve 60% power, 4% of surgeons performing hip surgery and no surgeons for the other two surgeries.

Will all surgeons identified as having poor performance really be poor performers?

The researchers also highlight that even if a surgeon is identified as a poor performer using death rates, they may not truly have poor performance.

The exact number correctly identified will vary depending on how many procedures they do, how common poor performance is and the threshold set for considering a difference in performance to be statistically significant.

The authors estimated that if only one in 20 cardiac surgeons truly had poor performance, 63% would be correctly identified on the basis of the average number of procedures in three years. For the other procedures the corresponding figures would be:

62% for hip fracture surgery
57% for oesophagectomy or gastrectomy
38% for bowel cancer resection

The remainder of surgeons identified as having poor performance would only fall into this category due to chance.

There is also the possibility that experienced surgeons would be identified as having poor performance. A consultant with many years of experience may be more likely to operate in very high-risk cases where patients have multiple complex health problems, and these types of surgery have a much higher risk of mortality through no fault of the surgeon.

What other ways do the authors suggest to better indicate poor performance?

As these findings show, when using patient death rates, not all surgeons identified as having a higher number of death rates will necessarily have poorer performance, and vice versa.

The researchers suggest a number of options for improving the power to detect poor performance:

pooling death data over a longer time frame, although this would mean a delay in identification of poor performance
pooling death rates for different surgical procedures within specialties (for example all adult heart surgeries) rather than looking at single procedures – although this could mask differences between procedures
reporting death rates per surgical team or per hospital rather than per individual surgeon
altering the threshold at which a difference is considered statistically significant

The researchers also make the point that mortality rates for types of surgery with a low risk of death may not be particularly useful when it comes to informed patient choice. Other post-operative outcomes, such as post-operative bleeding, infection or persistent pain, or emergency readmission rates, could provide a better assessment of surgical performance.

What do the authors conclude?

The authors conclude by making the following recommendations for better public reporting of surgeon outcomes:

when the annual number of procedures is low, pool data over time, but also consider the timeliness of data reporting (how quickly underperformance can be identified)
select outcome measures for which the outcome event is fairly frequent
for specialties in which most surgeons do not achieve 60% power, the unit of reporting should be the team, hospital or trust
present results using appropriate statistical techniques
avoid making the interpretation that no evidence of poor performance equals acceptable performance
report surgeon outcomes with appropriate health warnings, such as highlighting low numbers and data quality issues
report surgeon outcomes alongside unit or hospital outcomes to guide interpretation

Overall, this article is useful for both members of the public and professionals in highlighting some important limitations of using patient death rates following surgical procedures as the sole indication of ‘good’ or ‘bad’ surgeons.