SWAG: Long-term Surgical Workflow Prediction with Generative-based Anticipation

Surgical AI Assistant with LLM API

This application demonstrates how future predictions and recognized classes can be translated into textual contextual awareness. The LLM model can be prompted to reply to questions based on its current understanding of the video and its predicted values.

Abstract

Purpose

While existing approaches excel at recognising current surgical phases, they provide limited foresight and intraoperative guidance into future procedural steps. Similarly, current anticipation methods are constrained to predicting short-term and single events, neglecting the dense, repetitive, and long sequential nature of surgical workflows. To address these needs and limitations, we propose SWAG (Surgical Workflow Anticipative Generation), a framework that combines phase recognition and anticipation using a generative approach.

Methods

This paper investigates two distinct decoding methods—single-pass (SP) and auto-regressive (AR)—to generate sequences of future surgical phases at minute intervals over long horizons. We propose a novel embedding approach using class transition probabilities to enhance the accuracy of phase anticipation. Additionally, we propose a generative framework using remaining time regression to classification (R2C). SWAG was evaluated on two publicly available datasets, Cholec80 and AutoLaparo21.

Results

Our single-pass model with class transition probability embeddings (SP*) achieves 32.1% and 41.3% F1 scores over 20 and 30 minutes on Cholec80 and AutoLaparo21, respectively. Moreover, our approach competes with existing methods on phase remaining time regression, achieving weighted mean absolute errors of 0.32 and 0.48 minutes for 2- and 3-minute horizons.

Conclusion

SWAG demonstrates versatility across generative decoding frameworks and classification and regression tasks to create temporal continuity between surgical workflow recognition and anticipation. Our method provides steps towards intraoperative surgical workflow generation for anticipation. Project: https://maxboels.com/research/swag.

Methods

Our proposed model, illustrated in Figure 2, comprises a vision encoder, a temporal aggregation module based on self-attention, compression and pooling mechanisms, and a future prediction module that can operate with either a single-pass (SP) or an auto-regressive (AR) decoder. While our model uses classification for the recognition task, it can use either classification or regression for anticipation. This architecture also includes a novel token embedding initialization strategy based on the recognised class and the temporal position of the decoded target token.

Results

Model performance (accuracy) of surgical phase recognition and anticipation on Cholec80 and AutoLaparo21 up to 18 minutes, with mean values over 18 minutes.

Detailed performance metrics comparing different model configurations across various evaluation horizons.

Conclusion

We introduced SWAG, a pioneering framework that unifies surgical phase recognition and long-term anticipation through generative modeling. Our framework demonstrates:

Superior performance over naive baselines in classification tasks
Competitive results in regression tasks, particularly for remaining surgery duration predictions
Successful integration of prior knowledge in future-generated tokens
Real-world applicability through implementation in a practical software application

Clinically, SWAG's real-time phase anticipation capabilities offer valuable support for intraoperative decision-making, with potential to enhance both patient safety and procedural efficiency. Our evaluation on Cholec80 and AutoLaparo21 datasets validates SWAG as a robust solution for surgical workflow anticipation.

Future Works

Enhance the generative process with dynamic confidence-based adjustments for more reliable and robust predictions in clinical scenarios.
Develop a vision-language-action framework by integrating language inputs to enable direct response to surgical team prompts.
Incorporate multiple input modalities including language, audio, and proprioceptive sensory data from surgical robots.
Expand training labels to cover both surgical phases and specific action triplets for finer-grained workflow prediction.
Address challenges related to anatomical differences, surgeon skill, and patient-specific factors for improved real-time adaptability.

BibTeX

@article{boels2024swag,
    title   = {SWAG: Long-term Surgical Workflow Prediction with Generative-based Anticipation},
    author  = {Maxence Boels and Yang Liu and Prokar Dasgupta and Alejandro Granados and Sebastien Ourselin},
    year    = {2024},
    journal = {Submitted to IJCARS}
}