Introduction

Patient-reported outcome measures (PROMs) are nowadays considered standard assessments in the care of patients with spinal diseases, as they help to quantify pain and disability. PROMs, such as the Core Outcome Measures Index (COMI) or the Oswestry Disability Index (ODI) are subjective measures that help to estimate how a patient perceives his or her current condition. The benefits of additional objective assessments, with measures such as the 6-min walking test (6WT) or Timed Up and Go (TUG) test, have recently been highlighted. [2, 5, 8, 9, 12, 13, 15, 17, 18, 20, 21] These measures can be used to precisely measure symptom severity and are hence valuable tools to detect progression or improvement of symptoms with both conservative and surgical treatment. Previous studies have shown that the TUG tests is quick, easy to use and highly reliable. [7] Although this type of objective evaluation does not replace PROMs, it provides additional information and is subject to bias to a lesser extent. [14] With normal population reference values available, a free smartphone app simplifies the measurement and interpretation of TUG raw values, transforming those into normalized, age- and sex-adjusted objective functional impairment (OFI) T-scores.

Until today, TUG tests are conducted by healthcare professionals, which is resource-demanding and does not allow for the determination of a patient’s condition outside a fixed appointment. If patient self-measurement by means of the TUG test were to prove sufficiently reliable, there would be many useful applications. For example, patients with severe spinal stenosis undergoing conservative treatment could monitor themselves closely and report to their spine surgeon in case of progressive functional deterioration. A similar but potentially even more clinically relevant scenario applies to patients with mild degenerative cervical myelopathy (DCM; modified Japanese Orthopaedic Association (mJOA) score of 15–17). As in this patient population close observation and timely surgical care in case of functional deterioration is recommended, [3, 4] patient self-measurement with the TUG test could prevent from undetected and potentially irreversible functional decline between fixed appointments. Similarly, the post-operative healing process could be monitored more closely and worsening, possibly as an early sign of adverse events (AEs), could trigger communication with the spine surgeon, possibly allowing to initiate diagnostic measures and treatment at an earlier time point.

The prerequisite for this, however, is that the TUG test can be reliably applied by patients themselves, on which there is currently no data. Hence, the aim of this study was to analyse the reliability of TUG test self-assessments by patients.

Methods

Study design

The study was designed as a single-centre clinical validation study, conducted at the Spine Centre of Eastern Switzerland (OSWZ) of the Kantonsspital St.Gallen. The study population consisted of adult patients, who underwent inpatient surgical or non-operative treatment for a spinal disease or pathology between 2022 and 2023. These patients were screened for inclusion and exclusion criteria. Sufficient mobility to perform the TUG test and the signing of a general informed consent sheet were required for inclusion. Patients were excluded if they were under 18 years of age, did not sign the informed consent form, or refused to participate. The study was approved by the Ethics Committee of Eastern Switzerland (EKOS 23/179).

TUG test and data collection

The TUG test is the most commonly used objective functional test for the evaluation of a patient with degenerative disease of the lumbar spine. [19] It assesses simple but important functions such as getting up, walking, changing direction, walking again, and sitting down. These basic functions are essential for performing activities of daily living (ADLs) and regaining quality of life (QoL). [1, 6, 8, 13] It is a clinically validated test that is regularly used in clinical practice. [8, 20, 21].

The TUG test requires only a chair and a walking distance of 3 m. To perform the test, patients were asked to sit in a chair with their arms resting on the back of the chair. At the request of the examiner, or at their own starting signal if they were measuring themselves, patients stood up and walked as quickly as possible (without running) behind a line marked on the floor at three metres. When they reached the line, they turned 180 degrees and returned to the chair as quickly as possible to sit down again. Timing was started when patients stood up and stopped when they sat down again. Patients were allowed to wear their normal shoes and, if necessary, use a walking aid such as crutches, walkers or a trolley. The walking aid used was then recorded. All TUG tests were conducted using standardized chairs (47 cm seat height) and on the same non-slip hospital floor surface to ensure consistency across measurements and to allow for direct comparison of results. A self-designed, freely available smartphone app for both Apple and Android smartphones (TUG app) was used to standardize the measurements. This app offers the function of a stopwatch and simultaneously calculates additional parameters such as the normalized T-score (adjusted for age and sex) and OFI (Fig. 1).

Fig. 1
figure 1

Image from the free smartphone application for measuring the Timed Up and Go Test (TUG). After entering the sex and age of the patient, the time of the TUG can be measured with an integrated stopwatch. Simultaneously, the app calculates additional parameters such as the normalised T-score (adjusted for age and sex) and the degree of objective functional impairment (OFI). This example measurement results in a stopped time of 20.21 s and a T-score of 140.3 resulting in a moderate OFI for a female patient over the age of 60

Only patients, who completed the TUG test twice were included in the analysis. The first measurement was performed by a healthcare professional, and the second measurement was performed by the patients themselves. The order of the test was kept constant throughout the study to avoid potential differences in the time measured by the patients due to insufficient understanding of the TUG test and how the time is measured. The interval between the two measurements was kept between one and two hours to avoid patients measuring a higher time due to fatigue, while also minimizing the risk of a difference due to fluctuations in their clinical condition. Other required data (age, sex, walking aid, American Society of Anesthesiology (ASA) risk scale, smoking status, Charlson Comorbidity Index (CCI), Body Mass Index (BMI), underlying disease type, affected spinal region) were part of the standard patient care and documentation of patients and were stored in a pseudo-anonymised manner. Between both measurements, patients were asked to fill out PROM questionnaires for a self-assessment of their current condition, including the COMI back/neck as well as the ODI/NDI. Depending on the Canadian Frailty Index (CFI), patients were grouped into "Very Fit", "Well", "Managing Well", "Vulnerable", "Mildly Frail", "Moderately Frail", "Severely Frail", "Very Severely Frail" and "Terminally Ill". [16] The disease types were subdivided into "degenerative", "trauma", "tumour/neoplastic", "infectious" and "deformity".

Statistical analysis

It was our null hypothesis that there will be no significant differences in the time measured in seconds by patients and by healthcare professionals, meaning that patients are able to measure themselves reliably. For this evaluation, test–retest reliability was calculated using intraclass correlation coefficients (ICCs). A value close to 1.00 is considered as perfect correlation, while a value close to 0.00 indicates poor or no correlation between the two groups. Subgroup analyses were conducted, based on the demographic data and the functional status, to evaluate, whether self-assessments with the TUG test are reliable for all patients in general or only for certain subgroups. Furthermore, paired, two-sided t-tests were performed for the TUG test raw value to determine whether patients over- or underrate themselves, in case any difference existed.

All statistical analyses and generation of graphs were performed using StataSE 18.0 (StataCorp. 2023. Stata Statistical Software: Release 18. College Station, TX: StataCorp LLC). Descriptive statistics were employed, describing the sample as count (percent) and mean (standard deviation (SD)). Graphical illustrations of results were used to explore relationships. A p-value below 0.05 was considered statistically significant.

Results

Study sample and demographics

During the data collection period, 83 patients were included, of which nine were excluded because they did not have a second measurement and therefore treated as “lost to follow-up” (Table 1).

Table 1 Baseline demographic data of n = 74 patients evaluated with the Timed-Up and Go (TUG) test. Data is presented as mean (standard deviation) or count (percent). ASA = American Society of Anesthesiology

The mean age of the study population was 62.9 (SD 17.8) years, with 29 patients being female, representing 39.2% of the study population. Most patients had an ASA surgical risk scale of II (n = 40, 54.1%) followed by III (n = 19, 25.7%), I (n = 12, 16.2%) and 3 patients with IV (4.1%). The Charlson comorbidity index was > 1 in most patients (n = 33, 44.6%), 0 in 29 patients (39.2%) and 1 in 12 patients (16.2%). For the study population, just under 2/3 of patients had a “very fit” to “managing well” Canadian frailty index (n = 48, 64.9%). The disease type was divided into degenerative disc disease (n = 48, 64.9%), trauma (n = 13, 17.6%), infection (n = 12, 16.2%) and deformity (n = 1, 1.4%). The spinal region affected by the underlying disease was lumbosacral in most patients (n = 53, 71.6%), followed by cervical (n = 12, 16.2%) and thoracic (n = 9, 12.2%). Of the patients included, 22 required a walking aid to perform the TUG test, representing 29.7% of the total. The aids used remained unchanged from the first to the second measurement.

Patient Reported Outcome Measures

The mean ODI of the cohort with thoracolumbar pathologies was 46.2%, whereas the mean NDI of the cohort with cervical pathologies was 38.7%. The Visual analog scale (VAS), measured for the intensity of pain in the back and the extremities, ranged on average from 4.4 to 5.5 (Table 2).

Table 2 Patient-reported outcome measures of n = 74 patients evaluated with the Timed-Up and Go (TUG) test. Data is presented as mean (standard deviation). COMI = core outcome measures index; NDI = neck disability index; ODI = Oswestry disability index; VAS = visual analog scale

Test–retest reliability

The ICC over the total cohort was 0.8740 with a p-value of < 0.001. Further subgroup analyses using the ICC for comparisons based on demographic parameters such as age, sex, ASA surgical risk scale, BMI, frailty index, ODI, NDI and use of walking aids all had high coefficients above 0.7065, with p-values < 0.05 (Table 3,Fig. 2).

Table 3 Interrater reliability and difference in the repeated measurement of n = 74 patients evaluated twice with the Timed-Up and Go (TUG) test. Data is presented as intra-class correlation coefficient (ICC) or as mean (standard deviation)
Fig. 2
figure 2

Box plots (median with 25th – 75th percentile, upper and lower percentiles (whiskers) and outliers (dots)) showing the results of the two-sided t-test, (t(73) = 1.59, p = 0.116) which was calculated to compare the absolute Timed Up and Go (TUG) test times in seconds by healthcare professionals (in blue; 19.3 s; SD: 9.4) and by the patients themselves (in pink; 18.4 s; SD: 9.7). Lines over boxes show the interquartile range

Mean TUG test times measured by the healthcare professional were 19.3 s (SD 9.4) and by patients themselves were 18.4 s (SD 9.7; p = 0.116; Fig. 3).

Fig. 3
figure 3

depicts the horizontal box plot of interrater reliability (ICC) by subgroups for the Timed Up and Go (TUG) test. The plot illustrates the Intraclass Correlation Coefficient (ICC) for various subgroups, with the red dashed line at 0.75 demarcating the threshold for excellent reliability. The subgroups are ordered by decreasing median ICC values

Discussion

For certain spinal pathologies, close clinical follow-ups play an important role in deciding whether to treat conservatively or surgically to prevent neurological deterioration. This would be particularly the case for cervical spinal stenosis with mild cervical myelopathy. For other spinal pathologies, serial clinical follow-ups are important in order to better monitor the healing process. A more detailed description of the clinical course with low granularity likely allows for a better estimation of the prognosis and expected healing process. The TUG test provides information regarding OFI and can be carried out quickly without major effort or specialized equipment. [7] This test would be suitable for self-administration by the patient in the home environment. With this test as a tool, it would be possible to monitor the patient's objective functional capacity much more closely and to intervene timely in the event of a relevant functional deterioration. However, for this application to be admissible, the reliability of the test needs to be examined if it is administered by the patients themselves. This present study examined the question of the reliability of the TUG test when measured by the patient himself.

Our study cohort of 74 evaluated patients reflects a representative, broad collective of patients with spinal diseases managed surgically or non-operative in terms of age and comorbidity. [10] We included spinal pathologies of the cervical spine, the thoracic spine and the lumbar spine that induced mobility restrictions either by mechanical or radicular pain, or by neurological deficits. The prerequisite for mobility, albeit limited, was given in part without (68.9%) and in part with aids (29.7%).

Essentially, our study found that the test–retest reliability of measurements performed by healthcare professionals or patients themselves was excellent. A direct comparison of both measurements showed a mean difference of 0.9 s, which was statistically insignificant (p = 0.116). Considering that the minimum clinically important difference (MCID) of the TUG test for spinal pathologies ranges between 2.1 – 3.4 s., the difference between both measurements can also be considered clinically irrelevant as the difference in measurement is smaller than the smallest detectable difference in functional impairment. The ICC of 0.8740 indicates excellent test–retest reliability for the entire cohort, according to the recommendations by Koo et al. [11]. Influencing factors that limit the reliability of the self-measured TUG test are conceivable, however. To rule out lower reliability in certain settings and under specific conditions, variables including age and sex as demographic factors as well as ASA, BMI, frailty, the use of walking support and PROMs (NDI/ODI) were analysed as functional factors in subgroup analyses (Table 3). The highest ICCs were found in patients of younger age (under 65 years; ICC 0.9047, p < 0.001), regardless of gender, lower ASA score, lower frailty and lower NDI or ODI. Interrater reliability was slightly inferior in patients > 65 years (ICC 0.8584, p < 0.001), patients with ASA grades 3 and 4 (ICC 0.7066, p < 0.001), patients considered vulnerable or frail (ICC 0.8799, p < 0.001), and in patients not using any type of walking aid (ICC 0.8070, p < 0.001). The higher ICC observed in younger patients (≤ 65 years) compared to those over 65 years of age may be partially attributed to greater familiarity with smartphone technology among younger individuals. This technological proficiency could lead to more accurate self-measurements in this demographic. Symptom severity, determined by an ODI of > 40 points for patients with thoracolumbar disease did not influence interrater reliability, but patients with cervical diseases and a NDI of > 40 points scored slightly worse (ICC 0.8607, p = 0.011). These results are in line with expectations, as certain comorbidities and symptoms may have a negative impact on the understanding of the exact performance of the test and correct timekeeping. The ICC remained acceptably high even under these conditions, however, indicating that overall, self-measured TUG test results can be considered sufficiently reliable in patients with spinal pathologies.

Strengths and Weaknesses.

The prospective study design incorporating a defined set of clinically relevant variables and PROM scores can be considered a strength of this study. Moreover, robust statistical methods were applied in a reasonably large cohort without missing data for the main outcome variable.

The period between the both measurements was intentionally set short, in order to reduce possible bias by fluctuations in the patient's underlying clinical condition. At the same time, this can potentially lead to a poorer result in the second measurement due to fatigue in severely deconditioned or impaired patients. Moreover, the study setting differs somewhat from a self-measurement in the home environment, where there naturally exists a longer time interval between the last instruction on how to perform the test and the self-measurement. Self-measurements at home may therefore be influenced by incorrect performance, albeit this risk can be considered low for this simple test. It should be noted that our study employed the use of standardized chairs and surfaces for all measurements, which may not be entirely replicable in home environments. Consequently, future studies should assess the impact of potential variability in home furniture and flooring on the reliability of the self-administered TUG test. Compliance with correct test performance might be improved by image-, video-based and/or written instructions. No statement can be made about this based on our data, however. Lastly, randomizing the sequence of testing (either healthcare personal or patient self-measurement first) would have been ideal to rule out systematic differences in test performance resulting from repetitive testing. The lower test results by 0.9 s on average in the second TUG test may correspond to a minor “learning effect”, which could have been eliminated by randomizing the sequence. Overall, however, we felt that instructing the patient during the first TUG test by healthcare personal would ensure a correct test conduction and outweigh the disadvantages.

Implications for clinical practice.

Considering the excellent test–retest reliability of the TUG test, determination of OFI may be “outsourced” from healthcare personal to patients, which helps to save time and resources in daily patient care. Even though our findings were made exclusively made in an inpatient setting, extrapolating the results to an outpatient setting can be considered. Here, patient self-examination would open the door towards a more thorough serial patient assessment with higher granularity. This is particularly helpful in spine conditions where functional decline is relevant and timely (surgical) treatment may be required even between planned follow-up visits, e.g., mild DCM, spinal cord cavernomas, syringomyelia, thoracic disc herniation, lumbar spinal stenosis among others. The application of self-measured TUG test extends even beyond spinal applications, considering conditions such as normal pressure hydrocephalus that frequently present with gait difficulties and mobility restrictions. Even if our results do not yet permit the unfiltered application of the self-measured TUG test in the outpatient setting and repeating a similar study with patient home-measurement would be ideal, it is questionable whether such a study will ever be conducted in the future. Although our findings indicate a high degree of reliability in self-administered TUG tests in an inpatient setting, it is advisable to exercise caution when extrapolating these results to home environments. The controlled hospital setting differs from home conditions in terms of the availability of standardized equipment, the surrounding environment, and the level of immediate professional oversight. While our results are promising, further research is needed to specifically evaluate the reliability of self-administered TUG tests in home settings. Variables such as furniture, flooring, and potential distractions may influence test performance, and thus require further investigation. Future studies should focus on validating the reliability of home-based, self-administered TUG tests to fully assess their potential for widespread clinical application in outpatient monitoring and care.

Conclusion

This study provides evidence for a high reliability of self-testing by means of the TUG test. Our findings implicate the possibility of patients performing the TUG test without supervision by trained healthcare personal, which helps to save time and resources. Although further research would be ideal for evaluating the reliability of TUG test self-measurement in the outpatient setting it seems reasonable to expand its use for serial self-examinations in the home environment.