How item response theory can help you take patient insights even further
- raresstoica4
- Feb 18
- 3 min read
Updated: Apr 16

For many, item response theory (IRT) is a tool for selecting items for a new patient-reported outcome (PRO) or producing a short-form PRO of an existing tool. Whilst not inaccurate, there is a whole world of other applications of IRT that may be less familiar. So, let’s look at the wider benefits of an IRT approach.
What is IRT?
IRT is a broad group of psychometric methods that can be used in a variety of ways to analyse PRO data. It is distinguished from the other major group of psychometric methods, classical test theory (CTT), by its focus on modelling item responding – specifically, how probable it is that a respondent will answer in a particular way, given their level of the PRO being measured.
IRT models differ in their level of complexity but can include parameter estimates that represent how severe an item is (difficulty), how well it tells responders apart (discrimination), guessing, and upper limits to probability. IRT models can be applied to dichotomous or polytomous response scales. When using IRT models, a first task is to understand which model best characterises patients’ responses to the measure. Once a model is selected, the parameter estimates and predicted values from the model can be used for further insights.

Back to shortening a scale
A typical application of IRT is to create short forms of longer scales. With this goal in mind, you might run an IRT model, select items based on the item parameters – for example, selecting a range of items with varying difficulty – and then use this shortened measure in future work. PROMIS is an excellent example of a systematic application of IRT to produce shortened measures.
Item parameters may change
Parameters may differ by group. For example, the item “I can do laundry” from activities of daily living measure may perform differently across age groups. This is an example of test bias. Similarly, parameters may change over time – for example, an item might get systematically easier. This would be an example of response shift.
You can study both bias and response shift in the IRT framework by using a group of methods referred to as differential item functioning (DIF). There are many insights from studying DIF. In our example above, if item parameters change by age group, this means that the PRO is not performing in the same way for young and old patients of the same level of PRO. When this is true, it is questionable whether you could compare scores across groups on the PRO.
DIF analysis is a powerful tool that can be used to assess bias across groups or time, response shift, differential effects of treatment by studying items across trial arms and much more.
Reliability is not constant in IRT models
In CTT, reliability is a constant value for all responders. In IRT, reliability (or information, as it is referred to) varies dependent on the level of the construct being measured. In short, this means a given PRO may be more or less reliable for different ranges of scores. IRT lets you estimate the ranges of scores where a PRO is most reliable. In turn, this allows you to make more nuanced selection decisions for the inclusion of PROs in studies, for example, as selection can be tailored to the expected severity of samples.
How can this benefit your understanding of individual level change?
One way that reliability varying may be useful is in detecting individual level change. For individual level change, you are typically trying to identify what change in a score is greater than the change you might expect due to measurement error. The implication of reliability varying is that the magnitude of change required to be deemed meaningful will also vary. For a responder at a level where the reliability for the PRO is high, a smaller change over time would be deemed outside of that expected due to error than for a responder where reliability of the PRO is low.
In summary
There are many more valuable applications of IRT, including Rasch models, score interpretations, score linkage and harmonisation, response scale analysis and adaptive testing, to name just a few.
If you find this broad topic of interest, or specific areas in particular, feel free to comment and let us know — we’d be happy to discuss and share more insights with you.
Comments