Intra-observer agreement in single and joint double readings of contrast-enhanced breast MRI screening for women with high genetic breast cancer risks

Objectives: To examine intra-observer reliability (IR) for lesion detection on contrast-enhanced breast magnetic resonance images (MRI) for screening women at high risk of breast cancer in single and joint double readings, without case selection. Methods: Contrastenhanced breast MRIs were interpreted twice by the same independent reader and twice in joint readings. IR was assessed for lesion detection, normal MRI identification, mass, non-mass like enhancements (NMLE) and focus characterisation, and BI-RADS assessment. Results: MRI examinations for 124 breasts, 65 women (mean age 43.4y) were retrospectively reviewed with 110 lesions identified. Abnormal BI-RADS (3-5) classifications were found for 52.3% in single readings and 58.5% in joint readings. Seven biopsies were performed for 4 histologically confirmed cancers. IR for BI-RADS classifications was good for single (0.63, 95% CI: 0.49-0.77), and joint readings (0.77, 95% CI: 0.61-0.93). IR for background parenchymal enhancement (BPE) was moderate across single (0.53, 95% CI: 0.40-0.65) and joint readings (0.44, 95% CI: 0.33-0.56). IR for BI-RADS category according to each enhancement was poor for single (0.27, 95% CI: 0.10-0.44), and higher for joint readings, (0.58, 95% CI: 0.43-0.72). Conclusions: IR in BI-RADS breast assessments or BI-RADS lesion assessments are better with joint reading in screening for women with high genetic risks, in particular for abnormal MRI (BI-RADS 3, 4 and 5).


Introduction
Around 5-10% of all breast cancers are hereditary and approximately half of these hereditary breast cancers are characterised by known mutations in the BRCA1 and BRCA2 genes.In this population, the risk of developing breast cancer is between 65-80% [1,2], and cancers are particularly aggressive and occurs at young ages.In response to this high risk, current practice guidelines propose two options to women in addition to medical prophylaxis or ovariectomy: either surgery with bilateral prophylactic mastectomy, which is rarely accepted in France, or an intensive screening program to detect cancer at an early stage and to reduce breast cancer mortality.Initially, this screening programme included clinical breast examinations and annual mammography and started at 30 years of age [3][4][5].In this group, sensitivity of mammography is under 50% [6], not only due to high density but also due to benign radiological presentation of the image and a rapid growth [7].In addition, in mammography-based screening programs, up to 55% of detected cancers are interval cancers [8][9][10][11][12][13].An alternative screening method consisting of annual contrast-enhanced magnetic resonance imaging (MRI) of the breast has been proposed for screening because of its high sensitivity (71-100%) [14][15][16][17][18][19].This method has recently been demonstrated to be superior NobleResearch www.nobleresearch.orgto mammography for cancer detection for women at high familial risk [20].
To be reliable, a screening examination must be reproducible, yet up until now reproducibility has mostly been studied only for lesion detection and characterisation [21][22][23][24][25][26].In almost all of these reports, the MRIs were suspect and lesions were annotated and confirmed with histopathology.Furthermore, mostly large masses were included across studies and only a few studies have examined non-mass-like enhancements (NMLE) or focus, although the aim of screening is to detect small cancers at early stages.
Thus, to increase intra-observer reproducibility of the contrast-enhanced breast MRI interpretation in this hereditary high-risk breast cancer population, we propose a joint-reading method, as has been proposed for mammographic screening for these patients [20].We aim to compare intra-observer agreement between an independent reader's double reading and joint double readings.We will evaluate intra-observer reproducibility for lesion detection, for normal MRI identification, for mass, NMLE and focus characterisation and Breast Imaging Reporting and Data System (BI-RADS) assessment, without case selection.

Patient selection
All asymptomatic women at high risk of breast cancer screened by breast dynamic contrast-enhanced MRI at our institution between 2007 and 2009 were considered for inclusion.High-risk women were defined as women known to be BRCA1, BRCA2 or TP53 gene mutation carriers or at high-risk (above 20%) of carrying a gene mutation, without excluding women who had a personal cancer history.The patient's first (or only) MRI was considered in this study.Institutional review board approval was obtained prior to the commencement of this retrospective study.Written informed consent of patients was not required.

MRI protocol
Breast MRI was performed using a 1.5 Tesla scanner (INTERA, Philips).The patient lay prone with the breast in a dedicated breast coil.The screening MRI examination comprised of axial T2-weighted fast spin-echo images (TR: 2871 ms, TE: 110 ms, slice thickness 3 mm) and dynamic three-dimensional fast spoiled gradient-echo axial or sagittal T1-weighted sequences (TR: 9.3 ms, TE: 4.6 ms, slice thickness 1.2 mm), for both breasts without fat saturation.The T1 images were collected before and a few seconds after the beginning of the 14cc intravenous bolus injection of gadolinium (Dotarem®, 0.5 mmol/ mL, Guerbet).For each dynamic study, 5 to 7 sequences were acquired (one sequence per 60-90s) and then for each sequence a subtraction image was created.After the dynamic study, an axial T1 WATS fat-suppressed sequence was obtained.

Study design and image interpretation
To assess intraobserver variability in this study, two radiologists (experienced reader A, and junior reader B) reviewed the MRI across five readings in total.Reader A interpreted the examinations twice independently (readings A1, A2), with a minimum one-month interval between the readings.Reader B interpreted the examinations once in preparation for the joint readings (B1), then the two readers reviewed each MRI in a joint reading (C1), which was repeated one month later (C2).Intra-observer variability in this study is the comparison of the two independent readings (A1-A2) by an experienced radiologist and comparison across the two joint readings (C1-C2).
Data from the workstation (View forum, Philips) were used with multiplanar reconstruction, maximum-intensity projection and a kinetics analysis.Morphological analysis was interpreted on the native sequences where the lesion intensity was maximal.All breast lesions were interpreted and categorized according to the BI-RADS of American College of Radiology (ACR) classification [21].The quality of the exam and any difficulties encountered with the reading were reported in the radiologists' reports.
Background parenchymal enhancement (BPE) of the breast was classified into 4 categories reflecting a progressive level of difficulty for MRI interpretation: (1) absent or late non-intense homogenous enhancement; (2) early non-intense homogenous, late intense homogenous or late non-intense heterogeneous enhancement; (3) early intense homogenous, early non-intense heterogeneous or late intense heterogeneous enhancement; (4) early intense heterogeneous enhancement.'Early' was defined as visible on the first sequence after injection at 60-90 seconds, 'intense' was defined as more intense than fatty tissue and 'homogenous' was employed according to the BI-RADS definition as confluent or uniform.

Previous MRI interpretation
Before inclusion in this study, all images had been interpreted by one of the radiologists at the institution, including a BI-RADS assessment.The BI-RADS assessment distribution per breast was 68.5% BI-RADS 1-2, 27.4% BI-RADS 3 and 4% BI-RADS 4-5.Per patient, the BI-RADS assessment was 52.3% BI-RADS 1-2 (34 women), 40% BI-RADS 3 (26 women) and 7.7% BI-RADS 4-5 (5 women) giving an overall recall rate of 47.7%.This data was used as baseline comparison data.Histological confirmation via biopsy was proposed for eight patients (12.3%) and seven agreed.Malignant tumours were found for 4 patients (6% of all patients included, or 57% positive predictive value (PPV) for malignant tumours found across all lesions biopsied) and one atypical lesion (atypical ductal hyperplasia and flat epithelial atypia).

Diagnostic interpretation
The two radiologists read the MRI examinations without access to the clinical, mammography and ultra-sound data, nor to previous MRI interpretations.However, they were not blinded to patient age, hormonal status, treatment and breast history (surgery, radiotherapy).Each lesion was categorised by one of the five BI-RADS scores and divided into BI-RADS 1-2, BI-RADS 3 or BI-RADS 4-5.Breasts were categorised according to BI-RADS lesions, with the highest BI-RADS category retained for breasts with multiple lesions.Very small foci (≤3 mm) were not reported, but foci of 4 and 5 mm were listed.

Statistical methods
To measure intra-observer reproducibility for lesion detection, normal MRI identification, judgement of categorical enhancement, mass, NMLE, focus characterisation and BI-RADS assessment, the kappa statistic was calculated for each pair of readings for the single reader (A1,A2) and for the joint readings (C1, C2) and are reported with 95% confidence intervals (CI).Due to the small series size, it was not possible to calculate the coefficient for the others items such as type of lesion (morphological and dynamical) nor the sensitivity and specificity.

Patient characteristics
MRI examinations for 65 women and 124 breasts (unilateral mastectomy in 6 patients) were retrospectively reviewed.It was the first MRI screening for 30 patients (46%).The mean patient age at the time of MR imaging was 43.4 years (min-max: 27-75).Of the 65 patients, 25 had BRCA1 gene mutations, 27 had BRCA2 gene mutations and 13 had non-proven mutations.Twentyone women had a personal breast cancer history of either invasive carcinoma [17], ductal carcinoma in situ (DCIS) [1], or an association of both [3] and this was bilateral for two women.Seventeen had previously received breast radiotherapy on the imaged breast.
The image quality was considered poor for 6 images (9.2%) because of motion artefacts, and 12 images (18.5%) were judged as difficult to interpret because of a motion artefact [9], cardiac artefact [1], diffuse impeding BPE [3] or a technical issue [1].These images were however maintained for analyses as the quality of the image did not rule out a reliable clinical interpretation.Based on literature indicating that BPE is less frequent in menopausal women after radiotherapy treatment, BPE was examined according to whether the patient had radiotherapy and whether or not she was menopausal [22] (Table 1).One hundred and ten lesions were identified on the MR images for the 65 patients.The distribution of BI-RADS category across readings is given per breast (Table 2) and per lesion (Table 3).There were between 9.7 and 11.3% of BI-RADS 4/5 lesions depending on reading (comparable to "biopsy rates").When considered per patient for comparison with "recall rates" in the literature, in the single readings, between 40-50.8% of patients were categorised as having BI-RADS 3-5 lesions (and thus requiring further analyses) and in the joint readings this rate varied from 52.3% to 53.8%.If we consider the PPV as the number of malignant tumours found across all lesions biopsied, we find thus a PPV of between 28-33% according to reading.4).
We also analysed reliability according to the overall BI-RADS category in subgroups of patients based on BPE reported in the second independent reading (categories 1 and 2 or categories 3 and 4).Intra-observer reliability of BI-RADS category across single readings was good when BPE was not serious (categories 1 and 2), kappa =0.69, 95% CI: 0.55-0.83with 87% concordant cases (94/108 Intra-observer reliability was calculated for BI-RADS category according to each enhancement.Lesions that were not described on one of the readings were classified as BI-RADS1 or 2 (as not classified as a suspect or undetermined lesion).Reliability was poor for single readings, kappa =0.27, 95% CI: 0.10-0.44 and higher for joint readings 0.58, 95% CI: 0.43-0.72,but concordance rates for cases with no lesions were good: 54%, joint and 62%, single.
Reliability was also calculated according to the type of enhancement (mass, focus, NMLE) with moderate reliability for single readings, kappa =0.34, 95% CI: 0.21-0.46.When the lesion type was concordant, the BI-RADS category was the same in 81% of cases (36/37) and when it was discordant, the BI-RADS category was the same in only 50% of cases (4/8).Joint reading reliability was moderate, kappa =0.44, 95% CI: 0.32-0.56.When the lesion type was concordant the BI-RADS category was the same in 100% of cases (56/56) and when it was discordant, the BI-RADS category was the same in only 60% of cases (6/10).

Discussion
Our study population is comparable with other high genetic risk populations the literature in terms of cancer detection and biopsy rates.Our cancer detection rate of 6% (4 of 65 patients) is comparable to 1.7% [23] reported in a review of five prospective studies, 4% in the recent EVA trial [29] and 8-9 % in other series [6,24].Our rate of BI-RADS 4/5 lesions, which is approximately equivalent to 'biopsy rates' in the previous interpretation and in the literature, was between 9.7% and 11.3%, compared to 12.3% in the previous interpretation, and 8.3% [25] and 13% [31] in the literature.The PPV is between 28-33% compared to 17-57% [26] in the literature and 57% in the previous MRI interpretation.The recall rate is 40-50.8% of patients in the single readings, 52.3-53.8% in joint readings and 47.7% in the previous MRI interpretation, higher than previously reported at 8-17% in other annual screening studies [23,24,[27][28][29], probably due to an overuse of the BI-RADS3 category as discussed further.

Intra-observer reliability
Reproducibility has mostly been assessed only for description and characterisation of lesions.Only one study, MARIBS, has looked at the reproducibility of detection of lesions [28].This study was a performance test on 8 selected cases where the BI-RADS assessment was made on the whole breast.Other screening studies date either from before the BI-RADS classification [30,31] or assess only pathological MRIs and focussed mostly on inter-observer rather than intra-observer reliability.

Background parenchymal enhancement (BPE)
In our study, 70 to 88% of exams were categories 1-2 and 12-25% were categories 3-4, so we can consider that this categorisation distinguishes enhancement types.Intra-observer agreement observed in our study for BPE was moderate (0.44-0.53) in both independent and joint readings.The agreement was not affected by enhancement intensity, similar to breast density reported by mammography (0.46-0.59) [32].Most of our mismatches occurred between categories 1 and 2 which has limited clinical impact.However, one of the major difficulties was identifying BPE from NMLE (discussed further).In clinical practice, when parenchymal glandular enhancement is suspected, a control MRI performed 15 days later in a different part of the menstruation cycle, could be indicated.It is notable that a high number of post-menopausal women and patients having received radiotherapy had BPE, in contrast to what could be expected according to the literature [22].
Breast BI-RADS assessment Concordance rates reach 84% for all breasts in single reading and 91% in joint reading and the kappa-values are good at respectively 0.63 and 0.77.These results are in accordance with the previous agreement rate of 87% [33]  The BI-RADS 3 assessments rate (21-34% women) is higher than in the literature (6.6-25%) [26,30,32,[35][36][37]].Among our BI-RADS 3 assessments, we referenced 18% masses, as reported in Eby et al. [35]; but respectively 68.5% and 36% NMLE.Our high rate may be explained by an overuse of the NMLE description and BI-RADS3 classification for BPE.Others articles have reported the same difficulties, especially in the early years of clinical practice [35,36].Having more objective criteria to recognize BPE for BI-RADS 3 enhancements should improve intra-observer agreement.

Limitations
The main limit of our study is that it is a retrospective study at a single centre where the junior reader was trained by the senior reader.Further, the MRI interpretations in our study differ from those performed in daily clinical practice because the clinical, mammographic and ultra-sound data and prior MRI were not available to the readers.A further limiting factor is the small number of cases (65 women and 124 breasts) which does not allow the calculation of kappa-values within each lesion type nor the calculation of sensitivity and specificity.

Conclusions
Overall, the parenchymal enhancement classification seems to categorise the population well and to be reproducible, but it is essential to study the performance of MRI according to the category of BPE.Intra-observer agreement in BI-RADS breast assessments or BI-RADS lesion assessments are improved with joint reading in screening for women with high genetic risks, in particular for abnormal MRI (BI-RADS 3, 4 and 5).For normal MRI, the intra-observer agreement is good for single readings and is not improved with the joint reading.The joint reading of all MRI (normal and abnormal) led to increased monitoring of some 'in-between' patients (BI-RADS3) without increasing the biopsy rate.Given the importance of establishing a reliable screening programme for these women at high risk, this study evaluating inter-rater reliability without case selection including normal MRI, small lesions and non MLE provides useful data.Future research on joint reading for MRI screening for women with high genetic risk of breast cancer could be useful for the improvement of MRI performance with a specific focus on determining what kind of joint reading is optimal, in which population, and whether the results warrant the costs involved.

Table 1
Background parenchymal enhancement (BPE) classifications based on independent or joint readings across all women, post-menopausal women and women having received radiotherapy (ranges across two readings)

Table 4
Intra-observer reliability across single and joint readings of breast MRI according type of diagnostic information BI-RADS category per breast (n=124) across all readings of MR images for women at high genetic risk of breast cancer Intra-observer reliability for BPE was moderate across single readings, kappa =0.53, 95% CI: 0.40-0.65,and across joint readings, kappa =0.44, 95% CI: 0.33-0.56(Table

Table 3
BI-RADS category per lesion across all readings of MR images for women at high genetic risk of breast cancer breasts).Intra-observer reliability could not be calculated across categories 3 and 4 due to the small number but 62.5% of the cases were concordant (10/16 breasts).
[34]rted by Warren et al in a screening study, and indicate that joint readings are more reliable than independent readings.This highlights the necessity to improve the BI-RADS definition and the MRI spatial definition, particularly for lesions surrounded by fat or by glandular tissue.Concerning the BI-RADS assessment of lesions in the literature, only inter-observer agreement had been reported with 0.57 and 0.45 kappa-values, but in those studies only large masses were included (12-23 mm) and, as Stoutjesdijk et al.[34]highlight, masses are the most easily type of lesion to describe and variability increases with NMLE.