Methods xxx xxxx xxx xxx b iframe width height
Methods xxx (xxxx) xxx–xxx
classification accuracy of breast cancer pathological images, the dataset sizes of 249 and 400 images are still too small compared with the open datasets of natural images.
To further promote and complement the research on the breast cancer pathological image classification field, we cooperated with Peking University International Hospital to release a new pathological image dataset of breast cancer. The format of our increased breast cancer pathological image dataset is completely consistent with the dataset published by Bioimaging2015. The institutional review board approved the study, and all the released data are anonymous. All images were acquired from March 2015 to March 2018 using a Leica Aperio AT2 slide scanner, and all patients were from China.
Our image dataset consists of 3771 high-resolution (2048 × 1536 pixels) and annotated hematoxylin and eosin (H&E) stained breast pa-thological images. Hematoxylin highlights nuclei by staining DNA and eosin highlights other structures by staining proteins. All images have the same acquisition conditions: 100x or 200x magnification. The preparation procedure for pathological sections used in this work was the standard paraffin process, which is widely used in the pathological routine. According to the cancer type in each image, each image is la-beled as normal, benign, in situ carcinoma or invasive carcinoma. The annotation was performed by two medical experts, and the images where there was disagreement were taken to the chief of pathology for final confirmation. Table 1 summarizes the image distribution. The initial image is the Bioimaging2015 dataset, and the extended image is our enhanced dataset, which can be regarded as an extension of the dataset in the article. Table 2 describes the image format in our dataset. Overall, based on the initial 249 images, we increased the number of images in the dataset to 4020.
In particular, the structure and function of breasts vary with age, such as puberty, sexual maturity, pregnancy, lactation and old age. To ensure the GSK872 of data to improve the learning ability of the ma-chine learning algorithm, our dataset covers as many different subsets spanning different age groups as possible, which can fully reflect the morphology of breast tissue.
When a pathological image with high resolution (2048 × 1536 pixels) is input, our goal is to accurately classify the image into one of four categories: normal, benign, in situ carcinoma and invasive carci-noma, as shown in Fig. 1. To this end, we propose a new hybrid con-volutional and recurrent deep learning method, and the general workflow of our method is as follows (Fig. 2). In the training stage, the pathological images are preprocessed and enhanced to improve quantitative analysis. After preprocessing, we first fine-tune the pretrained Inception-V3  model. For each image, the trained patch-wise model is used to extract the feature representation
Summary of our dataset.
Dataset Normal Benign In situ carcinoma Invasive carcinoma Total
Description of pathological images in our dataset.
Color model R(ed)G(reen)B(lue)
Fig. 1. Examples of breast cancer pathological images in our dataset.
vectors of 12 patches. Then, these 12 vectors are used as input to train image-wise long short-term memory (LSTM) . In the testing stage, one pathological image is divided into an average of 12 small patches. Then, a fine-tuned Inception-V3 is used to extract the patch-wise image features. Each patch is extracted to a feature vector of 1 × 5376 di-mensions. That is, 12 feature vectors can be extracted from one pa-thological image. Finally, the 12 feature vectors (12 × 1 × 5376) are input into a bidirectional LSTM to fuse the features of the 12 small patches to make the final complete image-wise classification. Since our method integrates the advantages of CNN and RNN, the short-term and long-term spatial correlations between patches can be preserved. We cover the method in more detail in the following sections.
4.1. Image preprocessing and augmentation
To alleviate many of the known inconsistencies in the staining process, thereby bringing sections that were processed under different conditions into a normalized space to enable better analysis, we used the method described in this paper  to normalize the pathological images of H&E staining. In our study, we perform 50 random color augmentations for each image.
Methods xxx (xxxx) xxx–xxx
It is well known that deep learning methods are heavily dependent on the size of the training dataset, with a network structure of higher complexity requiring more data to avoid overfitting and generalizing. Meanwhile, the breast pathological images provided are very large spanning 2048 × 1536 pixels. To address the problems of large image sizes and insufficient data, we extract patches from each image and augment the image by applying varying degrees of rotation and flipping the extracted patches. This mode of data augmentation is consistent with a real-world scenario, as there is no fixed orientation adopted by pathologists when observing and analyzing pathological images under a microscope. The label for each patch is inherited from the class assigned to the original image.