CNN classification model for lung and colon cancer images

Abstract

Lung and colon cancers are the main causes of cancer death in Australia, especially in the population aged over 50. This has a substantial impact socially and economically. Early diagnosis increases the patient’s chances of appropriate treatment and survival. Currently, the histopathological process is a manual process that is tedious and prone to disagreements between pathologists.

Advances in Machine Learning in image analysis have resulted in algorithms showing to be effective in classification tasks such as tumour diagnosis. Computer-aided diagnosis shows potential for improving diagnostic accuracy.

In this study I develop a five-class convolutional neural network to classify hematoxylin and eosin (H&E) stained histopathology images of colon and lung tissues to determine the presence of cancer cells.

The H&E image dataset used in this study is a subset of the dataset provided by Borkowski, et al.  in Tampa, Florida, USA using resources and facilities at the James A. Haley Veterans’ Hospital. Due to resource constraints it consisted of 5000 256 x 256 pixel images (one thousand images each of lung squamous cell carcinomas, lung adenocarcinomas, benign lung tissue, colon adenocarcinomas and benign colon tissue).

The resulting model consists of three convolutional layers and three dense layers with batch normalisation for regularisation used between layers. The overall accuracy, precision, and sensitivity of the model are 90.290%, 90.6%, and 90.286% respectively. The results are promising but there is still a long way to go before this could be incorporated into patient care. The most surprising outcome was only one lung image was misclassified as a colon image.

Research proposal

Cancer is a disease of cells in the body. It occurs when abnormal cells grow in an uncontrolled manner. Cancer cells can damage surrounding cells or spread to other parts of the body. (Cancer Australia, n.d.) In Australia in 2020, it is estimated that around 50,000 people will die from cancer and there will be 150,000 new cases diagnosed. This has a substantial impact socially and economically, not just on individuals and their families but the wider community as well. Lung and colon cancer are the most common cancers in mortality in people aged over 50. (AIHW, 2020).

In 2018 lung cancer was the most common cause of cancer death and is estimated to remain so. It is estimated that there will be 13,258 new cases (7,238 males and 6,020 females) diagnosed, with 8,641 deaths (4,991 males and 3,650 females) (Cancer Australia, 2020).

In 2016 colon cancer was the third most diagnosed cancer in Australia and it is estimated to be the fourth most common in 2020. It is estimated that there will be 15,494 new cases (8340 males and 7154 females) of colon cancer diagnosed in 2020, with estimated deaths of 5,322 (2,828 males and 2,494 females). (Cancer Australia, 2020)

Histopathology began in the 17th century. It is a process of distinguishing between normal tissue, non-malignant (benign) tissue, and malignant lesions (carcinomas). Even today it remains a mainly manual process where a pathologist examines glass slides using conventional brightfield microscopy. With the advance of digitisation, the glass slides are now digitised and can be viewed by a pathologist on a computer or analysed with image analysis techniques. (Pathol, 2019)

Vision is a task that is easy for humans but difficult for computers. Computer vision is one of the most active research topics for deep learning. Most of the research has been based on replicating human abilities. (Goodfellow, Bengio, & Courville, 2016)

Medical imaging is not new. X-rays have been around since 1895 when they were discovered by Wilhelm C Röntgen. The success of medical imaging in patient care and the advances in medical imaging has led to large number of images needing to be interpreted by physicians and technicians (e.g. pathologists, radiologists). Computer-aided detection (CADe) and computer-aided diagnosis (CADx) is an active area of medical imaging research. Whilst CADe focuses on detection and the location of lesions in medical images, CADx focuses on diagnosis, i.e. the distinction between malignant and benign lesions. (Suzuki, 2012)

Advances in Machine Learning in the field of image analysis has led to algorithms that are effective in the detection of specific objects (image segmentation) and classification tasks, such as tumour diagnosis. (Pathol, 2019) One subset of Machine Learning is Deep Learning, which uses multiple processing layers to find patterns in data with many levels of abstraction. This has led to dramatic improvements in speech recognition, object detection and visual object recognition. (LeCun, Bengio & Hinton, 2015)

“Convolutional networks are simply neural networks that use convolution in place of general matrix multiplication in at least one of their layers.” (Goodfellow, et al, 2016, p 326) They are used for processing image data that can be considered a two-dimensional grid of pixels.

In this study I am going to build, train, tune and deploy a five-class convolutional neural network in Amazon Web Services (AWS) SageMaker to classify H&E stained histopathology images of colon and lung tissues to determine the presence of cancer cells. The programming language used is Python 3 with the Tensorflow 2 machine learning platform. The source code can be viewed in Appendices B, C, and D.

A high accuracy with a low false positive rate and a low false negative rate is required for using a computer-aided diagnosis system to detect lung cancer and colon cancer cells in histopathological images. Both false positive and false negative results could have severe consequences for patients. Patients falsely diagnosed with cancer lead to an over diagnosis of cancer and possibly unnecessary treatment, whereas patients falsely diagnosed as not having cancer go untreated.

Data Understanding

The images were captured from pathology slides using resources and facilities at the James A. Haley Veterans’ Hospital, in Tampa, Florida. The images are de-identified, Health Insurance Portability and Accountability Act (HIPAA) compliant to ensure patient privacy and have been validated. The original dataset consisted of 750 images of lung tissue (250 lung squamous cell carcinomas, 250 lung adenocarcinomas, and 250 benign lung tissue), and 500 images of colon tissue (250 each of colon adenocarcinomas and benign colon tissue).

The image preparation as described by Borkowski, et al. (2019) is that the original 1024 x 768 pixel images were cropped to 768 x 768 pixels. Then using Augmentor (an image augmentation library in Python) the dataset was expanded out to 25,000 images by using left and right rotations (up to 25 degrees, 1.0 probability) and horizontal and vertical flips (0.5 probability).

Dos Santos, Sabourin, and Maupin (2009) state that overfitting is a key problem in supervised machine learning tasks and that small datasets are more prone to this than large datasets. Warden’s (2017) rule of thumb for the size of a dataset to train a classifier is 1,000 images per class.

As stated above the images in this dataset are all the same size (768 x 768 pixels). Often in image classification problems the images are different sizes and convolutional networks require images to be the same size. The larger the input size the less shrinking occurs and therefore less deformation of features and patterns inside the image. Off square images can be cropped or scaled down using interpolation. Both methods have risks. Cropping could lead to missing features or patterns that occur near the edges, and scaling could lead to deforming features or patterns. Deforming is seen as less risky than cropping. (Hashemi, 2019)

Hashemi goes on to discuss resizing smaller images and prefers zero-padding over scaling up (zooming in). Zero-padding does not deform the image and results in better computational efficiency. This is because the zero input units don’t activate their convolutional unit in the next layer.

In preparing the slides the pathologist stains the cells to bring out features. Hematoxylin, normally purple, has an affinity to the nucleic acids in the nuclei. Eosin, normally pink, binds to the cytoplasm of the cells. (Pontalba, Gwynne-Timothy, David, Jakate, Androutsous & Khademi, 2019)

As the quality of hematoxylin and eosin (H&E) staining varies due to “dye concentration, staining time, formalin fixation time, freezing, cuttings skill, type of glass slide, and fading after staining”, colour normalisation is used to standardise the images. The choice of slide scanner and its settings also impact the quality of the image. Differences in staining and imaging protocols may lead to poor performance of the model. (Nam, et al. (2020, p129)

Colour normalisation is applied as a pre-processing step. It attempts to reduce colour variability by transforming the input data to a common space. Reduced variability in the colours of tissues has led to improvements in algorithms. (Pontalba, et al., 2019) Borkowski et al. don’t state whether or not colour normalisation has been performed on the dataset. Colour normalisation is outside the scope of this study due to its complexity and the timeframe required to undertake it.

Due to limited resource allocation in AWS SageMaker (and the time it takes to increase resource allocation) image selection, exploration, and dataset splitting were done on a local machine as the images were too big to process in the SageMaker notebook instance. The first 1000 images from each image class were selected. The images were read in and the image size was reduced to 256 x 256 pixels (down from 768 x 768 pixels). Initially, an image size of 512 x 512 pixels was attempted but the images were too big to be processed in SageMaker. A selection of images is shown in Figure 1 Images of colon cells and Figure 2 Images of lung cells.

Figure 1 Images of colon cells
Images a) to c) contain non-cancerous cells. Images d) to f) contain Colon ACA cells
Figure 2 Images of lung cells
Images a) and b) are non-cancerous cells. Images c) and d) are Lung ACA cells. Images e) and f) are Lung SCC cells

As can be seen in Figure 3 Class distribution of input dataset each image class contains 1000 images.

Figure 3 Class distribution of input dataset

The dataset was split into 80% Training and 20% Test (Figure 4 Data split visualisation) using the stratify parameter to ensure class populations were consistent with the input dataset (Figure 5 Class distribution of Training dataset) and random_state of 88 to ensure that the split is reproducible. (Note: the Training dataset was split 80/20 again when fitting the model later so the second 20% is used for validation).

Figure 4 Data split visualisation
Figure 5 Class distribution of Training dataset

The 80/20 principle comes from Vilfredo Pareto, an Italian engineer and economist, who came up with the idea that 80% of consequences come from 20% of the causes.

The shape of the training dataset can be seen in Figure 6 Training dataset shape. There are 4000 images of size 256 x 256 pixel and 3 channels (Red, Green, Blue) as the images are in colour.

Figure 6 Training dataset shape

The first nine images from the training and test datasets are shown in Figure 7 Sample from Training dataset and Figure 8 Sample from Test dataset, respectively. The training and test datasets were then saved as npz files (Numpy’s compressed array format) and uploaded to a SageMaker notebook instance.

Figure 7 Sample from Training dataset
Figure 8 Sample from Test dataset

Modelling

A convolutional neural network layer’s output is a result of not only the input shape, but the kernel shape, padding, and strides as well. (Dumoulin & Visin, 2018). A diagram of the best tuned convolutional neural network model used in this study is in Figure 9 Convolutional Neural Network. Three convolutional layers, three maxpooling layers, and three dense layers were selected.

The image size is 256 x 256 pixels as it is a power of 2. This was chosen so downsizing using MaxPooling multiple times would not require rounding to the closest integer (i.e. 256, 128, 64, etc).

The number of filters for the first layer was set to 64. This can be any power of 2 up to 1024. The higher the number of filters the more powerful the model but at the risk of overfitting. Each successive layer increases the number of filters. (Dertat, 2017) In this study the number of filters doubled with each layer.

Figure 9 Convolutional Neural Network

Kernel size is always an odd number. This is because “all previous layer pixels would be symmetrical around the output pixel.” (Figure 10 Kernel size comparison) An even-numbered kernel would lead to distortions. (Sahoo, 2018, p14). Kernel size tends to be either 3×3 or 5×5. For this study, 3×3 was chosen. Smaller kernel sizes look at a few pixels at a time, this helps to capture smaller, complex features in an image. They also have better weight sharing, with a lower number of weights making it computationally efficient. The large number of layers created with this size it learns more complex, more non-linear features. But at the cost of the need for more memory. (Icecreamlabs, 2018)

Figure 10 Kernel size comparison
Source: https://towardsdatascience.com/deciding-optimal-filter-size-for-cnns-d6f7b56f9363

Optimizer stochastic gradient descent (SGD) was chosen. It has the advantage over batch gradient descent in that it doesn’t perform redundant computations, thereby making it faster. (Ruder, 2016)

Activation function ReLu (Rectified Linear Unit) was chosen for the convolutional layers. It is the most used as its linear behaviour is easier to train and it often achieves better performance. Its formula is ReLU(x) = max(0,x) This is a piecewise linear function where the output is either greater than zero (i.e. positive) or 0. (Patel, 2019)

Padding was set to ‘same’. Without padding the image would get smaller and smaller with every convolution. Padding ensures that the outer pixels at the corners and edges get used as much as those in the middle. With ‘same’ the output image is the same as the input image. (Savyakhosla, 2019)

Max pooling is the most common kind of pooling. With max pooling the input is split into non-overlapping patches and the maximum value of each patch is output. (Dumoulin & Visin, 2018). This results a summarized version of the features from the input. Reducing the number of parameters ensures a higher computational speed. “In all cases, pooling helps to make the representation become approximately invariant to small translations of the input. Invariance to translation means that if we translate the input by a small amount, the values of most of the pooled outputs do not change.” (Goodfellow, Bengio & Courville, 2016, p 336). It is almost always 2×2 with a stride of 2.

Dropout is a simple way to prevent neural networks from overfitting. It is based on a theory of the role of sex in evolution, in that offspring are half the genes from one parent and half from the other plus a small amount of random mutation. In sexual reproduction the ability of a set of genes to work with another random set of genes makes it more robust than asexual reproduction. As genes can’t rely on a large set of partners to be present, they must learn on their own or in collaboration with a small set of other genes. This theory is used in neural networks trained with dropout where each hidden neuron must learn to work with a randomly chosen sample of other units. Instead of relying on other hidden units to correct its mistakes, this results in the hidden unit being more robust and creating useful features on its own. (Srivastava, Hinton, Krizhevsky, Sutskever, & Slakhutdinov, 2014)

Srivastava, et al. also found that random dropout worked in a wide variety of domains such as digit recognition, object classification, document classification, speech recognition, and the analysis of computational biology data. That is, it is a general technique not specific to any domain. They also found that using dropout on every layer improved the error rate. It also introduced noise which required a learning rate 10-100 times the learning rate of a standard neural network, as well as a higher momentum (0.95 to 0.99). As a higher momentum and learning speed increases the network weights, they suggest max-norm regularization of 3 to 4.

In this study it was found that dropout deteriorated the performance of the model (accuracy down to 34%). Instead batch normalisation was found to be better. Ioffe and Szegedy (2015) found that batch normalisation allowed the use of higher learning rates and it was less careful about initialisation. It eliminates the need for dropout. Batch normalisation fixes the means and variances of the layer inputs. The authors also claim that it substantially speeds up the training as well.

The modelling assumptions are:

  1. The CNN does not have features that are spatially dependent. (Albawi & Mohammed, 2017) That is, in this study, we do not have to look at where in the slide the cells are.
  2. The images have been correctly labelled.
  3. The scanning of the images has been consistent.
  4. The images have been colour normalised.
  5. Training set of one thousand images per class is sufficient to train the model to a high level of accuracy.

Model Evaluation and Deployment

The hyperparameter job took five hours (Figure 12 Hyperparameter Tuning Jobs) and the best hyperparameter tuning job is shown in Figure 12 Best hyperparameter tuning jobs. The best parameters are batch size 63, dense layer 512, epochs 19, and a learning rate of 0.068336. It took 1 hour and 10 minutes (2,998 seconds) to run. Fortunately, spot instances were used to minimise costs with a saving of 78.8%. Increasing the number of filters from 32 to 64 dramatically increased the runtime of the hyperparameter tuning jobs and consequently not all tuning jobs were able to compete in the maximum training time of 3600 seconds.

Figure 12 Hyperparameter Tuning Jobs
Figure 13 Best hyperparameter tuning job

Output from all 10 jobs can be seen in Appendix A – Output form Hyperparameter Tuning.

As can be seen in Figure 14 Best Model Summary the Training Validation accuracy is 90.3%. The previous model using only 32 filters produced a model with 72.5% accuracy.

The model was successfully deployed to endpoint ‘tensorflow-training-201010-0628-004-1c3e7b4e-ep’ as shown in Figure 15 Model deployment to endpoint on AWS Sagemaker.

Figure 14 Best Model Summary
Figure 15 Model deployment to endpoint on AWS SageMaker

The confusion matrix shown in Figure 16 Confusion matrix shows that the model has performed quite well especially in classes 0 (Colon ACA) and 3 (Lung No). In Table 1 Model performance statistics we can see that in precision the model is 87.21% in classifying images of colon adenocarcinomas, 95.53% in classifying images of benign colon tissue, 83.98% in classifying lung adenocarcinomas, 100% in classifying benign lung tissue, and 86.27% in classifying lung squamous cell carcinomas. The overall accuracy, precision and sensitivity of the model is 90.290%, 90.6%, and 90.286% respectively. The model is a good fit model as the this accuracy on data unseen by the model is the same accuracy as the best trained model (refer Figure 13 Best Model Summary above).

Figure 16 Confusion matrix
0 = Colon – ACA, 1= Colon – No
2 = Lung – ACA, 3 = Lung – No, 4 = Lung – SCC
Table 1 Model performance statistics

Whilst the model has performed reasonably well in classifying the images, I believe there is still a lot of room for improvement. One way to improve the classification would be to use larger sized images. I would have liked to have used images of size 512 x 512 pixels but I had technical issues with resources that at the time of writing AWS still could not rectify. Increasing the dataset size to give the model more images to train on could also improve the model performance. Perhaps the most surprising result is that only 1 lung image was misclassified as a colon image. As expected, the other misclassifications occurred within the same body tissue type.

Conclusion

In this study I set out to build, train, tune and deploy a 5-class convolutional neural network to classify hematoxylin and eosin (H&E) stained histological lung and colon images, using AWS SageMaker.  The topic turned out to be more interesting than first imagined.

It was also a challenging problem. Time differences between Queensland and AWS in the United States caused many frustrating delays in getting resource increases and technical issues resolved, especially with their service level agreement being two business days. Unfortunately for me, this kept occurring over a weekend and consequently four days elapsed was lost each time. Time spent at the start of a project on calculating how much processing power is required would be beneficial in the future.

I found that batch normalisation instead of dropout produced a better performing model. As expected, increasing the number of filters dramatically increased the run time of the model, but did produce a much better performing model (90.3% v 72.5%).

Given the level of accuracy in the model I am inclined to believe the images have been colour normalised. This was stated in assumption 4 above.

Whilst the best performing model achieved an accuracy of 90.29%, the level of false positive and false negatives shows there is still a lot of work to be done if computer-aided diagnosis to detect lung cancer and colon cancer cells in histopathological images is to be incorporated into the patient care process.

I would be interested to see how a model using all 25,000 full-size (768 x 768 pixel) images performed. Unfortunately, that would require immense processing power.

References

AIHW. (2020, June 02). Cancer data in Australia.
https://www.aihw.gov.au/reports/cancer/cancer-data-in-australia/contents/summar

Albawi, S. & Mohammed, T.A. (2017) Understanding of a convolutional neural network
Doi: 10.1109/ICEngTechnol.2017.8308186.

Borkowski, A.A., Bui, M.M., Thomas, L.B., Wilson, C.P., DeLand, L.A., & Mastorides, S.M. (2019) Lung and colon cancer histopathological image dataset (LC25000) arXiv:1912.12142v1 [eess.IV], 2019
https://arxiv.org/pdf/1912.12142.pdf

Dataset: https://academictorrents.com/details/7a638ed187a6180fd6e464b3666a6ea0499af4af

Cancer Australia. (2020, August 24). Bowel cancer.
https://www.canceraustralia.gov.au/affected-cancer/cancer-types/bowel-cancer/statistics

Cancer Australia. (2020, August 24). Lung cancer.
https://www.canceraustralia.gov.au/affected-cancer/cancer-types/lung-cancer/statistics

Cancer Australia. (n.d.). What is cancer?
https://www.canceraustralia.gov.au/affected-cancer/what-cancer

Dertat, A. (2017, November 9). Applied deep learning – part 4: Convolutional neural networks. Towards Data Science.
https://towardsdatascience.com/applied-deep-learning-part-4-convolutional-neural-networks-584bc134c1e2

Dos Santos, E.M., Sabourin, R., & Maupin, P. (2009) Overfitting cautious selection of classifer ensembles with genetic algorithms. Information Fusion, Volume 10, Issue 2, p150-162 doi: https://doi.org/10.1016/j.inffus.2008.11.003

Dumoulin, V. & Visin, F. (2018, January 12). A guide to convolution arithmetic for deep learning.
https://arxiv.org/pdf/1603.07285.pdf

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
http://www.deeplearningbook.org/

Hashemi, M. Enlarging smaller images before inputting into convolutional neural network: Zero-padding vs. interpolation. Journal of Big Data 6, 98 (2019). https://doi.org/10.1186/s40537-019-0263-7

Icecreamlabs. (2018, August 19). 3×3 convolution filters: a popular choice.
https://icecreamlabs.com/2018/08/19/3×3-convolution-filters%E2%80%8A-%E2%80%8Aa-popular-choice/

Ioffe, S. & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift.

LeCun, Y., Bengio, Y. & Hinton, G. (2015). Deep learning. Nature 521p436–444
https://doi.org/10.1038/nature14539

Nam, S., Chong, Y., Jung, C. K., Kwak, T. Y., Lee, J. Y., Park, J., Rho, M. J., & Go, H. (2020). Introduction to digital pathology and computer-aided pathology. Journal of pathology and translational medicine54(2), 125–134. https://doi.org/10.4132/jptm.2019.12.31

Nitish S., Geoffrey H., Alex K., Ilya S., & Ruslan S. (2014). Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research. 15(1):1929–58.
https://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf

Patel, K. (2019, September 8). Convolutional neural networks – a beginner’s guide. Towards Data Scence.
https://towardsdatascience.com/convolution-neural-networks-a-beginners-guide-implementing-a-mnist-hand-written-digit-8aa60330d022

Pathol, J. (2019, November). Computational pathology definitions, best practices, and recommendations for regulatory guidance: a white paper from the Digital Pathology Association. The Journal of Pathology. 249(3). P286-294. Doi: 10.1002/path.533
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6852275/

Pontalba, J. T., Gwynne-Timothy, T., David, E., Jakate, K., Androutsos, D., & Khademi, A. (2019). Assessing the impact of color normalization in convolutional neural network-based nuclei segmentation frameworks. Frontiers in bioengineering and biotechnology7, 300. https://doi.org/10.3389/fbioe.2019.00300

Ruder, S. (2016, January 19). An overview of gradient descent optimization algorithms.
https://ruder.io/optimizing-gradient-descent/

Sahoo, S. (2018, August 20). Deciding optimal kernel size for CNN. Towards Data Science.
https://towardsdatascience.com/deciding-optimal-filter-size-for-cnns-d6f7b56f9363

Savyakhosla. (2019, July 26). CNN | Introduction to padding. Geeks for Geeks.
https://www.geeksforgeeks.org/cnn-introduction-to-padding/

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014 January) Dropout: a simple way to prevent neural networks from overfitting, The Journal of Machine Learning Research, Volume 15, Issue 1
https://dl.acm.org/doi/abs/10.5555/2627435.2670313

Suzuki K. (2012). A review of computer-aided diagnosis in thoracic and colonic imaging. Quantitative imaging in medicine and surgery2(3), 163–176. https://doi.org/10.3978/j.issn.2223-4292.2012.09.02

Warden, P. (2017, December 14). How many images do you need to train a neural network?
https://petewarden.com/2017/12/14/how-many-images-do-you-need-to-train-a-neural-network/

Appendices

Appendix A – Output from Hyperparameter Tuning