Fusion of Five Deep Learning Models for Skin
Lesion Classification in ISIC Challenge 2019
Hamed Aghapanah1,Ehsan Mohammadi1, Shakib Yazdani2,Saeed Kermani1,Alireza Vard1
1 Department of Bioelectrics and Biomedical Engineering, School of Advanced Technologies, Isfahan University of Medical Sciences, Isfahan, Iran
2 Department of Electrical and Computer Enginiering, Isfahan University of Technology,Iran
Abstract— Melanoma is one of the most dangerous forms of skin cancer that can be cured, if it is diagnosed in the early stages. In this project, we proposed a weighted voting method among of five deep learning models including: Densenet201, ResNet50, InceptionResNetV2, Xception and VGG16 to classify the dermoscopic images of the ISIC challenge into nine different diagnostic categories of the skin lesion We splitted all data into 70% train, 15%validation and 15% test The test accuracy of the classification in the fusion of models was equal to 99%. In next Task, metadata used in the last layer of proposed deep learning model and accuracy was improved up to 99.5%.
Keywords—Fusion Deep learning Models, ISIC Challenge, classification, dermoscopic images.
skin can change over time for many reasons, such as aging, genetic predisposition, exposure to sunlight, and allergies. Some skin changes may also be a sign of disease in the body. Skin cancer can grow slowly and go unnoticed easily, making it difficult to keep track of any changes. However, earlier detection of skin cancer can greatly increase the success of treatment.
Melanoma is one of the most dangerous forms of skin cancer that can appear anywhere on the body, even on areas that are not exposed to the sun. The most frequent locations for melanoma are the face, scalp, trunk or torso (chest, abdomen, back), legs, and arms. However, melanoma can also develop under the fingernails or toenails; on the palms, soles, or tips of fingers and toes; or on mucous membranes, such as skin that lines the mouth, nose, vagina, and anus. Early detection and Staging of lesion are so important because when melanoma is found and treated early, the chances for long-term survival are increased. As melanoma progresses, it becomes increasingly harder to treat and has tragic results. Five-year survival rates for patients with early stage detection are greater than 92% and 86% of all diagnosed patients enjoy long term survival after a simple surgery. Cutaneous melanoma (CM) is potentially the most dangerous form of skin tumor and causes 90% of skin cancer mortality .
In this regards, we present a weighted voting algorithm among of five deep learning models including: Densenet201, ResNet50, InceptionResNetV2, Xception and VGG16 in order to categorize the dermoscopic images of the ISIC challenge into nine skin lesion classes.
The remainder of this report is organized as follows: the dataset used in this project is introduced in Section II. In section III, the proposed method is explained and the experimental results are presented in Section IV.
Dataset of ISIC challenge 2019 consists of 8 stage classes and one unknown class. The number of images in each class is not equal and the last class has not any member. So we used all original dataset and an external data set for the unknown class.
Original dataset consist of 25,331 JPEG images of skin lesions. The "ISIC 2019: Training" data includes content from several copyright holders . BCN20000 dataset, composed of 19424 dermoscopic images of skin lesions captured from 2010 to 2016 in the facilities of the Hospital Clinic in Barcelona. In Figure 1 ISIC dataset for training model has been shown.
Figure 1 ISIC dataset for train
Unknown Class dataset
The last class "None of the others" in training dataset was vacant. So we filled it by ordinary skin images from face recognition dataset and some images from this dataset are shown in Figure 2.
Figure 2 original dataset for training unknown class
In this section, at first, we briefly review five popular deep learning models utilized in our work (Subsection A to Subsection E) and in Subsection F, the proposed fusion method is explained.
In DenseNet networks, there are blocks where each layer blocks its own input from all layers before it. And between the two dense blocks there are layers called Conv and Relu layers. Operations occur and feature size maps are reduced and normalized . The most important feature of Densenet is additive nerones of pervious layers to next layer. In here the structure has the little lost data.
Figure 3 The transformations within a layer in DenseNets
Since ResNet blew people’s mind in 2015, many in the research community have dived into the secrets of its success; many refinements have been made in the architecture . The Residual Block as a main part of ResNet model has been shown in Figure 4 .
Taking advantage of its powerful representational ability, the performance of many computer vision applications other than image classification has been boosted, such as object detection and face recognition.
Figure 4Residual learning : a building block
In the figure 4, the most important feature to learn from ResNet model is presented. This identity mapping does not have any parameters and is just there to add the output from the previous layer to the layer ahead . In Figure 5 three types of ResNet has been presented. Block Conv and RLU are abbreviations of convolution and activation function, respectively .First one is residual unit of ResNet, second one is type-B Inception residual unit of Inception-ResNet-v2 and the last one is abstract residual unit structure where the residual block is denoted by F.
Figure 5Left: residual unit of ResNet . Middle: type-B Inception residual unit of Inception-ResNet-v2 . Right: abstract residual unit structure where the residual block is denoted by F 
This has also enabled significant simplification of the Inception blocks Figure 6 with dimension reduction approach. Just compare the model architectures in Figure 7:
Figure 6 Inception module with dimension reduction
Figure 7 On the left is the overall schema for the pure Inception-v4 network. On the right is the detailed composition of the stem. Note that this stem configuration was also used for the Inception-ResNet-v2 network outlines in Figures 5, 6. V denotes the use of ‘Valid’ padding, otherwise ‘Same’ padding was used. Sizes to the side of each layer summarize the shape of the output for that layer
Figure 8 A canonical Inception module (Inception V3)
At the top of the second Inception-ResNet-v2 Figure 7, you'll see the full network expanded. Notice that this network is considerably deeper than the previous Inception V3. Figure 8 is an easier to read version of the same network where the repeated residual blocks have been compressed. Here, notice that the inception blocks have been simplified, containing fewer parallel towers than the previous Inception V3 .
The VGG Network was introduced by researchers at the Visual Graphics Group in Oxford. This network is best known for its pyramid-like shape in which the layers closer to the image are wider, and the deeper layers are deeper .The model achieves 92.7% top-5 test accuracy in ImageNet, which is a dataset of over 14 million images belonging to 1000 classes . The architecture depicted VGG16 is shown in Figure 9.
Figure 9VGG16 Architecture
The input to conv1 layer is of fixed size l1,l2 related to RGB image size. The image is passed through a stack of convolutional (conv.) layers, where the filters were used with a very small receptive field: 3×3 (which is the smallest size to capture the notion of left/right, up/down, center). In one of the configurations, it also utilizes 1×1 convolution filters, which can be seen as a linear transformation of the input channels. The convolution stride is fixed to 1 pixel; the spatial padding of conv. layer input is such that the spatial resolution is preserved after convolution, i.e. the padding is 1-pixel for 3×3 conv. layers. Spatial pooling is carried out by five max-pooling layers, which follow some of the conv. Layers (not all the conv. layers are followed by max-pooling). Max-pooling is performed over a 2×2 pixel window, with stride 2.
Three Fully-Connected (FC) layers follow a stack of convolutional layers (which has a different depth in different architectures): the first two have 4096 channels each, the third performs 1000-way ILSVRC classification and thus contains 1000 channels (one for each class). The final layer is the soft-max layer. The configuration of the fully connected layers is the same in all networks.
All hidden layers are equipped with the rectification (ReLU) non-linearity. It is also noted that none of the networks (except for one) contain Local Response Normalization (LRN), such normalization does not improve the performance on the ILSVRC dataset, but leads to increased memory consumption and computation time.
Xception by Google, stands for Extreme version of Inception, is reviewed. With a modified depth wise separable convolution , it is even better than Inception-v3 for both ImageNet, ILSVRC and JFT datasets. . In Figure 10 , The Xception architecture is shown.
Figure 10 The Xception architecture: the data first goes through the entry flow, then through the middle flow which is repeated eight times, and finally through the exit flow. Note that all Convolution and SeparableConvolution layers are followed by batch normalization  (not included in the diagram). All SeparableConvolution layers use a depth multiplier of 1 (no depth expansion). 
F. Fusion Method
At first we used the model, shown in Figure 10, but the performance of it was not good enough. So we suggest a fusion method by combination of five above models . As illustrated in Figure 11.
Figure 11 Classification method base on CNN
In training phase, We first teach five different models with regard to instructional images. These models are Densenet as model I, Resnet as Model II, InceptionResNetV2 as Model III, VGG16 as Model IV and Xception as last Model respectively. Then in test phase, we will use voting between outputs of models.
Figure 12 Classification method base on (Fusion Models)
Here, the result of five mention models are achieved in dense layer. For task one, weight of meta data has not mention. In task two, weight of meta data add to dense layer and improve accuracy of classification of dermoscopic images into 9 classes Figure 13.
Figure 13 Classification method base on (Dual Fusion Models and metadata)
For evaluation the performance of proposed algorithm, we calculated four assessment parameters as follows:
We splitted all data into 70% train, 15%validation and 15% test. In Figure 14 the accuracy result of training and validation is displayed. This diagram illustrates the concept that with increasing iterations in the simulation, the simulation accuracy increases dramatically. Initial classification accuracy was around 95%, increasing to more than 95% with increasing iterations. The result of test prediction is mention in
Table 1. It is worth noting that since the number of samples in the classes is not the same, the accuracy criterion alone is not an appropriate value to express the performance of the proposed model and therefore other criteria should be used.
Figure 14 accuracy of five models
Comparing between parameters in models
Dual Fusion Models and metadata
We are thankful to the organizer of "MICCAI 2019 challenge ISIC for providing the skin images. Automated Skin Lesion Classification Using Ensemble of Deep Neural Networks in ISIC 2019 Skin Lesion Detection Challenge.
 C. Garbe et al., “Diagnosis and treatment of melanoma. European consensus-based interdisciplinary guideline–Update 2016,” Eur. J. Cancer, vol. 63, pp. 201–217, 2016.
 M. Combalia et al., “BCN20000: Dermoscopic Lesions in the Wild,” arXiv Prepr. arXiv1908.02288, 2019.
 G. Huang, S. Liu, L. Van der Maaten, and K. Q. Weinberger, “Condensenet: An efficient densenet using learned group convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2752–2761.
 Z. Wu, C. Shen, and A. Van Den Hengel, “Wider or deeper: Revisiting the resnet model for visual recognition,” Pattern Recognit., vol. 90, pp. 119–133, 2019.
 S. Targ, D. Almeida, and K. Lyman, “Resnet in resnet: Generalizing residual architectures,” arXiv Prepr. arXiv1603.08029, 2016.
 X. Zhang, Z. Li, C. Change Loy, and D. Lin, “Polynet: A pursuit of structural diversity in very deep networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 718–726.
 K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
 R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive mixtures of local experts.,” Neural Comput., vol. 3, no. 1, pp. 79–87, 1991.
 C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning,” in Thirty-First AAAI Conference on Artificial Intelligence, 2017.
 F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1251–1258.
 R. M. Kamble et al., “Automated diabetic macular edema (DME) analysis using fine tuning with Inception-Resnet-v2 on OCT images,” in 2018 IEEE-EMBS Conference on Biomedical Engineering and Sciences (IECBES), 2018, pp. 442–446.
 M. Habibzadeh, M. Jannesari, Z. Rezaei, H. Baharvand, and M. Totonchi, “Automatic white blood cell classification using pre-trained deep learning models: ResNet and Inception,” in Tenth International Conference on Machine Vision (ICMV 2017), 2018, vol. 10696, p. 1069612.
 K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv Prepr. arXiv1409.1556, 2014.
 H. Qassim, A. Verma, and D. Feinzimer, “Compressed residual-VGG16 CNN model for big data places image recognition,” in 2018 IEEE 8th Annual Computing and Communication Workshop and Conference (CCWC), 2018, pp. 169–175.
 E. Rezende, G. Ruppert, T. Carvalho, A. Theophilo, F. Ramos, and P. de Geus, “Malicious software classification using VGG16 deep neural network’s bottleneck features,” in Information Technology-New Generations, Springer, 2018, pp. 51–59.
 “b26de33df0fa62068e9a2e51fdf11911cb3ff181 @ ieeexplore.ieee.org.” .
 N. Aloysius and M. Geetha, “A review on deep convolutional neural networks,” in 2017 International Conference on Communication and Signal Processing (ICCSP), 2017, pp. 588–592.
 “Chollet_Xception_Deep_Learning_CVPR_2017_paper @ openaccess.thecvf.com.” .