Cassava Leaf Disease Classification

Authors: Bangxi Xiao, Zuxuan Huai, Daxin Niu

Cassava, one of the largest carbohydrate providers in Africa, is essential to African society. Due to its capability in surviving severe conditions, this crop has been widely grown by smallholder farms. One of the major causes that result in poor yield is cassava leaf disease. The current method that the farmers are using to solve this issue is by hiring experts to inspect the plants. But it is quite costly and inefficient. In order to solve such a problem, we are aiming to use machine learning methods to make the identification progress more efficient.

This project is hosted on Kaggle, and the data is provided by the Makerere University AI Lab. The link to this Kaggle challenge is attached below:

In the first part of the project, we performed exploratory data analysis on the given 21,367 labeled images. We built a basic VGG16 model with a few fully-connected layers and three dropout regularization layers as our baseline model and reached an accuracy of 0.6312. For the details of part 1 of our project, please refer to the link below.

Although this model performs slightly better than labeling all classes as the majority class, the accuracy is not high enough to provide an informative and reliable diagnosis for farmers who need easy-to-access and timely results. This week, we worked on developing various convolutional neural networks and adjusting data storage formats to optimize our pipeline and achieve better results.

More Efficient Data Storage — TFRecord

After loading in our data, we notice that it takes too much memory and makes the training process slow. So we decided to transform our data into TFRecords format which goes well with TensorFlow.

The TFRecord format is a simple format that stores the image data as a sequence of binary strings. This is a very efficient format to work with large datasets, especially with images. TensorFlow internally uses Protocol Buffers to serialize/deserialize the data and store them in bytes, so much space is saved when holding the ample amount of data. We used the code below to first decode the images, and then parse the files, and finally transform them into the TFRecord format.

TFRecord Sample Code

Image Augmentation

Image Augmentation Concept Illustration from Analytics Vidhy

Image augmentation is an important technique to enrich the training dataset as well. In this step, we manually create more images through a combination of flips, rotations, and zooms. Since in our original dataset, some images are photos of the entire plant whereas others are photos of leaves. We use the following codes to perform image augmentation.

Image Augmentation Code
Image Augmentation Code 2

Transfer Learning & Fine-tuning

Instead of building a CNN model from scratch, a widely used technique in deep learning is to use a pre-trained model from ImageNet data and customize the model to our own dataset. The process that applies a well-trained model is called transfer learning, and tailoring it to our own dataset is call fine-tuning. The table below listed some models available from Keras library.

Models available in Keras.application

In this project, so far we have trained two ResNet152V2 models, one Inception V3 model, one Xception model, and one EfficientNetV3 model. We have added various layers and image augmentation in each model. We tune hyperparameters such as learning rate, batch size. We have tried various input image sizes for different models as well.


ResNet has always been famous for its capability in image recognition. Due to its strong performance in the field, we decided to implement ResNet152V2. The reason why we specifically picked ResNet152V2 is that we want to try the deepest possible ResNet and see how it performs on this given task.

We have resized the image to 400x400 and trained the data using two ResNet152V2 models. Our first ResNet152V2 model looks like the following:

ResNet152V2 Implementation

The model was trained with 80 epochs with batch size equals to 64 and learning rate at 0.001. The performance of the model was quite well. It reached around 77 percent accuracy by the end of the training process. The following screenshot illustrates the last few training epochs of the model.

ResNet152V2 Training Results

We have also created a loss and accuracy plot illustrating the change of train and validation loss and accuracy. The loss and accuracy plot looks like the following:

ResNet152V2 Model Performance Plot

We can see that the model in fact reaches a quite stable state around 40 epochs and the accuracy hasn’t increased too much since then.

Considering the deeper model does not always guarantee better accuracy, we decide to train another ResNet152V2 model with fewer layers. Our second version of the model looks like the following:

ResNet152V2 Implementation

We trained the model using the same batch size, learning rate. We have also increased the number of epochs hoping to increase the accuracy. Unfortunately, our assumption was not true. Fewer layers did not give us better accuracy. The overall accuracy was about 3 percent lower than the original model. The end of the training epochs look like the following:

ResNet152V2 Trainig Results

The loss and accuracy plot for this second model looks like the following:

ResNet152V2 Model Performance Plot

From the plot, we can see that the loss and accuracy do not change much after the first 20 epochs. Training more epochs did not provide us with too much improvement. Nonetheless, cutting layers from the previous model did reduce our accuracy. Therefore, we might need more layers for this problem if we want to achieve higher accuracy.

For the two ResNet models we trained, there exist some drawbacks. Although ResNet brought us a quite high accuracy, the size of the neural network is quite large. The number of parameters needed was massive. To train a model like this, high ram and high computational power were required. This becomes very time-consuming if we want to make the neural network deeper. Therefore, this model might not be the most suitable model for this project.

Another problem with this model is that it seems to be difficult for the model to reach a higher accuracy than 77%. We still need to do more fine-tuning on the model but it’s been quite difficult to get the validation accuracy up.

Inception V3

From the table above, Inception V3 seems to have decent accuracy and relatively small size, so we decided to build a model and give it a try. The input image has size 400 by 400 because we want to keep as much information as possible from the original image since the original model is very deep and has a great capacity to process information. On top of the base model, we added one average pooling layer, two max-pooling layers to extract useful information. We have also added one dropout layer and one batch normalization layer to reduce variance and prevent overfitting.

Because the dataset is very large and takes a long time to train, we initially used a batch size of 64 and 0.001 as the learning rate. We trained the model for 20 epochs. The screenshots below show the performance of the model in the first 20 epochs.

InceptionV3 Training Result Part 1
InceptionV3 Training Result Part 1

Although the training loss quickly drops after a few epochs, the training and validation accuracy was stagnated around 0.72–0.74. The graph below shows the accuracy in the first 20 epochs.

InceptionV3 Model Performance Plot

Since we have limited knowledge about how to properly construct the added layers and how much it would modify a single layer would increase the model’s performance, we decided to modify the learning rate. We applied the same model again for 20 epochs but with a learning rate of 0.0001.

The screenshots below show the training results for the Inception model with a smaller learning rate in the first 20 epochs. Although the performance was not very good at the beginning, it quickly improved and even had a validation accuracy as high as 78%.

InceptionV3 Training Result Part 1
InceptionV3 Training Result Part 2

The result is very promising, so we trained this model for another 20 epochs. The model has stabilized around 0.77–0.79 validation accuracy. We can see the model performance in the model below.

InceptionV3 Training Result Part 1


In addition to Inception, we also used the Xception model. Xception is an extension of the inception Architecture. The difference between Xception and Inception is that Xception replaces the standard Inception modules with depth-wise Separable Convolutions. We started out with a similar architecture.

After investigating the effect of the learning rate on the Inception model, we investigated the effect of changing batch sizes in the Xception model. We removed one max pooling layer and added a dropout layer. We started with batch size 64 and a learning rate of 0.0002 for 20 epochs. The size of the image input is 400 by 400.

Xception Model Performance Plot

We found the training accuracy was stable around 0.72 and validation accuracy was around 0.74. We decided to lower the batch size and run 20 epochs again. The accuracy plot below shows the model’s performance.

Xception Model Performance Plot

We can see an upward trend in the training accuracy, however, the validation accuracy is a little volatile.

We adjusted the input image size to 224 by 224 and added more variations in image augmentation including width shift, height shift, feature-wise standard normalization, and zoom. We trained the model for 20 epochs with the same batch size 64 and learning rate 0.0001. The performance is shown below.

Xception Model Performance Plot

To our surprise, after these adjustments, both the training and validation accuracy dropped compared to the previous models. This is a great lesson we learned that because convolutional neural networks are so flexible that it is hard to target the problem that could increase the performance. We could modify the model in various ways but the result is hard to predict. Another challenge is that the dataset is very large and normally it takes hours to run enough epochs to achieve a good score. It is both computationally heavy and time-consuming to tune the model.

EfficientNet B-3

Efficient Net was first introduced by Mingxing Tan and Quoc V. Le, aiming to address the problem of scaling up the ConvNets by investigating the question “is there a principled method to scale up ConvNets that can achieve better accuracy and efficiency?”. The empirical study shows that it is critical to balance all dimensions of network width/depth/resolution, and surprisingly, such balance can be achieved by simply scaling each of them with a constant ratio.

Scaling in CNNs

Based on the above, the authors suggested a subtle but effective approach called “the compound scaling method” — this method uniformly scales network width, depth, and resolution with a set of fixed scaling coefficients, unlike traditional practice that arbitrarily scales the factors. The model performance is highly satisfying:

As we can see in the above figure, when the number of parameters is below 100 million, the Efficient Net outperforms the other models.

The ideal resolution of the input image of the Efficient Net is 380 times 380, thus in the model construction process, we set the input of the image (380 x 380 x 3) coming from a randomly cropped image with a shape (600 x 800 x 3). Regarding the head of the model, because the core model has already had a fairly complete structure, we didn’t add too much to it — to avoid overfitting.

Efficient Net B-3: Model Head Structure

After the activation layer in the base model, we added a global pooling layer mapping the 4-D space to a flattened one. Then, going through a dense layer as well as a batch normalization layer, with the dropout layer and a softmax output layer, we eventually get the probabilistic output tensor.

The training process of the model was also a hard time for both the computer and us, we incorporated the “Early Stopping” and “Reduce Learning Rate on Validation Loss” mechanism to accelerate the training process and avoid overfitting.

Training Schedules

Also, in order to prevent massive computational costs, we applied transfer learning — froze the deep layers during the first 20 training epochs. The validation accuracy in the first 20 epochs rises to 0.8:

X: Epoch, Y: Accuracy

The loss of the model performs also a satisfying trend — it decreases as the training epochs go:

X: Epoch, Y: Loss (Cross-entropy loss with label smoothing)

Kaggle Submission

Based on our training models, we submitted the Efficient Net model to Kaggle as it achieved 0.79 accuracies on our validation set. Our score from Kaggle is 0.76. Although it has improved a lot compared to the baseline model, we still have a long way to go compare to the top models on the leader board.

Kaggle Submission

Please note that we trained our models using Google Colab notebook, this Kaggle submission book uploads a pre-trained model. Although our model has improved over 10% compared to the baseline model, we still have much room to improve.

Challenges & Next Steps:

At the current stage, the biggest challenge we are facing is accuracy. We have been fine-tuning things from hyperparameters to model structures. Nonetheless, the increase in accuracy was not significant. Our accuracy often moves between 0.75 to 0.80. This effect can often be seen within the first 20 epochs. The reason why this is happening might be due to the strong performance of the base model. Our layers constructed after the pre-train model might not be as efficient as we thought. The “learning” on these layers might not perform as well as expected.

For the next step, we will do more fine-tuning on our existing models. Furthermore, we will explore other possible models from different sources. There might be a chance that some other model would be more fit for our current task and we are looking forward to finding the “best” one for this project.