Cassava Leaf Disease Classification Project Overview

Authors: Bangxi Xiao, Zuxuan Huai, Daxin Niu


Cassava, one of the largest carbohydrate providers in Africa, is essential to African society. Due to its capability in surviving severe conditions, this crop has been widely grown by smallholder farms. One of the major causes that result in poor yield is cassava leaf disease. The current method that the farmers are using to solve this issue is by hiring experts to inspect the plants. But it is quite costly and inefficient. In order to solve such a problem, we are aiming to use machine learning methods to make the identification progress more efficient.

This project is hosted on Kaggle, and the data is provided by the Makerere University AI Lab. The link to this Kaggle challenge is attached below:

Tasks and Goals

We are currently supplied with over 21000 cassava leaf images which are separated into five categories. Our current task is to train a baseline model in order to classify the given images. We will observe how the model performs and make adjustments along the way. Our final goal is to construct a model that could identify the images correctly so that the model could potentially help the farmers in Africa. To achieve such a goal requires us to build a convolutional neural network that beats our baseline model with a significant increase in accuracy on the testing set. We will be applying different hyperparameters and testing different models, and aiming to achieve a high test accuracy.

Exploratory Data Analysis

Before constructing the model, we decided to perform exploratory data analysis on our dataset so that we can get a better understanding of our inputs.

We imported the files and created a data frame hosting all the training data. All the training samples are labeled from 0 to 4, each indicating one category for our classification task. 0 implies Cassava Bacterial Blight (CBB), 1 indicates Cassava Brown Streak Disease (CBSD), 2 indicates Cassava Green Mottle (CGM), 3 means Cassava Mosaic Disease (CMD), and 4 implies that the Cassava leaf is healthy.

After checking the categories, we decided to build a pie chart to see how the images are distributed over the five categories. This could give us a better idea of what potential bias we might have in the training set. Therefore, we can try to eliminate the bias during our model constructing stage.

Target Variable Breakdown

From the above pie chart generated using the training data, we noticed that more than half of the training samples have Cassava Mosaic Disease (CMD). This might cause the model to predict a given image as Cassava Mosaic Disease (CMD) more likely if the bias is ignored. Other than the Cassava Mosaic Disease (CMD), the other four categories are quite evenly distributed. Therefore, we might not have the need to pay extra attention to the other categories.

With the category information sorted out, we decided to take a closer look at each category and input image size. We displayed some images from each category and one of the categories is attached below.

CMD Sample Images Visualization

We have also checked the size of the input images. This is important because many of the existing convolutional models require the input to have the same dimension. If the shapes of the input are all different, it could potentially limit our choice of models. Luckily, the given training samples all have the same dimension of 800 by 600, which opens up a wide range of model choices.

The last thing we did for the exploratory data analysis was to analyze the image inputs’ basic stats. We plotted the RGB channels of the images to help understand the density of each channel.

RGB Density Plot

Other than the basic color density analysis, we also included some extra image analysis on luminance, kurtosis, median, and contrast. The following graph illustrates our analysis on that part.

Luminance, Median, Contrast, and Kurtosis Plots

With the above exploratory data analysis, we now have a better understanding of the given dataset. In the next section, we will discuss our choice of the Baseline model.

Baseline Model — VGG16

We have considered many potential baseline models and eventually decided to use VGG16 as our baseline model.

VGG16 is a deep convolutional neural network. It is constructed using five “blocks”. The first two blocks each consist of two convolutional layers and a pooling layer. The rest three blocks each contain three convolutional layers and a pooling layer. These five blocks make up the VGG16 model. This well-constructed model performed really well on image recognition tasks. It has achieved over 92.7% top-5 test accuracy in ImageNet. Therefore, we believe that this model would be a perfect starting point for our project.

For our model, we imported VGG16 as the baseline model and added three fully connected layers at the end. Below is the code cell for our baseline model.

We tested our baseline model locally by splitting the dataset into a training and a testing set. We also used Tensorboard to visualize our accuracy in training and testing. Our epoch accuracy graph is shown below.

VGG16 Accuracy Plot

The orange line indicates the training accuracy, and the blue line represents the testing accuracy. The training accuracy has been increasing along the way, but the testing accuracy decreased slightly at the very end. Overall, our baseline model achieved a testing accuracy of around 70%.

As a baseline model, the accuracy seems to be decent. Nonetheless, we will still work on improving our model so that it can achieve better performance.

A link to our Tensorboard is attached below. Please feel free to check it out and find out more about the model training process.

We have also uploaded our baseline model to Kaggle. The model received an accuracy of 0.6312. The link to our submission is attached below. We will keep working on it and hopefully increase the accuracy of the Kaggle testing set.

Model Fitting — ResNet50

In order to further improve the model performance, we tried various models including ResNet, Inception, MobileNet, VGG family. We found that ResNet performs the best among them all. ResNet was first introduced by Kaiming, Xiangyu, Shaoqing, and Jian in 2015, aiming to address the depth problem in deep learning frameworks. It fits a residual mapping to lower the information loss during the learning process.

The training details are as follows:
Input image: 336 x 336 (3 channels)
Train Images: 18721
Validation Images: 2676
Augment Process: (Random Flip, Random Rotation, Random Crop, Random Height, Random Width, Random Contrast, Random Zoom, Rescaling)
Weights: Initialized by “ImageNet”
Loss: Categorical Cross Entropy with smooth labeling 0.05
Optimizer: Adam
Learning Rate: 1e-5
Learning Epochs: 30
Fine Tune Epochs: 20

Both the training and validation accuracy achieve 0.85 or higher:

ResNet50: Training Accuracy

Also, we include the training loss:

ResNet50: Training Loss

The final score on Kaggle reaches 0.858, outperforming the last model we developed.

Model Fitting — VGG19

The VGG19 model we trained also reached a validation accuracy of 0.87. The framework addresses the depth problem of the networks by adding very small filters (size of 3 by 3 in application), which enables the number of layers to increase to 16–19. During the training, we set the batch size to 32, input image size after augmentation to 336 by 336 by 3. Training details are listed as follows:

Input image: 336 x 336 (3 channels)
Train Images: 18721
Validation Images: 2676
Augment Process: (Random Flip, Random Rotation, Random Crop, Random Height, Random Width, Random Contrast, Random Zoom, Rescaling)
Weights: Initialized by “ImageNet”
Loss: Categorical Cross Entropy with smooth labeling 0.02
Optimizer: Adam
Learning Rate: 1e-5
Learning Epochs: 20
Fine Tune Epochs: 10

The resulting learning curves:

Validation Accuracy of VGG19 model
Validation Loss of VGG19 model

However, it is worth mentioning that the VGG model takes the longest per-epoch training time among all the models we developed — it is quite a time-consuming model but with satisfying performance.

Model Fitting — MobileNet

Lastly, we developed the MobileNetV3Large model, which learns the optimal network architecture for the leaf dataset through AutoML; also, by incorporating the “squeeze-and-excitation” block into the core architecture, the architecture becomes more robust. We reveal the training detail as follow:

Input image: 224 x 224 (3 channels)
Train Images: 18721
Validation Images: 2676
Augment Process: (Random Flip, Random Rotation, Random Crop, Random Height, Random Width, Random Contrast, Random Zoom, Rescaling)
Weights: Initialized by “ImageNet”
Loss: Categorical Cross Entropy with smooth labeling 0.05
Optimizer: Adam
Learning Rate: 1e-5
Learning Epochs: 20
Fine Tune Epochs: 60

The validation loss as well as accuracy curves:

Validation Accuracy of MobileNetV3Large
Validation Loss of MobileNetV3Large

We found that even after a number of 80 training epochs, the training accuracy and validation accuracy are still gradually climbing, suggesting that the model’s demand for further training, yield better results. Eventually, we stopped the model at around 80 epochs, the final validation accuracy reached 0.84.

Combined Models

To make the model more robust to various kinds of images, we tried to develop an ensemble model, which consists of 3 distinctive deep learning frameworks and all of the 3 learning models achieve a validation accuracy of 0.85 or even higher.

Ensemble Learning with Attention Layer

The 3 chosen models are: MobileNetV3Large (reach validation accuracy of 0.84), VGG19 (reach validation accuracy of 0.87) and ResNet50 (reach validation accuracy of 0.88). Given an image from a test set, simply plug it into the 3 trained classifiers, which results in 3 different predictive probabilistic vectors. Walking them through the normalized attention layer calculated by:

The “acc_k” in the above formula is the validation accuracy of the k-th sub-model (one of the MobileNet, VGG, and ResNet). By incorporating this “voting” mechanism, the model is furtherly strengthened.

However, as we take a further insight into the validation accuracy of each class, we surprisingly noticed that there are gaps between the different categories. The majority of the plant diseases, known as CMD, which takes about 60% of the training set, reaches a validation score of 0.91 while other relatively minor classes only claim about 0.82 to 0.83 accuracy.

The outcome of the Kaggle score is:

Compared with the last submission, the score increases by 2%, indicating that the ensemble method actually works.

Outlook & Reflection

The cassava leaf classification challenger allowed us to have a better understanding of how deep learning can be applied to a real-world problem. We have also learned about different ways and methods that could help us to achieve higher accuracy.

Data Processing:

Data processing is a very important part of the whole training process. Since we were only supplied with around 21 thousand images, we would want to increase the training set so that our model could “learn” more. Therefore, we applied image augmentation which helps us to obtain a larger training set. This eventually helped us to increase the model accuracy by 5–7 percent.

Model selection:

This is another important decision we had to make for this Kaggle challenge. In order to get the best results, we decided to try out many different models and find the one that performs the best. Overall, we tested ResNet152V2, ResNet50, Inception, Xception, VGG16, VGG19, and MobileNet. The three best models were ResNet50, VGG19 and MobileNet. The validation accuracy we get from these models varies from 76% to 87%. Testing all of the models out allows us to find the best one that is fitted for the task


This is one of the most important aspects of our training process. Fine-tuning allows us to get the model on the right track and eventually achieve great results. Learning rate is one of the hyperparameters we tune. Our original learning rate was too big that the algorithm failed to find the “minimum” for the dataset. Nonetheless, after we adjust the learning to a lower value, our model was able to converge and have increased accuracy. Therefore, learning rate tuning plays an important role in fine-tuning. Another important aspect is the structure of the model. Originally, we assumed that a deeper model would give us better accuracy. But that is not the case. We compared ResNet152V2 with ResNet50 and found that ResNet50 actually gives us a higher accuracy. Therefore, we used ResNet50 as one of our final models. From this comparison, we realized that more layers do not always provide us with better results.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store