May 2, 2022

Generalizing Your Model: An Example With EfficientNetV2 and Cats & Dogs

Daniel Reiff

Consider this scenario. You are using the new fancy state-of-the-art CNN network architecture, EfficientNetV2, to train an image classifier. You’ve achieved impressive training accuracy (> 95%) but the model is not learning evaluation samples nearly as well as training samples.

As machine learning engineers, we understand that our models are only as good as they perform on unseen data. Which begs the question:

How can we increase the performance of our networks on unseen data?

When our models are overfitting the two most common fixes are:

  1. Train our model on more samples
  2. Change the complexity of our model

Since data is expensive (time + cost) to acquire, in this blog we will focus on transforming our data with an augmentation pipeline and changing model complexity. At Forsight, a construction tech startup focused on construction safety and security, our machine learning team uses these strategies to produce better generalized models.

You can read about our work in CNN interpretation here and PPE detection here.

In this article we will use a dataset of cats & dogs to show you how to fine tune your model and improve performance on unseen data. You can easily extend this example and use it to improve your own model! Let’s jump in.

Dataset & Model

We will keep it simple with a classic image classification problem, cats vs. dogs, using a high quality dataset from kaggle. Let’s take 20,000 images and train on 16,000 of those images. The remaining 4,000 images will be used for evaluation.

We will start with a baseline EfficientNetV2B0 model architecture. The base model will sit between input/standardization layers and a binary classification head. The binary classification head will include a global average pooling layer, a dropout layer with 35% dropout, and a dense prediction layer. The model will train for 100 epochs with an Adam optimizer. For faster convergence, we’ll use a 1Cycle learning rate schedule, maxing out at 0.001. The 1Cycle learning rate consists of two phases to achieve super-convergence. In phase 1, we gradually increase the learning rate to the maximum using cosine annealing. In phase 2, we decrease the learning rate to 0 again using cosine annealing. You can read more about it here.

epoch loss
epoch binary accuracy

After 100 epochs, the model has learned the training data extremely well with accuracy nearly approaching 100%. But validation loss is increasing indicating that we are overfitting to the training data and validation accuracy is stuck at around 85% which is not good enough! Let’s start fine tuning the model to improve performance on the validation samples with an image augmentation pipeline.

Image Augmentations

Data augmentations are a series of input transformations that preserve output labels. It is a common technique for increasing the size and diversity of a dataset. For an image dataset, common transformations include pixel level operations like changing the color, brightness, and adding noise. Image level transformations like rotations and flips are also common. Let’s insert an image augmentation pipeline into the greater data pipeline prior to model training. In some simple binary classification problems, reducing model complexity can also help prevent overfitting. But in this example, we will use an image augmentation pipeline to increase the diversity of the training set and hopefully reduce overfitting.

For the pipeline, we will use albumentations, a fast & flexible library widely used in industry, research, competitions, and for projects. We will implement pixel-level transforms: blur, random brightness contrast, rgb shift, and noise. Afterwards we will implement image level transformations: horizontal flip and random rotate 90. Finally we will add image compression. You can explore different augmentations, sequence transformations into pipelines, and test everything with your own images here! Below is our complete pipeline with an example input & output image:

Let’s experiment with the pipeline by playing around with the probability that each augmentation is applied to each training sample. For now, we will keep it simple and apply the same probability, x, to each augmentation. But in the future, the probabilities can be fine tuned. We will increase this probability progressively from 0% to 33%. We will use the same b0 model detailed above and all other parameters will remain unchanged.

The augmentation pipeline improves model performance on unseen data with just a 5% chance of each augmentation being applied! Validation loss no longer increases as we train the model indicating that the model is no longer overfitting. Raising the probability to 10%+ further decreases the validation loss with diminishing returns. On the other hand, train performance suffers as we increase the augmentation %. This is because it is more difficult for the model to learn patterns within train samples when they are constantly changing. But the model has a better grasp of the evaluation set, which is what we want! The model weights are now clearly less affected by the detail and noise in the training set. Moving forward, let’s set the augmentation probability parameter to 33%. Now that the model is not overfitting, we can further improve performance on unseen data by experimenting with model complexity.

Model Complexity

In 2021, Mingxing Tan and Quoc V. Le introduced a smaller and more efficient version of EfficientNet called EfficientNetV2. After studying the bottlenecks in EfficientNet, they designed a new parameter search space which produced an improved model architecture. You can read more about the model architecture and parameter search process here✎ EditSign. They also introduced a new non-uniform scaling strategy where layers are gradually added to later stages and scaled up by a depth parameter (layers) and by a width parameter (channels) to create more complex models. The authors used their improved parameter search space to create a baseline model, EfficientNetV2B0, and then used their scaling strategy to create more complex models. We will experiment with the more complex models but will also look at reducing the baseline model complexity by scaling down the depth parameter. Heres an overview of all the models we will be experimenting with:

We’re interested in the relationship between model complexity and model performance on unseen data. This leads us to ask the following questions:

Is the baseline model too complex for the classification task at hand? Is this causing the model to learn the noise in the training set?

Is the baseline model not complex enough? Is more complexity needed to differentiate cats & dogs?

We will attempt to answer these questions by training models of varied complexity which are detailed in the table above. We will use an augmentation pipeline with a 33% chance of each transformation being applied. All other model and dataset parameters will remain the same.

Increasing model complexity leads to improved performance on unseen data! We are able to decrease validation loss by 45% when increasing model complexity from 5.92E+6 params in the b0 baseline model to 1.18E+8 params in the L model. It makes intuitive sense that differentiating cats & dogs would benefit from more model complexity. Cats & dogs appearances can vary widely among different breeds. Some breeds of dogs look very cat-like and vice-versa. Additionally, the model can’t use basic characteristics like color and size to differentiate. More complex features like facial structure and paws need to be considered. Let’s use Grad-CAM to analyze a samples unseen by the models. Red regions are distinguishing features that cause the model to classify the image as either a cat or a dog. You can read more about interpreting algorithms with Grad-CAM here.

In both dog samples, the red region more precisely fits the dogs’ bodies as we increase model complexity. At low complexity, the red regions include parts of the background indicating that these models haven’t really learned distinguishing features of dogs. It is especially striking how for the dog on the right, the red region shifts away from the background and targets the dog’s face and nose as we progressively up the complexity.

For the cat sample, the focus is all over the place, especially in the less complex models. The b0–3 models emphasize all parts of the image except for the cat’s face. But, as we continue to add more complexity the red region zeros in on the cat’s face. The most complex model, L, emphasizes the cat’s whiskers which we know is a distinguishing feature!


In this blog, we’ve hopefully provided some useful insights into and tools for improving your model’s performance on unseen data. With the help of cats & dogs we’ve explored an image augmentation pipeline to reduce overfitting. Additionally, we’ve experimented with different model complexities. In this example, we’ve witnessed how more complexity can help our models zero in on more complicated features. Most importantly, we’ve used these tools to increase model accuracy on unseen data from ~85% to ~97%! Even with the best model and augmentation pipeline, our model can still get some samples wrong. As you can see below, some samples of dogs look like cats (left image) and vice versa (right).

In machine learning, we can only improve model performance so much from experimenting with parameters in our architecture and pipeline. We hope this article provides valuable insights as to how you can get the most out of your data!

If you’re interested in this topic and you would love to work on similar problems, please reach out to us.

See more on Medium.

Wanna see more of this content? Please check the full article on our Medium channel where we have deeper content ready for you.

Go to medium