Introduction to Batch Size
There are different data points in the dataset. When different data points group together, it makes a batch. These batches are further used in the training process of a model. The batch size may vary depending on the algorithm being used. These algorithms come under the subset of gradient descent. Neural networks and machine learning models are typically trained using an optimization technique known as gradient descent. By employing training data, these models learn over time, and the cost function in gradient descent, in particular, acts as a barometer by gauging the precision of each cycle of parameter changes. There are different versions of Gradient descent such as Batch Gradient Descent, Stochastic, and Mini-batch Gradient Descent.
Batch Gradient Descent
In batch gradient descent, the full training dataset is passed to the model. So we can say that batch size= full training samples. However, it argues many problems such as it requires a lot of memory to train the model as the whole training dataset needs to be loaded once in the memory. Secondly, bypassing the full training dataset, the stable error gradient might occasionally lead to a state of convergence below the maximum that the model is capable of. When it’s applied to very big datasets, gradient descent can be sluggish. When there are several millions of examples, the gradient descent approach might take a long time since each iteration requires a forecast for every occurrence in the training dataset.
Advantages of Batch Gradient Descent
- There were fewer variations and uncertain measures taken toward the approximation of the loss function since the coefficients were adjusted by calculating the average of all the training examples instead of the value of one sample.
- Vectorization, which expedites the process of all training examples at once, can be advantageous. It produces stable convergence and robust error when compared to stochastic gradient descent. It is computationally efficient because it processes all training datasets rather than just one, using all of the computer resources available.
- However, it uses all available computer resources to process all training samples rather than just one, making it computationally efficient.
Disadvantages of Batch Gradient Descent
- A local minimum caused by a stable error gradient can appear on occasion
- It’s possible that processing the entire training set will consume too much memory, requiring more use of memory.
- Depending on the available computer resources, processing each training example in a batch can take too long.
Stochastic Gradient Descent
In stochastic gradient descent, only one training data point is considered at a time. However, the model iterates over the whole dataset until it passes through all training data points. So we can say that batch size= one data point.
Advantages of Stochastic Gradient Descent:
- As the network considers only one training sample, it is simpler to fit into memory.
- It is computationally quick.
- It can converge more quickly for larger datasets since it updates the parameters more frequently.
- Due to frequent updates, the steps necessary to reach the loss function minima include oscillations that can enable one to escape local loss function minimums.
Disadvantages of Stochastic Gradient Descent:
- Due to the frequent updates, the methods used to achieve the minima are highly noisy. The gradient frequently descends in a new direction as a result of this. Additionally, because of noisy steps, it could take longer to reach the loss function’s minima.
- The expense of processing one training example at a time makes frequent updates computationally expensive.
- Due to the fact that it only interacts with one sample at a time, it lacks the benefit of vectorized operations.
Mini-Batch Gradient Descent
Mini-batch gradient descent makes batches of user choices. It doesn’t restrict the user to make a predefined batch size. Let us consider an example the user uses the batch size of 32 data points and the whole dataset consists of 2k data points. The system will make 63 batches. Each batch will consist of 32 data points.
Advantages of Mini-Batch Gradient Descent
- Fits conveniently in the memory
- It is effectively computed.
- Sustained error variations and convergence are produced by the training samples’ average.
Disadvantages of Mini-Batch Gradient Descent
- It needs additional hyperparameters to be set during the training process of a model.
- Error information is obtained from each batch during training. However, the accumulated error needs to be obtained which can be cumbersome.
- It generates complicated functions.
We come to know that batch is the set of data points. Gradient descent is an optimization algorithm that needs to update the weights of a model. There exist different gradient descent algorithms such as Batch, Stochastic, and Mini-batch gradient descent. Batch gradient descent considers all data points, Stochastic considers only one data point at a time and Mini-batch gradient descent considers small batch of the same size to train the model.
How to choose batch size?
Practically speaking, researchers advise experimenting with smaller batch sizes first (often 32 or 64), taking into consideration that small batch sizes necessitate tiny learning rates. To fully utilize the processing capabilities of the GPUs, the number of consecutive batch sizes should be a power of 2.
What are the main advantages of batch size reduction during training the model?
Small batches move through the process more rapidly and uniformly, which promotes quicker learning. The cause of the increased speed is clear. Because there are fewer things in the batch, there is less unpredictability.