Scalable algorithms that can handle the large-scale nature of modern datasets are an integral part of many applications of machine learning (ML). Among these, efficient optimization algorithms, as the bread and butter of many ML methods, hold a special place. Optimization methods that use only first derivative information, i.e., first-order methods, are the most common tools used in training ML models. This is despite the fact that many of these methods come with inherent disadvantages such as slow convergence, poor communication, and the need for laborious hyper-parameter tuning.