The area of online machine learning in big data streams covers algorithms that are (1) distributed and (2) work from data streams with only a limited possibility to store past data. The first requirement mostly concerns software architectures and efficient algorithms. The second one also imposes nontrivial theoretical restrictions on the modelling methods: In the data stream model, older data is no longer available to revise earlier suboptimal modelling decisions as the fresh data arrives. Within the STREAMLINE project, Online Machine Learning is a priority. We have extended the capabilities of Flink to run fast algorithms like XGBoost in a distributed manner. We will also work on distributed software architectures and libraries as well as machine learning models for online learning.
This publication gives a more detailed overview of the concept of Online Machine Learning.