Description:
When there is a need to make an ultimate decision about the unique features of big data platforms, one should note that they have configurable parameters. Apache Spark is an open-source big data processing platform that can process real-time data, and it requires an advanced central processing unit and high memory capacity. Therefore, it gives us a great number of configurable parameters such as the number of cores and driver memory that are tuned during the execution. Different from the preceding works, in this study, a Kriging-based multi-objective optimization method is developed. Kriging-based means executing a surrogate model to create a response surface by providing a set of optimal solutions. The most important advantage of the proposed method over the alternatives is that it consists of three fitness functions. The method is evaluated on the MLlib library and the benchmarks of Hibench. MLlib provides various machine learning algorithms that are suitable to execute on resilient distributed data sets. The experimental results show that the proposed method outperformed the alternatives in hypervolume improvement and reducing uncertainty. Further, the results support the hypothesis that focusing on the parameters associated with data compression and memory usage improves the effectiveness of multi-objective optimization methods developed for Spark. Multi-objective optimization leads to an inevitable complexity in Spark due to the dimensionality of objective functions. Despite the fact that simplifying the setup and steps of optimization has proven to be the most effective way to reduce that complexity, it is not very effective to avoid ambiguity of the Pareto front. While the proposed method achieved 1.93x speedup in benchmark experiments, there is a remarkable difference (0.63 of speedup) between the speedup of our method and that of the closest competitor. Increasing the number of cores in multi-objective optimization does not contribute to speedup; rather, it leads to waste of CPU sources. Instead, the optimal number of cores should be determined by checking the changes of speedup with varying Spark configurations.