Comparison between Hadoop-MapReduce & Spark

  • December 16, 2017

While a lot has been written about MapReduce and Spark, I am attempting to summarize it crisply based on parameters Viz. Real-time analysis, Data, Algorithms - Iterative, Graph and Speed.             

Here's a small comparison between Hadoop-MapReduce(MR) & Spark (S)

Real-Time Analysis::

MR - It is not good because it is designed for batch processing

S - Distributed processing of streaming data is supported

Data::

MR - a.) Saves data on disk   

b.) Disk I/O takes a lot of time

c.) Higher latency

S - a.) Saves data in memory

b.) Lower latency

Iterative Algorithm::

MR- Inconvenient 'coz to read input from disk and write output to disk for each iteration

S - Cache's intermediate results, multiple iterations using the cache makes it fast

Graph Algorithm::

MR - No mechanism for messaging information of neighboring nodes

S - A graph algorithm library called GraphX is included.

Speed::

MR- Unable to effectively utilize the memory of Hadoop cluster.

S - Using RDD (Resilient Distributed Datasets), data can be stored in memory and can be saved to disk only when necessary.