Comparison between Hadoop-MapReduce & Spark
While a lot has been written about MapReduce and Spark, I am attempting to summarize it crisply based on parameters Viz. Real-time analysis, Data, Algorithms - Iterative, Graph and Speed.
Here's a small comparison between Hadoop-MapReduce(MR) & Spark (S)
MR - It is not good because it is designed for batch processing
S - Distributed processing of streaming data is supported
MR - a.) Saves data on disk
b.) Disk I/O takes a lot of time
c.) Higher latency
S - a.) Saves data in memory
b.) Lower latency
MR- Inconvenient 'coz to read input from disk and write output to disk for each iteration
S - Cache's intermediate results, multiple iterations using the cache makes it fast
MR - No mechanism for messaging information of neighboring nodes
S - A graph algorithm library called GraphX is included.
MR- Unable to effectively utilize the memory of Hadoop cluster.
S - Using RDD (Resilient Distributed Datasets), data can be stored in memory and can be saved to disk only when necessary.