ITechShree-Data-Analytics-Technologies

Apache Spark Quick Start

 Is Apache spark a database?

Apache Spark  is open source, in memory , highly expensive cluster programming framework.

Nowadays spark applications are more demanding because of its high performance on multiple domains and vast useful libraries .

 

Why Apache Spark preferred

 

1. Why is Apache spark preferred over MapReduce? It provides 100 times faster performance than hadoop MapReduce processing framework which access disk while reading and writing data in hdfs wheres spark processing does  in memory.

2. It's built in API is written  in multiple languages like Java,Scala,Python and R languages to get the job done.

3.It includes rich set of higher level tools  :

  • Spark SQL: It supports API for processing SQL and structured data
  • Mlib :It includes algorithm and model to train and test data in machine learning ,
  • Stream processing: Spark streaming and Structured streaming  are used to process streams of data in near real time applications.
  • Graphx : Its used for graph processing and analytic.

 

ITechShree:Spark Components






Does spark compatible with Hadoop

 

Yes spark is compatible with Hadoop Cluster.Though spark has independent cluster spark framework(Standalone cluster) which leads to run spark jobs independently.

But spark can also be installed and worked on Hadoop(Hadoop version 2) cluster using its cluster framework Yarn and Mesos and also in Mapreduce framework (Hadoop Version 1) .Thus it enhances the functionality of Hadoop using all its components and tools.

 

 

RDD(Resilient distributed Dataset) in Spark

 

The basic building block of spark is RDD(Resilient distributed Dataset).It is core concept of Spark Framework.

It can hold any type of data and is distributed across the cluster.

How is RDD resilient?It is resilient because it is immutable  in nature and highly fault tolerant.

Why is RDD immutable in spark? Immutable means once RDD created we cant modify the original one.we could modify only with transformation but it will recreate RDD keeping original remains same.

 In cluster data gets crashed if any nodes fails but

RDDs are also fault tolerance because it knows how to recreate and recompute the datasets.

 

RDD supports two types of operation:

1. Transformation:

When we call transformation function on RDD, it creates new RDD.nothing is evaluated. 

 map, filter, flatMap, groupByKey, reduceByKey, aggregateByKey, pipe are the some examples of transformation.

2. Action:

When an Action function is called on a RDD object, all the data processing queries gets computed at that time and the result value is evaluated.

reduce, collect, count, first, take, countByKey, and foreach. etc.

 

 

How Spark processes data

 

 What is the use of DAG in spark?What is Dag execution engine and DAG schedular?

When Spark job is submitted at Spark console, it converts code containing transformation and actions into logical DAG(Direct Acyclic Graph) of task stages.When action is executed at higher level ,spark strikes to DAG scheduler and  executes the task submitted to DAG scheduler.

 

 

Hope you are aware of how Spark gains popularity over hadoop and makes big data application faster.

Happy Learning!

Debashree

Post a Comment

2 Comments

  1. Thanks a lot madam for your support and content ,madam it helped me a lot to into a sonata software, madam there is no words to appreciate you .once again Thanks a lot

    ReplyDelete

Please do not enter any spam link in the comment box