Apache Spark Quick Start

Is Apache spark a database?

Apache Spark is open source, in memory , highly expensive cluster programming framework.

Nowadays spark applications are more demanding because of its high performance on multiple domains and vast useful libraries .

Why Apache Spark preferred

1. Why is Apache spark preferred over MapReduce? It provides 100 times faster performance than hadoop MapReduce processing framework which access disk while reading and writing data in hdfs wheres spark processing does in memory.

2. It's built in API is written in multiple languages like Java,Scala,Python and R languages to get the job done.

3.It includes rich set of higher level tools :

Spark SQL: It supports API for processing SQL and structured data
Mlib :It includes algorithm and model to train and test data in machine learning ,
Stream processing: Spark streaming and Structured streaming are used to process streams of data in near real time applications.
Graphx : Its used for graph processing and analytic.

Does spark compatible with Hadoop

Yes spark is compatible with Hadoop Cluster.Though spark has independent cluster spark framework(Standalone cluster) which leads to run spark jobs independently.

But spark can also be installed and worked on Hadoop(Hadoop version 2) cluster using its cluster framework Yarn and Mesos and also in Mapreduce framework (Hadoop Version 1) .Thus it enhances the functionality of Hadoop using all its components and tools.

RDD(Resilient distributed Dataset) in Spark

The basic building block of spark is RDD(Resilient distributed Dataset).It is core concept of Spark Framework.

It can hold any type of data and is distributed across the cluster.

How is RDD resilient?It is resilient because it is immutable in nature and highly fault tolerant.

Why is RDD immutable in spark? Immutable means once RDD created we cant modify the original one.we could modify only with transformation but it will recreate RDD keeping original remains same.

In cluster data gets crashed if any nodes fails but

RDDs are also fault tolerance because it knows how to recreate and recompute the datasets.

RDD supports two types of operation:

1. Transformation:

When we call transformation function on RDD, it creates new RDD.nothing is evaluated.

map, filter, flatMap, groupByKey, reduceByKey, aggregateByKey, pipe are the some examples of transformation.

2. Action:

When an Action function is called on a RDD object, all the data processing queries gets computed at that time and the result value is evaluated.

reduce, collect, count, first, take, countByKey, and foreach. etc.

How Spark processes data

What is the use of DAG in spark?What is Dag execution engine and DAG schedular?

When Spark job is submitted at Spark console, it converts code containing transformation and actions into logical DAG(Direct Acyclic Graph) of task stages.When action is executed at higher level ,spark strikes to DAG scheduler and executes the task submitted to DAG scheduler.

Hope you are aware of how Spark gains popularity over hadoop and makes big data application faster.

Happy Learning!

Debashree

ITechShree-Data-Analytics-Technologies

Apache Spark Quick Start

Posted by: D Gorai

Post a Comment

2 Comments

Labels

Random Posts

Popular Posts

Flume Spooling directory example

Apache Spark interview questions Set 2

Learn Flume

Menu Footer Widget

ITechShree-Data-Analytics-Technologies

Apache Spark Quick Start

Posted by: D Gorai

You may like these posts

Post a Comment

2 Comments

Labels

Random Posts

Popular Posts

Flume Spooling directory example

Apache Spark interview questions Set 2

Learn Flume

Menu Footer Widget