ITechShree-Data-Analytics-Technologies

Apache Spark interview questions Set 2

 


1.Difference between groupByKey() and reduceByKey() in spark?

groupBykey() works on dataset with key value pair(K,V) and groups data based on the key.A lot of shuffling occurs while grouping the dataset if it is not partitioned.

val dataset = sc.parallelize(Array(('a',5),('b,3),('b',4),('c',7)),3)

val groupdataset= data.groupByKey().collect()

group.foreach(println)

 

reduceByKey() equivalent to grouping+ aggregation .We can say it works on combining dataset pairs  based on key within same machine  before shuffling.

val data= Array("a","b","c","d")

val combined_data = sc.parallelize(words).map(w => (w,1)).reduceByKey((v+w)=> v+w)

data.collect.foreach(println)

 

 

2.Define lineage graph and DAG in spark?

All RDDs created in Spark depends one or more RDD  that new rdd  contains pointer to parent RDD.All these dependencies between RDDs is represented by a graph rather than actual data is known as lineage graph.

DAG is combination of vertices and edges whereas vertices represents RDDs and the edges is represented by the operations applied over RDD.

3. What is the benefit of lineage graph?

Lineage graph information is used to recompute RDD whenever needed. If a part of  RDD is lost if any reason,then lineage graph information re computes RDD again and continues to process spark application.

4. What is catalyst Optimizer?

It is new addition in Spark SQL framework.It  allows spark to automatically transform SQL queries to execute more efficiently by adding new optimization techniques such as filtering,indexing  and  ensuring  performance of data source joins  most efficient order.

5. Why Dataset is more faster than RDD API?

Spark Dataset does not use standard serializers rather they  uses Encoders can efficiently transform objects  into internal binary storage.

Wheres RDD API uses Java or kryo serializer.Hence, it is slower to perform simple grouping and aggregation operations.

6. What are the Cluster Manager available in Spark?

· Standalone Mode: By default, spark provides simple cluster manager that is standalone. It is easy to set up within spark distribution and resilient in nature.

 

· Apache Mesos: Apache Mesos is an open-source project and distributed cluster manager.It supports two level of scheduling. The main advantage of using Apache mesos as cluster manager as it supports dynamic partitioning between spark and other frameworks as well as  scalable partitioning between multiple instances of Spark.

 

· Hadoop YARN: It is the cluster resource manager of Hadoop 2. and it is compatible  with Spark  as well.

· KubernetesKubernetes is an open-source system and new cluster manager scheduler for automating deployment, scaling, and management of containerized applications.

7. What is the benefit of using broadcast variable in spark?

Broadcast variables is  read-only variable cached on each machine . They can be used, for example, when there is need  to give every node a copy of a large input dataset in an efficient manner. It eliminates the necessity to ship copies of a variable for each task, so data can be processed faster.

8. What is the use of accumulator in spark?

 Accumulator are  variables that are only "added" to through an associative operation and can therefore be efficiently supported in parallel processing. They are used as counters like in Mapreduce  or sums which spark program runs over cluster..

val acc = sc.accumulator(0)

 sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)

9. Is there any benefit in comparison with Spark?

Yes, MapReduce is a programming framework used by many big data tools  when the data becomes bigger and bigger. Most bigdata tools like Pig and Hive convert their queries into Map and Reduce phases to optimize them better

10. What is the use of Spark SQL over HQL and SQL?

.Spark SQL is a part of  Spark Core engine that performs both SQL as well as  Hive Query Language using their existing  syntax. We can also join  SQL table and HQL table using Spark SQL.

 

Hope you enjoy my blog!!

Happy Learning!!

Keep growing.


Post a Comment

1 Comments

  1. Thanks to share basic interview questions, but its highly recommended for freshers, but experienced professionals get different questions such as "why out of memory error you got it how to solve? " such questions. If you have pls share such questions as well.

    Thanks in advance
    Regards
    Venu
    bigdata training institute in Hyderabad
    spark training in Hyderabad

    ReplyDelete

Please do not enter any spam link in the comment box