A brief introduction to Apache Hadoop: Dive into the Big data world : start of Bigdata journey

To start into Bigdata world you must learn:

1. You need to brush up your knowledge in Core Java and SQL

2. You must aware of Basic Unix commands.

Now let's move into the Bigdata World.

What is big data:

Nowadays, with the introduction of social media, e commerce, stock markets, business etc data is growing exponentially and traditional data system is unable to store or process this huge volume of data. This is called Bigdata.

Data is considered as Big data if it possesses following characteristics.

·Volume: Data is of huge in size and unable to fit into traditional system.

·Velocity: The data is growing at a very fast rate. Users across the globe upload and send huge data within a minute every day.

·Variety: Nowadays data are not always stored in rows and column. Data from Log file, CCTV footage etc are considered unstructured.

·Veracity: It refers to uncertainty of data availability to users as data is huge in size so it may contain wrong data also.

·Value: Big data is none of consequence unless data analyzed correctly or processed.

Big data is of four types :

· Structured Data is used to refer to the data which has proper schema .It is saved in databases.

·Unstructured Data does not have any clear format or schema like videos ,images etc

·Semi Structured data does not reside in a relational database but has some schema properties that make it easier to analyze like Log files, JSON Format etc

· Data that may contain inconsistencies in data values and formats is considered as quasi structured like interrupted downloaded data.

Hadoop: A good solution

·It stores huge data and stores efficiently

·It is horizontally scalable as data grows very fast

·It is cost effective

·Programmers as well as non programmers find easy to build Hadoop applications.

What is Hadoop?

Hadoop is an open-source, a Java-based programming framework that stores and processes large data sets in a distributed computing environment. It is built on the Google File System .Hadoop was developed by Doug cutting who was working in Yahoo. But later Yahoo gave Hadoop to Apache foundation and from onwards Apache Hadoop is available as top level open source project of Apache.

Lets discuss about components available in hadoop eceosystem.

Hadoop Distributed File System (HDFS):

HDFS is storage layer in Hadoop. It is distributed file system where data is being stored ,gets distributed & then it proceeds .It is highly fault tolerant means here, data is copied and stored into multiple locations and in case of any server failure it tracks the data from different to continue its service.

HBase:

HBase is a distributed columnar database built on top of the Hadoop file system.
It is a No SQL database or non-relational database, mainly used when you need random, real-time, read/write access to your big data.

Sqoop:

Its is tool to transfer structured data into HDFS.It is designed to import data from relational databases such as Oracle , to HDFS and export data from HDFS to relational database.

Flume:

Flume is responsible for moving streaming data from different sources like avro,spooling directory,netcat,file system etc into hdfs. It ideally works for moving event data from multiple systems.

Hadoop MapReduce:

Hadoop Mapreduce is the original hadoop processing engine primarily JAVA based . Data is processed in distributed over the multiple server and then processed by the Map and Reduce programming model. Many tools such as Hive, Pig build on Map Reduce Model internally in execution mode.

SPARK:

It is open source, in memory , highly expensive cluster programming framework and provides 100 times faster performance than hadoop MapReduce processing framewok. Nowadays spark applications are more demanding because of its high performance on multiple domains and vast useful libraries . It's built in API is written in Java,Scala,Python and R languages to get the job done.

It includes rich set of higher level tools like Spark SQL(for SQL and structured data processing), Mlib (for machine learning) , Stream processing (Spark streaming and Structured streaming), Graphx (for graph processing).

Spark Components are depicted below

Pig:

It is an open-source high level dataflow system built on top of hdfs.. It uses Pig Latin scripts to write code . It internally coverts pig script into Map-Reduce code in execution and relaxes producer from writing huge Map-Reduce code. It is best suitable for Ad-Hoc queries in Analytics like filter & join which is challenging to perform in Map-Reduce but can be done efficiently using smaller Pig scripts .

Impala:

It is a open source massive parallel processing SQL engine which runs on the data stored in Hadoop cluster and processes using dialect of SQL( Impala SQL). It is ideally used for interactive analysis because of its high performance and very low latency compared to other SQL engine in hadoop ecosystem.

Hive:

Similar to the Impala, it also provides SQL intellect, so that users can write SQL like queries called HQL (Hive Query Language ) to extract the data from Hadoop whereas Impala is preferable for executing adhoc queries which needs more processing time in hive.Hive is suitable for processing structured data and extract, transform and load operations. Hive executes queries which internally converts into Map-Reduce code which leads a user not to write any code in low-level Map-Reduce.

Oozie:

Oozie is a workflow scheduling system which manages Hadoop jobs ( like Hive,Pig, Sqoop , Spark jobs as well as system specific jobs like java and shell) runs in parallel and schedules .

Hue:

Hue is an acronym for Hadoop user experience. It is an open-source web based hadoop GUI through which we could perform various functions like
1. Upload and browse data
2. Query a table in Hive and Impala
3. Run Spark and Pig jobs
4. Workflow search data.

It has following features :

HDFS file browser
Job browser/designer
Hive/Pig query Editor
Oozie app for workflows
Has hadoop API
Access to shell
User Admin
App for Solr Searches

Cloudera Search:

It lets users explore and analyze data in real-time to find what’s relevant and gain new insights. . Users do not need any technical and programming skills to search and explore data stored in or ingested into Hadoop and Hbase. It makes Apache hadoop accessible to everyone via integrated full text search compared to standalone search solution cloudera search is fully integrated in the Cloudera platform, taking advantage of the flexible, scalable, and robust storage system and data processing frameworks included in CDH .This eliminates the need to move large data sets across infrastructures to perform business tasks .

How these components work?

These components work together to process Big data.

Sqoop ,Flume is used to data ingestion .Sqoop transfers structured data to Hadoop from various resources such as relational databases system or local files and Flume transfer event data.
Spark and MapReduce perform data processing on information stored in HDFS and the NoSQL distributed data Hbase.Pig,
Hive & Impala is used to analyze data and converts data using Map and Reduce programming .Hive is more suitable for structured data and impala for adhoc queries.
After data is processed and analyzed it is accessible to user through Hue and Cloudera search. Hue is web-interface for exploring data.

This is all about the basic understanding of hadoop components and how they work together.

For more please visit my blogs.

See you in my next blog!!

ITechShree-Data-Analytics-Technologies

A brief introduction to Apache Hadoop: Dive into the Big data world : start of Bigdata journey

Posted by: D Gorai

Post a Comment

4 Comments

Labels

Random Posts

Popular Posts

Flume Spooling directory example

Apache Spark interview questions Set 2

Learn Flume

Menu Footer Widget

ITechShree-Data-Analytics-Technologies

A brief introduction to Apache Hadoop: Dive into the Big data world : start of Bigdata journey

Posted by: D Gorai

You may like these posts

Post a Comment

4 Comments

Labels

Random Posts

Popular Posts

Flume Spooling directory example

Apache Spark interview questions Set 2

Learn Flume

Menu Footer Widget