To start into Bigdata world you must learn:
1. You need to brush up your knowledge in Core Java and SQL
2. You must aware of Basic Unix commands.
Now let's move into the Bigdata World.
What
is big data:
Nowadays,
with the introduction of social media, e commerce, stock markets,
business etc data is growing exponentially and traditional data
system is unable to store or process this huge volume of data. This
is called Bigdata.
Data
is considered as Big data if it possesses following characteristics.
·Volume: Data
is of huge in size and unable to fit into traditional system.
·Velocity: The
data is growing at a very fast rate. Users across the globe upload
and send huge data within a minute every day.
·Variety: Nowadays
data are not always stored in rows and column. Data from
Log file, CCTV footage etc are considered unstructured.
·Veracity: It
refers to uncertainty
of data availability to users as data is huge in size so it may
contain wrong data also.
·Value: Big
data is none of consequence unless data analyzed correctly or
processed.
Big
data is of four types :
· Structured
Data is
used to refer to the data which has proper schema .It is saved in
databases.
·Unstructured
Data does
not have any clear format or schema like videos ,images etc
·Semi
Structured data does not reside
in a relational database but has some schema properties that make it
easier to analyze like Log files, JSON Format etc
· Data
that may contain inconsistencies in data values and formats
is considered as quasi
structured like
interrupted downloaded data.
Hadoop:
A good solution
·It
stores huge data and stores efficiently
·It
is horizontally scalable as data grows very fast
·It
is cost effective
·Programmers
as well as non programmers find easy to build Hadoop applications.
What
is Hadoop?
Hadoop
is an open-source, a Java-based programming framework that stores and
processes large data sets in a distributed computing environment. It
is built on the Google File System .Hadoop was developed by Doug cutting who was working in Yahoo. But later Yahoo gave Hadoop to Apache foundation and from onwards Apache Hadoop is available as top level open source project of Apache.
Lets
discuss about components available in hadoop eceosystem.
- Hadoop Distributed File System (HDFS):
HDFS
is storage layer in Hadoop. It is distributed
file system where
data is being stored ,gets distributed & then it proceeds .It is
highly fault tolerant means here,
data is copied and stored into multiple locations
and in case of any server failure it tracks the data from different
to continue its service.
HBase
is a distributed columnar database built on top of the Hadoop file
system.
It is a No SQL database or non-relational database, mainly used when you need random, real-time, read/write access to your big data.
It is a No SQL database or non-relational database, mainly used when you need random, real-time, read/write access to your big data.
Its is tool to transfer structured data into HDFS.It is designed to import data from relational databases such as Oracle , to HDFS and export data from HDFS to relational database.
Flume
is responsible for moving streaming data from different
sources like avro,spooling directory,netcat,file system etc
into hdfs. It ideally works for moving event data from
multiple systems.
- Hadoop MapReduce:
Hadoop Mapreduce is the original
hadoop processing engine primarily JAVA based . Data is
processed in distributed over the multiple server and then processed
by the Map and Reduce programming model. Many tools such as Hive, Pig
build on Map Reduce Model internally in execution mode.
It
is open source, in memory , highly expensive cluster programming
framework and provides 100 times faster performance than hadoop
MapReduce processing framewok. Nowadays spark applications are more
demanding because of its high performance on multiple domains and
vast useful libraries . It's built in API is written in
Java,Scala,Python and R languages to get the job done.
It includes rich set of higher level tools like Spark SQL(for SQL and structured data processing), Mlib (for machine learning) , Stream processing (Spark streaming and Structured streaming), Graphx (for graph processing).
Spark
Components are depicted below
- Pig:
- Impala:
- Hive:
- Oozie:
- Hue:
- HDFS file browser
- Job browser/designer
- Hive/Pig query Editor
- Oozie app for workflows
- Has hadoop API
- Access to shell
- User Admin
- App for Solr Searches
- Cloudera Search:
- Sqoop ,Flume is used to data ingestion .Sqoop transfers structured data to Hadoop from various resources such as relational databases system or local files and Flume transfer event data.
- Spark and MapReduce perform data processing on information stored in HDFS and the NoSQL distributed data Hbase.Pig,
- Hive & Impala is used to analyze data and converts data using Map and Reduce programming .Hive is more suitable for structured data and impala for adhoc queries.
- After data is processed and analyzed it is accessible to user through Hue and Cloudera search. Hue is web-interface for exploring data.
It
is an open-source high level dataflow system built on top of hdfs..
It uses Pig Latin scripts to write code . It internally coverts
pig script into Map-Reduce code in execution and relaxes
producer from writing huge Map-Reduce code. It is best suitable
for Ad-Hoc queries in Analytics like filter &
join which is challenging to perform in Map-Reduce but can be
done efficiently using smaller Pig scripts .
It
is a open source massive parallel processing SQL
engine which runs on the data stored in Hadoop
cluster and processes using dialect
of SQL( Impala SQL). It is
ideally used for interactive analysis because of its
high performance and very low latency compared to other
SQL engine in hadoop ecosystem.
Similar
to the Impala, it also provides SQL intellect, so that users can
write SQL like queries called HQL (Hive Query Language ) to extract
the data from Hadoop whereas Impala is preferable for executing adhoc
queries which needs more processing time in hive.Hive is suitable for
processing structured data and extract, transform and load
operations. Hive executes queries which internally converts
into Map-Reduce code which leads a user not to
write any code in low-level Map-Reduce.
Oozie
is a workflow scheduling system which
manages Hadoop jobs ( like
Hive,Pig, Sqoop , Spark jobs as well as system specific jobs like
java and shell) runs in
parallel and schedules .
Hue
is an acronym for Hadoop user experience. It is an open-source web
based hadoop GUI through which we could perform various functions
like
1. Upload and browse data
2. Query a table in Hive and Impala
3. Run Spark and Pig jobs
4. Workflow search data.
1. Upload and browse data
2. Query a table in Hive and Impala
3. Run Spark and Pig jobs
4. Workflow search data.
It
has following features :
It
lets users
explore and analyze data in real-time to find what’s relevant and
gain new insights. .
Users do not need any technical and programming skills to search and
explore data stored in or ingested into Hadoop and Hbase. It
makes Apache hadoop
accessible to everyone via integrated full text search compared to
standalone search solution cloudera search is fully
integrated in the Cloudera platform, taking advantage of the
flexible, scalable, and robust storage system and data processing
frameworks included in CDH .This
eliminates the need to move large data sets across infrastructures to
perform business tasks .
How
these components work?
These
components work together to process Big data.
This
is all about the basic understanding of hadoop components and how
they work together.
For more please visit my blogs.
See you in my next blog!!
4 Comments
Very good
ReplyDeleteReally helpful guide for new beginners
ReplyDeleteVery nice Article,keep sharing more posts with us.
ReplyDeleteThank you..
big data online course
Thanks for sharing useful information, keep sharing your thoughts like this...
ReplyDeleteUnix Training in Chennai
Unix Courses Online
Please do not enter any spam link in the comment box