ITechShree-Data-Analytics-Technologies

A brief introduction to Apache Hadoop: Dive into the Big data world : start of Bigdata journey




To start into Bigdata world you must learn:
 1. You need to brush up your knowledge in Core Java and SQL

 2. You must aware of Basic Unix commands.


Now let's move into the Bigdata World.


What is big data:

Nowadays, with the introduction of social media, e commerce, stock markets, business etc data is growing exponentially and traditional data system is unable to store or process this huge volume of data. This is called Bigdata.


Data is considered as Big data if it possesses following characteristics.

·Volume: Data is of huge in size and unable to fit into traditional system.
·Velocity: The data is growing at a very fast rate. Users across the globe upload and send huge data within a minute every day.
·Variety: Nowadays data are not always stored  in rows and column. Data from Log file, CCTV footage etc are considered unstructured.
·Veracity: It refers to uncertainty of data availability to users as data is huge in size so it may contain wrong data also.
·Value: Big data is none of consequence unless data analyzed correctly or processed.
Big data is of four types :
· Structured Data is used to refer to the data which has proper schema .It is saved in databases.
·Unstructured Data does not have any clear format or schema like videos ,images etc
·Semi Structured data  does not reside in a relational database but has some schema properties that make it easier to analyze like Log files, JSON Format etc
· Data that may contain inconsistencies in data values and formats is considered as quasi structured like interrupted downloaded data.

Hadoop: A good solution


·It stores huge data and stores efficiently
·It is horizontally scalable as data grows very fast
·It is cost effective
·Programmers as well as non programmers find easy to build Hadoop applications.

What is Hadoop?
Hadoop is an open-source, a Java-based programming framework that stores and processes large data sets in a distributed computing environment. It is built on the Google File System .Hadoop was developed by Doug cutting who was working in Yahoo. But later Yahoo gave Hadoop to Apache foundation and from onwards Apache Hadoop is available as top level open source project of Apache.

Lets discuss about  components available in hadoop eceosystem.
  • Hadoop Distributed File System (HDFS):

HDFS is storage layer in Hadoop. It is distributed file system  where data is being stored ,gets distributed & then it proceeds .It is highly fault tolerant means  here, data is copied and stored  into  multiple locations  and in case of any server failure it tracks the data from different to continue its service. 

HBase is a distributed columnar database built on top of the Hadoop file system.
It is a No SQL database or non-relational database, mainly used when you need random, real-time, read/write access to your big data.


Its is tool to transfer structured data into HDFS.It is designed to import data from relational databases such as Oracle , to HDFS and export data from HDFS to relational database. 

Flume is responsible for moving  streaming data from different sources like avro,spooling directory,netcat,file system etc  into hdfs. It ideally works  for moving  event data from multiple systems.

Hadoop Mapreduce  is  the  original hadoop processing engine primarily JAVA based .  Data is processed in distributed over the multiple server and then processed by the Map and Reduce programming model. Many tools such as Hive, Pig build on Map Reduce Model internally in execution mode.
It is open source, in memory , highly expensive cluster programming framework and provides 100 times faster performance than hadoop MapReduce processing framewok. Nowadays spark applications are more demanding because of its high performance on multiple domains and vast useful libraries . It's built in API is written  in Java,Scala,Python and R languages to get the job done.


It includes rich set of higher level tools like Spark SQL(for SQL and structured data processing), Mlib (for machine learning) , Stream processing (Spark streaming and Structured streaming), Graphx (for graph processing).

Spark Components are depicted below

  • Pig:
  • It is an open-source high level dataflow system built on top of hdfs.. It uses Pig Latin scripts to write code . It  internally coverts pig script into Map-Reduce code in execution and relaxes  producer from writing huge Map-Reduce code. It is best suitable  for  Ad-Hoc queries in Analytics  like filter & join which is challenging to perform in Map-Reduce but  can be done efficiently using smaller  Pig scripts .
  • Impala:
  • It is a  open source massive parallel processing SQL engine   which runs on the data stored in  Hadoop cluster and processes using  dialect of SQL( Impala SQL). It is ideally used  for interactive analysis because of its high  performance and very low latency compared to other SQL engine in hadoop ecosystem. 
  • Hive:
  •   Similar to the Impala, it also provides SQL intellect, so that users can write SQL like queries called HQL (Hive Query Language ) to extract the data from Hadoop whereas Impala is preferable for executing adhoc queries which needs more processing time in hive.Hive is suitable for processing structured  data and extract, transform and load operations.  Hive executes queries which internally converts into  Map-Reduce code which leads  a user not  to write any code in low-level Map-Reduce. 
  • Oozie:
  • Oozie is a workflow  scheduling system   which  manages Hadoop  jobs ( like Hive,Pig, Sqoop , Spark jobs as well as system specific jobs like java and shell) runs in parallel and schedules . 

  • Hue:
  • Hue is an acronym for Hadoop user experience. It is an open-source web based hadoop GUI through which we could perform various functions like
    1. Upload and browse data
    2. Query a table in Hive and Impala
    3. Run Spark and Pig jobs
    4. Workflow search data.
    It has following features :
    1. HDFS file browser
    2. Job browser/designer
    3. Hive/Pig query Editor
    4. Oozie app for workflows
    5. Has hadoop API
    6. Access to shell
    7. User Admin
    8. App for Solr Searches

  • Cloudera Search:

  • It lets users explore and analyze data in real-time to find what’s relevant and gain new insights. . Users do not need any technical and programming skills to search and explore data stored in  or ingested into Hadoop and Hbase. It makes Apache hadoop accessible to everyone via integrated full text search compared to standalone search solution cloudera search is  fully integrated in the Cloudera platform, taking advantage of the flexible, scalable, and robust storage system and data processing frameworks included in CDH .This eliminates the need to move large data sets across infrastructures to perform business tasks .

    How these components work?
    These components work together to process Big data.
    • Sqoop ,Flume is used to data ingestion .Sqoop transfers structured data to Hadoop from various resources such as relational databases system or local files and Flume transfer event data.
    •  Spark and MapReduce perform data processing on information stored in HDFS and the NoSQL distributed data Hbase.Pig, 
    • Hive & Impala is used to analyze data and converts data using Map and Reduce programming .Hive is more suitable for structured data and impala for adhoc queries.
    • After data is processed and analyzed it is accessible to user through  Hue and Cloudera search. Hue is web-interface for exploring data.


    This is all about the basic understanding of hadoop components and how they work together.

    For more please visit my blogs.


    See you in my next blog!!



Post a Comment

4 Comments

Please do not enter any spam link in the comment box