Big Data Basics

NoSql:
  1. Data is highly unstructured.
  2. Doesn't follow stringent structure of RDBMS enabling speed and agility. 
  3. DBs are distributed and data can be distributed across multiple nodes and servers. 
  4. Allows for horizontal scaling: As the data grows add more nodes without impacting performance.
Big Data: 
  1. Big Data refers to large collection of data (that may be structured, unstructured or semi structured) that expands so quickly that it is difficult to manage with regular database or statistical tools.
  2. HDFS does not offer native support for security and authentication.
  3. Cluster has nodes

Hadoop v/s conventional DB:

Hadoop:
  1. Data is distributed across many nodes and processing
  2. Write once, read many. Once you write the data, you can delete it but can't modify it.
  3. Archival data: Telephonic call or transaction data
  4. Doesn't support SQL at all.
  5. It is an ecosystem of tools, technologies, and platforms.
  6. Runs on many commodity H/W and uses commodity S/W.
  7. Supports Hbase that is a NoSQL distributed DB.

Conventional DB:
  1. Conceptually all data sits in one server/database.
  2. Data can be modified.
  3. Support SQL


Hadoop layers:
  1. Bottom layer/Layer 1: Commodity Cluster Hardware
  2. Middle layer/Hadoop Layer/Layer 2: MapReduce, HDFS
  3. Top layer/Tools layer/Layer 3: RHadoop, Mahout, Hive, Pig, HBase, Sqoop
  4. RHadoop: Supports statistical language R
  5. Mahout: Machine learning
  6. Hive/Pig: NoSQL
  7. Sqoop: Getting data into and out of the Hadoop file system

Advantages:
1. Scalable
2. Cost effective in terms of processing large volumes of data.

=== Hive ===

  1. It provides SQL intellect so that users can write such queries called as HQL to extract data from hadoop.
  2. These SQL queries are converted to MapReduce queries. These queries in turn will communicate with HDFS.
  3. Great platform to write SQL writes to interact with HDFS.
  4. Not RDMS, or OLTP or real time updates or queries 
  5. Nice features:
    1. Supports different file formats like sequence/text/avro/orc/rc file. 
    2. Metadata gets stored in RDBMS
    3. Provides lots of compression techniques. 
    4. SQL queries are converted into MapReduce or tez or spark jobs. 
    5. UDF can include mapreduce scripts can be plugged 
    6. Specalized joins helps improve query function


Hive v/s RDBMS:
Hive:

  1. Enforce schema on Read and not on write. So you can write any kind of data till you read it. 
  2. Supports storage of 100PetaBytes of data.
  3. Doesn't support OLTP
RDBMS:

  1. Schema on Write. Won't let insert any data if its out of schema. 
  2. Allows storage of around 10PB of data. 
  3. Support OLTP
Impala:

  1. It is not mapreduce
  2. It is Massively Parallel Processing engine on top of Hadoop to query and analyze the data sets. 
  3. Utilizes Hive metastore to store table structure
  4. With the help of external tables, data resides in the Hadoop file system and structure in the metastore.
  5. Popular for Data scientists and analysts. 


Hive v/s Pig v/s Spark
Hive:

  1. Gives non-programmers ability to query and analyze Hadoop DBs
  2. Abstraction layer on top of Hadoop
  3. Batch oriented framework
  4. Useful for structured data. 
  5. Users can use SQL like interface to interact with backend Hadoop platform. 
  6. Supports:
    1. Batch query processing: For huge datasets. 
    2. Interactive query processing: For real time data processing. 
  7. Hive queries get converted into MapReduce jobs. 
  8. Predefined or UDF(User Defined Functions) can be used to perform certain action. 
  9. In hive: 
    1. select * will create a fetch job but not map reduce
    2. Aggregation functions like min, max, etc will create a map reduce job. 
Pig:
  1. Requires some programming knowledge to query and extract the data. 
  2. Abstraction layer on top of Hadoop
  3. Batch oriented framework. 
  4. Useful for structured, semi-structured, and unstructured data. 
  5. Pig has 2 parts: Pig Latin and Pig runtime. 
    1. Pig runtime converts the job from Pig to MapReduce
  6. Popular amongst data engineers
Spark: 
  1. In memory processing, you need to know java to utilize spark. 
  2. Faster but is low level since it requires coding knowledge.
  3. Useful for structured, semi-structured, and unstructured data. 
Decision making between Hive, Pig, Spark
  1. If you have unstructured data then go with Pig or Spark. 
  2. If you have structured data then go with Hive and load the data into Hive. 
  3. If you want faster processing go with Spark. 
  4. If you are fine with waiting few hours then go with Pig or Hive. 
  5. If you have technical knowledge then go with Spark -> Pig -> Hive. 
Few notes:
  1. Solr: Elastic search tool. Searches for words within documents. 
  2. Sqoop is used to import data into Hadoop. Pig is used to process that data. 
  3. Hbase: column family NoSql DB

No comments:

Post a Comment

NoSQL

This one is reviewed but I need to delete its copy from hubpages or somewhere NoSQL Data models: key-value  Aggregate model.  key or i...