Technical Interview Questions: Big Data Basics

Big Data Basics

NoSql:

Data is highly unstructured.
Doesn't follow stringent structure of RDBMS enabling speed and agility.
DBs are distributed and data can be distributed across multiple nodes and servers.
Allows for horizontal scaling: As the data grows add more nodes without impacting performance.

Big Data:

Big Data refers to large collection of data (that may be structured, unstructured or semi structured) that expands so quickly that it is difficult to manage with regular database or statistical tools.
HDFS does not offer native support for security and authentication.
Cluster has nodes

Hadoop v/s conventional DB:

Hadoop:

Data is distributed across many nodes and processing
Write once, read many. Once you write the data, you can delete it but can't modify it.
Archival data: Telephonic call or transaction data
Doesn't support SQL at all.
It is an ecosystem of tools, technologies, and platforms.
Runs on many commodity H/W and uses commodity S/W.
Supports Hbase that is a NoSQL distributed DB.

Conventional DB:

Conceptually all data sits in one server/database.
Data can be modified.
Support SQL

Hadoop layers:

Bottom layer/Layer 1: Commodity Cluster Hardware
Middle layer/Hadoop Layer/Layer 2: MapReduce, HDFS
Top layer/Tools layer/Layer 3: RHadoop, Mahout, Hive, Pig, HBase, Sqoop
RHadoop: Supports statistical language R
Mahout: Machine learning
Hive/Pig: NoSQL
Sqoop: Getting data into and out of the Hadoop file system

Advantages:
1. Scalable
2. Cost effective in terms of processing large volumes of data.

=== Hive ===

It provides SQL intellect so that users can write such queries called as HQL to extract data from hadoop.
These SQL queries are converted to MapReduce queries. These queries in turn will communicate with HDFS.
Great platform to write SQL writes to interact with HDFS.
Not RDMS, or OLTP or real time updates or queries
Nice features:

Supports different file formats like sequence/text/avro/orc/rc file.
Metadata gets stored in RDBMS
Provides lots of compression techniques.
SQL queries are converted into MapReduce or tez or spark jobs.
UDF can include mapreduce scripts can be plugged
Specalized joins helps improve query function

Hive v/s RDBMS:
Hive:

Enforce schema on Read and not on write. So you can write any kind of data till you read it.
Supports storage of 100PetaBytes of data.
Doesn't support OLTP

RDBMS:

Schema on Write. Won't let insert any data if its out of schema.
Allows storage of around 10PB of data.
Support OLTP

Impala:

It is not mapreduce
It is Massively Parallel Processing engine on top of Hadoop to query and analyze the data sets.
Utilizes Hive metastore to store table structure
With the help of external tables, data resides in the Hadoop file system and structure in the metastore.
Popular for Data scientists and analysts.

Hive v/s Pig v/s Spark
Hive:

Gives non-programmers ability to query and analyze Hadoop DBs
Abstraction layer on top of Hadoop
Batch oriented framework
Useful for structured data.
Users can use SQL like interface to interact with backend Hadoop platform.
Supports:

Batch query processing: For huge datasets.
Interactive query processing: For real time data processing.

Hive queries get converted into MapReduce jobs.
Predefined or UDF(User Defined Functions) can be used to perform certain action.
In hive:

select * will create a fetch job but not map reduce
Aggregation functions like min, max, etc will create a map reduce job.

Pig:

Requires some programming knowledge to query and extract the data.
Abstraction layer on top of Hadoop
Batch oriented framework.
Useful for structured, semi-structured, and unstructured data.
Pig has 2 parts: Pig Latin and Pig runtime.

Pig runtime converts the job from Pig to MapReduce

Popular amongst data engineers

Spark:

In memory processing, you need to know java to utilize spark.
Faster but is low level since it requires coding knowledge.
Useful for structured, semi-structured, and unstructured data.

Decision making between Hive, Pig, Spark

If you have unstructured data then go with Pig or Spark.
If you have structured data then go with Hive and load the data into Hive.
If you want faster processing go with Spark.
If you are fine with waiting few hours then go with Pig or Hive.
If you have technical knowledge then go with Spark -> Pig -> Hive.

Few notes:

Solr: Elastic search tool. Searches for words within documents.
Sqoop is used to import data into Hadoop. Pig is used to process that data.
Hbase: column family NoSql DB

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)