IOE Note | Elective II : BIG DATA TECHNOLOGIES [CT 765 07]

Notices

View All

View Old Questions

Computer Engineering(BCT)

Electrical Engineering(BEL)

Electronics and Communication(BEX)

View All

View Syllabus

Computer Engineering(BCT)

Electrical Engineering(BEL)

Electronics and Communication(BEX)

View All

Notes of Elective II : BIG DATA TECHNOLOGIES [CT 765 07]

NoSQL

Structured and Unstructured Data

Structured Data

- Structured data is defined as the data that can fit into the fixed record or file.
- Such data can be stored in relational databases and spreadsheet.
- It is easily searchable by the basic algorithm or basic queries.
- It is written in the format that is easy for machine to understand.
- It is simple to enter, store, query and analyze.
- It must be strictly defined in terms of field name and type.

-The example of structured data is:

   Employee             Age               Salary
 ---------------------------------------
   Ram                      34                 30000
   Shyam                  20                 12000
   Hari                       24                 20000

Unstructured Data

- Unstructured data is defined as the data that can not be classified and can not fit in a fixed record.
- Such data can not be stored in relational databases as they do not possess any well known structure.
- It generally allows keyword based queries or sophisticated conceptual queries..
- It is written in the format that is easy for humans to understand.
- It is being increasingly valuable and available.
- It is difficult to unbox, understand and analyze.
- It refers to the free text data.

- The example of structured data includes personal messaging, business documents, web content and so on.

Scaling of Database

The scaling techniques used in traditional databases are as follows:
1. Vertical Scaling
- It is also known as scaling up.
- It is achieved by upgrading the new hardware requirements to fulfill the processing demands.
- It is generally configured in the single machine.
- It may hamper availability of resources at the time of upgrade.

2. Horizontal Scaling
- It is also known as scaling out.
- It is achieved by adding the necessary hardware parallel to the currently available hardware.
- It is generally configured in multiple machines.
- It do not hamper availability of resources.
- It involves database sharding and database replication.
- Database sharding refers to the stripping of data and storing in multiple machine to allow concurrent access to the database.
- Database replication refers to the storing of copies of database in multiple machines to ensure availability and prevent from single point failure.

CAP Theorem

- CAP stands for Consistency, Availability and Partition Tolerance.
- CAP theorem is used to describe the limitations of distributed databases.
- Consistency refers to the process of maintaining the same state of data in all the replicas at any instance.
- Availability refers to the process of successfully operate at any instance even if some node crashes.
- Partition tolerance refers to the process of operation of the system in presence of network partition.

- CAP theorem states that any distributed database with shared data can have at must two of the three desirable properties (Consistency, Availability and Partition Tolerance).

- So, CAP theorem can be summarized as: For any distributed database, one of the following can hold:
1. If a database guarantees availability and partition tolerance, it must forfeit consistency. Eg: Cassandra, CouchDb and so on.
2. If a database guarantees consistency and partition tolerance, it must forfeit availability. Eg: Hbase, MongoDb and so on.
3. If a database guarantees availability and consistency, there is no possibility of network partition. Eg: RDBMS like MySQL, Postgres and so on.

Why absolute consistency is generally sacrificied?

- The large companies generally scales out vertically, that means the network partition is present that is spread over thousands of nodes.
- Among consistency and availability, such companies have to choose only one as per the CAP theorem.
- Due to large number of network partitioning, there is high chance of failure of some nodes.
- If the data is not become available on time, it means a huge loss to the company.
- So, the company choose availability in terms of financial gain and trust from the client.
- In this sense, strict consistency is generally sacrificed.

- In other to maintain consistency, the company follow eventual consistency process instead of strict or absolute consistency.

Eventual Consistency

- A database is said to be eventually consistent if all the replicas will gradually become consistent in the absence of the updates.
- It means that the system will be consistent eventually not at the same instance.

Turnable Consistency

- In some distributed databases that forfeits consistency for availability like Cassandra, it extends the concepts of eventual consistency by offering tunable consistency for any given read or write operations.
- In case of tunable consistency, the client application is responsible to decide how consistent the requested data should be.

Properties of Distributed Database

1. Basically Available (guarantees availability)
2. Soft-state (state of system may change over time)
3. Eventual consistency (system becomes consistent over time)

Generally, it is known as BASE properties.

Taxonomy of NoSQL Implementation

NoSQL

- NoSQL stands for Not only SQL.
- It is a class of database management system with the following properties:
1. It does not use SQL as query language.
2. It has distributed and fault tolerant architecture.
3. It is schema less.
4. It follows BASE properties rather than ACID properties.
5. Consistency is traded in favor of Availability.

RDBMS vs NoSQL

- RDBMS is a table based database. NoSQL is a schema less database.
- RDBMS stores data in rows and columns in the form of table. NoSQL stores data in multiple collections and nodes.
- RDBMS must be scaled up vertically. NoSQL can be scaled out horizontally.
- If the data does not fit in RDBMS table, the table must be restructured. NoSQL can handle unstructured data in efficient manner.

NoSQL Taxonomy

NoSQL databases are of four types. They are as follows:
1. Document Store
2. Graph Database
3. Key-Value Store
4. Columnar Database

Document Store

- It is used to store documents in some standard format such as PDF, JSON and so on.
- It is referred as Binary Large Objects (BLOB).
- The documents which are stored can be indexed.
- Eg: MongoDB, CouchDB

Graph Database

- The data are represented in the form of graph (i.e. nodes and vertices).
- The node represents the data.
- The vertex represents the relation of nodes.
- Eg: VertexDB, Neo4j

Key-Value Store

- The data are represented in the form of keys and complex values like list.
- Keys are stored in a hash table which can be distributed easily.
- It supports regular CRUD operations.
- Eg: Amazon DynamoDB, Redis

Columnar Database

- It is the hybrid form of Relational database system and key-value store.
- The values are stored in a group of zero or more columns.
- Values are queried by matching keys.
- Eg: Hbase, Cassandra

Basic Architecture of Hbase, Cassandra and MongoDb

Cassandra

- Cassandra is a distributed database that is used to manage large amount of structured data spread out across the world.

- The reasons for choosing Cassandra are as follows:
1. value availability over consistency
2. require high write throughput
3. high scalability required
4. no single point of failure

- The key properties are as follows:
1. It is column oriented database.
2. It is scalable, fault tolerant and consistent.
3. It supports ACID properties.

- The key components of Cassandra are as follows:
1. Node (place where data is stored)
2. Data center (collection of related nodes)
3. Cluster (collection of one or more data centers)
4. Commit logs (write operation is written for crash recovery)
5. Mem table (after commit log, data is written to mem table)
6. SSTable (disk file to which data is flushed from mem table when its contents reach threshold value)
7. Bloom filter (algorithm for testing whether the element is a member of a set)

- Cassandra is accessed through its node using Cassandra Query Language (CQL). It treats database (Keyspace) as a container of tables.

Data Modeling in Cassandra
- Keyspace is the outermost container for data. It consists of one or more column families.
- A column family is a container of a collection of rows.
- Each row contains ordered columns.
- A column is the basic data structure of Cassandra with three values; key or column name, value and time stamp.

Tunable Consistency - Write Consistency Level
- It indicates the number of replicas on which write must succeed before returning acknowledgement to the client application.
- It consists of following levels:
1. ANY (must be written to at least one node)
2. ALL (must be written to commit log and mem-table on all replica nodes in the cluster for that row)
3. EACH_QUORUM (written to commit log and mem-table on a quorum of replica nodes in all data centers.)
4. LOCAL_ONE (send to and acknowledged by at least one replica node in the local data center)
5. LOCAL_QUORUM (written to commit log and mem-table on a quorum of replica nodes in the same data center as coordinator node.)
6. LOCAL_SERIAL (written conditionally to the commit log and memory table on a quorum of replica nodes in the same data center.)
7. ONE (written to the commit log and memory table of at least one replica node.)

Tunable Consistency - Read Consistency Level
1. ALL (Returns the record with the most recent timestamp after all replicas have responded)
2. EACH_QUORUM (Returns the record with the most recent timestamp once a quorum of replicas in each data center of the cluster has responded)
3. LOCAL_SERIAL (confined to the data center)
4. LOCAL_QUORUM (Returns the record with the most recent timestamp once a quorum of replicas in the current data center as the coordinator node has reported)
5. LOCAL_ONE (Returns a response from the closest replica, as determined by the snitch, but only if the replica is in the local data center)
6. ONE (Returns a response from the closest replica, as determined by the snitch)
7. QUORUM (Returns the record with the most recent timestamp after a quorum of replicas has responded regardless of data center)
8. TWO (Returns the most recent data from two of the closest replicas)
9. THREE (Returns the most recent data from three of the closest replicas)

MongoDB

- MongoDB is a cross platform and document oriented database.
- It uses JSON format.
- It works on the concept of collection and document.
- Collection is a group of MongoDB documents. It is equivalent to table in RDBMS. It does not enforce schema.
- Document is a set of key-value pairs. Each document has dynamic schema.

- The basic considerations while designing schema in MongoDB are as follows:
1. Design according to user requirements.
2. Combine objects into one document if they are used together.
3. Duplicate the data.
4. Do joins while write, not on read.

- Example:
Create a collections named "posts", insert following records:
title:MongoDB, description:MongoDB is a NoSQL database, by: Tom, Comments:We use MongoDB for unstructured data, likes:100.
Write a query to search title of the post written by Tom.

> use test //create database name test and use that database
> db.createCollection("posts") //create collection named posts
> db.posts.insert({
         title : "MongoDB",
         description : "MongoDB is a NoSQL database",
         by : "Tom",
         comments : "We use MongoDB for unstructured data",
         likes : 100
    })
> db.posts.find({"by" :  "Tom"}, {"title" : 1, _id : 0}).pretty() // title of the post written by Tom 
> db.posts.find({"likes" : {$lt : 150}}).pretty() // likes less than 150

Sponsored Ads