Big Data, What is it?

LiYen Yoong
5 min readApr 18, 2020

My colleague from the Business Operation texted me one morning and asked me where she can get insights, understand some of the terminologies, the difference between the SQL and NoSQL, and make a decision which type of database to use. Instantly, I replied, “get it from me!” I was pretty confident that I could give her an answer, and I wanted to explain databases interestingly.

What is SQL?

Structured Query Language (SQL) is a computer language for database management systems and data manipulation. SQL is used to perform insertion, update, deletion. It allows us to access and modify data. It stored in a relational model, with rows and columns. Rows contain all of the information about one specific entry and columns are the separate data points.

What is NoSQL?

NoSQL encompasses a wide range of database technologies that designs to cater to the demands of modern apps. It stored a wide range of data types, each with different data storage models. The main ones are document, graph, key-value and columnar.

It explains the above picture. Apps such as Facebook, Twitter, search engine (web) and IoT applications generate huge amounts of data, both structured and unstructured. The best examples to explain what is unstructured data are photos and videos. Therefore, it needs a different method to store the data. NoSQL databases do not store data in rows and columns (table) format.

Differences between SQL and NoSQL

There are a lot of websites that we can search online to give us the differences, and I referred to this website.

NoSQL is also known as schema-less databases. The above screenshot uses the word, dynamic schema, which means the same; it does not have a fixed schema that locked the same number of the columns (fields) for data entry. NoSQL data allow having data with a different number of columns.

Image: https://www.guru99.com/nosql-tutorial.html

Another significant difference is scalability, SQL is vertical scaling and NoSQL is horizontal scaling. Let’s use a picture to explain scalability.

The relational databases’ design is to run on a single server to maintain the integrity of the table mappings. It avoids the problems of distributed computing. Often, we will look into more RAM, more CPU and more HDD, ways to upsize our system by upgrading our hardware specification. It scales up or vertical scaling. This process is expensive.

NoSQL databases are non-relational, making it easy to scale out or horizontal scaling, meaning that it runs on multiple servers that work together, each sharing part of the load. It can be done on inexpensive commodity hardware.

Question: SQL or NoSQL?

Let’s refer to this article; the choice of the database between SQL and NoSQL cannot be concluded on the differences between them but the project requirements. If your application has a fixed structure and does not need frequent modifications, SQL is a preferable database. Conversely, if you have applications where data is changing frequently and growing rapidly, like in Big Data analytics, NoSQL is the best option for you. And remember, SQL is not deceased and can never be superseded by NoSQL or any other database technology.

In short, it depends on what type of applications or project requirements and type of query result as well. Next, let see how this relates to Big Data.

Big Data

Big data refers not just to the total amount of data generated and stored electronically (volume) but also to specific datasets that are large in both size and complexity which algorithms are required to extract useful information from them. Example sources such as search engine data, healthcare data and real-time data. In my previous article about What is Big Data?, I shared that Big Data has 3 V’s:

  • Volume of data. Amount of data from myriad sources.
  • Variety of data. Types of data; structured, semi-structured and unstructured.
  • Velocity of data. The speed and time at which the Big Data is generated.

Yes, based on all the above, we have covered 2 of the 3 V’s, the volume and variety. Velocity is how fast data is generated and processed. Although, there are more V’s out there and some are relevant to Big Data description. During my visit to the Big Data World 2018 in Singapore, I realized that my understanding of Big Data was limited. In this blog, I am going to write more.

Storing Big Data

Unstructured data storage cannot be stored in the normal RDBMS for some reasons and often Big Data is related to real-time data and required real-time processing requirements.

Hadoop Distributed File System (HDFS)

It provides efficient and reliable storage for big data across many computers. It is one of the popular distributed file systems (DFS) which stored both unstructured and semi-structured data for data analysis.

Big Data Analytics

There are not many tools for NoSQL analytics in the markets at the moment. One of the popular methods of dealing with Big Data is MapReduce by dividing it up into small chunks and process each of these individually. In other words, MapReduce spread the required processing or queries over many computers (many processors).

Big Data does not limit to search engine and healthcare. It can be an e-commerce website where we want to perform targeted advertising and provide recommendations systems that we can see in websites such as Amazon, Spotify or Netflix.

Big Data Security

Securing a network and the data it holds is the key issues, a basic measurement such as firewall and encryption should be taken to safeguard networks against unauthorized access.

Big Data and AI

While the smart home has become a reality in recent years, the successful invention of smart vehicles which allows vehicles to drive in auto-mode, gives us a big hope that one day smart city can be realized. Countries such as Singapore, Korea, China and European countries such as Ireland and the UK are planning smart cities, using the implementation of internet of things (IoT) and Big Data management techniques to develop smart cities.

Reference:
Dawn E. Holmes (2017) Big Data A Very Short Introduction.

--

--