Why is big data so important?

Learn what big data is, how is it defined using the 5 Vs of big data and how it is stored using Distributed File System such as HDFS
Understanding big data and the 4 Vs of big data

I believe that we have already established how important data has become over the past decade however if you thought you have understood data, here comes a new term to let you know just how little you know. Big data refers to the data that you work with normally but in an abnormally large quantity. The graph below shows just how the number of IOT devices has increased over the past years.

Number of IOT devices in the world by 2020
source

The line graph above shows that the number of internet connected devices around the world has increased exponentially. From having around 20 Billion devices in 2015, the number rose more than 100% to a whooping 50 Billion in 2020. Such sharp rise in the usage of internet devices means that more data is generated every second and so we need special architectures to deal with this much amount of data. This is why the term “Big Data” was coined and is treated as a separate domain compared to normal data processing.

What is big data?

When Web 2.0 started off back in 2005, data was a very new concept and was not taken too seriously. However as time progressed and people caught up with technology, the internet saw a sharp increase in the number of users. Each of these users generated a large amount of data so conventional data processing techniques failed. As of 2022, Facebook sees a total of 1.9 billion users every day!!!. Just for comparison Facebook revealed back in 2015 that it has a total of 1.44 Billion users every month. That means, only 6 years later it now has more users each day than it had in a month in 2015. THAT is how quickly internet users have grown.

Another interesting statistic. Back in 1998 when Google was first launched, it had a total of 10000 searches in a day. Sounds a lot until you realize that today Google processes about 70000 search queries per second .

source

As the above chart from statista.com shows, the world has seen an increase of internet users from approximately 1.9 Billion in 2010 to 5 Billion! in 2022, an increase of 163% in a span of about 12 years.

The whole point of discussing these facts and figures is that a normal user does not realize how quickly the internet has grown and how many queries are carried out every second. This means that their is a huge wave of data that travels across the globe every second so special techniques need to be implied to cater to this data. For a look into general data processing techniques. read here.

5 Vs of big data

We have talked about Big Data is fundamentally different from normal data but how do we draw is distinction. How do we know what data is big data?

Well for that, certain terms are defined to create boundaries for big data. Initially big data was defined by only 3 Vs (Volume, Velocity, Variety)however as time progressed and firms realized how quickly data is growing and how important it is to analyze it, the definition was expanded to a total of 5 Vs.

The 5 Vs

  • Velocity
    • This refers to the speed at which data is being generated and hence processed. Since we have already seen how the number of IOT devices has increased, people are generating more data by the second and processing this data as quickly as possible has become important for a competitive advantage. For example if you search for a ‘Resume builder’ application, Google will save and process this data immediately so that they can tailor their search results and ads displayed to you to be more relevant to you, perhaps job postings.
Velocity of Big Data Generation
  • Volume
    • It might be obvious that the term ‘Big’ in ‘Big Data’ refers to the size of this data. We have now moved on from the Gigabyte world and now everyday we have Terabytes of data being generated.
    • The volume of data helps us determine whether our particular use case data can be considered as big data or not.
  • Veracity
    • Veracity here directly translates to the quality of data. With data being generated in such bulk quantity, it is impossible to ensure that all of the data is correct and represents valuable information. Some cases of inconsistencies could be people filling out dummy data in forms. It gets very difficult for companies to filter real data from such dummy entries.
Variety of data according to the 4 Vs of Big Data
Variety of data
  • Variety
    • Data that is being generated on a daily basis is often labelled as ‘chaotic’. This is because there is no single format for this data. Their is consistency across data types and the data itself is generated in multiple formats. Data is generally classified in the following terms:
      • Structured data:
        • This is tabular data which is often stored in SQL databases in an RDBMS systems. It is most easy to make sense of this data since there are more-often-than-not mappings present which link one data entity with another.
      • Semi-Structured data:
        • This is structured data but stored in an unstructured format such as CSV or TSV. Semi-Structured data is one which can be easily converted to a structured form such as we can use the commas as a delimiter in a CSV to turn it into a proper table.
      • Unstructured data:
        • This form of data includes Images, Videos, Text files and Audios. All of these data items are important but their exactly any intuitive way to link one with another or any storage methodology which groups these together.

Storage and handling of big data

Normal data is usually stored in conventional Relational Database Management Systems (RDBMS). These RDBMS use SQL for retrieving and manipulating data. SQL is great for data processing and is most widely used for ETL related tasks as well as some of Data Science projects. The problem is that SQL is not optimized for such large amounts of data, especially one which is generated at such a fast pace.

Another problem with SQL is that it only caters to Structured data. As we read above that Big Data is generated from multiple sources hence it comprises of data of all formats (Variety). For handling Unstructured data, we use NoSQL however for Big Data either one of these is not individually sufficient. Big Data contains Structured, semi-structured as well as Un-Structured data so we need bits of both SQL as well as NoSQL.

Distributed File System (DFS)

A distributed file system consists of multi-node setup where each node represents a machine which is capable of processing a certain amount of data. When is passed to a distributed file system, the file itself is distributed amongst all the nodes. This way each node processes different parts of the file and the overall processing is completed significantly quickly.

Apart from quicker data processing, DFS has other benefits as well. Some implementation of the DFS such HDFS (Hadoop Distributed File System) allow you to create multiple copies of the segments of the data file which are then stored across all the nodes. This ensures that even if any of the nodes fail, the data is securely stored on other nodes can be easily recovered.

Conclusion

That’s all for an overview of Big Data. There’s a lot more to learn about Big Data and its storage and handling however all that would be too much for this article. Currently only perhaps the worlds biggest companies (Such as Google, Microsoft, Facebook and Amazon) might be concerned with big data because not all organizations are generating this bulk amount of data however as we have seen that the usage of the internet is growing day by day. Sooner or later everyone will have to shift to these advanced processing techniques.

Total
0
Shares
Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Prev
Naive Bayes Python Implementation and Understanding
Niave Bayes Explained in Python

Naive Bayes Python Implementation and Understanding

An explanation of the Bayes theoram of conditional probability followed by

Next
Install Hadoop on Windows — Step by step guide
Installing hadoop on windows 10

Install Hadoop on Windows — Step by step guide

A guide to setting up APACHE HADOOP on Windows 10 and running hdfs clusters as