Hands-on Big Data – The Architecture

This is going to be a quite big and interesting project for me: building a Big Data server from scratch with the latest data engines. I can’t wait to see all up and running.

I was thinking first what I should implement to cover all areas of data analytics (that I could use for my future projects) and I decided the following architecture should work pretty neat (well, it’s pretty standard though).

Architecture Diagram

Software Requirements

Considering the above diagram I decided to implement (initially) the following software:

– Java 8. All this software relies on Java. I had to use Open JDK as Oracle one is not available any more for free download.

– Hadoop 2.7.7. This is the latest stable version of Hadoop 2. I might change this in the future to Hadoop 3 but I wanted to check first if in a very well-known version it works fine. I want to implement a pseudo-distributed environment where both NameNode and DataNode will reside on the same machine.

– Hive 2.3.6. We will use Hive to simplify the Hadoop MapReduce transactions and to create our Data Warehouse using HDFS.

– Talend for Big Data 7.3.1. We need to consume our data and do some clean up so here we have Talend, very preferred ETL for this purpose. This tool has components for most known BD data sources.

– Talend for Data Integration 7.3.1. Integrating data between different databases with some kind of transformation will require a data integration tool so Talend, again, will help us with that.

– Spark 2.4.4 with Anaconda 3. I will use Spark for the transformations or whatever I can’t do with other predefined tools. Anaconda 3 provides us the Python 3 environment I need for this.

The rest of software will be installed depending on the project needs, let’s say, I will install Mongo as soon as I need to store Json’s or other kind of structured file data. I will install MySQL as RDBMS when I need it. Probably SQL Sever as Data Warehouse and hopefully Tableau for Dashboards and reporting. I don’t want to make this decision yet as it will depend on the project I choose to work on.

Windows or Linux?

Definitely Linux. I tried Windows last month and it was an absolute nightmare. I was working fine (more or less) with Spark and Anaconda 3 but when I tried to install Hive everything went into madness. Environment variables didn’t work fine due to the back slash and Hadoop didn’t work either. So I decided to jump into Linux world, something that I wanted to do for a long time, but I was working quite intensively with Microsoft products and I didn’t want to implement double boot system. Anyway, I just got rid completely of Windows 10 and I installed an Ubuntu 19.04. I love it. Fast and nice user interface and, finally, I can have much more control of the installation.

It’s going to be a great installation fun please leave a comment if you liked this post. Happy installation!.

1 Comment

Leave a comment