Major Topics and AWS Services:
Data Pipeline
Data Security
Machine Learning

Big Data tools :

The node types in Amazon EMR are as follows:

Master node: A node that manages the cluster by running software components to coordinate the distribution of data and tasks among other nodes—collectively referred to as slave nodes—for processing. The master node tracks the status of tasks and monitors the health of the cluster.

Core node: A slave node with software components that run tasks and store data in the Hadoop Distributed File System (HDFS) on your cluster.

Task node: A slave node with software components that only run tasks. Task nodes are optional.


RedShift Node types:

Each cluster has a leader node and one or more compute nodes.

The leader node receives queries from client applications, parses the queries, and develops query execution plans.

The leader node then coordinates the parallel execution of these plans with the compute nodes and aggregates the intermediate results from these nodes. It then finally returns the results back to the client applications.

Redshift Notes:

  • When you enable logging on your cluster, Amazon Redshift creates and uploads logs to Amazon S3 that capture data from the creation of the cluster to the present time.



We(AWS) recommend that all devices that connect to AWS IoT have an entry in the registry. The registry stores information about a device and the certificates that are used by the device to secure communication with AWS IoT.



  • PutRecord API call returns the shard ID of where the data record was placed and the sequence number that was assigned to the data record.
  • Each shard can support up to 5 transactions per second for reads, up to a maximum total data read rate of 2 MB per second.
  • The following sections contain concepts and terminology necessary to understand and benefit from the Kinesis Producer Library (KPL).
    • Records
    • Batching
    • Aggregation
    • Collection


Amazon Glacier:

Encryption is there by default at all times.


AWS Data pipeline:

AWS Data Pipeline is a web service that you can use to automate the movement and transformation of data. With AWS Data Pipeline, you can define data-driven workflows, so that tasks can be dependent on the successful completion of previous tasks.

  • Supported services are


AWS Athena:

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL.

  • Amazon Athena uses Hive only for DDL (Data Definition Language) and for creation/modification and deletion of tables and/or partitions.

AWS Machine Learning:

  • AWS ML can directly read data from S3 only. It can also copy data from RDS and Redhift to s3 in .csv form.
  • Amazon ML supports three types of ML models: binary classification, multiclass classification, and regression.

Amazon SageMaker:

Amazon SageMaker is a fully-managed platform that enables developers and data scientists to quickly and easily build, train, and deploy machine learning



Open Source Tools

Apache HBase:
HBase is an open source, non-relational, distributed database developed as part of the Apache Software Foundation’s Hadoop project. HBase runs on top of Hadoop Distributed File System (HDFS) to provide non-relational database capabilities for the Hadoop ecosystem.

Apache Mahout:

Apache Mahout is a project of the Apache Software Foundation to produce free implementations of distributed or otherwise scalable machine learning algorithms focused primarily in the areas of collaborative filtering, clustering and classification

Apache Hive:

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data summarization, query and analysis.

Apache Zeppelin:

Web-based notebook that enables data-driven,
interactive data analytics and collaborative documents with SQL, Scala and more.

Apache Spark:

Apache Spark is a unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.