SUBSCRIBE VIA EMAIL TO GET NEW BATCH UPDATES

Enter your email address to get new batch updates:

Delivered by FeedBurner

Sunday, September 21, 2014

HADOOP on 20th Sep 11:00AM at Marthahalli (Bangalore) by Mr. Anil Kumar

Syllabus:

HADOOP DEVOLOPMENT


Introduction
  • What is Cloud Computing
  • What is Grid Computing
  • What is Virtualization
  • How above three are inter-related to each other
  • What is Big Data
  • Introduction to Analytics and the need for big data analytics
  • Hadoop Solutions - Big Picture
  • Hadoop distributions
  • Comparing Hadoop Vs. Traditional systems
  • Volunteer Computing
  • Data Retrieval - Radom Access Vs. Sequential Access
  • NoSQL Databases

The Motivation for Hadoop
  • Problems with traditional large-scale systems
  • Requirements for a new approach

Hadoop: Basic Concepts
  • What is Hadoop?
  • The Hadoop Distributed File System
  • How MapReduce Works
  • Anatomy of a Hadoop Cluster

Hadoop demons
  • Namenode
  • Datanode
  • Secondary namenode
  • Job tracker
  • Task tracker

HDFS at detail
  • Blocks and Splits
  • Replication
  • Data high availability
  • Data Integrity
  • Cluster architecture and block placement

Programming Practices & Performance Tuning
  • Developing MapReduce Programs in
    • Local Mode
    • Pseudo-distributed Mode
    • Fully distributed mode

Writing a MapReduce Program
  • Examining a Sample MapReduce Program
  • Basic API Concepts
  • The Driver Code
  • The Mapper
  • The Reducer
  • Hadoop's Streaming API

Setup Hadoop cluster of Apache, Cloudera and HortonWorks
  • Install and configure Apache Hadoop
  • Make a fully distributed Hadoop cluster on a single laptop/desktop
  • Install and configure Cloudera Hadoop distribution in fully distributed mode
  • Install and configure HortonWorks Hadoop distribution in fully distributed mode
  • Monitoring the cluster
  • Getting used to management console of Cloudera and Horton Works

Delving Deeper Into the Hadoop API
  • Using Combiners
  • The configure and close Methods
  • SequenceFiles
  • Partitioners
  • Counters
  • Directly Accessing HDFS
  • ToolRunner
  • Using The Distributed Cache

Common MapReduce Algorithms
  • Sorting and Searching
  • Indexing
  • Classification/Machine Learning
  • Term Frequency - Inverse Document Frequency
  • Word Co-Occurrence
  • Hands-On Exercise: Creating an Inverted Index

Debugging MapReduce Programs
  • Testing with MRUnit
  • Logging
  • Other Debugging Strategies

Advanced MapReduce Programming
  • A Recap of the MapReduce Flow
  • Custom Writables and WritableComparables
  • The Secondary Sort
  • Creating InputFormats and OutputFormats
  • Pipelining Jobs With Oozie
  • Map-Side Joins
  • Reduce-Side Joins

• Monitoring and debugging on a Production Cluster

  • Counters
  • Skipping Bad Records
  • Rerunning failed tasks with Isolation Runner

• Tuning for Performance

  • Reducing network traffic with combiner
  • Reducing the amount of input data
  • Using Compression
  • Running with speculative execution
  • Refactoring code and rewriting algorithms Parameters affecting Performance
  • Other Performance Aspects


Hadoop Ecosystem covered as part of Hadoop Developer

Eco system component: Hive
  • Hive concepts
  • Install and configure hive on cluster
  • Create database, access it console
  • Develop and run sample applications in Java/Python to access hive

Eco system component: Sqoop
  • Install and configure sqoop on cluster
  • Import data from Oracle/Mysql to hive

Eco system component: PIG
  • Install and configure PIG
  • Write sample Pig Latin scripts

Eco system component: HBase
  • Hbase concepts
  • Install and configure hbase on cluster
  • Create database, Develop and run sample applications

Eco system component:Cassandra
  • Cassandra concepts
  • Install and configure Cassandra
  • Create database, access it console
  • Develop and run sample applications in Java/Python to access Cassandra data

Eco system component:Oozie
  • Oozie concepts
  • Install and configure Oozie on cluster
  • Create a sample workflows and run them on cluster

• Overview of o
ther Eco system component:
o Avro, Thrift, Rest, Mahout, Flume, Chukwa, YARN, MR2 etc.

Analytics Basics
  • Analytics and big data analytics
  • Commonly used analytics algorithms
  • Analytics tools like R and Weka
  • Mahout

No comments:

Post a Comment

Blog Archive