Wednesday, August 27, 2014

Big Data Overview



Big Data:
Big data is the data which it’s growing from scarce to superabundant day by day. We all are making big buzz about big data, it’s good and big-headache to analyze. Now this is a very hot topic. It’s defined as Volume, Velocity, and Variety by IBM.As you know we all are using Facebook account to upload photos, videos and messages on daily basis.

       7TB of data are processed by Twitter every day.
       10TB of data are processed by Facebook every day.       
       2 billion internet users in the world today.
       4.6 billion mobile phones in 2011.

Rollin Ford, the CIO of Wal-Mart said “Every day I wake up and ask, ‘how can I flow data better, manage data better, analyse data better?”.

Big Data History:
As we know, we are always gathering information about documents, photos and videos in storage devices by using latest technologies and devices (with large pixels). In Olden days, people collected the papers and stored in the warehouse or library in form of records, books etc.

The fact of the matter is big data is not new. In 1940s, Libraries effected and had to adapt their storage methods to meet the quickly increasing papers of new publications and re-search and encounter the first attempts to qualify the growth rate in the volume of data as information explosion in 1941.

In 1997s, the term “Big Data” was used for first time in an article by NASA researchers M.Cox and D. Ellsworth to identify the rise of data was becoming an issue for current computer systems.

In 1999s, Doug Cutting wrote the Lucene is the key component of search index in Apache Nutch, Apache Solr and Elastic Search. Nutch faced some troubles to manage computations running on parallel computers while building an open source engine. In 2003 and 2004, Google published its GFS and Map reduces papers, the route became clear to implement Hadoop by Doug Cutting in 2006.

Today, companies in every industry is facing big data problems which generated by their daily-increased ability to collect information.

Big Data Key Challenges:
1.      Performance
2.      Network Traffic
3.      Risk Failure
4.      Analysing the data
5.      Addressing data quality
6.      Displaying meaningful results
7.      Dealing with reports by using graphical presentation

Big Data Technologies:

  •  Data Scientist
  •  In-Memory databases
  •  Distributed Databases
  •  NO-SQL
  •  Data warehouse appliances
  •  Hadoop

Tuesday, August 19, 2014

Flume implementation using SpoolDirectory Source, HDFS Sink, File Channel


       Flume: Apache Flume is a distributed, reliable, and available system for efficiently collecting,   aggregating and moving large amounts of log data from many different sources to a centralized data store. 

Steps:
1.  create Directory to copy the log files from mount location. 
For Ex: /home/user/flume/source

2. create HDFS Directory to copy the files from flume spool directory.
For Ex: /user/hdpuser/flumelogs

3. create the flume-conf.conf file in /home/user/flume/ folder.
#Single agent flume configuration

#Agent components
a1.sources = src1
a1.channels = chan1
a1.sinks = sink1

#configuring the souce
a1.sources.src1.type = spooldir
a1.sources.src1.spoolDir = /home/user/flume/source

#configuring the channel
a1.channels.chan1.type = file
#Checkpoint directory
a1.channels.chan1.checkpointDir = /home/user/flume /.flume/file-channel/checkpoint             
#log files created in Data directory while running flume-agent
a1.channels.chan1.dataDirs = /home/user/flume/flume/.flume/file-channel/data

#configuring the Sink
a1.sinks.sink1.type = hdfs
a1.sinks.sink1.hdfs.path = /user/hdpuser/flumelogs/year=20%y/month=%m
#a1.sinks.sink1.hdfs.round = true
#a1.sinks.sink1.hdfs.roundValue = 10
#a1.sinks.sink1.hdfs.roundUnit = minute
a1.sinks.sink1.hdfs.useLocalTimeStamp = true
a1.sinks.sink1.hdfs.batchSize = 10000
a1.sinks.sink1.hdfs.fileType = DataStream
#a1.sinks.sink1.hdfs.filePrefix = %d-START-
#creating log file in hdfs based on the roll size
a1.sinks.sink1.hdfs.rollCount = 0
a1.sinks.sink1.hdfs.rollSize = 10737418240
a1.sinks.sink1.hdfs.rollInterval = 216000
#a1.sinks.sink1.sink.serializer = text
#a1.sinks.sink1.sink.serializer.appendNewline = false

a1.sources.src1.channels = chan1
a1.sinks.sink1.channel = chan1

4. run the flume-agent
in command promt run the below command
$cd  /home/user/flume
$flume-ng agent --conf conf --conf-file flume-conf.conf --name a1 
--- this creates .flumespool in /home/user/flume/source contains the flume metadata information when flume-agnet stream the information.
Note: "Spooling Directory Source runner has shutdown" means there is no file to stream.

5. copy the log file in to the /home/user/flume/source.
Flume-agent take that file and stream into the HDFS (/user/hdpuser/flumelogs/year=20%y/month=%m) .
6. Open your HDFS File Browser and see the file.
7.  Checkpoint directory and Data Directory files are created by flume file channel with log data and check points.

 Note:
a) The channel transaction capacity will need to be smaller than equal to the channel capacity. 
b) Batch Size  <=  channel capacity.   
In Same way, you can use memory channel also 

Sunday, August 3, 2014

Swing Progress Bar example in JAVA

SwingWorker Progress Bar example is creating progress bar and it closes the progressbar automatically when it's done.
We need to provide the progress bar while running the application to increase the user-friendly.
In this program we are building the progress bar with Swing framework.
This program indicates how much task is completed and same thing shows in the progressbar.

Source code:
import java.awt.event.ActionEvent;
import java.awt.event.ActionListener;

import javax.swing.JButton;
import javax.swing.JFrame;
import javax.swing.JLabel;
import javax.swing.JPanel;
import javax.swing.JProgressBar;
import javax.swing.SwingWorker;

public class MyProgressBarTest extends JFrame {
    private static final long serialVersionUID = 1L;

    private static JProgressBar progressBar;
    private static JLabel waitLabel = new JLabel("Please wait...");

    public static void main(String[] args) {
        MyProgressBarTest obj = new MyProgressBarTest();
        obj.createGUI();
    }

    public void createGUI() {
        JPanel panel = new JPanel();
        JButton button = new JButton("Progress");

        progressBar = new JProgressBar();
        progressBar.setVisible(false);
        waitLabel.setVisible(false);
        panel.add(waitLabel);
        panel.add(progressBar);

        button.addActionListener(new ActionListener() {
            @Override
            public void actionPerformed(ActionEvent arg0) {

                SwingWorker<Void, Void> mySwingWorker = new SwingWorker<Void, Void>() {
                    @Override
                    protected Void doInBackground() throws Exception {

                        // mimic some long-running process here...
                        calculateResult();
                        return null;
                    }
                    //actual code is running on backend of progress bar
                    public void calculateResult() {
                        for (int i = 0; i < 300000; i++) {
                            System.out.println("Progress Bar: " + i);
                        }
                    }
                    //disable the progressbar and waitlabel when task completed
                    protected void done() {
                        progressBar.setVisible(false);
                        waitLabel.setVisible(false);
                    }
                };

                mySwingWorker.execute();

                progressBar.setIndeterminate(true);
                progressBar.setVisible(true);
                waitLabel.setVisible(true);

            }
        });

        panel.add(button);
        add(panel);

        setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
        setLocationRelativeTo(null);
        pack();
        setSize(200, 300);
        setVisible(true);
    }
}


Hadoop WordCount Program with MapReduce v2 (MRv2)



Wordcount is the typical example to understand simple map-reduce functionality. This program reads the text file and counts how often each word occur in the input text file.


Simple Map Reduce Contains
Input and output directory: The input dir is text files and the output dir is text files, each line of which contains a word and the count of how often it occurred, separated by a tab.

Driver Class:  This driver class is responsible for triggering the map reduce job in Hadoop, it is in this driver class we provide the name of our job, output key value data types and the mapper and reducer classes

Mapper: Each mapper takes a line as input and breaks it into words. It then emits a key/value pair of the word and 1.

Reducer: Each reducer sums the counts for each word and emits a single key/value with the word and sum.
As an optimization, the reducer is also used as a combiner on the map outputs. This reduces the amount of data sent across the network by combining each word into a single record.  

Source Code:

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;

import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class WordCount {
   
    //Mapper
    public static class Map extends
            Mapper<LongWritable, Text, Text, IntWritable> {
       
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {
           
            String line = value.toString();
            StringTokenizer tokenizer = new StringTokenizer(line);
            while (tokenizer.hasMoreTokens()) {
                word.set(tokenizer.nextToken());
                context.write(word, one);
            }
           
        }
    }

   // Reducer
public static class Reduce extends
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable t : values) {
sum+=t.get();
}

context.write(key, new IntWritable(sum));
}
}

    //Driver class
    public static void main(String[] args) throws Exception {
           
        Configuration conf = new Configuration();
        Job job = new Job(conf, "WordCount");
       
        job.setJarByClass(WordCount.class);
       
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        //Mapper   
        job.setMapperClass(Map.class);
        //Reducer
        job.setReducerClass(Reduce.class);
       
        //Input Text Format
        job.setInputFormatClass(TextInputFormat.class);
        //Output Text Format
        job.setOutputFormatClass(TextOutputFormat.class);
        //Input directory
        FileInputFormat.addInputPath(job, new Path(args[0]));
        //Output directory
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.waitForCompletion(true);
   
    }
}

Execute the Program:
Assuming HADOOP_HOME is the root of the installation and HADOOP_VERSION is the Hadoop version installed, compile WordCount.java and create a jar:
  1. create the WordCount.jar and mention Wordcount.class in the class file.
  2. Assuming that:
    •  /usr/wordcount/input - input directory in HDFS
    • /usr/wordcount/output - output directory in HDFS
  3.  $ bin/hadoop dfs -ls /usr/wordcount/input/ contains sample.txt as "Hello World Bye World Hello Hadoop Goodbye Hadoop"
  4. Run the application:
    $ bin/hadoop jar /usr/wordcount.jar WordCount /usr/wordcount/input /usr/wordcount/output
  5. Output:
    $ bin/hadoop dfs -cat /usr/wordcount/output/part-00000
    Bye 1
    Goodbye 1
    Hadoop 2
    Hello 2
    World 2