Showing posts with label HDFS Sink. Show all posts
Showing posts with label HDFS Sink. Show all posts

Tuesday, August 19, 2014

Flume implementation using SpoolDirectory Source, HDFS Sink, File Channel


       Flume: Apache Flume is a distributed, reliable, and available system for efficiently collecting,   aggregating and moving large amounts of log data from many different sources to a centralized data store. 

Steps:
1.  create Directory to copy the log files from mount location. 
For Ex: /home/user/flume/source

2. create HDFS Directory to copy the files from flume spool directory.
For Ex: /user/hdpuser/flumelogs

3. create the flume-conf.conf file in /home/user/flume/ folder.
#Single agent flume configuration

#Agent components
a1.sources = src1
a1.channels = chan1
a1.sinks = sink1

#configuring the souce
a1.sources.src1.type = spooldir
a1.sources.src1.spoolDir = /home/user/flume/source

#configuring the channel
a1.channels.chan1.type = file
#Checkpoint directory
a1.channels.chan1.checkpointDir = /home/user/flume /.flume/file-channel/checkpoint             
#log files created in Data directory while running flume-agent
a1.channels.chan1.dataDirs = /home/user/flume/flume/.flume/file-channel/data

#configuring the Sink
a1.sinks.sink1.type = hdfs
a1.sinks.sink1.hdfs.path = /user/hdpuser/flumelogs/year=20%y/month=%m
#a1.sinks.sink1.hdfs.round = true
#a1.sinks.sink1.hdfs.roundValue = 10
#a1.sinks.sink1.hdfs.roundUnit = minute
a1.sinks.sink1.hdfs.useLocalTimeStamp = true
a1.sinks.sink1.hdfs.batchSize = 10000
a1.sinks.sink1.hdfs.fileType = DataStream
#a1.sinks.sink1.hdfs.filePrefix = %d-START-
#creating log file in hdfs based on the roll size
a1.sinks.sink1.hdfs.rollCount = 0
a1.sinks.sink1.hdfs.rollSize = 10737418240
a1.sinks.sink1.hdfs.rollInterval = 216000
#a1.sinks.sink1.sink.serializer = text
#a1.sinks.sink1.sink.serializer.appendNewline = false

a1.sources.src1.channels = chan1
a1.sinks.sink1.channel = chan1

4. run the flume-agent
in command promt run the below command
$cd  /home/user/flume
$flume-ng agent --conf conf --conf-file flume-conf.conf --name a1 
--- this creates .flumespool in /home/user/flume/source contains the flume metadata information when flume-agnet stream the information.
Note: "Spooling Directory Source runner has shutdown" means there is no file to stream.

5. copy the log file in to the /home/user/flume/source.
Flume-agent take that file and stream into the HDFS (/user/hdpuser/flumelogs/year=20%y/month=%m) .
6. Open your HDFS File Browser and see the file.
7.  Checkpoint directory and Data Directory files are created by flume file channel with log data and check points.

 Note:
a) The channel transaction capacity will need to be smaller than equal to the channel capacity. 
b) Batch Size  <=  channel capacity.   
In Same way, you can use memory channel also