Technical Tweets: HDFS Sink

Flume: Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.

Steps:

1. create Directory to copy the log files from mount location.

For Ex: /home/user/flume/source

2. create HDFS Directory to copy the files from flume spool directory.

For Ex: /user/hdpuser/flumelogs

3. create the flume-conf.conf file in /home/user/flume/ folder.

#Single agent flume configuration

#Agent components

a1.sources = src1

a1.channels = chan1

a1.sinks = sink1

#configuring the souce

a1.sources.src1.type = spooldir

a1.sources.src1.spoolDir = /home/user/flume/source

#configuring the channel

a1.channels.chan1.type = file

#Checkpoint directory

a1.channels.chan1.checkpointDir = /home/user/flume /.flume/file-channel/checkpoint

#log files created in Data directory while running flume-agent

a1.channels.chan1.dataDirs = /home/user/flume/flume/.flume/file-channel/data

#configuring the Sink

a1.sinks.sink1.type = hdfs

a1.sinks.sink1.hdfs.path = /user/hdpuser/flumelogs/year=20%y/month=%m

#a1.sinks.sink1.hdfs.round = true

#a1.sinks.sink1.hdfs.roundValue = 10

#a1.sinks.sink1.hdfs.roundUnit = minute

a1.sinks.sink1.hdfs.useLocalTimeStamp = true

a1.sinks.sink1.hdfs.batchSize = 10000

a1.sinks.sink1.hdfs.fileType = DataStream

#a1.sinks.sink1.hdfs.filePrefix = %d-START-

#creating log file in hdfs based on the roll size

a1.sinks.sink1.hdfs.rollCount = 0

a1.sinks.sink1.hdfs.rollSize = 10737418240

a1.sinks.sink1.hdfs.rollInterval = 216000

#a1.sinks.sink1.sink.serializer = text

#a1.sinks.sink1.sink.serializer.appendNewline = false

a1.sources.src1.channels = chan1

a1.sinks.sink1.channel = chan1

4. run the flume-agent

in command promt run the below command

$cd /home/user/flume

$flume-ng agent --conf conf --conf-file flume-conf.conf --name a1

--- this creates .flumespool in /home/user/flume/source contains the flume metadata information when flume-agnet stream the information.

Note: "Spooling Directory Source runner has shutdown" means there is no file to stream.

5. copy the log file in to the /home/user/flume/source.

Flume-agent take that file and stream into the HDFS (/user/hdpuser/flumelogs/year=20%y/month=%m) .

6. Open your HDFS File Browser and see the file.

7. Checkpoint directory and Data Directory files are created by flume file channel with log data and check points.

Note:

a) The channel transaction capacity will need to be smaller than equal to the channel capacity.

b) Batch Size <= channel capacity.
In Same way, you can use memory channel also

Technical Tweets

Tuesday, August 19, 2014

Flume implementation using SpoolDirectory Source, HDFS Sink, File Channel