Flume: Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.
Steps:
1. create Directory to copy the log files from mount location.
For Ex: /home/user/flume/source
2. create HDFS Directory to copy the files from flume spool directory.
For Ex: /user/hdpuser/flumelogs
3. create the flume-conf.conf file in /home/user/flume/ folder.
#Single agent flume configuration
#Agent components
a1.sources = src1
a1.channels = chan1
a1.sinks = sink1
#configuring the souce
a1.sources.src1.type = spooldir
a1.sources.src1.spoolDir = /home/user/flume/source
#configuring the channel
a1.channels.chan1.type = file
#Checkpoint directory
a1.channels.chan1.checkpointDir = /home/user/flume /.flume/file-channel/checkpoint
#log files created in Data directory while running flume-agent
a1.channels.chan1.dataDirs = /home/user/flume/flume/.flume/file-channel/data
#configuring the Sink
a1.sinks.sink1.type = hdfs
a1.sinks.sink1.hdfs.path = /user/hdpuser/flumelogs/year=20%y/month=%m
#a1.sinks.sink1.hdfs.round = true
#a1.sinks.sink1.hdfs.roundValue = 10
#a1.sinks.sink1.hdfs.roundUnit = minute
a1.sinks.sink1.hdfs.useLocalTimeStamp = true
a1.sinks.sink1.hdfs.batchSize = 10000
a1.sinks.sink1.hdfs.fileType = DataStream
#a1.sinks.sink1.hdfs.filePrefix = %d-START-
#creating log file in hdfs based on the roll size
a1.sinks.sink1.hdfs.rollCount = 0
a1.sinks.sink1.hdfs.rollSize = 10737418240
a1.sinks.sink1.hdfs.rollInterval = 216000
#a1.sinks.sink1.sink.serializer = text
#a1.sinks.sink1.sink.serializer.appendNewline = false
a1.sources.src1.channels = chan1
a1.sinks.sink1.channel = chan1
4. run the flume-agent
in command promt run the below command
$cd /home/user/flume
$flume-ng agent --conf conf --conf-file flume-conf.conf --name a1
--- this creates .flumespool in /home/user/flume/source contains the flume metadata information when flume-agnet stream the information.
Note: "Spooling Directory Source runner has shutdown" means there is no file to stream.
5. copy the log file in to the /home/user/flume/source.
Flume-agent take that file and stream into the HDFS (/user/hdpuser/flumelogs/year=20%y/month=%m) .
6. Open your HDFS File Browser and see the file.
7. Checkpoint directory and Data Directory files are created by flume file channel with log data and check points.
Note:
a) The channel transaction capacity will need to be smaller than equal to the channel capacity.
b) Batch Size <= channel capacity.
In Same way, you can use memory channel also
In Same way, you can use memory channel also
This comment has been removed by the author.
ReplyDeleteI'm getting an error in prod flume with similar set up
ReplyDelete2015-09-06 17:34:46,554 (pool-6-thread-1) [INFO - org.apache.flume.client.avro.ReliableSpoolingFileEventReader.readEvents(ReliableSpoolingFileEventReader.java:238)] Last read was never committed - resetting mark position.
2015-09-06 17:34:49,563 (pool-6-thread-1) [WARN - org.apache.flume.source.SpoolDirectorySource$SpoolDirectoryRunnable.run(SpoolDirectorySource.java:239)] The channel is full, and cannot write data now. The source will try again after 4000
curl localhost:5653/metrics
{"SINK.s3-sink":{"BatchCompleteCount":"0","ConnectionFailedCount":"1","EventDrainAttemptCount":"0","ConnectionCreatedCount":"0","Type":"SINK","BatchEmptyCount":"0","ConnectionClosedCount":"0","EventDrainSuccessCount":"0","StopTime":"0","StartTime":"1441560792619","BatchUnderflowCount":"0"},"SOURCE.spooling-directory":{"OpenConnectionCount":"0","Type":"SOURCE","AppendBatchAcceptedCount":"0","AppendBatchReceivedCount":"20","EventAcceptedCount":"0","AppendReceivedCount":"0","StopTime":"0","StartTime":"1441560792648","EventReceivedCount":"2000","AppendAcceptedCount":"0"},"CHANNEL.fileChannel":{"EventPutSuccessCount":"0","ChannelFillPercentage":"99.9988","Type":"CHANNEL","StopTime":"0","EventPutAttemptCount":"260","ChannelSize":"999988","StartTime":"1441560792613","EventTakeSuccessCount":"0","ChannelCapacity":"1000000","EventTakeAttemptCount":"1"}}
Any idea what I can do?
I think your disk space is full. Could you please check the driver space where exactly your writing your .log data files.
DeleteWhich channel you are using in flume-conf?
If I want to make the hdfs output file to certain size, like 10 MB, what should I change?
ReplyDelete