What is TextInputFormat?

In TextInputFormat, each line in the text file is a record. Value is the content of the line while Key is the byte offset of the line. For instance, Key: longWritable, Value: text

In Hadoop, TextInputFormat is a class that is part of the Hadoop MapReduce framework. It is a specific input format used for reading plain text files in Hadoop MapReduce jobs.

Here’s a breakdown of what TextInputFormat does:

  1. Input Splitting: It divides the input text files into fixed-size splits (typically block-sized in HDFS) to enable parallel processing of these splits by different mapper tasks.
  2. Record Reader: It defines a record reader (LineRecordReader by default) that reads and parses each split into key-value pairs, where the key is the byte offset of the line in the file, and the value is the content of the line.
  3. Key-Value Pairs: In the context of TextInputFormat, each key represents the byte offset of the beginning of a line in the input file, and the corresponding value is the content of that line.

Using TextInputFormat is common when dealing with simple text data where each line is considered a record, and processing can be done on a per-line basis.

In a Hadoop MapReduce program, you typically specify the input format using the job.setInputFormatClass(TextInputFormat.class) method, and Hadoop takes care of the rest, splitting the input, assigning splits to mappers, and reading records within each split.