Hadoop CSV: Read with MapReduce & Hive
While Hadoop itself doesn’t natively support reading CSV files, they can be processed using Hadoop’s MapReduce framework or tools like Hive.
- Reading a CSV file with the MapReduce framework:
One can develop a MapReduce program to read a CSV file. Each line in the CSV file will be taken as input in the Mapper phase and split into fields. In the Reducer phase, the processed data can be written to HDFS or another storage. - Reading CSV files using Hive:
Hive is a data warehouse tool built on top of Hadoop, allowing users to query and manipulate data using Hive’s SQL language. One can create an external table to read the CSV file and use Hive’s query statements to work with this data.
Example code:
Sample code for reading a CSV file using the MapReduce framework.
public class CSVReader {
public static class CSVMapper extends Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String[] fields = line.split(",");
// 处理CSV文件中的每一行数据
context.write(new Text(fields[0]), new Text(fields[1]));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "CSVReader");
job.setJarByClass(CSVReader.class);
job.setMapperClass(CSVMapper.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path("input.csv"));
FileOutputFormat.setOutputPath(job, new Path("output"));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Example code for reading a CSV file using Hive:
CREATE EXTERNAL TABLE my_table (
col1 STRING,
col2 STRING,
col3 INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/path/to/csv/file';
SELECT * FROM my_table;
By using the two methods mentioned above, it is possible to read CSV files on Hadoop and perform corresponding data processing operations.