記錄一下第一次跑Hadoop WordCount Job的過程 :)
1. 建立HDFS資料夾
#全部的資料夾會自動建立
hduser@hadoop-master:/usr/local/hadoop$hadoop dfs -mkdir /home/hduser/wordcount
2. 匯入要分析的文件資料(local-dir)到HDFS資料夾
$hadoop dfs -copyFromLocal
$hadoop dfs -copyFromLocal
#匯入
hduser@hadoop-master:/usr/local/hadoop$hadoop dfs -copyFromLocal /home/hduser/wordcount /home/hduser/wordcount
#查看匯入的資料
hduser@hadoop-master:/usr/local/hadoop$ hadoop dfs -ls /home/hduser/wordcount
Warning: $HADOOP_HOME is deprecated.
Found 4 items
drwxr-xr-x - hduser supergroup 0 2012-08-14 16:08 /home/hduser/wordcount/output
-rw-r--r-- 1 hduser supergroup 674566 2012-08-14 16:02 /home/hduser/wordcount/pg20417.txt
-rw-r--r-- 1 hduser supergroup 1573150 2012-08-14 16:02 /home/hduser/wordcount/pg4300.txt
-rw-r--r-- 1 hduser supergroup 1423801 2012-08-14 16:02 /home/hduser/wordcount/pg5000.txt
Warning: $HADOOP_HOME is deprecated.
Found 4 items
drwxr-xr-x - hduser supergroup 0 2012-08-14 16:08 /home/hduser/wordcount/output
-rw-r--r-- 1 hduser supergroup 674566 2012-08-14 16:02 /home/hduser/wordcount/pg20417.txt
-rw-r--r-- 1 hduser supergroup 1573150 2012-08-14 16:02 /home/hduser/wordcount/pg4300.txt
-rw-r--r-- 1 hduser supergroup 1423801 2012-08-14 16:02 /home/hduser/wordcount/pg5000.txt
3. 執行WordCount範例
\hduser@hadoop-master:/usr/local/hadoop$ hadoop jar hadoop*examples*.jar wordcount /home/hduser/wordcount /home/hduser/wordcount/output
Warning: $HADOOP_HOME is deprecated.
12/08/14 16:07:28 INFO input.FileInputFormat: Total input paths to process : 3
12/08/14 16:07:28 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/08/14 16:07:28 WARN snappy.LoadSnappy: Snappy native library not loaded
12/08/14 16:07:29 INFO mapred.JobClient: Running job: job_201208141548_0002
12/08/14 16:07:30 INFO mapred.JobClient: map 0% reduce 0%
12/08/14 16:08:08 INFO mapred.JobClient: map 49% reduce 0%
12/08/14 16:08:10 INFO mapred.JobClient: map 64% reduce 0%
12/08/14 16:08:18 INFO mapred.JobClient: map 66% reduce 0%
12/08/14 16:08:33 INFO mapred.JobClient: map 100% reduce 0%
12/08/14 16:08:39 INFO mapred.JobClient: map 100% reduce 22%
12/08/14 16:08:48 INFO mapred.JobClient: map 100% reduce 100%
12/08/14 16:08:54 INFO mapred.JobClient: Job complete: job_201208141548_0002
12/08/14 16:08:55 INFO mapred.JobClient: Counters: 29
12/08/14 16:08:55 INFO mapred.JobClient: Job Counters
12/08/14 16:08:55 INFO mapred.JobClient: Launched reduce tasks=1
12/08/14 16:08:55 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=100525
12/08/14 16:08:55 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
12/08/14 16:08:55 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
12/08/14 16:08:55 INFO mapred.JobClient: Launched map tasks=3
12/08/14 16:08:55 INFO mapred.JobClient: Data-local map tasks=3
12/08/14 16:08:55 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=30107
12/08/14 16:08:55 INFO mapred.JobClient: File Output Format Counters
12/08/14 16:08:55 INFO mapred.JobClient: Bytes Written=880838
12/08/14 16:08:55 INFO mapred.JobClient: FileSystemCounters
12/08/14 16:08:55 INFO mapred.JobClient: FILE_BYTES_READ=2214849
12/08/14 16:08:55 INFO mapred.JobClient: HDFS_BYTES_READ=3671878
12/08/14 16:08:55 INFO mapred.JobClient: FILE_BYTES_WRITTEN=3775567
12/08/14 16:08:55 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=880838
12/08/14 16:08:55 INFO mapred.JobClient: File Input Format Counters
12/08/14 16:08:55 INFO mapred.JobClient: Bytes Read=3671517
12/08/14 16:08:55 INFO mapred.JobClient: Map-Reduce Framework
12/08/14 16:08:55 INFO mapred.JobClient: Map output materialized bytes=1474341
12/08/14 16:08:55 INFO mapred.JobClient: Map input records=77932
12/08/14 16:08:55 INFO mapred.JobClient: Reduce shuffle bytes=1474341
12/08/14 16:08:55 INFO mapred.JobClient: Spilled Records=255962
12/08/14 16:08:55 INFO mapred.JobClient: Map output bytes=6076095
12/08/14 16:08:55 INFO mapred.JobClient: CPU time spent (ms)=13590
12/08/14 16:08:55 INFO mapred.JobClient: Total committed heap usage (bytes)=616169472
12/08/14 16:08:55 INFO mapred.JobClient: Combine input records=629172
12/08/14 16:08:55 INFO mapred.JobClient: SPLIT_RAW_BYTES=361
12/08/14 16:08:55 INFO mapred.JobClient: Reduce input records=102322
12/08/14 16:08:55 INFO mapred.JobClient: Reduce input groups=82335
12/08/14 16:08:55 INFO mapred.JobClient: Combine output records=102322
12/08/14 16:08:55 INFO mapred.JobClient: Physical memory (bytes) snapshot=594595840
12/08/14 16:08:55 INFO mapred.JobClient: Reduce output records=82335
12/08/14 16:08:55 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2085924864
12/08/14 16:08:55 INFO mapred.JobClient: Map output records=629172
12/08/14 16:07:28 INFO input.FileInputFormat: Total input paths to process : 3
12/08/14 16:07:28 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/08/14 16:07:28 WARN snappy.LoadSnappy: Snappy native library not loaded
12/08/14 16:07:29 INFO mapred.JobClient: Running job: job_201208141548_0002
12/08/14 16:07:30 INFO mapred.JobClient: map 0% reduce 0%
12/08/14 16:08:08 INFO mapred.JobClient: map 49% reduce 0%
12/08/14 16:08:10 INFO mapred.JobClient: map 64% reduce 0%
12/08/14 16:08:18 INFO mapred.JobClient: map 66% reduce 0%
12/08/14 16:08:33 INFO mapred.JobClient: map 100% reduce 0%
12/08/14 16:08:39 INFO mapred.JobClient: map 100% reduce 22%
12/08/14 16:08:48 INFO mapred.JobClient: map 100% reduce 100%
12/08/14 16:08:54 INFO mapred.JobClient: Job complete: job_201208141548_0002
12/08/14 16:08:55 INFO mapred.JobClient: Counters: 29
12/08/14 16:08:55 INFO mapred.JobClient: Job Counters
12/08/14 16:08:55 INFO mapred.JobClient: Launched reduce tasks=1
12/08/14 16:08:55 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=100525
12/08/14 16:08:55 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
12/08/14 16:08:55 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
12/08/14 16:08:55 INFO mapred.JobClient: Launched map tasks=3
12/08/14 16:08:55 INFO mapred.JobClient: Data-local map tasks=3
12/08/14 16:08:55 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=30107
12/08/14 16:08:55 INFO mapred.JobClient: File Output Format Counters
12/08/14 16:08:55 INFO mapred.JobClient: Bytes Written=880838
12/08/14 16:08:55 INFO mapred.JobClient: FileSystemCounters
12/08/14 16:08:55 INFO mapred.JobClient: FILE_BYTES_READ=2214849
12/08/14 16:08:55 INFO mapred.JobClient: HDFS_BYTES_READ=3671878
12/08/14 16:08:55 INFO mapred.JobClient: FILE_BYTES_WRITTEN=3775567
12/08/14 16:08:55 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=880838
12/08/14 16:08:55 INFO mapred.JobClient: File Input Format Counters
12/08/14 16:08:55 INFO mapred.JobClient: Bytes Read=3671517
12/08/14 16:08:55 INFO mapred.JobClient: Map-Reduce Framework
12/08/14 16:08:55 INFO mapred.JobClient: Map output materialized bytes=1474341
12/08/14 16:08:55 INFO mapred.JobClient: Map input records=77932
12/08/14 16:08:55 INFO mapred.JobClient: Reduce shuffle bytes=1474341
12/08/14 16:08:55 INFO mapred.JobClient: Spilled Records=255962
12/08/14 16:08:55 INFO mapred.JobClient: Map output bytes=6076095
12/08/14 16:08:55 INFO mapred.JobClient: CPU time spent (ms)=13590
12/08/14 16:08:55 INFO mapred.JobClient: Total committed heap usage (bytes)=616169472
12/08/14 16:08:55 INFO mapred.JobClient: Combine input records=629172
12/08/14 16:08:55 INFO mapred.JobClient: SPLIT_RAW_BYTES=361
12/08/14 16:08:55 INFO mapred.JobClient: Reduce input records=102322
12/08/14 16:08:55 INFO mapred.JobClient: Reduce input groups=82335
12/08/14 16:08:55 INFO mapred.JobClient: Combine output records=102322
12/08/14 16:08:55 INFO mapred.JobClient: Physical memory (bytes) snapshot=594595840
12/08/14 16:08:55 INFO mapred.JobClient: Reduce output records=82335
12/08/14 16:08:55 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2085924864
12/08/14 16:08:55 INFO mapred.JobClient: Map output records=629172
4. cat 輸出的結果
hduser@hadoop-master:/usr/local/hadoop/bin$ hadoop dfs -cat /home/hduser/wordcount/output/part-r-00000
5. 複製輸出到本地端的機器上
#getmerge
hduser@hadoop-master:~/workcount/output$ hadoop dfs -getmerge /home/hduser/wordcount/output ./
12/08/14 17:16:40 INFO util.NativeCodeLoader: Loaded the native-hadoop library
hduser@hadoop-master:~/wordcount$ ls
output
output
#head output
hduser@hadoop-master:~/workcount/output$ head output
"(Lo)cra" 1
"1490 1
"1498," 1
"35" 1
"40," 1
"A 2
"AS-IS". 1
"A_ 1
"Absoluti 1
"Alack! 1
"(Lo)cra" 1
"1490 1
"1498," 1
"35" 1
"40," 1
"A 2
"AS-IS". 1
"A_ 1
"Absoluti 1
"Alack! 1
Reference:
http://wiki.apache.org/hadoop/WordCount
http://wiki.apache.org/hadoop/WordCount
沒有留言:
張貼留言
留個話吧:)