国内精品久久久久久久星辰影视-亚洲天堂久久精品成人-亚洲国产成人综合青青-91精品啪在线看国产网站-日韩一区二区在线观看

?

開(kāi)發(fā)技術(shù) / Technology

您的當前位置：網(wǎng)站首頁(yè) > 行業(yè)洞察 > 開(kāi)發(fā)技術(shù)

一個(gè)簡(jiǎn)單數據處理例子

日期：2015年1月29日作者：zhjw 來(lái)源：互聯(lián)網(wǎng) 點(diǎn)擊：685

一個(gè)簡(jiǎn)單數據處理例子

　　1、Pig數據模型

　　　　Bag：表

　　　　Tuple：行，記錄

　　　　Field：屬性

　　　　Pig不要求同一個(gè)Bag里面的各個(gè)Tuple有相同數量或相同類(lèi)型的Field

　　2、Pig Lating常用語(yǔ)句

　　　　1）LOAD:指出載入數據的方法

　　　　2）FOREACH：逐行掃描進(jìn)行某種處理

　　　　3）FILTER：過(guò)濾行

　　　　4）DUMP：把結果顯示到屏幕

　　　　5）STORE：把結果保存到文件

　　3、簡(jiǎn)單例子：

　　　　假如有一份成績(jì)單，有學(xué)號、語(yǔ)文成績(jì)、數學(xué)成績(jì)，屬性之間用|分隔，如下：

20130001|80|90
20130002|85|96
20130003|60|70
20130004|74|86
20130005|65|98

　　1）把文件從本地系統上傳到Hadoop

[coder@h1 hadoop-0.20.2]$ bin/hadoop dfs -put /home/coder/score.txt in

　　查看是否上傳成功:

[coder@h1 hadoop-0.20.2]$ bin/hadoop dfs -ls /user/coder/in
Found 1 items
-rw-r--r--   2 coder supergroup         75 2013-04-20 14:33 /user/coder/in/score.txt

　　2）載入原始數據，使用LOAD

grunt> scores = LOAD 'hdfs://h1:9000/user/coder/in/score.txt' USING PigStorage('|') AS (num:int,Chinese:int,Math:int);

　　輸入文件是：'hdfs://h1:9000/user/coder/in/score.txt'

　　表名（Bag）：scores

　　從輸入文件讀取數據（Tuple）時(shí)以 | 分隔

　　讀取的Tuple包含3個(gè)屬性，分別為學(xué)號（num）、語(yǔ)文成績(jì)（Chinese）和數學(xué)成績(jì)（Math），這三個(gè)屬性的數據類(lèi)型都為int

　　3）查看表的結構

grunt> DESCRIBE scores;
scores: {num: int,Chinese: int,Math: int}

　　4）假如我們需要過(guò)濾掉學(xué)號為20130005的記錄

grunt> filter_scores = FILTER scores BY num != 20130005;

　　查看過(guò)濾后的記錄

grunt> dump filter_scores;
(20130001,80,90)
(20130002,85,96)
(20130003,60,70)
(20130004,74,86)

　　5）計算每個(gè)人的總分

grunt> totalScore = FOREACH scores GENERATE num,Chinese+Math;

　　查看結果：

grunt> dump totalScore;

(20130001,170)
(20130002,181)
(20130003,130)
(20130004,160)
(20130005,163)

　　

　　6）將每個(gè)人的總分結果輸出到文件

grunt> store totalScore into 'hdfs://h1:9000/user/coder/out/result' using PigStorage('|');

　　查看結果：

復制代碼

[coder@h1 ~]$ hadoop dfs -ls /user/coder/out/result
Found 2 items
drwxr-xr-x   - coder supergroup          0 2013-04-20 15:54 /user/coder/out/result/_logs
-rw-r--r--   2 coder supergroup         65 2013-04-20 15:54 /user/coder/out/result/part-m-00000
[coder@h1 ~]$ ^C
[coder@h1 ~]$ hadoop dfs -cat /user/coder/out/result/*
20130001|170
20130002|181
20130003|130
20130004|160
20130005|163
cat: Source must be a file.
[coder@h1 ~]$

復制代碼

　　再看一個(gè)小例子：

　　有一批如下格式的文件：

zhangsan#123456#zhangsan@qq.com
lisi#434dfdds#lisi@126.com
wangwu#ffere233#wangwu@163.com
zhouliu#fgrtr43#zhouliu@139.com

　　每行記錄有三個(gè)字段：賬號、密碼、郵箱，字段之間以#號分隔，現在要提取這批文件中的郵箱。

　　

　　1）上傳文件到hadoop

[coder@h1 hadoop-0.20.2]$ bin/hadoop dfs -put data.txt in

[coder@h1 hadoop-0.20.2]$ bin/hadoop dfs -ls /user/coder/in
Found 1 items
-rw-r--r--   2 coder supergroup        122 2013-04-24 20:34 /user/coder/in/data.txt
[coder@h1 hadoop-0.20.2]$

　　2）載入原始數據文件

grunt> T_A = LOAD '/user/coder/in/data.txt' using PigStorage('#') as (username:chararray,password:chararray,email:chararray);

　　3）取出email字段

grunt> T_B = FOREACH T_A GENERATE email;

　　4）把結果輸出到文件

grunt> STORE T_B INTO '/user/coder/out/email'

　　5）查看結果

[coder@h1 hadoop-0.20.2]$ bin/hadoop dfs -cat /user/coder/out/email/*
zhangsan@qq.com
lisi@126.com
wangwu@163.com
zhouliu@139.com
cat: Source must be a file.

上一篇：Hive安裝
下一篇：：Pig的安裝

關(guān)于逆火

行業(yè)動(dòng)察

行業(yè)新聞
開(kāi)發(fā)技術(shù)

產(chǎn)品

服務(wù)

加入逆火

社會(huì )招聘
校園招聘