1、Pig數據模型
Bag:表
Tuple:行,記錄
Field:屬性
Pig不要求同一個(gè)Bag里面的各個(gè)Tuple有相同數量或相同類(lèi)型的Field
2、Pig Lating常用語(yǔ)句
1)LOAD:指出載入數據的方法
2)FOREACH:逐行掃描進(jìn)行某種處理
3)FILTER:過(guò)濾行
4)DUMP:把結果顯示到屏幕
5)STORE:把結果保存到文件
3、簡(jiǎn)單例子:
假如有一份成績(jì)單,有學(xué)號、語(yǔ)文成績(jì)、數學(xué)成績(jì),屬性之間用|分隔,如下:
20130001|80|90 20130002|85|96 20130003|60|70 20130004|74|86 20130005|65|98
1)把文件從本地系統上傳到Hadoop
[coder@h1 hadoop-0.20.2]$ bin/hadoop dfs -put /home/coder/score.txt in
查看是否上傳成功:
[coder@h1 hadoop-0.20.2]$ bin/hadoop dfs -ls /user/coder/in
Found 1 items
-rw-r--r-- 2 coder supergroup 75 2013-04-20 14:33 /user/coder/in/score.txt
2)載入原始數據,使用LOAD
grunt> scores = LOAD 'hdfs://h1:9000/user/coder/in/score.txt' USING PigStorage('|') AS (num:int,Chinese:int,Math:int);
輸入文件是:'hdfs://h1:9000/user/coder/in/score.txt'
表名(Bag):scores
從輸入文件讀取數據(Tuple)時(shí)以 | 分隔
讀取的Tuple包含3個(gè)屬性,分別為學(xué)號(num)、語(yǔ)文成績(jì)(Chinese)和數學(xué)成績(jì)(Math),這三個(gè)屬性的數據類(lèi)型都為int
3)查看表的結構
grunt> DESCRIBE scores; scores: {num: int,Chinese: int,Math: int}
4)假如我們需要過(guò)濾掉學(xué)號為20130005的記錄
grunt> filter_scores = FILTER scores BY num != 20130005;
查看過(guò)濾后的記錄
grunt> dump filter_scores; (20130001,80,90) (20130002,85,96) (20130003,60,70) (20130004,74,86)
5)計算每個(gè)人的總分
grunt> totalScore = FOREACH scores GENERATE num,Chinese+Math;
查看結果:
grunt> dump totalScore;
(20130001,170) (20130002,181) (20130003,130) (20130004,160) (20130005,163)
6)將每個(gè)人的總分結果輸出到文件
grunt> store totalScore into 'hdfs://h1:9000/user/coder/out/result' using PigStorage('|');
查看結果:
[coder@h1 ~]$ hadoop dfs -ls /user/coder/out/result Found 2 items drwxr-xr-x - coder supergroup 0 2013-04-20 15:54 /user/coder/out/result/_logs -rw-r--r-- 2 coder supergroup 65 2013-04-20 15:54 /user/coder/out/result/part-m-00000 [coder@h1 ~]$ ^C [coder@h1 ~]$ hadoop dfs -cat /user/coder/out/result/* 20130001|170 20130002|181 20130003|130 20130004|160 20130005|163 cat: Source must be a file. [coder@h1 ~]$
再看一個(gè)小例子:
有一批如下格式的文件:
zhangsan#123456#zhangsan@qq.com lisi#434dfdds#lisi@126.com wangwu#ffere233#wangwu@163.com zhouliu#fgrtr43#zhouliu@139.com
每行記錄有三個(gè)字段:賬號、密碼、郵箱,字段之間以#號分隔,現在要提取這批文件中的郵箱。
1)上傳文件到hadoop
[coder@h1 hadoop-0.20.2]$ bin/hadoop dfs -put data.txt in
[coder@h1 hadoop-0.20.2]$ bin/hadoop dfs -ls /user/coder/in Found 1 items -rw-r--r-- 2 coder supergroup 122 2013-04-24 20:34 /user/coder/in/data.txt [coder@h1 hadoop-0.20.2]$
2)載入原始數據文件
grunt> T_A = LOAD '/user/coder/in/data.txt' using PigStorage('#') as (username:chararray,password:chararray,email:chararray);
3)取出email字段
grunt> T_B = FOREACH T_A GENERATE email;
4)把結果輸出到文件
grunt> STORE T_B INTO '/user/coder/out/email'
5)查看結果
[coder@h1 hadoop-0.20.2]$ bin/hadoop dfs -cat /user/coder/out/email/* zhangsan@qq.com lisi@126.com wangwu@163.com zhouliu@139.com cat: Source must be a file.