MapReduce之MapJoin案例

2020 年 8 月 17 日
筆記
hadoop

使用場景
優點
具體辦法：採用DistributedCache
案例
需求分析
程式碼實現

使用場景

Map Join 適用於一張表十分小、一張表很大的場景。

優點

思考：在Reduce 端處理過多的表，非常容易產生數據傾斜。怎麼辦？
在Map端快取多張表，提前處理業務邏輯，這樣增加Map 端業務，減少Reduce 端數據的壓力，儘可能的減少數據傾斜。

具體辦法：採用`DistributedCache`

（1）在Mapper的setup階段，將文件讀取到快取集合中。
（2）在驅動函數中載入快取。

/快取普通文件到Task運行節點。
job.addCacheFile(new URI("file://e:/cache/pd.txt");

案例

每個MapTask在map()中完成Join
注意：

只需要將要Join的數據order.txt作為切片，讓MapTask讀取
pd.txt不以切片形式讀入，而直接在MapTask中使用HDFS下載此文件，下載後，使用輸入流手動讀取其中的數據
在map()之前通常是將大文件以切片形式讀取，小文件手動讀取！

order.txt—->切片(orderId,pid,amount)—-JoinMapper.map()
pd.txt—–>切片(pid,pname)—-JoinMapper.map()

需求分析

MapJoin適用於關聯表中有小表的情形

程式碼實現

JoinBean.java

public class JoinBean {
	
	private String orderId;
	private String pid;
	private String pname;
	private String amount;
	
	@Override
	public String toString() {
		return  orderId + "\t" +  pname + "\t" + amount ;
	}

	public String getOrderId() {
		return orderId;
	}

	public void setOrderId(String orderId) {
		this.orderId = orderId;
	}

	public String getPid() {
		return pid;
	}

	public void setPid(String pid) {
		this.pid = pid;
	}

	public String getPname() {
		return pname;
	}

	public void setPname(String pname) {
		this.pname = pname;
	}

	public String getAmount() {
		return amount;
	}

	public void setAmount(String amount) {
		this.amount = amount;
	}


}

MapJoinMapper.java

/*
 * 1. 在Hadoop中，hadoop為MR提供了分散式快取
 * 			①用來快取一些Job運行期間的需要的文件(普通文件，jar，歸檔文件(har))
 * 			②通過在Job的Configuration中，使用uri代替要快取的文件
 * 			③分散式快取會假設當前的文件已經上傳到了HDFS，並且在集群的任意一台機器都可以訪問到這個URI所代表的文件
 * 			④分散式快取會在每個節點的task運行之前，提前將文件發送到節點
 * 			⑤分散式快取的高效是由於每個Job只會複製一次文件，且可以自動在從節點對歸檔文件解歸檔
 * 
 * 		
 * 
 * 
 */
public class MapJoinMapper extends Mapper<LongWritable, Text, JoinBean, NullWritable>{

	private JoinBean out_key=new JoinBean();
	private Map<String, String> pdDatas=new HashMap<String, String>();
	//在map之前手動讀取pd.txt中的內容
	
	@Override
	protected void setup(Mapper<LongWritable, Text, JoinBean, NullWritable>.Context context)
			throws IOException, InterruptedException {
		
		//從分散式快取中讀取數據
		URI[] files = context.getCacheFiles();
		
		for (URI uri : files) {
			
			BufferedReader reader = new BufferedReader(new FileReader(new File(uri)));
			
			String line="";
			
			//循環讀取pd.txt中的每一行
			while(StringUtils.isNotBlank(line=reader.readLine())) {
				
				String[] words = line.split("\t");
				
				pdDatas.put(words[0], words[1]);

			}
			
			reader.close();
			
		}
		
	}
	
	//對切片中order.txt的數據進行join，輸出
	@Override
	protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, JoinBean, NullWritable>.Context context)
			throws IOException, InterruptedException {
		
		String[] words = value.toString().split("\t");
		
		out_key.setOrderId(words[0]);
		out_key.setPname(pdDatas.get(words[1]));
		out_key.setAmount(words[2]);
		
		context.write(out_key, NullWritable.get());
			
	}
	
}

MapJoinDriver.java

public class MapJoinDriver {
	
	public static void main(String[] args) throws Exception {
		
		Path inputPath=new Path("e:/mrinput/mapjoin");
		Path outputPath=new Path("e:/mroutput/mapjoin");
		

		//作為整個Job的配置
		Configuration conf = new Configuration();
		//保證輸出目錄不存在
		FileSystem fs=FileSystem.get(conf);
		
		if (fs.exists(outputPath)) {
			
			fs.delete(outputPath, true);
			
		}
		
		// ①創建Job
		Job job = Job.getInstance(conf);
		
		job.setJarByClass(MapJoinDriver.class);
		
		
		// 為Job創建一個名字
		job.setJobName("wordcount");
		
		// ②設置Job
		// 設置Job運行的Mapper，Reducer類型，Mapper,Reducer輸出的key-value類型
		job.setMapperClass(MapJoinMapper.class);
		
		// 設置輸入目錄和輸出目錄
		FileInputFormat.setInputPaths(job, inputPath);
		FileOutputFormat.setOutputPath(job, outputPath);
		
		// 設置分散式快取
		job.addCacheFile(new URI("file:///e:/pd.txt"));
		
		//取消reduce階段
		job.setNumReduceTasks(0);

		// ③運行Job
		job.waitForCompletion(true);
		
	}

}

Tags: hadoop

MapReduce之MapJoin案例

使用場景

優點

具體辦法：採用`DistributedCache`

案例

需求分析

程式碼實現

VirMach 便宜 VPS

QNews

MapReduce之MapJoin案例

使用場景

優點

具體辦法：採用DistributedCache

案例

需求分析

程式碼實現

分享此文：

Related Posts

不吹牛X，我真的幹掉了if-else

Kubernetes+Docker+Istio 容器雲實踐

.Net MVC5（.Net Framework 4.0+）多語言解決方案

國產ARPG武俠單機《武林志2》曝光 採用無縫大地圖

VirMach 便宜 VPS

QNews

熱門搜尋

具體辦法：採用`DistributedCache`

國產ARPG武俠單機《武林志2》曝光採用無縫大地圖