排序在Spark运用程序中使用的比较多,且维度也不一样,如二次排序,三次排序等,在机器学习算法中经常碰到,所以非常重要,必须掌握!
所谓二次排序,就是根据两列值进行排序,如下测试数据:
2 3
4 1
3 2
4 3
8 7
2 1
经过二次排序后的结果(升序):
2 1
2 3
3 2
4 1
4 3
8 7
在编写二次排序代码前,先简单的写下单个key排序的代码:
val conf=new SparkConf().setAppName("SortByKey").setMaster("local")
val sc=new SparkContext(conf)
val lines=sc.textFile("C:\\User\\Test.txt")
words=val wordcount=words.map(word=>(word._2,word._1)).(false).map(word=>(word._2,word._1))
wordcount.collect().foreach(println)
以上就是简单的wordcount程序,程序中使用了sortByKey排序
首先我们先通过Java代码实现上面测试数据进行二次排序
排序最主要的就是Key的准备,我们先用Java编写二次排序的key,参考代码如下:
import java.io.Serializable;
public class SecondarySortKey implements {
private int first;
private int second;
@Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + first;
result = prime * result + second;
return result;
}
@Override
public boolean equals(Object obj) {
if (this == obj)
return true;
if (obj == null)
return false;
if (getClass() != obj.getClass())
return false;
SecondarySortKey other = (SecondarySortKey) obj;
if (first != other.first)
return false;
if (second != other.second)
return false;
return true;
}
public int getFirst() {
return first;
}
public void setFirst(int first) {
this.first = first;
}
public int getSecond() {
return second;
}
public void setSecond(int second) {
this.second = second;
}
public SecondarySortKey(int first, int second) {
this.first = first;
this.second = second;
}
public boolean $greater(SecondarySortKey other) {
if (this.first > other.getFirst()) {
return true;
} else if (this.first == other.getFirst() && this.second > other.getSecond()) {
return true;