Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.3k views
in Technique[技术] by (71.8m points)

performance - Java Streams: How to do an efficient "distinct and sort"?

Let's assume I've got a Stream<T> and want to get only distinct elements and sorted.

The na?ve approach would be to do just the following:

Stream.of(...)
    .sorted()
    .distinct()

or, maybe the other way around:

Stream.of(...)
    .distinct()
    .sorted()

Since the implementation of both of them is not really accessible by the JDK's source code I was just wondering about possible memory consumption and performance implications.

Or would it be even more efficient to write my own filter as the following?

Stream.of(...)
    .sorted()
    .filter(noAdjacentDuplicatesFilter())

public static Predicate<Object> noAdjacentDuplicatesFilter() {
    final Object[] previousValue = {new Object()};

    return value -> {
        final boolean takeValue = !Objects.equals(previousValue[0], value);
        previousValue[0] = value;
        return takeValue;
    };
}
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

When you chain a distinct() operation after sorted(), the implementation will utilize the sorted nature of the data and avoid building an internal HashSet, which can be demonstrated by the following program

public class DistinctAndSort {
    static int COMPARE, EQUALS, HASHCODE;
    static class Tracker implements Comparable<Tracker> {
        static int SERIAL;
        int id;
        Tracker() {
            id=SERIAL++/2;
        }
        public int compareTo(Tracker o) {
            COMPARE++;
            return Integer.compare(id, o.id);
        }
        public int hashCode() {
            HASHCODE++;
            return id;
        }
        public boolean equals(Object obj) {
            EQUALS++;
            return super.equals(obj);
        }
    }
    public static void main(String[] args) {
        System.out.println("adjacent sorted() and distinct()");
        Stream.generate(Tracker::new).limit(100)
              .sorted().distinct()
              .forEachOrdered(o -> {});
        System.out.printf("compareTo: %d, EQUALS: %d, HASHCODE: %d%n",
                          COMPARE, EQUALS, HASHCODE);
        COMPARE=EQUALS=HASHCODE=0;
        System.out.println("now with intermediate operation");
        Stream.generate(Tracker::new).limit(100)
            .sorted().map(x -> x).distinct()
            .forEachOrdered(o -> {});
        System.out.printf("compareTo: %d, EQUALS: %d, HASHCODE: %d%n",
                          COMPARE, EQUALS, HASHCODE);
    }
}

which will print

adjacent sorted() and distinct()
compareTo: 99, EQUALS: 99, HASHCODE: 0
now with intermediate operation
compareTo: 99, EQUALS: 100, HASHCODE: 200

The intermediate operation, as simple as map(x -> x), can’t be recognized by the Stream implementation, hence, it must assume that the elements might not be sorted in respect to the mapping function’s result.

There is no guaranty that this kind of optimization happens, however, it is reasonable to assume that the developers of the Stream implementation will not remove that optimization and even try to add more optimizations, so rolling your own implementation will prevent your code from benefiting from future optimizations.

Further, what you have created is a “stateful predicate”, which is strongly discouraged, and, of course, will break when being used with a parallel stream.

If you don’t trust the Stream API to perform this operation efficiently enough, you might be better off implementing this particular operation without the Stream API.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...