Java 8 –在流中查找重复的元素
程序员文章站
2022-07-12 09:27:51
...
本文介绍了三种在Stream中查找重复元素的算法。
-
Set.add()
-
Collectors.groupingBy
-
Collections.frequency
在本文的最后,我们使用JMH基准测试哪个是最快的算法。
1. Filter&Set.add()
如果元素已经在集合中,则Set.add()
返回false;否则,返回false。 让我们在文章末尾查看基准。
JavaDuplicated1.java
package com.mkyong;
import java.util.Arrays;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import java.util.stream.Collectors;
public class JavaDuplicated1 {
public static void main(String[] args) {
// 3, 4, 9
List<Integer> list = Arrays.asList(5, 3, 4, 1, 3, 7, 2, 9, 9, 4);
Set<Integer> result = findDuplicateBySetAdd(list);
result.forEach(System.out::println);
}
public static <T> Set<T> findDuplicateBySetAdd(List<T> list) {
Set<T> items = new HashSet<>();
return list.stream()
.filter(n -> !items.add(n)) // Set.add() returns false if the element was already in the set.
.collect(Collectors.toSet());
}
}
输出量
3
4
9
2. Map&Collectors.groupingBy
2.1通过Collectors.groupingBy
创建Map
,并找到计数> 1的元素。
JavaDuplicated2.java
package com.mkyong;
import java.util.*;
import java.util.function.Function;
import java.util.stream.Collectors;
public class JavaDuplicated2 {
public static void main(String[] args) {
// 3, 4, 9
List<Integer> list = Arrays.asList(5, 3, 4, 1, 3, 7, 2, 9, 9, 4);
Set<Integer> result = findDuplicateByGrouping(list);
result.forEach(System.out::println);
}
public static <T> Set<T> findDuplicateByGrouping(List<T> list) {
return list.stream()
.collect(Collectors.groupingBy(Function.identity()
, Collectors.counting())) // create a map {1=1, 2=1, 3=2, 4=2, 5=1, 7=1, 9=2}
.entrySet().stream() // Map -> Stream
.filter(m -> m.getValue() > 1) // if map value > 1, duplicate element
.map(Map.Entry::getKey)
.collect(Collectors.toSet());
}
}
输出量
3
4
9
3. Collections.frequency
它将每个项目与一个列表进行比较– Collections.frequency(list, i)
。
JavaDuplicated3.java
package com.mkyong;
import java.util.Arrays;
import java.util.Collections;
import java.util.List;
import java.util.Set;
import java.util.stream.Collectors;
public class JavaDuplicated3 {
public static void main(String[] args) {
// 3, 4, 9
List<Integer> list = Arrays.asList(5, 3, 4, 1, 3, 7, 2, 9, 9, 4);
Set<Integer> result = findDuplicateByFrequency(list);
result.forEach(System.out::println);
}
public static <T> Set<T> findDuplicateByFrequency(List<T> list) {
return list.stream().filter(i -> Collections.frequency(list, i) > 1)
.collect(Collectors.toSet());
}
}
输出量
3
4
9
4. JMH基准
一个针对上述三种算法的简单JMH基准,可从大小为1000的Stream中查找重复元素。
pom.xml
<dependency>
<groupId>org.openjdk.jmh</groupId>
<artifactId>jmh-core</artifactId>
<version>1.23</version>
</dependency>
<dependency>
<groupId>org.openjdk.jmh</groupId>
<artifactId>jmh-generator-annprocess</artifactId>
<version>1.23</version>
</dependency>
BenchmarkFindDuplicate.java
package com.mkyong;
import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;
import java.util.*;
import java.util.concurrent.TimeUnit;
import java.util.function.Function;
import java.util.stream.Collectors;
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Benchmark)
@Fork(value = 2, jvmArgs = {"-Xms4G", "-Xmx4G"})
public class BenchmarkFindDuplicate {
private List<Integer> DATA_FOR_TESTING;
@Setup
public void init() {
// random 1000 size
DATA_FOR_TESTING = new Random().ints(1000, 1, 1000)
.boxed()
.collect(Collectors.toList());
}
public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(BenchmarkFindDuplicate.class.getSimpleName())
.forks(1)
.build();
new Runner(opt).run();
}
@Benchmark
public void setAdd(Blackhole bh) {
Set<Integer> items = new HashSet<>();
Set<Integer> collect = DATA_FOR_TESTING.stream()
.filter(n -> !items.add(n))
.collect(Collectors.toSet());
bh.consume(collect);
}
@Benchmark
public void groupingBy(Blackhole bh) {
Set<Integer> collect = DATA_FOR_TESTING.stream()
.collect(Collectors.groupingBy(Function.identity(), Collectors.counting()))
.entrySet()
.stream()
.filter(m -> m.getValue() > 1)
.map(Map.Entry::getKey)
.collect(Collectors.toSet());
bh.consume(collect);
}
@Benchmark
public void frequency(Blackhole bh) {
Set<Integer> collect = DATA_FOR_TESTING.stream()
.filter(i -> Collections.frequency(DATA_FOR_TESTING, i) > 1)
.collect(Collectors.toSet());
bh.consume(collect);
}
}
输出量
# JMH version: 1.23
# VM version: JDK 11.0.6, OpenJDK 64-Bit Server VM, 11.0.6+10
# VM invoker: /usr/lib/jvm/adoptopenjdk-11-hotspot-amd64/bin/java
# VM options: -Xms4G -Xmx4G
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: com.mkyong.BenchmarkFindDuplicate.frequency
# Run progress: 0.00% complete, ETA 00:05:00
# Fork: 1 of 1
# Warmup Iteration 1: 0.827 ms/op
# Warmup Iteration 2: 0.821 ms/op
# Warmup Iteration 3: 0.812 ms/op
# Warmup Iteration 4: 0.822 ms/op
# Warmup Iteration 5: 0.822 ms/op
Iteration 1: 0.814 ms/op
Iteration 2: 0.810 ms/op
Iteration 3: 0.779 ms/op
Iteration 4: 0.776 ms/op
Iteration 5: 0.814 ms/op
Result "com.mkyong.BenchmarkFindDuplicate.frequency":
0.799 ±(99.9%) 0.075 ms/op [Average]
(min, avg, max) = (0.776, 0.799, 0.814), stdev = 0.019
CI (99.9%): [0.724, 0.874] (assumes normal distribution)
# JMH version: 1.23
# VM version: JDK 11.0.6, OpenJDK 64-Bit Server VM, 11.0.6+10
# VM invoker: /usr/lib/jvm/adoptopenjdk-11-hotspot-amd64/bin/java
# VM options: -Xms4G -Xmx4G
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: com.mkyong.BenchmarkFindDuplicate.groupingBy
# Run progress: 33.33% complete, ETA 00:03:20
# Fork: 1 of 1
# Warmup Iteration 1: 0.040 ms/op
# Warmup Iteration 2: 0.038 ms/op
# Warmup Iteration 3: 0.037 ms/op
# Warmup Iteration 4: 0.036 ms/op
# Warmup Iteration 5: 0.039 ms/op
Iteration 1: 0.039 ms/op
Iteration 2: 0.039 ms/op
Iteration 3: 0.039 ms/op
Iteration 4: 0.039 ms/op
Iteration 5: 0.039 ms/op
Result "com.mkyong.BenchmarkFindDuplicate.groupingBy":
0.039 ±(99.9%) 0.001 ms/op [Average]
(min, avg, max) = (0.039, 0.039, 0.039), stdev = 0.001
CI (99.9%): [0.038, 0.040] (assumes normal distribution)
# JMH version: 1.23
# VM version: JDK 11.0.6, OpenJDK 64-Bit Server VM, 11.0.6+10
# VM invoker: /usr/lib/jvm/adoptopenjdk-11-hotspot-amd64/bin/java
# VM options: -Xms4G -Xmx4G
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: com.mkyong.BenchmarkFindDuplicate.setAdd
# Run progress: 66.67% complete, ETA 00:01:40
# Fork: 1 of 1
# Warmup Iteration 1: 0.027 ms/op
# Warmup Iteration 2: 0.028 ms/op
# Warmup Iteration 3: 0.026 ms/op
# Warmup Iteration 4: 0.026 ms/op
# Warmup Iteration 5: 0.027 ms/op
Iteration 1: 0.026 ms/op
Iteration 2: 0.027 ms/op
Iteration 3: 0.028 ms/op
Iteration 4: 0.028 ms/op
Iteration 5: 0.028 ms/op
Result "com.mkyong.BenchmarkFindDuplicate.setAdd":
0.027 ±(99.9%) 0.003 ms/op [Average]
(min, avg, max) = (0.026, 0.027, 0.028), stdev = 0.001
CI (99.9%): [0.024, 0.031] (assumes normal distribution)
# Run complete. Total time: 00:05:01
REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.
Benchmark Mode Cnt Score Error Units
BenchmarkFindDuplicate.frequency avgt 5 0.799 ± 0.075 ms/op
BenchmarkFindDuplicate.groupingBy avgt 5 0.039 ± 0.001 ms/op
BenchmarkFindDuplicate.setAdd avgt 5 0.027 ± 0.003 ms/op
Process finished with exit code 0
在Java 8 Stream中,使用Set.Add()
过滤是查找重复元素的最快算法,因为它仅循环一次。
Set<T> items = new HashSet<>();
return list.stream()
.filter(n -> !items.add(n))
.collect(Collectors.toSet());
Collections.frequency
最慢,因为它会将每个项目与一个列表进行比较– Collections.frequency(list, i)
。 如果我们增加列表的大小,性能将会变慢。
return list.stream().filter(i -> Collections.frequency(list, i) > 1)
.collect(Collectors.toSet());
参考文献
From: https://mkyong.com/java8/java-8-find-duplicate-elements-in-a-stream/
上一篇: [python] 查找列表中重复的元素
下一篇: 查找数组中重复的元素