Compte de fréquence de mots Java 8

Question

Comment compter la fréquence des mots de List en Java 8?

List <String> wordsList = Lists.newArrayList("hello", "bye", "ciao", "bye", "ciao");

Le résultat doit être:

{ciao=2, hello=1, bye=2}

Mouna · Accepted Answer

Je souhaite partager la solution que j'ai trouvée car au départ, je m'attendais à utiliser les méthodes de carte et réduction, mais c'était un peu différent.

Map<String, Long> collect = wordsList.stream().collect(groupingBy(Function.identity(), counting()));

Ou pour les valeurs entières:

Map<String, Integer> collect = wordsList.stream().collect(groupingBy(Function.identity(), summingInt(e -> 1)));

MODIFIER

J'ajoute comment trier la carte par valeur:

LinkedHashMap<String, Long> countByWordSorted = collect.entrySet() .stream() .sorted(Map.Entry.comparingByValue(Comparator.reverseOrder())) .collect(Collectors.toMap( Map.Entry::getKey, Map.Entry::getValue, (v1, v2) -> { throw new IllegalStateException(); }, LinkedHashMap::new ));

Marco13 · Answer

(NOTE: Voir les modifications ci-dessous)

Au lieu de Mounas répond , voici une approche qui compte le mot en parallèle:

import Java.util.Arrays; import Java.util.List; import Java.util.Map; import Java.util.stream.Collectors; public class ParallelWordCount { public static void main(String[] args) { List<String> list = Arrays.asList( "hello", "bye", "ciao", "bye", "ciao"); Map<String, Integer> counts = list.parallelStream(). collect(Collectors.toConcurrentMap( w -> w, w -> 1, Integer::sum)); System.out.println(counts); } }

EDIT En réponse à ce commentaire, j'ai effectué un petit test avec JMH, comparant les approches toConcurrentMap et groupingByConcurrent, avec différentes tailles de liste d'entrée et des mots aléatoires de différentes longueurs. Ce test a suggéré que l'approche toConcurrentMap était plus rapide. Quand on considère la différence entre ces approches "sous le capot", il est difficile de prédire quelque chose comme ça.

En guise d’extension supplémentaire, sur la base de commentaires supplémentaires, j’ai étendu le test pour couvrir les quatre combinaisons de toMap, groupingBy, en série et en parallèle.

Les résultats sont toujours que l'approche toMap est plus rapide, mais de manière inattendue (du moins pour moi) les versions "concurrentes" dans les deux cas sont plus lentes que les versions série ...:

 (method) (count) (wordLength) Mode Cnt Score Error Units toConcurrentMap 1000 2 avgt 50 146,636 ± 0,880 us/op toConcurrentMap 1000 5 avgt 50 272,762 ± 1,232 us/op toConcurrentMap 1000 10 avgt 50 271,121 ± 1,125 us/op toMap 1000 2 avgt 50 44,396 ± 0,541 us/op toMap 1000 5 avgt 50 46,938 ± 0,872 us/op toMap 1000 10 avgt 50 46,180 ± 0,557 us/op groupingBy 1000 2 avgt 50 46,797 ± 1,181 us/op groupingBy 1000 5 avgt 50 68,992 ± 1,537 us/op groupingBy 1000 10 avgt 50 68,636 ± 1,349 us/op groupingByConcurrent 1000 2 avgt 50 231,458 ± 0,658 us/op groupingByConcurrent 1000 5 avgt 50 438,975 ± 1,591 us/op groupingByConcurrent 1000 10 avgt 50 437,765 ± 1,139 us/op toConcurrentMap 10000 2 avgt 50 712,113 ± 6,340 us/op toConcurrentMap 10000 5 avgt 50 1809,356 ± 9,344 us/op toConcurrentMap 10000 10 avgt 50 1813,814 ± 16,190 us/op toMap 10000 2 avgt 50 341,004 ± 16,074 us/op toMap 10000 5 avgt 50 535,122 ± 24,674 us/op toMap 10000 10 avgt 50 511,186 ± 3,444 us/op groupingBy 10000 2 avgt 50 340,984 ± 6,235 us/op groupingBy 10000 5 avgt 50 708,553 ± 6,369 us/op groupingBy 10000 10 avgt 50 712,858 ± 10,248 us/op groupingByConcurrent 10000 2 avgt 50 901,842 ± 8,685 us/op groupingByConcurrent 10000 5 avgt 50 3762,478 ± 21,408 us/op groupingByConcurrent 10000 10 avgt 50 3795,530 ± 32,096 us/op

Je ne connais pas très bien JMH. J'ai peut-être commis une erreur ici. Toute suggestion ou correction est la bienvenue:

import Java.util.ArrayList; import Java.util.List; import Java.util.Map; import Java.util.Random; import Java.util.concurrent.TimeUnit; import Java.util.function.Function; import Java.util.stream.Collectors; import org.openjdk.jmh.annotations.Benchmark; import org.openjdk.jmh.annotations.BenchmarkMode; import org.openjdk.jmh.annotations.Mode; import org.openjdk.jmh.annotations.OutputTimeUnit; import org.openjdk.jmh.annotations.Param; import org.openjdk.jmh.annotations.Scope; import org.openjdk.jmh.annotations.Setup; import org.openjdk.jmh.annotations.State; import org.openjdk.jmh.infra.Blackhole; @State(Scope.Thread) public class ParallelWordCount { @Param({"toConcurrentMap", "toMap", "groupingBy", "groupingByConcurrent"}) public String method; @Param({"2", "5", "10"}) public int wordLength; @Param({"1000", "10000" }) public int count; private List<String> list; @Setup public void initList() { list = createRandomStrings(count, wordLength, new Random(0)); } @Benchmark @BenchmarkMode(Mode.AverageTime) @OutputTimeUnit(TimeUnit.MICROSECONDS) public void testMethod(Blackhole bh) { if (method.equals("toMap")) { Map<String, Integer> counts = list.stream().collect( Collectors.toMap( w -> w, w -> 1, Integer::sum)); bh.consume(counts); } else if (method.equals("toConcurrentMap")) { Map<String, Integer> counts = list.parallelStream().collect( Collectors.toConcurrentMap( w -> w, w -> 1, Integer::sum)); bh.consume(counts); } else if (method.equals("groupingBy")) { Map<String, Long> counts = list.stream().collect( Collectors.groupingBy( Function.identity(), Collectors.<String>counting())); bh.consume(counts); } else if (method.equals("groupingByConcurrent")) { Map<String, Long> counts = list.parallelStream().collect( Collectors.groupingByConcurrent( Function.identity(), Collectors.<String> counting())); bh.consume(counts); } } private static String createRandomString(int length, Random random) { StringBuilder sb = new StringBuilder(); for (int i = 0; i < length; i++) { int c = random.nextInt(26); sb.append((char) (c + 'a')); } return sb.toString(); } private static List<String> createRandomStrings( int count, int length, Random random) { List<String> list = new ArrayList<String>(count); for (int i = 0; i < count; i++) { list.add(createRandomString(length, random)); } return list; } }

Les heures ne sont similaires que pour le cas en série d’une liste de 10000 éléments et de mots de 2 lettres.

Il pourrait être intéressant de vérifier si, pour des tailles de liste encore plus grandes, les versions concurrentes finissent par surperformer les versions série, mais ne disposent pas du temps nécessaire pour effectuer un autre test de performance détaillé avec toutes ces configurations.

nejckorasa · Answer

Trouvez le produit le plus fréquent dans la collection, avec des génériques:

private <V> V findMostFrequentItem(final Collection<V> items) { return items.stream() .filter(Objects::nonNull) .collect(Collectors.groupingBy(Functions.identity(), Collectors.counting())) .entrySet() .stream() .max(Comparator.comparing(Entry::getValue)) .map(Entry::getKey) .orElse(null); }

Calculer les fréquences des éléments:

private <V> Map<V, Long> findFrequencies(final Collection<V> items) { return items.stream() .filter(Objects::nonNull) .collect(Collectors.groupingBy(Function.identity(), Collectors.counting())); }

Donald Raab · Answer

Si vous utilisez Eclipse Collections , vous pouvez simplement convertir la List en une Bag .

Bag<String> words = Lists.mutable.with("hello", "bye", "ciao", "bye", "ciao").toBag(); Assert.assertEquals(2, words.occurrencesOf("ciao")); Assert.assertEquals(1, words.occurrencesOf("hello")); Assert.assertEquals(2, words.occurrencesOf("bye"));

Ce code fonctionnera avec Java 5 - 8.

Note: Je suis un partisan des collections Eclipse

Eugene · Answer

Je vais présenter ici la solution que j'ai faite (celle avec le groupement est bien meilleure :)).

static private void test0(List<String> input) { Set<String> set = input.stream() .collect(Collectors.toSet()); set.stream() .collect(Collectors.toMap(Function.identity(), str -> Collections.frequency(input, str))); }

Juste mon 0.02 $

Sym-Sym · Answer

Un autre 2 cent de la mienne, étant donné un tableau:

import static Java.util.stream.Collectors.*; String[] str = {"hello", "bye", "ciao", "bye", "ciao"}; Map<String, Integer> collected = Arrays.stream(str) .collect(groupingBy(Function.identity(), collectingAndThen(counting(), Long::intValue)));

Piyush · Answer

Voici un moyen de créer une carte de fréquence à l'aide de ses fonctions.

List<String> words = Stream.of("hello", "bye", "ciao", "bye", "ciao").collect(toList()); Map<String, Integer> frequencyMap = new HashMap<>(); words.forEach(Word -> frequencyMap.merge(Word, 1, (v, newV) -> v + newV) ); System.out.println(frequencyMap); // {ciao=2, hello=1, bye=2}

Ou

words.forEach(Word -> frequencyMap.compute(Word, (k, v) -> v != null ? v + 1 : 1) );

Easycoder · Answer

public class Main { public static void main(String[] args) { String testString ="qqwweerrttyyaaaaaasdfasafsdfadsfadsewfywqtedywqtdfewyfdweytfdywfdyrewfdyewrefdyewdyfwhxvsahxvfwytfx"; long Java8Case2 = testString.codePoints().filter(ch -> ch =='a').count(); System.out.println(Java8Case2); ArrayList<Character> list = new ArrayList<Character>(); for (char c : testString.toCharArray()) { list.add(c); } Map<Object, Integer> counts = list.parallelStream(). collect(Collectors.toConcurrentMap( w -> w, w -> 1, Integer::sum)); System.out.println(counts); } }