regex

Word frequency count example

With this example we are going to demonstrate how to count the frequency of words in a file. In short, to count the frequency of words in a file you should:

  • Create a new FileInputStream with a given String path by opening a connection to a file.
  • Get the FileChannel object associated with the FileInputStream, with getChannel() API method of FileInputStream.
  • Get the current size of this channel’s file, using size() API method of FileChannel.
  • Create a MappedByteBuffer, using map(MapMode mode, long position, long size) API method of FileChannel that maps a region of this channel’s file directly into memory.
  • Convert the byte buffer to character buffer. Create a new Charset for a specified charset name, using forName(String charsetName) API method of Charset and then a new CharsetDecoder, using newDecoder() API method of Charset. Then use decode(ByteBuffer in) API method of CharBuffer to decode the remaining content of a single input byte buffer into a newly-allocated character buffer.
  • Create a new word pattern and a new line pattern, by compiling given String regular expressions to a Pattern, using compile(string regex) API method of Pattern.
  • Match the line pattern to the buffer, using matcher(CharSequence input) API method of Pattern.
  • For each line get the line and the array of words in the line, using find() and group() API methods of Matcher, for the matcher created for the line pattern.
  • Then for each word get the word and add it in a TreeMap.

Let’s take a look at the code snippet that follows:

001
002
003
004
005
006
007
008
009
010
011
012
013
014
015
016
017
018
019
020
021
022
023
024
025
026
027
028
029
030
031
032
033
034
035
036
037
038
039
040
041
042
043
044
045
046
047
048
049
050
051
052
053
054
055
056
057
058
059
060
061
062
063
064
065
066
067
068
069
070
071
072
073
074
075
076
077
078
079
080
081
082
083
084
085
086
087
088
089
090
091
092
093
094
095
096
097
098
099
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
package com.javacodegeeks.snippets.core;
import java.io.FileInputStream;
import java.nio.CharBuffer;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;
import java.util.Map;
import java.util.TreeMap;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
 
public class WordFreq {
 
    public static void main(String args[]) throws Exception {
 
  String filePath = "C:/Users/nikos7/Desktop/file.odt";
 
 
  // Map File from filename to byte buffer
 
  FileInputStream in = new FileInputStream(filePath);
 
  FileChannel filech = in.getChannel();
 
  int fileLen = (int) filech.size();
 
  MappedByteBuffer buf = filech.map(FileChannel.MapMode.READ_ONLY, 0,
 
 
    fileLen);
 
 
  // Convert to character buffer
 
  Charset chars = Charset.forName("ISO-8859-1");
 
  CharsetDecoder dec = chars.newDecoder();
 
  CharBuffer charBuf = dec.decode(buf);
 
 
  // Create line pattern
 
  Pattern linePatt = Pattern.compile(".*$", Pattern.MULTILINE);
 
 
  // Create word pattern
 
  Pattern wordBrkPatt = Pattern.compile("[\\p{Punct}\s}]");
 
 
  // Match line pattern to buffer
 
  Matcher lineM = linePatt.matcher(charBuf);
 
 
  Map m = new TreeMap();
 
  Integer one = new Integer(1);
 
 
  // For each line
 
  while (lineM.find()) {
 
 
// Get line
 
 
CharSequence lineSeq = lineM.group();
 
 
 
// Get array of words on line
 
 
String words[] = wordBrkPatt.split(lineSeq);
 
 
 
// For each word
 
 
for (int i = 0, n = words.length; i < n; i++) {
 
 
    if (words[i].length() > 0) {
 
 
 
  Integer frequency = (Integer) m.get(words[i]);
 
 
 
  if (frequency == null) {
 
 
 
 
frequency = one;
 
 
 
  } else {
 
 
 
 
int value = frequency.intValue();
 
 
 
 
frequency = new Integer(value + 1);
 
 
 
  }
 
 
 
  m.put(words[i], frequency);
 
 
    }
 
 
}
 
  }
 
  System.out.println(m);
    }
}

Output:

WordPress=2, Working=1, Your=3, You’ll=1, a=136, able=1, about=8, above=2, absolutely=1, absurd=1, accept=.....

 
This was an example of how to count the frequency of words in a file in Java.

Do you want to know how to develop your skillset to become a Java Rockstar?
Subscribe to our newsletter to start Rocking right now!
To get you started we give you our best selling eBooks for FREE!
1. JPA Mini Book
2. JVM Troubleshooting Guide
3. JUnit Tutorial for Unit Testing
4. Java Annotations Tutorial
5. Java Interview Questions
6. Spring Interview Questions
7. Android UI Design
and many more ....
I agree to the Terms and Privacy Policy

Ilias Tsagklis

Ilias is a software developer turned online entrepreneur. He is co-founder and Executive Editor at Java Code Geeks.
Subscribe
Notify of
guest


This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Back to top button