regex
Word frequency count example
With this example we are going to demonstrate how to count the frequency of words in a file. In short, to count the frequency of words in a file you should:
- Create a new FileInputStream with a given String path by opening a connection to a file.
- Get the FileChannel object associated with the FileInputStream, with
getChannel()
API method of FileInputStream. - Get the current size of this channel’s file, using
size()
API method of FileChannel. - Create a MappedByteBuffer, using
map(MapMode mode, long position, long size)
API method of FileChannel that maps a region of this channel’s file directly into memory. - Convert the byte buffer to character buffer. Create a new Charset for a specified charset name, using
forName(String charsetName)
API method of Charset and then a new CharsetDecoder, usingnewDecoder()
API method of Charset. Then usedecode(ByteBuffer in)
API method of CharBuffer to decode the remaining content of a single input byte buffer into a newly-allocated character buffer. - Create a new word pattern and a new line pattern, by compiling given String regular expressions to a Pattern, using
compile(string regex)
API method of Pattern. - Match the line pattern to the buffer, using
matcher(CharSequence input)
API method of Pattern. - For each line get the line and the array of words in the line, using
find()
andgroup()
API methods of Matcher, for the matcher created for the line pattern. - Then for each word get the word and add it in a TreeMap.
Let’s take a look at the code snippet that follows:
001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 054 055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 095 096 097 098 099 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 | package com.javacodegeeks.snippets.core; import java.io.FileInputStream; import java.nio.CharBuffer; import java.nio.MappedByteBuffer; import java.nio.channels.FileChannel; import java.nio.charset.Charset; import java.nio.charset.CharsetDecoder; import java.util.Map; import java.util.TreeMap; import java.util.regex.Matcher; import java.util.regex.Pattern; public class WordFreq { public static void main(String args[]) throws Exception { String filePath = "C:/Users/nikos7/Desktop/file.odt" ; // Map File from filename to byte buffer FileInputStream in = new FileInputStream(filePath); FileChannel filech = in.getChannel(); int fileLen = ( int ) filech.size(); MappedByteBuffer buf = filech.map(FileChannel.MapMode.READ_ONLY, 0 , fileLen); // Convert to character buffer Charset chars = Charset.forName( "ISO-8859-1" ); CharsetDecoder dec = chars.newDecoder(); CharBuffer charBuf = dec.decode(buf); // Create line pattern Pattern linePatt = Pattern.compile( ".*$" , Pattern.MULTILINE); // Create word pattern Pattern wordBrkPatt = Pattern.compile( "[\\p{Punct}\s}]" ); // Match line pattern to buffer Matcher lineM = linePatt.matcher(charBuf); Map m = new TreeMap(); Integer one = new Integer( 1 ); // For each line while (lineM.find()) { // Get line CharSequence lineSeq = lineM.group(); // Get array of words on line String words[] = wordBrkPatt.split(lineSeq); // For each word for ( int i = 0 , n = words.length; i < n; i++) { if (words[i].length() > 0 ) { Integer frequency = (Integer) m.get(words[i]); if (frequency == null ) { frequency = one; } else { int value = frequency.intValue(); frequency = new Integer(value + 1 ); } m.put(words[i], frequency); } } } System.out.println(m); } } |
Output:
WordPress=2, Working=1, Your=3, You’ll=1, a=136, able=1, about=8, above=2, absolutely=1, absurd=1, accept=.....
This was an example of how to count the frequency of words in a file in Java.