2018年12月27日 星期四

#單字出現頻率統計程式

#單字出現頻率統計程式

從網路上抓了一篇文章,要統計這篇文章出現哪些單字,並建立一個dict() 做key與value的應用。最後統計單字出現的頻率,並將出現超過1次的字print出來。
以下是網路隨意抓的文章:

The Izumo and Kaga have been carrying helicopters designed for anti-submarine warfare since entering service over the past three years. They will need to have their decks reinforced to accommodate the heavier F-35Bs, as well as the heat and force from the jets' thrusters when they land vertically.
Japan will also increase its order for F-35A jets, which take off and land on conventional runways, to 105, the government said. Forty-two of those jets are in service or were part of earlier Japanese orders. Those planes will replace the Japan Air Self-Defense Force's aging F-15J fighters.
The purchases will be spread over 10 years, with 27 of the F-35As and 18 of the F-35Bs to be acquired, as well as the two warships to be refitted, in the first five years.
Total spending over the first five years is pegged at $282.4 billion and will include creating cyber defense and naval transportation units that operate across Japan's three military branches, the Ground, Air and Maritime Self-Defense Forces.


先將文字存成一個純文字檔案sample.txt 路徑:
"/Users/lido/Library/Preferences/PyCharmCE2018.3/scratches/sample.txt"
這邊會用到python 的正則表達式函數 (可參考RUNOOB正則表達式教程)
re.match(pattern, string, flags=0),正則表達式可以查找文檔内或输入域内特定的文本。
import re

將文章讀取出來,存到article 內
fp = open("/Users/lido/Library/Preferences/PyCharmCE2018.3/scratches/sample.txt","r")
article = fp.read()
>>>type(article)
<class 'str'>

建立一個新的字串符new_article,利用正則匹配替換文章後存到裡面。
new_article = re.sub("[^a-zA-Z\s]","",article)

re.sub 檢索和替換
Python 的re模組提供了re.sub用於替換字符串中的匹配項。
re.sub(pattern, repl, string, count=0)
pattern : 正則中的模式字串。
repl : 替換的文字符串,也可為一個函數。
string : 要被查找替換的原始字符串。
count : 模式匹配後替換的最大次數,默認 0 表示替換所有的匹配。

^匹配字符的開頭,a-zA-Z匹配所有的字母,\s匹配任何空白字符,包括空格、制表符、換頁符等等,等價於 [ \f\n\r\t\v]。

正則匹配後,印出來看看:
>>>new_article
'The Izumo and Kaga have been carrying helicopters designed for antisubmarine warfare since entering service over the past three years They will need to have their decks reinforced to accommodate the heavier FBs as well as the heat and force from the jets thrusters when they land vertically Japan will also increase its order for FA jets which take off and land on conventional runways to  the government said Fortytwo of those jets are in service or were part of earlier Japanese orders Those planes will replace the Japan Air SelfDefense Forces aging FJ fighters The purchases will be spread over  years with  of the FAs and  of the FBs to be acquired as well as the two warships to be refitted in the first five years Total spending over the first five years is pegged at  billion and will include creating cyber defense and naval transportation units that operate across Japans three military branches the Ground Air and Maritime SelfDefense Forces '

接著利用split() 函數,將文字都存進word_list()中
words = new_article.split()

印出來看看:
>>>words
['The', 'Izumo', 'and', 'Kaga', 'have', 'been', 'carrying', 'helicopters', 'designed', 'for', 'antisubmarine', 'warfare', 'since', 'entering', 'service', 'over', 'the', 'past', 'three', 'years', 'They', 'will', 'need', 'to', 'have', 'their', 'decks', 'reinforced', 'to', 'accommodate', 'the', 'heavier', 'FBs', 'as', 'well', 'as', 'the', 'heat', 'and', 'force', 'from', 'the', 'jets', 'thrusters', 'when', 'they', 'land', 'vertically', 'Japan', 'will', 'also', 'increase', 'its', 'order', 'for', 'FA', 'jets', 'which', 'take', 'off', 'and', 'land', 'on', 'conventional', 'runways', 'to', 'the', 'government', 'said', 'Fortytwo', 'of', 'those', 'jets', 'are', 'in', 'service', 'or', 'were', 'part', 'of', 'earlier', 'Japanese', 'orders', 'Those', 'planes', 'will', 'replace', 'the', 'Japan', 'Air', 'SelfDefense', 'Forces', 'aging', 'FJ', 'fighters', 'The', 'purchases', 'will', 'be', 'spread', 'over', 'years', 'with', 'of', 'the', 'FAs', 'and', 'of', 'the', 'FBs', 'to', 'be', 'acquired', 'as', 'well', 'as', 'the', 'two', 'warships', 'to', 'be', 'refitted', 'in', 'the', 'first', 'five', 'years', 'Total', 'spending', 'over', 'the', 'first', 'five', 'years', 'is', 'pegged', 'at', 'billion', 'and', 'will', 'include', 'creating', 'cyber', 'defense', 'and', 'naval', 'transportation', 'units', 'that', 'operate', 'across', 'Japans', 'three', 'military', 'branches', 'the', 'Ground', 'Air', 'and', 'Maritime', 'SelfDefense', 'Forces']

很好,工作完成差不多,剩最後統計的for迴圈了。
利用.upper() 函數,將文字全部轉成大寫,好方便統計。
如果字符沒有在word_counts_dict()中,則index value=1,如果文符有在裡面(即重複了,則該字符的index value+1)。
最後print出index value>1的字符
for word in words:
    if word.upper() in word_counts:
        word_counts[word.upper()] = word_counts[word.upper()] +1    
    else:
        word_counts[word.upper()] = 1
key_list = list(word_counts.keys())
key_list.sort()

for key in key_list:
    if word_counts[key] > 1:
        print("{}:{}".format(key,word_counts[key]),end=" ")

完整程式碼:
import re
fp = open("/Users/lido/Library/Preferences/PyCharmCE2018.3/scratches/sample.txt","r")
article = fp.read()
new_article = re.sub("[^a-zA-Z\s]","",article)
words = new_article.split()
word_counts={}
for word in words:
    if word.upper() in word_counts:
        word_counts[word.upper()] = word_counts[word.upper()] +1    
    else:
        word_counts[word.upper()] = 1
key_list = list(word_counts.keys())
key_list.sort()

for key in key_list:
    if word_counts[key] > 1:
        print("{}:{}".format(key,word_counts[key]),end=" ")
AIR:2 AND:7 AS:4 BE:3 FBS:2 FIRST:2 FIVE:2 FOR:2 FORCES:2 HAVE:2 IN:2 JAPAN:2 JETS:3 LAND:2 OF:4 OVER:3 SELFDEFENSE:2 SERVICE:2 THE:14 THEY:2 THOSE:2 THREE:2 TO:5 WELL:2 WILL:5 YEARS:4 

沒有留言:

張貼留言