[1021] (파이썬)leet code_819. Most Common Word

09 Jan 2020

Reading time ~3 minutes

Table of Contents

목차
가장 흔한 단어
- 문제
풀이
풀이과정
- 1. 전처리
- 2. 최빈값 뽑기
전체 코드

가장 흔한 단어

문제

Given a string paragraph and a string array of the banned words banned, return the most frequent word that is not banned. It is guaranteed there is at least one word that is not banned, and that the answer is unique.

The words in paragraph are case-insensitive and the answer should be returned in lowercase.

Example1:

Input: paragraph = "Bob hit a ball, the hit BALL flew far after it was hit.", banned = ["hit"]
Output: "ball"
Explanation: 
"hit" occurs 3 times, but it is a banned word.
"ball" occurs twice (and no other word does), so it is the most frequent non-banned word in the paragraph. 
Note that words in the paragraph are not case sensitive,
that punctuation is ignored (even if adjacent to words, such as "ball,"), 
and that "hit" isn't the answer even though it occurs more because it is banned.

Example1:

Input: paragraph = "a.", banned = []
Output: "a"

[제한]

1 <= paragraph.length <= 1000
paragraph consists of English letters, space ‘ ‘, or one of the symbols: “!?’,;.”.
0 <= banned.length <= 100
1 <= banned[i].length <= 10
banned[i] consists of only lowercase English letters.

풀이

들어가기 전 알아 둘 개념

풀이과정

이 문제는 크게 두가지 과정을 거친다.
1️⃣ 전처리(문자만 골라서 소문자로 바꾸로 공백으로 나누기/ banned 단어 빼기)해서 새로운 딕셔너리 만들기
2️⃣ 전처리한 딕셔너리에서 최빈값 뽑기

1. 전처리

입력값에는 대소문자가 섞여 있으며 쉼표 등 구두점이 존재한다.

그러므로 이렇게 문자만 골라 보겠다

문자만을 고르는 것은 re.sub를 쓰겠다.
re.sub(r'[^\w]', ' ', paragraph).

\w: 단어 문자 Word Character
^: not 따라서 위 정규 식은 단어 문자가 아닌 모든 문자를 공백으로 치환 Substitute 하는 역할을 한다.

📜 re.sub 간단하게 보기

re.sub(뭘바꿔, 뭘로 바꿔, 바꿀 문자열)

‘apple’ 또는 ‘orange’를 찾아서 ‘fruit’로 바꿈

re.sub('apple|orange', 'fruit', 'apple box orange tree')    # apple 또는 orange를 fruit로 바꿈
'fruit box fruit tree'

골라낸것을 lower()를 통해 소문자로 바꾸고 split()를 기준으로 나눈다.
re.sub(r'[^\w]', ' ', paragraph).lower().split().

그리고 banned 단어를 포함하지 않기 위해 간단한 조건도 섞어보자
if word not in banned.

그럼 리스트 컴프리헨션으로 모든 위의 조건들을 반영한 새로운 딕셔너리를 만들어보겠다.
words = [word for word in re.sub(r'[^\w]', ' ', paragraph).lower().split() if word not in banned].

2. 최빈값 뽑기

다음과 같이 각 단어의 개수를 헤아려 보자.

차이썬의 기본 타입에서 argmax 를 지원하고 있지 않기 때문에(코딩테스트는 외부 라이브러리를 지원하지 않는다) 딕셔너리 항목의 각 개수를 세는 collections.Counter를 사용한다.
즉, 개수를 처리하는 부분은 Counter 모듈을 사용하면 좀 더 깔끔하게 처리할 수 있다.

여기서 개수를 담아두는 변수는 딕셔너리를 사용하며 defaultdict() 를 사용해 int 기본 값이 자동으로 부여되게 했다.
즉 이 코드를 사용하면 counts = collections.Counter(words) 이렇게 나온다.

# input 
words =['bob', 'a', 'ball', 'the', 'ball', 'flew', 'far', 'after', 'it', 'was']

count = collections.Counter(words)
print(count.most_common)
#<bound method Counter.most_common of Counter({'ball': 2, 'bob': 1, 'a': 1, 'the': 1, 'flew': 1, 'far': 1, 'after': 1, 'it': 1, 'was': 1})>

다음 코드는 words 에서 가장 흔하게 등장하는 단어의 첫 번째 값을 most_common(1) 으로 추출한다.
문제의 입력값에서는 [(‘ball’, 2)] 가 되며,
이 값의 [0][0] 을 추출해서 최종 적으로 첫 번째 인덱스의 키를 추출하게 된다.
이렇게 추출한 키인 ball 은 가장 흔한 단어가 되므로, 이제 이 값을 리턴한다.

전체 코드

위를 구현한 전체 코드는 다음과 같다

class Solution:
    def mostCommonWord(self, paragraph: str, banned: List[str]) -> str:
        words = [word for word in re.sub(r'[^\w]',' ',paragraph).lower().split() if word not in banned]
        print(words)
        count = collections.Counter(words)
        return count.most_common(1)[0][0]

Coding test