[논문분석] Techniques for Automatically Correcting Words in Text

개요

1992년에 발표되었다는게 믿겨지지 않는 Techniques for Automatically Correcting Words in Text 논문을 분석하여 내용을 정리한다.

Introduction

Detection and Error Correction
Interactive vs Automatic correction
Focus on isolated words without taking context into account 예: context error correction
- form, from / minute, minuet
- ater : after, later, alter, water, ate

N-gram Analysis Techniques

n-letter subsequences of words or strings
- unigrams, bigrams, trigrams
Input string + precompiled table of n-gram
- statistics to ascertain either its existence or its frequency
Better performance than OCR devices
Nonpositaional and Positional
- Nonpositaional는 위치 고려 없이 BOW처럼 0/1로 표현
- Positional은 위치를 고려함

Dictionary Lookup Techniques

사전에서 단어를 찾는 방법
Simple?
Depending on the size of the dictionary (25K~250K)
이 방법의 3가지 문제점
- 패턴 매칭 알고리즘
- 단어사전을 해시로 계산해놔서 실제로 해시로 lookup함
- Dictionary-partitioning schemes
- 자주 사용하는 단어를 캐시에 올리고 그 다음은 조금 더 느린 캐시를 쓰는 방법
- 형태소 처리 방법(Morphological-processing techniques)
- 단어를 원형으로 저장해놓고 이를 활용하여 lookup

Dictionary Construction Issues

Small Lexicon vs Large Lexicon

The Word Boundary Problem

단어의 경계 = 공백문자
공백으로 인해 오타가 났는데 2개 이상의 부분 단어가 말이 되는경우 예를 들어 understandhme는 under/stand/me 모두 형식상으로는 맞는 단어이다.
- ofthe, understandhme
- Sp ent, th ebook
- In form -> inform
- Forget -> for get
오타 수정 연구에 있어 아직 확연히 풀리지 않은 문제중 하나로 남아있음

Isolated word Error Correction

단순하게는 오타를 detecting 하는것만으로도 의미가 있다고 생각할수 있으나 대부분의 어플리케이션에서는 오타 감지 많으로 충분치 않다.
detected되고 corrected 되어야 함.

Spelling Error Patterns

기본 에러 타입
- insertion, deletion, substitution and transposition
단어 길이 효과
- 짧은 단어가 더 처리가 어렵다.
First-Position Errors
- 첫번째 위치의 단어가 에러일 가능성이 상대적으로 거의 없다.
Keyboard Effects
- 인접한 키들
Error Rates
- 스펠링 에러가 아직 패턴을 규정하기에는 풍부하지 않다.
- 말뭉치의 크기/ entry mode(사용자가 어떤 입력을 했는지) / 언제 연구를 진행했는지
음성적 에러
- 익숙하지 않은 단어를 음성적으로 맞다고 생각하는데로 추측해서 타이핑 할때 발생
Heuristic Rules And Probabilistic Tendencies
- 이번 논문에서는 분석결과 아래와 같은 오타의 특징이 있었다고 함
- 17개의 경험적 패턴, 12개의 잘못된 자음/모음의 사용
- 5개의 순서 조정
일반적이 오타
- 논문에서 이 패턴의 오타를 리스트업하였음

Techniques for Isolated-Word Error Correction

앞전까지 Detection에 대해 살펴보았다면 이제 Correction에 대해 알아보자.

isolated error correction은 아래의 세부문제로 나뉘어 질 수 있다.

Detection of an error (discussed above)
Generation of candidate corrections similarity key / rule-based techniques
Ranking of candidate corrections MED ~ Neural Nets

오타 수정은 크게 나누자면 2가지 테크닉이 있음

Geration 방식
- Similarity key techniques
- Rule-based techniques
Ranking 방식
- Minimum edit distance
- N-gram based techniques
- Probabilistic techniques
- Neural nets

Similarity Key Tech. : SOUNDEX and its variants

Phonetic Algorithms
- NYSIIS
- Metaphone
- Double Metaphone
- Metaphone3
- Caverphone
SOUNDEX algorithm
- Problem: How can we find a list of possible corrections?
- Solution: Store words in different boxes in a way that puts the similar words together.

[Example] punc와 punk가 같은 발음이니 어떻게 Soundex로 처리되는지 알아보자.

첫글자를 수집함 예: punc 는 P를 획득함
각 문자열에 숫자를 맵핑함
- a, e, h, I, o, u, w, y -> 0 or drop
- b, f, p, v ->1
- c, g, j, k, q, s, x, z ->2
- d, t -> 3
- l -> 4
- m, n ->5
- r ->6 예: punc -> P052
모든 0을 drop하고 글자를 반복함 예: P052 -> P52
같은 박스안의 실제 단어를 살펴봄
- punk is also in the P52 box.
- 발음이 서로 다른 것을 아래와 같이 표기가 달라짐
  - Robert : R163
  - Robin : R150

Rule-Based Techniques

rule을 정해 오타를 수정함

자주 나오는 오타를 맞게 교정함 예: hte -> the
Rules
- based on inflections:
  - V+C+ing -> V+CC+ing [ where V = vowel(모음) and C = consonant(자음) ]
  - occuring -> occurring
- 흔한 오타에 근거함 (예를 들어 키보드 타이핑 오류나 위치 뒤바뀜):
  - Cie -> Cei
Yannakoudakis et al. (1983) : Dictionary Based Technique (Knowledge-based spelling correction program based on the rules)

Minimum Edit Distance

어떻게 한 단어와 다른단어가 다르다는 것을 측정할 것인가?
- 얼마나 많은 연산을 한 단어가 다른 단어로 변환하는데 필요한지를 측정

INTENTION과 EXECUTION의 편집거리를 살펴보면 INTENTION을 EXECUTION으로 변환하는데는 하나의 deletion, 3개의 substitution, 1개의 insertion이 필요하므로 거리는 5가 된다. 만약 substitution을 cost 2로 계산한다면 8이 된다.

Minimum Edit Distance 정의
For two strings
- X of length n
- Y of length m
We define D(i,j) : the edit distance between X[1..i] and Y[1..j] i.e., the first i characters of X and the first j characters of Y
The edit distance between X and Y is thus D(n,m)

Computing Minimum Edit Distance

알고리즘에 대한 설명은 Minimum Edit Distance – Explained ! – Stanford University가 도움이 된다.

핵심은 값을 구할 좌표에서 아래의 3가지 값중 최소값을 구한다. – 타겟 좌표의 왼쪽값 + 1 – 타겟 좌표의 아래쪽값 + 1 – 타겟 좌표의 왼쪽 아래 값 + (타겟 좌표의 가로 세로 문자가 같으면 0, 다르면 2)

예를 들자면 아래 그림에서 N과 E가 교차되는 지점이라면 아래의 3개의 값의 최소값으로 3이된다. – 타겟 좌표의 왼쪽값, 2(N과 #가 교차하는 값) + 1 = 3 – 타겟 좌표의 아래쪽값, 2(I와 E가 교차하는 값) + 1 = 3 – 타겟 좌표의 왼쪽 아래 값(1, I와 #가 교차하는 값) + 2(N과 E가 서로 다르므로) = 3

영어설명은 아래와 같다.

Dynamic programming: A tabular computation of D(n,m)
Solving problems by combining solutions to sub-problems.
Bottom-up
- We comp ute D(i,j) for small i,j
- And compute larger D(i,j) based on previously computed smaller values i.e., compute D(i,j) for all i (0 < i < n) and j (0 < j < m)

N-gram Based Techniques

N-gram
- An N-gram is a sequence of N adjacent letters in a word
- The more N-grams, two strings, share the more similar they are.
Dice coefficient: DICE(X, Y) = 2 x |n-gram(X) ∩ n-gram(Y)| / |n-gram(X)| + |n-gram(Y)|
예제
- Zantac vs Contac
- Bigrams in Zantac : Za an nt ta ac 5 bigrams
- Bigrams in Contac : Co on nt ta ac 5 bigrams
- DICE(Zantac, Contac) = (2 x 3) / ( 5+5) = 0.6
Problems :
- N-gram techniques doesn’t show good performance on short words. (e.g. Trigram – The words of length 3 )
- N-gram similarity measure works best for insertion and deletion errors, well for substitution errors, but very poor for transposition errors.

Probabilistic Techniques

Two main probabilities are taken into account:

Transition Probability : Markov Assumption (n-Gram based LM)
- probability (chance) of going from one letter to the next. e.g., What is the chance that a will follow p in English? That u will follow q?
Confusion Probability : probability of one letter being mistaken (substituted) for another (can be derived from a confusion matrix) e.g., What is the chance that q is confused with p?
Useful to combine probabilistic techniques with dictionary methods [ Baye’s Rule: Toussaint, 1978; Hull & Srihari, 1982; Viterbi Algorithm: Forney, 1973 – lexicon (Trans. P) + OCR Device Ch.(Conf. P) ]
- For the various reasons discussed above (keyboard layout, phonetic similarity, etc.) people type other letters than the ones they intended.
- It is impossible to fully investigate all possible error causes and how they interact, but we can learn from watching how often people make errors and where.
- One way of doing so is to build a confusion matrix = a table indicating how often one letter is mistyped for another (this is a substitution matrix)

Confusion matrix for spelling errors

http://norvig.com/ngrams/spell-errors.txt 를 가지고 confusion matrix를 만듦

How do we decide if a (legal) word is an error?

Noisy Channel Model
- source와 decoder로 구성됨
- source는 정답지(맞는 단어)
- noise channel을 거쳐 나오는 오류의 단어를 베이지안 확률로 제일 높은 확률의 단어로 추측함

Language generated and passed through a noisy channel
Resulting noisy data received
Goal is to recover original data from noisy data
Metaphor comes from Jelinek’s (1976) work on speech recognition, but algorithm is a special case of Bayesian inference (1763)
Channel Input: w (true word);
Output : O (spell error word)
Decoding: hypothesis

Appling Bayes’ Rule

Applying Bayes’ Rule

오타 caat가 어떤 단어로 교정될지를 베이지안 확률로 예측해보는 것이다. carrot보다 cat이 확률이 더 높기 때문에 cat으로 교정한다.

Neural Nets

Application
- OCR (handwritten number recognition e.g. zipcode, receipt readiing): Matan 1992; Keeler 1992
- 2-phase hand-printed-recognizer: Blur 1987
- Name / address correction: Kukich 1988;
- Text-to-speech: Kukich 1990
- ….

Context-Dependent Word Correction

Context-dependent word correction = correcting words based on the surrounding context
Essentially a fancier name for a grammar checker = a mechanism which tells a user if their grammar is wrong.
25-40% of spelling errors are real words ( Kukichi, 1992)

Real-word error: Classification

5 Levels of processing constraints
- Lexical (어휘) level : Non-word error -> Isolated-word spelling correction techs.
- Syntactic (구문) level : The study was conducted be XYZ Marketing Research
- Semantic (의미) level: see you in five minuets
- Discourse (담화) structure level: Iris flowers have four parts: standards, petals, and falls. Pragmatic (화용) level: Has Mary registered for Intro to Communication Science yet? (where Computer was intended)
The Other Classification
- Local syntactic errors: The study was conducted mainly be John Black.
- Global syntactic errors: I can’t pick him cup cuz he might be working late.
- Semantic errors: He is trying to fine out.
- Orthographic intrusion errors: is there a were to display your plots on a DEC workstation?
  (way -> were )

NLP Prototypes for Handling Ill-Formed Input

A parser driven NLP system: Lexicon + Grammar + Parsing Procedure

크게 3가지 방식
- Acceptance-based approaches
- 문법에 오류가 있더라도 의미만 맞으면 통과시킴
- Relaxation-based approaches
- 에러가 무시될 수 없음 ( EPISTLE, CRITIQUE)
- Input (non-word errors) -> Parsing procedure (real-word errors: grammar, subj.-verb, punctuation)
- Expectation-based approaches
- CASPAR, DYPAR, MULTIPAR, NOMAD
- Parsing procedure: Builds a list of expected words
  - If input string is NOT in the list, it assumes an error detected
  - Attempts choosing one of the expectation list.
General techniques exists in lexical & syntactic error correction.
BUT, techniques for processing semantic, pragmatic, and discourse level are impractical ( ~ 1992).
Parser-Based Writing Aid Tools: CRITIQUE 1987( previously IBM EPISTLE 1982)

Statistically Based Error Detection and Correction Experiments

Statistical Language Models (SLMs) : A probability distribution P(s) over strings S attempts to reflect how frequently a string S occurs as a sentence.
Detecting real-word errors via POS bigram probabilities: UCREL CLAWS – POS bigram transition probabilities ( Dictionary Lookup -> POS bigram transition prob.)
Improving non-word correction accuracy via word bigram probabilities
Detecting & correcting real-word errors via Lexical Trigram Probabilities: Mays et al. 1991 – using Bayesian prob. score computation for each sentence
Neural Networks Language Models
SLM : explicitly represented as conditional probabilities
NNLM : implicitly represented as weights in the nets