From: "Earl F. Glynn"
Newsgroups: sci.math,sci.math.num-analysis
Subject: Re: Frequency of English Alphabet
Date: 4 Jan 1999 06:11:09 GMT
Wilson Figueroa wrote in message <368F9323.51BB@aznet.net>...
>I was reading some books on crpyptography and read about statistical
>attacks on ciphers.
>
>I was wondering if anyone has a complete list for all 26 letters of the
>English alphabet and how often these arise in normal writing (for
>example, the letter E is the most often used letter in the English
>language and one source said it has a 0.18 probability of occurring).
>What is the complete list for all 26 letters?
Here's the "One-Gram Probability Distribution" from Alan G. Konheim's
"Cryptography -- A Primer," John Wiley, 1981, p. 16:
A 0.0856
B 0.0139
C 0.0279
D 0.0378
E 0.1304
F 0.0289
G 0.0199
H 0.0528
I 0.0627
J 0.0013
K 0.0042
L 0.0339
M 0.0249
N 0.0707
O 0.0797
P 0.0199
Q 0.0012
R 0.0677
S 0.0607
T 0.1045
U 0.0249
V 0.0092
W 0.0149
X 0.0017
Y 0.0199
Z 0.0008
The Two-Gram Probability Distribution (p. 16) is also very interesting.
efg
_________________________________
efg's Computer Lab: www.efg2.com/lab
efg's Technical Book Store:
www.efg2.com/lab/TechBooks
Earl F. Glynn E-Mail: EarlGlynn@att.net
Overland Park, KS USA
==============================================================================
Addenda [djr]
Here are the letters, now listed according to the frequency of usage in
English. Adjoined is their encoding in International Morse Code.
0.1304 E .
0.1045 T -
0.0856 A .-
0.0797 O ---
0.0707 N -.
0.0677 R .-.
0.0627 I ..
0.0607 S ...
0.0528 H ....
0.0378 D -..
0.0339 L .-..
0.0289 F ..-.
0.0279 C -.-.
0.0249 M --
0.0249 U ..-
0.0199 G --.
0.0199 Y -.--
0.0199 P .--.
0.0149 W .--
0.0139 B -...
0.0092 V ...-
0.0042 K -.-
0.0017 X -..-
0.0013 J .---
0.0012 Q --.-
0.0008 Z --..
================
Boy scout handbook claims "the frequency in which we use [the letter] in the
English language" is:
ETAOINS HRDLUCM PFWVYB GJQKXZ
and provides the Morse code equivalents shown. I don't know the specific
connection intended by the inventors of the code, but the pattern is clear.
We can sort the letters this way: L1 < L2 if
(codelength1) < (codelength2), OR
(codelength1) = (codelength2) and (# dots)_1 > (# dots)_2, OR
(codelength1) = (codelength2) and (# dots)_1 = (# dots)_2, and
code_1 < code_2 lexicographically
(where "." < "-"); the first two rules attempt to encapsulate the idea
that the letters near the top are faster to transmit.
Here then is the ordering. It loosely follows the frequency table.
It is difficult to find a rationale which allots "O" the code "---"
and "K" the "nicer" code "-.-" when O is used 19 times as often as K !
(--- -.-? :-) )
0.1304 E .
0.1045 T -
0.0627 I ..
0.0856 A .-
0.0707 N -.
0.0249 M --
0.0607 S ...
0.0249 U ..-
0.0677 R .-.
0.0378 D -..
0.0149 W .--
0.0042 K -.-
0.0199 G --.
0.0797 O ---
0.0528 H ....
0.0092 V ...-
0.0289 F ..-.
0.0339 L .-..
0.0139 B -...
0.0199 P .--. Note: skipped ..-- and .-.-
0.0017 X -..-
0.0279 C -.-.
0.0008 Z --..
0.0013 J .---
0.0199 Y -.--
0.0012 Q --.-
Last two possible codes (in this ordering), namely "---." and "----", are
not used. Codes (5-character) also exist for digits and punctuation. I
don't think there is a way to signify case.
================
Braille anyone? Standard English Braille Cell List taken from
http://dots.physics.orst.edu/gs_bs_seb.html
[dot 1] a
[dot 1 2] b
[dot 1 4] c
[dot 1 4 5] d
[dot 1 5] e
[dot 1 2 4] f
[dot 1 2 4 5] g
[dot 1 2 5] h
[dot 2 4] i
[dot 2 4 5] j
[dot 1 3] k
[dot 1 2 3] l
[dot 1 3 4] m
[dot 1 3 4 5] n
[dot 1 3 5] o
[dot 1 2 3 4] p
[dot 1 2 3 4 5] q
[dot 1 2 3 5] r
[dot 2 3 4] s
[dot 2 3 4 5] t
[dot 1 3 6] u
[dot 1 2 3 6] v
[dot 2 4 5 6] w
[dot 1 3 4 6] x
[dot 1 3 4 5 6] y
[dot 1 3 5 6] z
* Punctuation marks:
[dot 2] ,
[dot 2 3] ;
[dot 2 5] :
[dot 2 5 6] .
[dot 2 3 5] !
[dot 3 5 6] ? (at end of word)
[dot 3] '
[dot 36] -
[dot 2 3 6] open double quote, when at beginning of word
[dot 3 5 6] close double quote
[dot 2 3 5 6] parenthesis, ( when on left, ) when on right
* Full-word signs:
[dot 1 2 3 4 6] and
[dot 1 2 3 4 5 6] for
[dot 1 2 3 5 6] of
[dot 2 3 4 6[ the
[dot 2 3 4 5 6] with
* Letter combinations:
[dot 1 6] CH
[dot 1 2 6] GH
[dot 1 4 6] SH
[dot 1 4 5 6] TH
[dot 1 5 6] WH
[dot 1 2 4 6] ED
[dot 1 2 4 5 6] ER
[dot 1 2 5 6] OU
[dot 2 4 6] OW
[dot 2 6] EN
[dot 3 5] IN
[dot 3 4] ST
[dot 3 4 5] AR
[dot 3 4 6] ING
* Indicators, always appear in combination with other cells:
[dot 3 4 5 6] number indicator (also used as internal contraction
for "ble")
[dot 6] capitalization and internal contraction indicator
[dot 5 6] letter and internal contraction indicator
[dot 5] second internal contraction indicator
[dot 4 6] third internal contraction indicator
[dot 4 5 6] general contraction indicator
[dot 4 5] second general contraction indicator
[dot 4] accented letter indicator (follows letter)
A slight correlation with frequency-of-use:
[dot 1] a
[dot 1 2] b
[dot 1 3] k
[dot 1 4] c
[dot 1 5] e
[dot 2 4] i
[dot 1 2 3] l
[dot 1 2 4] f
[dot 1 2 5] h
[dot 1 3 4] m
[dot 1 3 5] o
[dot 1 3 6] u
[dot 1 4 5] d
[dot 2 3 4] s
[dot 2 4 5] j
[dot 1 2 3 4] p
[dot 1 2 3 5] r
[dot 1 2 3 6] v
[dot 1 2 4 5] g
[dot 1 3 4 5] n
[dot 1 3 4 6] x
[dot 1 3 5 6] z
[dot 2 3 4 5] t
[dot 2 4 5 6] w
[dot 1 3 4 5 6] y
[dot 1 2 3 4 5] q
================
Oh, what the heck. One more comment on letter frequency: a mnemonic,
heard from d.j.e.nunn@durham.ac.uk (Douglas Nunn) :
Elephants' toenails are orange, not red, I suspect.
Helen drives Lorna's Ford Cortina.
My uncle George's yellow Peugeot went because
Vicky kept x-raying Jonathan's queer zebra.