diff options
Diffstat (limited to 'doc/codewalk/markov.xml')
-rw-r--r-- | doc/codewalk/markov.xml | 307 |
1 files changed, 0 insertions, 307 deletions
diff --git a/doc/codewalk/markov.xml b/doc/codewalk/markov.xml deleted file mode 100644 index 7e44840dc4..0000000000 --- a/doc/codewalk/markov.xml +++ /dev/null @@ -1,307 +0,0 @@ -<!-- -Copyright 2011 The Go Authors. All rights reserved. -Use of this source code is governed by a BSD-style -license that can be found in the LICENSE file. ---> - -<codewalk title="Generating arbitrary text: a Markov chain algorithm"> - -<step title="Introduction" src="doc/codewalk/markov.go:/Generating/,/line\./"> - This codewalk describes a program that generates random text using - a Markov chain algorithm. The package comment describes the algorithm - and the operation of the program. Please read it before continuing. -</step> - -<step title="Modeling Markov chains" src="doc/codewalk/markov.go:/ chain/"> - A chain consists of a prefix and a suffix. Each prefix is a set - number of words, while a suffix is a single word. - A prefix can have an arbitrary number of suffixes. - To model this data, we use a <code>map[string][]string</code>. - Each map key is a prefix (a <code>string</code>) and its values are - lists of suffixes (a slice of strings, <code>[]string</code>). - <br/><br/> - Here is the example table from the package comment - as modeled by this data structure: - <pre> -map[string][]string{ - " ": {"I"}, - " I": {"am"}, - "I am": {"a", "not"}, - "a free": {"man!"}, - "am a": {"free"}, - "am not": {"a"}, - "a number!": {"I"}, - "number! I": {"am"}, - "not a": {"number!"}, -}</pre> - While each prefix consists of multiple words, we - store prefixes in the map as a single <code>string</code>. - It would seem more natural to store the prefix as a - <code>[]string</code>, but we can't do this with a map because the - key type of a map must implement equality (and slices do not). - <br/><br/> - Therefore, in most of our code we will model prefixes as a - <code>[]string</code> and join the strings together with a space - to generate the map key: - <pre> -Prefix Map key - -[]string{"", ""} " " -[]string{"", "I"} " I" -[]string{"I", "am"} "I am" -</pre> -</step> - -<step title="The Chain struct" src="doc/codewalk/markov.go:/type Chain/,/}/"> - The complete state of the chain table consists of the table itself and - the word length of the prefixes. The <code>Chain</code> struct stores - this data. -</step> - -<step title="The NewChain constructor function" src="doc/codewalk/markov.go:/func New/,/\n}/"> - The <code>Chain</code> struct has two unexported fields (those that - do not begin with an upper case character), and so we write a - <code>NewChain</code> constructor function that initializes the - <code>chain</code> map with <code>make</code> and sets the - <code>prefixLen</code> field. - <br/><br/> - This is constructor function is not strictly necessary as this entire - program is within a single package (<code>main</code>) and therefore - there is little practical difference between exported and unexported - fields. We could just as easily write out the contents of this function - when we want to construct a new Chain. - But using these unexported fields is good practice; it clearly denotes - that only methods of Chain and its constructor function should access - those fields. Also, structuring <code>Chain</code> like this means we - could easily move it into its own package at some later date. -</step> - -<step title="The Prefix type" src="doc/codewalk/markov.go:/type Prefix/"> - Since we'll be working with prefixes often, we define a - <code>Prefix</code> type with the concrete type <code>[]string</code>. - Defining a named type clearly allows us to be explicit when we are - working with a prefix instead of just a <code>[]string</code>. - Also, in Go we can define methods on any named type (not just structs), - so we can add methods that operate on <code>Prefix</code> if we need to. -</step> - -<step title="The String method" src="doc/codewalk/markov.go:/func[^\n]+String/,/}/"> - The first method we define on <code>Prefix</code> is - <code>String</code>. It returns a <code>string</code> representation - of a <code>Prefix</code> by joining the slice elements together with - spaces. We will use this method to generate keys when working with - the chain map. -</step> - -<step title="Building the chain" src="doc/codewalk/markov.go:/func[^\n]+Build/,/\n}/"> - The <code>Build</code> method reads text from an <code>io.Reader</code> - and parses it into prefixes and suffixes that are stored in the - <code>Chain</code>. - <br/><br/> - The <code><a href="/pkg/io/#Reader">io.Reader</a></code> is an - interface type that is widely used by the standard library and - other Go code. Our code uses the - <code><a href="/pkg/fmt/#Fscan">fmt.Fscan</a></code> function, which - reads space-separated values from an <code>io.Reader</code>. - <br/><br/> - The <code>Build</code> method returns once the <code>Reader</code>'s - <code>Read</code> method returns <code>io.EOF</code> (end of file) - or some other read error occurs. -</step> - -<step title="Buffering the input" src="doc/codewalk/markov.go:/bufio\.NewReader/"> - This function does many small reads, which can be inefficient for some - <code>Readers</code>. For efficiency we wrap the provided - <code>io.Reader</code> with - <code><a href="/pkg/bufio/">bufio.NewReader</a></code> to create a - new <code>io.Reader</code> that provides buffering. -</step> - -<step title="The Prefix variable" src="doc/codewalk/markov.go:/make\(Prefix/"> - At the top of the function we make a <code>Prefix</code> slice - <code>p</code> using the <code>Chain</code>'s <code>prefixLen</code> - field as its length. - We'll use this variable to hold the current prefix and mutate it with - each new word we encounter. -</step> - -<step title="Scanning words" src="doc/codewalk/markov.go:/var s string/,/\n }/"> - In our loop we read words from the <code>Reader</code> into a - <code>string</code> variable <code>s</code> using - <code>fmt.Fscan</code>. Since <code>Fscan</code> uses space to - separate each input value, each call will yield just one word - (including punctuation), which is exactly what we need. - <br/><br/> - <code>Fscan</code> returns an error if it encounters a read error - (<code>io.EOF</code>, for example) or if it can't scan the requested - value (in our case, a single string). In either case we just want to - stop scanning, so we <code>break</code> out of the loop. -</step> - -<step title="Adding a prefix and suffix to the chain" src="doc/codewalk/markov.go:/ key/,/key\], s\)"> - The word stored in <code>s</code> is a new suffix. We add the new - prefix/suffix combination to the <code>chain</code> map by computing - the map key with <code>p.String</code> and appending the suffix - to the slice stored under that key. - <br/><br/> - The built-in <code>append</code> function appends elements to a slice - and allocates new storage when necessary. When the provided slice is - <code>nil</code>, <code>append</code> allocates a new slice. - This behavior conveniently ties in with the semantics of our map: - retrieving an unset key returns the zero value of the value type and - the zero value of <code>[]string</code> is <code>nil</code>. - When our program encounters a new prefix (yielding a <code>nil</code> - value in the map) <code>append</code> will allocate a new slice. - <br/><br/> - For more information about the <code>append</code> function and slices - in general see the - <a href="/doc/articles/slices_usage_and_internals.html">Slices: usage and internals</a> article. -</step> - -<step title="Pushing the suffix onto the prefix" src="doc/codewalk/markov.go:/p\.Shift/"> - Before reading the next word our algorithm requires us to drop the - first word from the prefix and push the current suffix onto the prefix. - <br/><br/> - When in this state - <pre> -p == Prefix{"I", "am"} -s == "not" </pre> - the new value for <code>p</code> would be - <pre> -p == Prefix{"am", "not"}</pre> - This operation is also required during text generation so we put - the code to perform this mutation of the slice inside a method on - <code>Prefix</code> named <code>Shift</code>. -</step> - -<step title="The Shift method" src="doc/codewalk/markov.go:/func[^\n]+Shift/,/\n}/"> - The <code>Shift</code> method uses the built-in <code>copy</code> - function to copy the last len(p)-1 elements of <code>p</code> to - the start of the slice, effectively moving the elements - one index to the left (if you consider zero as the leftmost index). - <pre> -p := Prefix{"I", "am"} -copy(p, p[1:]) -// p == Prefix{"am", "am"}</pre> - We then assign the provided <code>word</code> to the last index - of the slice: - <pre> -// suffix == "not" -p[len(p)-1] = suffix -// p == Prefix{"am", "not"}</pre> -</step> - -<step title="Generating text" src="doc/codewalk/markov.go:/func[^\n]+Generate/,/\n}/"> - The <code>Generate</code> method is similar to <code>Build</code> - except that instead of reading words from a <code>Reader</code> - and storing them in a map, it reads words from the map and - appends them to a slice (<code>words</code>). - <br/><br/> - <code>Generate</code> uses a conditional for loop to generate - up to <code>n</code> words. -</step> - -<step title="Getting potential suffixes" src="doc/codewalk/markov.go:/choices/,/}\n/"> - At each iteration of the loop we retrieve a list of potential suffixes - for the current prefix. We access the <code>chain</code> map at key - <code>p.String()</code> and assign its contents to <code>choices</code>. - <br/><br/> - If <code>len(choices)</code> is zero we break out of the loop as there - are no potential suffixes for that prefix. - This test also works if the key isn't present in the map at all: - in that case, <code>choices</code> will be <code>nil</code> and the - length of a <code>nil</code> slice is zero. -</step> - -<step title="Choosing a suffix at random" src="doc/codewalk/markov.go:/next := choices/,/Shift/"> - To choose a suffix we use the - <code><a href="/pkg/math/rand/#Intn">rand.Intn</a></code> function. - It returns a random integer up to (but not including) the provided - value. Passing in <code>len(choices)</code> gives us a random index - into the full length of the list. - <br/><br/> - We use that index to pick our new suffix, assign it to - <code>next</code> and append it to the <code>words</code> slice. - <br/><br/> - Next, we <code>Shift</code> the new suffix onto the prefix just as - we did in the <code>Build</code> method. -</step> - -<step title="Returning the generated text" src="doc/codewalk/markov.go:/Join\(words/"> - Before returning the generated text as a string, we use the - <code>strings.Join</code> function to join the elements of - the <code>words</code> slice together, separated by spaces. -</step> - -<step title="Command-line flags" src="doc/codewalk/markov.go:/Register command-line flags/,/prefixLen/"> - To make it easy to tweak the prefix and generated text lengths we - use the <code><a href="/pkg/flag/">flag</a></code> package to parse - command-line flags. - <br/><br/> - These calls to <code>flag.Int</code> register new flags with the - <code>flag</code> package. The arguments to <code>Int</code> are the - flag name, its default value, and a description. The <code>Int</code> - function returns a pointer to an integer that will contain the - user-supplied value (or the default value if the flag was omitted on - the command-line). -</step> - -<step title="Program set up" src="doc/codewalk/markov.go:/flag.Parse/,/rand.Seed/"> - The <code>main</code> function begins by parsing the command-line - flags with <code>flag.Parse</code> and seeding the <code>rand</code> - package's random number generator with the current time. - <br/><br/> - If the command-line flags provided by the user are invalid the - <code>flag.Parse</code> function will print an informative usage - message and terminate the program. -</step> - -<step title="Creating and building a new Chain" src="doc/codewalk/markov.go:/c := NewChain/,/c\.Build/"> - To create the new <code>Chain</code> we call <code>NewChain</code> - with the value of the <code>prefix</code> flag. - <br/><br/> - To build the chain we call <code>Build</code> with - <code>os.Stdin</code> (which implements <code>io.Reader</code>) so - that it will read its input from standard input. -</step> - -<step title="Generating and printing text" src="doc/codewalk/markov.go:/c\.Generate/,/fmt.Println/"> - Finally, to generate text we call <code>Generate</code> with - the value of the <code>words</code> flag and assigning the result - to the variable <code>text</code>. - <br/><br/> - Then we call <code>fmt.Println</code> to write the text to standard - output, followed by a carriage return. -</step> - -<step title="Using this program" src="doc/codewalk/markov.go"> - To use this program, first build it with the - <a href="/cmd/go/">go</a> command: - <pre> -$ go build markov.go</pre> - And then execute it while piping in some input text: - <pre> -$ echo "a man a plan a canal panama" \ - | ./markov -prefix=1 -a plan a man a plan a canal panama</pre> - Here's a transcript of generating some text using the Go distribution's - README file as source material: - <pre> -$ ./markov -words=10 < $GOROOT/README -This is the source code repository for the Go source -$ ./markov -prefix=1 -words=10 < $GOROOT/README -This is the go directory (the one containing this README). -$ ./markov -prefix=1 -words=10 < $GOROOT/README -This is the variable if you have just untarred a</pre> -</step> - -<step title="An exercise for the reader" src="doc/codewalk/markov.go"> - The <code>Generate</code> function does a lot of allocations when it - builds the <code>words</code> slice. As an exercise, modify it to - take an <code>io.Writer</code> to which it incrementally writes the - generated text with <code>Fprint</code>. - Aside from being more efficient this makes <code>Generate</code> - more symmetrical to <code>Build</code>. -</step> - -</codewalk> |