2014-03-30

On efficiency and sorting


Comparing programs.

When there are two or more programs or functions that solve the same problem equally well it's necessary to find ways to compare their performance. Average time taken to reach the solution would be likely be the first choice of metrics to compared. Unfortunately, average run time can be challenging to calculate for even moderately complex function multiple inputs or where multiple characteristics of a single input affect run time.

In some cases it is sufficient to look at the worst case instead of average cases. This at least lets you determine that a program will solve all problems within specific sizes and types within a certain maximum time. In computer science the worst case run time is symbolized by TP(n) where P is the name of the program and n is the size of its input.

The "big oh".

For further ease of comparison programs can be grouped by a simple function that describes their worst case run times. This classification of time complexity based on input size is called "big oh" and is symbolized by a capital letter 𝒪, in a script font where possible. For example a functions with input size n and a worst case run time of 7n3 - n2 + 8n + 9000 is said to be in 𝒪(n3) because as the input grows extremely large the effect of the highest order exponent is sufficient to describe its overall behavior.

The 𝒪 function is technically an upper bound on the worst case run time so any program that is in 𝒪(n2) is also in 𝒪(n3) and even in 𝒪(2n).

Comparing methods of sorting.

Sorting algorithms are a good place to start looking at efficiency because they are relatively easy to explain and visualize and have been the subject of considerable historical efficiency analysis.

While there are algorithms that sort specific types of data in specific ways in linear time, taking n steps to sort a list of n items, general comparison sorting is limited to times in 𝒪(n lg n). In the cases we examined in CSC148 this type of run time was possible by dividing or partitioning the items to be sorted recursively so that sub list of the items could be sorted (a single item list takes zero steps to sort) and merged in n steps. Any list of items can be devided in half log2n times leading to a final result of log2n merges each taking n steps.


Card file image (CC BY 2.0) Glyn Lowe. Merge sort animation (Wikimedia Commons) Swfung8.

2014-02-24

On recursion


General recursion.

The English word recursion is derived from a Late Latin word meaning to run back. Something is recursive, in the general sense of the word, whenever a part of the thing is a copy of the entire thing. As with the curves in the image above, at each step of examining it, either more closely or more broadly, the same form is observed.

Recursive computing algorithms.

In computer programming, a recursive function is one which, in some cases, calls itself as part of its process. To be a useful function it is important that there be at least one case where it does not call itself and that each level of recursion (each call to itself) move closer to this case. If both of these conditions are not met the function may became stuck in a state of infinite recursion. The program will hang, never reaching the next step it is meant to accomplish.

The cases where recursion depth does not increase are refered to as the base cases. For example, in the following CSC148 example function for finding the sum of all numbers in a set of nested lists the base case is when the item being examined is a number rather than a list.

def sum_list(L: list) -> float: '''Return sum of the numbers in L. L - a possibly nested list of numbers. L, and all sublists, are non-empty >>> sum_list([1, 2, 3]) 6 >>> sum_list([1, [2, 3, [4]], 5]) 15 ''' return sum([sum_list(x) if isinstance(x, list) else x for x in L])

A draw back of recursion.

The main problem I am aware of with use of recursion in computer programming is memory usage. Each time a function calls itself a new frame has to be added to the stack containing new instances of every object used by the function. This can reduce performance of other programs or even crash the computer if allowed to continue for a very large number of levels of recursion. To prevent this Python, by default, only allows 1000 levels of recursion. This prevents the worst problems, but also places some limits on the usefulness of recursion.

Tail recursion.

The problem of recursion gobbling up memory can be fixed in some cases by using tail call optimization. A tail call recursion is when a function calls itself as the very last thing it does before returning. In this case, unlike a recursive call anywhere else in the function, any information in the stack frame that isn't being passed to the new instance of the function can be safely discarded. If a language supports tail recursion optimization it will identify these situations and clean up the completed function call before beginning the next one. This keeps the stack from growing at all, no matter how many times a tail recursive function call itself.

As interesting as this is, Python does not have tail recursion optimization and is very unlikely to support it any time soon. Working in Python, very deep recursion may need to be either rewritten as iteration or manually tail call optimized using helper functions.

The last visible dog.

Talking about recursion reminds me of the film The Mouse and His Child (based on the excellent and bizarre book of the same name). In this story some characters ascribe a mystical significance to the "last visible dog" in a recursive logo image printed on cans of a specific brand of dog food.



Image made from CC images by Elijah Porter and Kevin Dooley.

2014-02-16

Memoization for lazier functions.


What is memoization and why is it useful?

While working on the first major assignment for CSC148 I found myself with a recursive algorithm that successfully found the value I needed but took an unacceptably long time when given anything but very small inputs. I asked about the problem on the course forum and instructor Dustin Wehr suggested a I look up a technique called memoization.

The basic idea of memoization is to have a function that has predictable inputs and expensive computation remember things it has already computed and use these remembered results instead of recalculating whenever possible.

Simple Python memoization using a mutable default argument.

Undoubtedly I'm not the first to think of this, but if you only have a function or two you need to memoize you can take advantage of the way Python handles default arguments that are mutable data structures. Ever since I first found out that changes to mutable default arguments persist from one call on a function to the next it has bothered me. While it's still not the choice I would have made were it up to me, it's a lot less irritating now that I've found a way to make use of it in certain situations.

Since the Wikipedia entry on memoization uses a pseudocode factorial calculating function in its example, I'll use a Python factorial function in mine:

def factorial(n: 'non-negative int') -> int: '''Ruturn n factorial == n! == (n * n-1 * n-2 * n-3 * ... * 1)''' if n == 0: # By convention 0! = 1 return 1 else: return n * factorial(n - 1)

On a modern computer this function won't take all that long to run even up to the maximum n allowed by Python's default level of recursion limit. We'll just imagine we need to optimize it for a slow ancient machine. To do this we create an argument with a dictionary as its default value. We use this dictionary to store previously computed factorials and then check the dictionary before computing a factorial.

def factorial(n: 'non-negative int', known_factorials: dict={0: 1}) -> int: '''Ruturn n factorial == n! == (n * n-1 * n-2 * n-3 * ... * 1)''' if n in known_factorials: return known_factorials[n] else: return n * factorial(n - 1)

In this case, because we can start with the base case of 0! = 1 in the known_factorials dictionary, the memoized version doesn't even have more lines of code.

To memoize all things.

If you have more than one or two functions that need to be memoized, doing this to each of them would get wastefully repetitive. Instead, make a generalized memoizer to wrap around all these functions or use the least-recently-used (LRU) cache memoizer functools.lru_cache from the Python library.

Image made from a CC image by Marc Wisniak.

2014-02-08

So many SLOGs. Random samples and mass searchs.


Part of keeping a proper course log is reading and referencing other students' logs. To that end I browsed a random sample of the SLOGs listed on the course side. I found some nice looking and pokéfied ones and an interesting post about separation of concerns, a concept I didn't pick up on in class. But, what about when I'm writing on a specific topic and want to find out what others have said about it?

Searching slogs with help from Python.

Most search engines have me covered when it comes to checking if a certain person has written about a specific topic. For example, to find out if I or a given other student have written about OOP I can search something like:
object oriented site:cscij.blogspot.ca OR site:another_slog.blog_site.something
But, with hundreds of SLOGs, typing the addresses out in the proper format by hand would take far too long. My approach to this problem was to copy all the SLOG URLs from the course page into a text file and write a short Python script to reformat them for use in a search engine.

multi_site_search.py

input_file_name = input('Enter name of .txt file with sites to search: ') input_file = open(input_file_name) output_text = [] output_file = open('multi_search_output.txt', 'w') for line in input_file.readlines(): line = line.strip('/ \n') if line.startswith('https://') or line.startswith('http://'): line = line[line.find('/') + 2:] if line.startswith('www.'): line = line[4:] output_text.append(line) output_file.write('site:' + ' OR site:'.join(output_text))

An imperfect solution.

Unfortunately, all the search engines I tried have a relatively short limit on the length of search string they will accept or freeze when given the full SLOGs list. Dividing the list into 25 pieces allowed me to search them all in a fairly reasonable amount of time but needing to do this lessened my sense of accomplishment considerably. Perhaps, when I've learned some more about using Python directly on the web, I can automate those 25 searches further and present the integrated results.

Another approach.

Interestingly, while testing out these searches, I came across another student using Python to manipulate the SLOG list. You should check out David's script to read the addresses directly from the site and randomly select some number of them to visit here.

Image CC by See-ming Lee.

2014-01-30

Drawing a "tree burst" iteratively and recursively.


This week in class we examined a program that used turtles (in the Logo turtle graphics sense) to draw a simple fractal pattern we referred to as a tree burst. The goal was to get a better understanding of recursion by engaging it in a visual way. It worked well and I was able to intuitively grasp how the recursion was going to work a little faster than in a previous example using nested list objects.

The program (tree_burst.py) is available from the course website and its visual output is roughly similar to the image below.


After looking at it for a while it struck me that, since each line is drawn by its own turtle and that turtle remains at the end of the line until it's needed in the next level of recursion, it shouldn't be too difficult to get the same result using iteration. I gave it a shot and found that I was right. My iterative solution is somewhat less elegant than the recursive example but still concise enough to post here in full.

tree_burst_iterative.py

import turtle COLORS = ('red', 'green', 'blue') def tree_burst_iterative(level, length, turtle): '''Use copies of turtle to draw a symmetrical ternary tree fractile to the specified level of detail starting with initial lines of specified length. ''' turtle_list = [] turtle_list.append(turtle) for _ in range(level): new_turtles = [] for t in turtle_list: # Turn all existing turtles East and make them red. t.setheading(0) t.color(COLORS[0]) # For each existing turtle create two recolored clones facing NW # and SW respectivly. for i in range(1, 3): new_turtles.append(t.clone()) new_turtles[-1].setheading(120 * i) new_turtles[-1].color(COLORS[i]) # Add new clones to existing turtles and move all turtles forward. turtle_list.extend(new_turtles) for t in turtle_list: t.forward(length) # Halve the length of travel for the next round of moves. length = length / 2 if __name__ == '__main__': t = turtle.Turtle() t.speed('slow') t.color('red') tree_burst_iterative(4, 128, t)

Turtle image CC by jovian_dreams

2014-01-26

On object-oriented programming.


This sounds vaguely familiar.

I first encountered the concept of object-oriented programming in the late 1990s when I somehow acquired a copy of Practical C++ Programming by Steve Oualline. It was way beyond my computing knowledge level, I had written the tiny beginnings of a choose your own adventure style game in QBasic and was learning the basics of HTML, but I was enough of an enthusiast to slog through a good chunk of the book, even after realizing it wasn't anything I was actually going to use any time soon. The first thing I got out of reading about object-orientation, aside from new uses for words like class and parent, was the sense that programming was more complicated than I had thought. The second was that the architecture of programming was almost as much about making the workings of computers comprehensible to humans as it was about making computers do what we want them to.

Whence object-oriented programming?

The name object-oriented programming was coined by Alan Kay in the 1960s to describe the type of architecture he was working on while developing the Smalltalk language. The structure of Smalltalk was influenced by LISP, Sketchpad, and particularly Simula, which is considered an object-oriented language even though it predates the classification. Kay's central concept for object-oriented programming was that programs should be divided into distinct entities, like the cells of an organism or the separate computers in a network. These entities would protect their internal processes and data and would work together strictly by sending and receiving messages. In a 2003 email, Kay defined true object-orientation as "only messaging, local retention and protection and hiding of state-process, and extreme late-binding of all things." and added that he didn't think any languages besides Smalltalk and LISP existed that could meet this specification. It's interesting that Kay doesn't explicitly include either classes or inheritance, the concepts that most struck me on my first brush with C++, as important factors.

Object-orientation in Python.

Python is quite object-oriented in that everything in Python is an object or a label for an object (a variable). However, Python doesn't fit Kay's concept of object-oriented programing in at least one important respect: Python objects are almost entirely lacking in privacy. This means that, despite conventions such as special variable names to discourage taking indiscriminate advantage of this, all of a Python object's data and methods are generally directly accessible to the rest of the program. If the object architecture in Python is a code of etiquette then in a language like Smalltalk its a code of law. Smalltalk objects sit in separate cells passing notes under the doors while Python's inhabit a single room but, like polite guests at a fancy tea, they refrain from eavesdropping on neighboring tables... for the most part.

Image created from CC images by Ciro Cattuto and Scott Davies