Python basics: classes
This is the post version of a talk I had the opportunity to give to my collegues at CAPSiDE-NTT. It is actually a follow up from another talk that also became a blog post here.
In certain sense, this is also an apology: I gave this talk trough some nasty videoconference tool because we were all in our homes due to the confinment. In that scenario I only had my voice and a single screen at my disposal to try to transmit some information to the patient audience.
How not to give a remote talk
So I decided to share my screen, opened a terminal and started Emacs. (I couln’t have missed that opportunity, of course.)
During the following hour, I talked and live coded, and wrote the relevant points and code on respective Emacs buffers. I also had room for a terminal shell at the right.
The setup looked something like his:
The whole session was a little odd.
I felt like being alone in a porch with an amateur radio station trying to comunicate with other civilizations. You know you’ve sent a message. You think it will arrive. You hope it will be decoded properly. But then, the only think you can really do is wait an see if it is interpreted correctly, not taken into an explicit war delcaration and wish that the aliens do not think, Hey, there’s some kind of spicy food down there, my sweet heptapodie.
But I disgress.
The future goes trough making web pages (again)
On the last post we saw some way to write web pages. It was an excuse to present python functions.
It is ovious that writing web pages by hand via function calls is crazy. Or it is not?. Anyway, there are other ways to write web pages. One that is particularly human friendly is Markdown.
Markdown
Markdown is intended to be readable as is, like plain text email. Some typical markdown document (source.md) looks like:
And is know to be transformed to some html like
We just need a tool that transforms the markdown to html. We will call this tool m and this is what we will be constructing in the rest of this post.
But, in order to do so, we need a proper development setup and methodology, and this means TDD.
TDD
What is a proper development setup? A proper development setup is one wich allows you to interact with the code (like a REPL), or at least see the effect of the changes ASAP.
Test Driven Development is a well known way to build trust and assert that things are going like you think they will.
The basic methodology is to start with a failing test. Then you write the smallest amount of code to make the test pass. If you do this diligently, then you build enough trust that the code is doing what it should be doing.
But I think TDD is realy hard. You have to be strict and always play by the TDD rules if you want to build trust. This is something that in a real project needs support, agreement and commitment from developers, managers and even the whole company.
But if you are exploring by yourself and your main objective is to gain knowledge, you need a system you can play with. You want fast cycles not to be productive, but to see how the system works and responds.
Actually, there are languages, or environments that give you this from the start. Python comes with a REPL, but we need something better.
Enter TDD.
TDD: Tmux Driven Development
So, we want to build a system to transform our markdown source document to html. And we said we will call this utility m.
We will do it in small steps. First, create an executable python script and call it m:
The only thing this script does is reading from stdin and writting to stdout, look:
We can call it a day, right?
We can do better. Here tmux comes in. We split the window in two. In one of the panes m continuously transforms our source.md file to a dest.html file:
On the other, we will be watching the results:
The whole setup will look something like this
We just need to a bit more plumbing
define an initial `to_html` function inside the `paudirac.markdown` module
and change m to read like this
From this point on, the whole system is running and everything is plumbed: when we change the source.md, dest.html will automatically be rebuilt.
How cool is that?!
Block structure
Well, actually, rigth now, the system is not specially cool. I mean, the work done by to_html amounts to nothing at all. But, like in TDD (here I meant Test Driven Development), the interface has driven our (now spartan) implementation.
Let’s get our hands dirty now.
From the CommonMark spec (a Markdown flavor) we got the first insight
This means that parsing can proceed in two steps: first, the block structure of the document can be discerned; second, text lines inside paragraphs, headings, and other block constructs can be parsed for inline structure. The second step requires information about link reference definitions that will be available only at the end of the first step. Note that the first step requires processing lines in sequence, but the second can be parallelized, since the inline parsing of one block element does not affect the inline parsing of any other.
—CommonMark spec
And the separator of blocks are blank lines.
So, we have to make a decision: how to start to process the source files. There are lots of choices: characters, words, etc.
It seems reasonable to start in a fairly large level of abstraction: lines.
We define a lines generator that does that:
and modify the to_html to use that one
This results in
so you can see that in this setup, tmxu is also acting as a fairly nice debugger! How cool is that?!
Okay, we have to pass a string to the writer
(Note that we are forcing the lns generator to become a list in order to be printable.)
This produces something like (fragment):
Now that we have the lines, we aim for the blocks. In markdown, the newline character \n doesn’t mean anything special. Actually, two contiguous non-blank lines contain text that is from the same block.
What separates blocks from one another are blank lines. We create an axuxiliary function that tells us if a line is blank or not:
and with it, we can define the blocks function:
and redefine our to_html
This actually gives us all the block structure!
The classical way to give structure to structure
We no longer have a plain file. We first transformed that to a list of lines, and now we have a list of lists of lines. Also, we’ve got rid off the blank lines, that were only needed for the block structure.
As the spec said, now we only have to start working on the content of the blocks themselves.
At this point, tough, I think things will be easier if we add a little bit of structure, and modeling.
Actually, we have identified some kind of object in our model. But it is on our minds. And our minds are like RAM and not always are a shared resource.
There is a way to make models explicit, exact and out of our minds: source code.
We are talking about Blocks. Let’s introduce a model for that.
Python has nice support for object modeling: classes.
A simple class in python is defined like this:
And with that in place, we can change our blocks function to return, you’ve guessed it, Blocks:
With that, our screen now looks not so promising, tough:
What’s happening here is that python is printing the Block object with its default internal representation. Fortunately, we can override that default implemetation:
which gives us a much better result, right?
Dunder methods
I think that your intuition is helping to understand here. In some sense, you’ve guessed that __init__ is some kind of constructor method and __repr__ is some kind of toString method. But, depending on your mileage, your intuition could vary.
Actually both __init__ and __repr__ are special python methods.
Like most things in python, they are special by convention. There is no hard rule that forbids you to create a class with a method named __bar__.
but it will be extremely unpythonic!
The convention here is that method names that have two leading underscores and two trailing underscores, are used by the interpreter for some special circumstances.
Because it is difficult to say underscore underscore init underscore underscore, the community agreed to call all the methods that follow that convention dunder methods, and you can talk about dunderinit or dunderrepr and everybody will understand, except your closest friends.
Divide and conquer
At this point, and following the spec advice, we can start working on the blocks themselves. Any block has all the information it needs.
In this simple source.md example whe can see at least 3 types of blocks
- headings,
- paragraphs and
- lists
Headings
We identify heading because their contents start with a # character. As before, we create a helper method to filter the headers out. Regexes at our rescue again:
We create a class H to hold the text of this special kind of block:
and another generator function parse that, given some blocks, will return us some representation of html, at this point a sequence of Hs and Blocks:
We glue it toghether in our to_html function:
That results in this output:
Not bad.
But we can work now on the details.
Information hiding
The parse function is a little bit ugly. There’s too much logic in there. There’s too much detail. As we’ve said, we wanted the parse function to bee feed with blocks and to give back html. But looking at it, it speaks about weird things like content[0]. Those trees don’t let us see the forest.
We can easily refactor those details to a block method. We can ask the block itself
and make the parse function clearer
Also, we can move the details of how to make an instance of an H to the H class himself.
Python doesn’t provide constructor method overloading, but we can use the named constructor pattern (which is a special case of factory method) and create a special constructor to creat an H from a block.
The parse function now already expresses what it does without exposing how it does it
This is super nice, because now we can work with the H class, and refine its responsabilities, without touching any other code.
Note that we’ve actually been able to change the class definition and modify even the initializer, and this had no effect on the working code other that creating a richer H object.
The extra cost of creating a named constructor is already paying.
@classmethods
Python offers classmethods via the @classmethod decorator.
The important thing to notice is that they receive explicitly a class as their first argument (in the same vein as how the instance methods receive explicitly an instance as a first argument). The convention is to use the name cls for the class instead of self.
A typical use case for classmethods are named constructors. It makes total sense, becase we want to create an instance of some class. Seems right to me to use the class as a receiver of the message to create an instance of itself, right? Who else could it be? Also, this is crucial when combined with inheritance.
Lists
Lists have more structure than headings, but they are recognized because each item in the list starts with * (at least in our subset of markdown).
Again, its the block responsability to identify if it is a list:
If we also define the correspoinding element class List and put everything else in a paragraph class P, the parse function is becomes:
The name constructor trick allows us to add a little bit more structure to the mix.
First, we define a Text element that will hold all the plain text. With that, we can define the new List and P elements, and redefine our H element. We also need an Item element for the list items.
We add the Html and Body elements that any HTML documents has, but that don’t have any respresentation in markdown.
With the complete set of elements we can define a proper parse function, that, given a sequence of tokens (our Blocks), return a proper AST:
We adjust our glue to_html function
and m returns us a string representing the deeply nested and structured html document (indented here for readability):
At this point we’ve converted our markdown document to a deeply nested data structure. It is a tree. It mimicks the DOM tree of an html document.
Ironically, an html document travels over the wire as a string that represents the document. It is the browser’s job to parse it (again!) and convert it to an actual DOM tree that it knows how to display on the screen.
We need to stringify it back!
Printing objects
Actually we’ve already did that. We’ve converted our tree to an string representation that I actually used in the last code snippet.
In that representation we’ve used the fact that we’ve overrided the __repr__ mehtod of the root node of our tree (the Html node). Remember:
is calling __repr__ recursively under the covers. The f-expression in the Html.__repr__ method
is asking the python interpreter to tell self.body to give it a string representation, which under the covers will call Body.__repr__ where there’s another f-expression:
that is asking one of the python builtin types (list) to represent itself. The default implementation here is also recursive and prints an initial [ and then asks any object in the list to represent itself and when there are no more items left, it prints a closing ]. Any item itself is asked to be represented as string via its own __repr__ method.
That is why we get all the nested tree representation. It has been very smart from our part, right?
This is very typical in languages that have an interactive REPL. They need some sort of string representation to print and read the objects and the smart thing is to at least have a representation that is the same to read the objects and to print them.
But now we can no longer leverage on regexes or interpreter’s tricks. We have to do some work and serialize the parse tree into a string that actually is a valid html string.
Luckily, the job of having a tree-like data structure and having to print it, has been done so many times that people abstracted a pattern.
The visitor pattern
The vistor design pattern emerges when you have a data structure and want to walk it down and apply some algorithm on it. Here, we want to apply the algoritm that transforms it to a very specific string representation, a representation that is different from the one we obtain with the __repr__ trick.
Conventionally, the visitor pattern relies in having two categories of objects
- the Elements, which are the objects that conform the data structure
- the Visitor, which represents the special kind of algorithm you want to apply to the data
Here, our Elements are all the classes that we’ve used to represent html. And our visitor HtmlVisitor will be the object responsible to represent this tree to an html string.
This is a pattern on which those two classes collaborate and as such there is some coupling within them. The contract is very generic, tough:
- the Element provides and accept public method that (well) accepts a visitor
- the Visitor provides a method visit_* for any element that it visits
Altough it is not necessary in Python, we create an abstract Element class that makes the contract explicit
We use the abc.ABC base class for our Element and apply the abstractmethod decorator to the accept method. The metaclass of the abc.ABC ensures that subclasses of Element implement the accept method. If they don’t, it raises an exception.
This will not prevent us from forgetting the accept method before actually running the code. Python is not a statically typed language. But will give us a nice and informative runtime error the earliest as possible:
We will now enforce the contract. We need to
- make any element inherit from Element
- implement the accept method in all of them
The concrete implementations in any element will be in charge of calling back the visitor for the specific method appropriate for them.
Now we should implement the corresponding visit methods on the HtmlVisitor. Each method specializes to render the corresponding element:
We just need to create a visitor an trigger the visiting process by making the root accepts it
With that in place, the source.md text is converted to (indented for clarity):
A (partial) sequence diagram shows us how this little contract is actually triggering a huge amount of collaboration between all the objects
It is a good pattern to know.
Exploiting the visitor
Recovering the AST representation is easy with another visitor
but leveraging all the work to the builtin __rep__ that was already working. This seems silly, but the calling site will behave the same with any of the visitors
In a context where the actual visitor can be choosen in runtime, having a common interface makes total sense.
But, better than that, now we’ve opened the door to many funny things. And they require minimal effort. Look:
will produce a very noisy document.
And, drawing inspiration from Acme::HOIGAN, we can implement the funny HoiganVisior
and obtain this nice and usefull web page.
Homework
This is not a complete markdown parser, but is fun to see what 250 LOC can do. Some things left that should be easy (because they don’t need anything outside their blocks) are:
- bold text
- text in italics
- ordered lists
- code snippets
The full code for this post is here.