How Python caches compiled bytecode.

While reading an email at the Python developers mailing list about PEP-488 (which is not yet approved and is under discussion) I wondered how bytecode files works in Python.

The purpose of this post is to take some notes for myself and share what I find.

PEP-3147

While I was reading the proposed PEP-488 (which will be explained later) there was several references to PEP-3147.

Before PEP-3147 was implemented, files were saved with the format '{filename}.pyc' (or .pyo) in the same directory where the source code was stored.

PEP-3147 was created as an extension to the Python import mechanism in order to improve sharing of compiled Python bytecode for different distributions with the sourcecode.

CPython compiles its source code into bytecode. For performance reasons Python doesn’t recompile every time, so it caches the content of the compiled code. Python only recompiles when it realizes the source code file has changed. Python stores in the cached compiled file two 32bit big-endian digits which represents a magic number and a timestamp. The magic number changes every time the Python distribution changes the bytecode (for example adding new bytecode instructions to the virtual machine). This prevents causing problems when trying to execute compiled code for different virtual machines.

As some distributions have different versions of Python installed and users can install their different versions the previous mechanism doesn’t allow to reuse the compiled files.

PEP-3147 extended this by creating on every package a __pycache__ directory which can contain different versions of the compiled files. The format of the names is now {filename}.{tag}.pyc. The tag can be seen on the imp module:

>>> import imp
>>> imp.get_tag()
'cpython-34'

The magic number used on the pyc files can also be found on this module:

>>> imp.get_magic()
b'\xee\x0c\r\n'

As expected when using another version of Python the magic numbers change:

Python 3.5.0a0 (default:c0d25de5919e, Jan 30 2015, 22:23:54)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.56)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import imp
>>> imp.get_tag()
'cpython-35'
>>> imp.get_magic()
b'\xf8\x0c\r\n'

PEP-3147 was introduced on python 3.2 if we try with Python 2.7.9 we can verify that get_magic() exists but not get_tag():

Python 2.7.9 (v2.7.9:648dcafa7e5f, Dec 10 2014, 10:10:46)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import imp
>>> imp.get_magic()
'\x03\xf3\r\n'
>>> imp.get_tag()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'get_tag'

This is why if we use Python2 we still see the .pyc or .pyo files amongst with our code:

.
|-- __init__.py
|-- __init__.pyc
|-- api.py
|-- api.pyc
|-- module
|   |-- __init__.py
|   |-- __init__.pyc
|   |-- raul.py
|   |-- raul.pyc

If the PEP has been implemented we can see something like this:

|-- meteora
|   |-- __init__.py
|   |-- __pycache__
|   |   |-- __init__.cpython-35.pyc
|   |   |-- __init__.cpython-34.pyc
|   |   |-- utils.cpython-35.pyc
|   |   `-- utils.cpython-34.pyc
|   `-- utils.py

To recompile or not to recompile

The next diagram has been extracted directly from the PEP-3147 which explains clearly what is the workflow to load/compile the bytecode when importing:

PEP 3147 Workflow

As previously explained the pyc file matches when both the magic number and the timestamp of the source file matches exactly in the compiled file.

When Python is asked to import foo it searches on sys.path if the file exists. If it is not found it checks whether there is a foo.pyc file. In case the foo.pyc file exists it will load it. Otherwise it will raise an ImportError.

If the file foo.py exists Python will check if there is a __pycache__/foo.{magic}.pyc file that matches the source file. In the case of match it will load it.

If the __pycache__/foo.{magic}.pyc doesn’t exist or doesn’t match (timestamp changed) it checks whether the __pycache__ directory has been created or not, and if it’s not created it creates it.

Finally it compiles the foo.py file and it generates the __pycache__/foo.{magic}.pyc file.

PEP-488

The purpose of the (not yet approved) PEP is to remove the PYO files which are Python Bytecode Optimized files.

Current behaviour:

Currently bytecode files can be PYC and PYO. A PYC file is a bytecode file when no optimization level has been applied on startup. PYO files are files that are generated when optimization has been specified (-O or -OO).

In order to test the different levels of optimizations I’ve created the next simple test:

.
|-- api
    |-- __init__.py

My __init__.py file consists of:

def test():
    """
    This is my test function
    """
    assert False == True

If we execute python without any optimization we can see that the docstring of our function is there and the assertion fails as it’s executed:

raulcd@test  $ python3
Python 3.4.2 (v3.4.2:ab2c023a9432, Oct  5 2014, 20:42:22)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import api
>>> api.test.__doc__
'\n    This is my test function\n    '
>>> api.test()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/raulcd/test/api/__init__.py", line 5, in test
    assert False == True
AssertionError
>>>

We can also verify that the compiled bytecode is:

.
`-- api
    |-- __init__.py
    `-- __pycache__
        `-- __init__.cpython-34.pyc

When we execute with -O we can see that assertion doesn’t fail as this optimization removes the assertions from our code:

raulcd@test  $ python3 -O
Python 3.4.2 (v3.4.2:ab2c023a9432, Oct  5 2014, 20:42:22)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import api
>>> api.test.__doc__
'\n    This is my test function\n    '
>>> api.test()
>>>

We can also see that the compiled file generated is:

.
`-- api
    |-- __init__.py
    `-- __pycache__
        |-- __init__.cpython-34.pyc
        `-- __init__.cpython-34.pyo

2 directories, 3 files
raulcd@test  $ ls -lrt api/__pycache__/
total 16
-rw-r--r--  1 raulcd  staff  280 Mar 17 11:19 __init__.cpython-34.pyc
-rw-r--r--  1 raulcd  staff  247 Mar 17 11:23 __init__.cpython-34.pyo

If we execute Python with -OO we can see that both the assertion and the docstring have disappeared. Note that I need to manually remove the .pyo file as Python import mechanism will not recompile (as per the workflow explained before, file has not changed and already exists the .pyo file):

raulcd@test  $ rm api/__pycache__/__init__.cpython-34.pyo
raulcd@test  $ python3 -OO
Python 3.4.2 (v3.4.2:ab2c023a9432, Oct  5 2014, 20:42:22)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import api
>>> api.test.__doc__
>>> api.test()
>>>

The generated .pyo file has the same name but we can see that the content is different and the size:

.
`-- api
    |-- __init__.py
    `-- __pycache__
        |-- __init__.cpython-34.pyc
        `-- __init__.cpython-34.pyo

2 directories, 3 files
raulcd@test  $ ls -lrt api/__pycache__/
total 16
-rw-r--r--  1 raulcd  staff  280 Mar 17 11:19 __init__.cpython-34.pyc
-rw-r--r--  1 raulcd  staff  211 Mar 17 11:28 __init__.cpython-34.pyo

Currently there is no way to know whether a PYO file has been executed with different levels of optimization. So when a new level of optimization wants to be applied all PYO files need to be removed and regenerated.

PEP-488 Proposal

The PEP proposes to remove PYO files by adding the optimization level applied to the PYC file incorporating it to the file name.

Currently bytecode files names are created by importlib.util.cache_from_source() using the expression defined on PEP 3147:

{name}.{cache_tag}.pyc.format(name=module_name, cache_tag=sys.implemenetation.cache_tag)

The PEP proposes to add the optimization level by modifiying the name to:

{name}.{cache_tag}.opt-{optimization}.pyc'.format(
    name=module_name, cache_tag=sys.implementation.cache_tag,
    optimization=str(sys.flags.optimize)
)

The “opt-” prefix was choosen to provide a visual separator from the cache tag.

And that’s all for today :)

Elasticsearch, Logstash and Kibana on Docker

While doing performance testing on a project I needed to process the access logs of our web servers to define the navigation profile of our current users.

So I though it would be a nice time to play with Elasticsearch, Logstash and Kibana as I’ve heard of the stack.

ELK Stack

The first thing to notice when using it is how easy to use is. It took me a couple of hours since I decided to use it to have a prototype working on my local host. So let’s see each one of them.

kibana stack image

Elasticsearch

Elasticsearch is a search server based on Lucene. Is Open Source and can be found on the Github Elasticsearch project

In order to set up Elastic Search the only thing that you need to do is, download the package and execute it:

➜  wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.4.2.tar.gz
➜  tar -zxvf elasticsearch-1.4.2.tar.gz
➜  cd elasticsearch-1.4.2
➜  ./bin/elasticsearch
[2015-02-11 10:43:21,573][INFO ][node                     ] [Jumbo Carnation] version[1.4.2], pid[6019], build[927caff/2014-12-16T14:11:12Z]
[2015-02-11 10:43:21,574][INFO ][node                     ] [Jumbo Carnation] initializing ...
[2015-02-11 10:43:21,578][INFO ][plugins                  ] [Jumbo Carnation] loaded [], sites []
[2015-02-11 10:43:23,483][INFO ][node                     ] [Jumbo Carnation] initialized
[2015-02-11 10:43:23,483][INFO ][node                     ] [Jumbo Carnation] starting ...
[2015-02-11 10:43:23,528][INFO ][transport                ] [Jumbo Carnation] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/10.105.14.17:9300]}
[2015-02-11 10:43:23,540][INFO ][discovery                ] [Jumbo Carnation] elasticsearch/_EGLpT09SfCaIbfW4KCSqg
[2015-02-11 10:43:27,315][INFO ][cluster.service          ] [Jumbo Carnation] new_master [Jumbo Carnation][_EGLpT09SfCaIbfW4KCSqg][pumuki][inet[/10.105.14.17:9300]], reason: zen-disco-join (elected_as_master)
[2015-02-11 10:43:27,332][INFO ][http                     ] [Jumbo Carnation] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/10.105.14.17:9200]}
[2015-02-11 10:43:27,332][INFO ][node                     ] [Jumbo Carnation] started
[2015-02-11 10:43:27,783][INFO ][gateway                  ] [Jumbo Carnation] recovered [4] indices into cluster_state

This will set elasticsearch web server listening on port 9200 on your localhost.

At this moment you should be able to retrieve the following information:

➜  curl -XGET http://localhost:9200/
{
  "status" : 200,
  "name" : "Jumbo Carnation",
  "cluster_name" : "elasticsearch",
  "version" : {
    "number" : "1.4.2",
    "build_hash" : "927caff6f05403e936c20bf4529f144f0c89fd8c",
    "build_timestamp" : "2014-12-16T14:11:12Z",
    "build_snapshot" : false,
    "lucene_version" : "4.10.2"
  },
  "tagline" : "You Know, for Search"
}

You can also get the stats by doing:

➜  curl -XGET http://localhost:9200/_stats
{"_shards":{"total":0,"successful":0,"failed":0},"_all":{"primaries":{},"total":{}},"indices":{}}

When I was playing I processed several times different logs. So in order to clean all the information of my elasticsearch instance I found quite useful the following command that will remove all your existing data. So BE CAREFULL:

➜  curl -XDELETE "http://localhost:9200/*"
{"acknowledged":true}

Logstash

Logstash is a tool to manage events and logs. Basically you use it to collect, parse and store logs. When used with elasticsearch you can send the processed logs structured to elasticsearch to be queried. It’s also Open Source, it’s part of the elasticsearch family and you can find the source code on the Github project repo.

In order to setup Logstash you will need to Download the package:

➜  wget https://download.elasticsearch.org/logstash/logstash/logstash-1.4.2.tar.gz
➜  tar -zxvf logstashh-1.4.2.tar.gz
➜  cd logstash-1.4.2

To process your Access logs and send them to Elasticsearch you will need to create the logstash configuration file. My configuration file is similar to the following one:

➜  cat logstash_simple.conf 
input {
  file {
    path => "/var/log/access/*.log"
    type => "apache_access"
  }
}

filter {
  if [path] =~ "access" {
    mutate { replace => { "type" => "apache_access" } }
    grok {
      match => { "message" => "%{COMBINEDAPACHELOG}" }
    }
  }
  date {
    match => [ "timestamp" , "dd/MMM/yyyy:HH:mm:ss Z" ]
  }
}


output {
  elasticsearch_http {
    host => localhost 
  } 
  stdout { 
  } 
}
➜

In the input section we define which logs logstash needs to process. You can define different types of input but we are basically just getting them from files. To see other types of input take a look at the documentation.

The filter is how logstash will process your logs. We are using grok which is like a regex parser for unstructured data. We just use the %{COMBINEDAPACHELOG} regex and set the date format.

For the output we have created two outputs. Our Elasticsearch instance and standard output, basically to see what is going on.

In order to run logstash:

➜  bin/logstash -f logstash_simple.conf

Kibana

Kibana is a visualization tool for data on top of elasticsearch. The Github project.

In order to set it up just download it and run it:

➜  wget https://download.elasticsearch.org/kibana/kibana/kibana-4.0.0-beta3.tar.gz 
➜  tar -zxvf kibana-4.0.0-beta3.tar.gz
➜  cd kibana-4.0.0-beta3
➜  bin/kibana
The Kibana Backend is starting up... be patient
{"@timestamp":"2015-02-11T12:34:29+00:00","level":"INFO","name":"Kibana","message":"Kibana server started on tcp://0.0.0.0:5601 in production mode."}

And kibana should be running on your localhost at port 5601.

The first page will ask you to create an index. If you don’t have any data yet you will not be able to create it. Once you have created the index you can start playing querying the data.

Deploy

Once the Stack was locally working I thought it would be good to deploy it to one of our boxes and send periodically our access logs to be able to have the logs updated every once in a while.

And I thought that maybe creating a Docker container to be able to replicate it easily on the future may be a good possibility.

Docker image

First Approach - One to rule them all

My first approach was to create a single container with the three services running on top of it. I know that’s not how you are supposed to use Docker but I wanted to try first.

So my idea was to have it everything running with supervisor on the docker container and add a data volume to the container with the logs where logstash will pick the files.

The code is available on this Github repo.

Create the image

As it is my first post about Docker I will explain a little bit how to create the image and build it. The image created was from a basic ubuntu one and basically you need to create a file called Dockerfile with the information in the link.

In order to build the container you just need to run:

➜  docker build -t elk:latest .

This will create a local image that can be executed. You can list your images doing:

➜  docker images
REPOSITORY                         TAG                 IMAGE ID            CREATED             VIRTUAL SIZE
elk                                latest              28bf7af29dc1        55 seconds ago      575.7 MB

Running the image

Once the image is built you can run it just by doing:

➜  docker run -d -p 5000:5601 --name elk -v /path/access-logs:/var/log/access elk

This will link your local port 5000 with the port 5601 on the container (which is the kibana one) and will add you local /path/access-logs to the container. Is at this path where you are supposed to be logging your access logs.

TODO Images, separate containers, push the image to docker hub

Create Blog using Pelican and deploy in github pages

This website has been created using pelican. Pelican is static site generator written in Python.

Basically the needs for the project were:

  • Easy deployment and mantainance
  • Write articles using Markdown
  • Code syntax highlighting

After a quick research in order to select the framework to use in order to keep things simple, Pelican had all the features needed.

Generation of website

Pelican is really easy to start with. You just need to create your project and install pelican:

$ pip install pelican

If you want to use markdown you will need to install it as a dependency also:

$ pip install markdown

Once you have installed pelican the only thing you need to do is generate the skeleton of the blog:

$ pelican-quickstart

It will prompt several questions about your site. Pelican automatically generates some files as a fabric script and a Makefile to make even easiers your deployments.

Once this is done you will need to start writing your content under the content folder. You can add subfolders to the content folder and the names of the subfolders will be used as categories for your blogs.

Once you have your article generated (sample file) is time to generate your site. There are several ways to generate your code:

$ pelican content
# Or you can use the generated Makefile
$ make html

The next Exception was raised because my locale settings were not set:

File ".../lib/python2.7/locale.py", line 443, in _parse_localename
    raise ValueError, 'unknown locale: %s' % localename
ValueError: unknown locale: UTF-8

You can set your locale for your user (modifying your .bash_profile) or for the session:

$ export LC_ALL=en_UK.UTF-8
$ export LANG=en_UK.UTF-8

Once you have generated your content you can run a Development server to see the result:

$ make serve

And you will be able to access localhost on the port 8000 by default to see the result.

Deployment in github pages

There are two types of github pages. Project and user ones. If you want to deploy to your project you can use the github target on the Makefile:

$ make github

This will post to the github pages branch of your repository.

But if you want to use the github pages under username.github.com you will need to do something more.

First of all you need to have a repository with your username at github. The repository needs to be called username.github.io in my case raulcd.github.io.

In order to make the process easier you can use the GitHub Pages Import. You can install it using pip:

$ pip install ghp-import

And to deploy you will need to run the next commands:

$ make html
$ ghp-import output
$ git push git@github.com:username/username.github.io.git gh-pages:master

You can also modify your Makefile to run the github target executing the previous commands.

Your code will be deployed and after some minutes it will be availabe at http://username.github.com.

If you have your own domain and want it to be redirected to your github pages you will need to create a CNAME file and deploy on github pages.

Create a directory content/extra and a file named CNAME (upper case) with the domain you want to redirect:

$ cat CNAME
yourdomain.com

The you can use the STATIC_PATHS on the pelicanconf.py file to tell pelican to deploy the CNAME file in the root directory when generating the content:

STATIC_PATHS = ['extra/CNAME']
EXTRA_PATH_METADATA = {'extra/CNAME': {'path': 'CNAME'},}

You will need to configure an A, an ALIAS or a CNAME record on your DNS provider to do the DNS redirection. You can see more info on the github pages domain configuration.