November 2012 ~ Himap

Manual profiling of python code (especially if you use celery)

vestphone November 27, 2012 No comments

English: Official logo of the Université libre de Bruxelles (Photo credit: Wikipedia)

Profiling tools like cprofile provide hardly understandable output for my celery based app. So I used a tool called manual profiling by Pierre de Buyl from University of Toronto - Université Libre de Bruxelles (many thanks Pierre for your help :) )

It's a great tool. You import it in the source code that you want to profile:

from manual_profiler import Profiler
import warnings

I import warnings too, so that the print commands do not get lost in celery's arcanes. Then, in my code, I create an instance of the profiler, and register all the functions that I want to monitor:

  # profiling
pr=Profiler()
correctErrors__=pr.register_function(correctErrors)
findSpellErrors__=pr.register_function(findSpellErrors)

As you can see, the register_function method rewrites the function to add the timing features to it. So I call the rewritten functions in my code.
Then, at the end of my code, I add a command to print the results:

    p.display()

With this configuration I obtain something like this:

[2012-11-27 22:24:35,937: WARNING/PoolWorker-1] Profiling information
--------------------------------------------------------------------------------
  Name                    # of calls    total time        time per call    
--------------------------------------------------------------------------------
[2012-11-27 22:24:35,937: WARNING/PoolWorker-1] correctErrors          41       64.102476 s        1.563475 s
[2012-11-27 22:24:35,938: WARNING/PoolWorker-1] findSpellErrors           1       61.000884 s       61.000884 s

Note that I improved the code a little bit to sort the entries by total time. I added the following line into the display function:

self.timers = sorted(self.timers, key=lambda t:t[1]._time, reverse=True)

This profiler gives information at functional level. Now, that I know that correctErrors is the function I need to improve, I use create the following class to monitor intra function events:

class ProfilingTimers():
    """ define a few timers for manual profiling"""
    def __init__(self):
        """Instancing the class prints the current time"""
        self.start=datetime.now()
        self.last=datetime.now()
        self.elapsed=self.last - self.start
        print strftime("Start time: "+str(self.start))   def elapsedSinceLast(self, eventName="N/A"):
        """ Calling this method displays a the elapsed time since the last event"""
        self.elapsed=datetime.now() - self.last 
        self.last=datetime.now()
        print strftime("End of "+eventName+" - Time elapsed since last event: "+str(self.elapsed.days)+" days "+str(self.elapsed.seconds)+' s '+str(self.elapsed.microseconds/1000)+" ms")

In the code at the start of the section of code to be monitored, I add :

p= ProfilingTimers()

And in a few places afterwards:

   p.elapsedSinceLast('image processing')

This command prints the elapsed time since the command was last invoked or since the ProfilingTimers instance was created. In the printed message, it includes the eventname that you have specified (here 'image processing'). With this method I can accurately assess with section of my function's code takes the most time.

With this great method, I could dig into my problem and after a few improvements... I obtained a 3200 times faster code execution!!!!

[2012-11-27 22:51:03,779: WARNING/PoolWorker-2] Profiling information
--------------------------------------------------------------------------------
  Name                    # of calls    total time        time per call    
--------------------------------------------------------------------------------
[2012-11-27 22:51:03,780: WARNING/PoolWorker-2] correctErrors          41        0.027236 s        0.000664 s
[2012-11-27 22:51:03,780: WARNING/PoolWorker-2] findSpellErrors           1        1.653973 s        1.653973 s

The poor man's method would be much less practical... It would involve putting in a few places the following command :

print strftime("time: %H:%M:%S", gmtime())

Programming

Profiling python code with celery

vestphone November 27, 2012 No comments

In this article I explain how you can profile python code with celery, and why I find this solution disappointing. I propose a better solution in the conclusion. Have fun !

Head of celery, sold as a vegetable. Usually only the stalks are eaten. (Photo credit: Wikipedia)

If you are in a hurry, you can jump to the conclusion about python profiling tools at the end of this post. Otherwise, you will find below the summary of the tests I performed with the most common python profiling software.

First I install celerymon:

pip install celerymon

Then to run my celery powered module, I add the -E option:

celery -A mypackage.mymodule worker --loglevel=info -E

At this point, the events monitored by celerymon are available at:

firefox http://localhost:8989/

Celerymon displays few events, so it is not adapted for code profiling.

For code profiling, I try using cprofile. So, to launch celery now I use a different command

sudo apt-get install python-profiler
python -m cProfile -o test-`date +%Y-%m-%d-%T`.prof /home/toto/virtualenv_1/bin/celery -A mypackage.mymodule worker --loglevel=info -E

Alternatively, it's possible to modify the python code to include cProfile directives (but I have not yet managed to collect the output in a file):

import cProfile
cProfile.run('foo()','filename.prof')

The profiling data is pure text, and hard to manipulate, even with pstats. So, I use visualization tools.

### KCACHEGRIND ###
kcachegrind is probably the best tool today to analyse profiling data.

sudo apt-get install kcachegrind
easy_install pyprof2calltree
pyprof2calltree -i myfile.prof -o myfile.prof.grind
kcachegrind myfile.prof.grind

### RUNSNAKERUN ###
runsnakerun is a more recent tool, with more limited functionalities

pip install SquareMap RunSnakeRun
runsnake OpenGLContext.profile

With this tool, you have a nice display of all the calls, the cumulative time spent per function, etc.
If you have the following problem, reinstall wxpython:

    from squaremap import squaremap
  File "/home/toto/virtualenv_1/lib/python2.6/site-packages/squaremap/squaremap.py", line 3, in 
    import wx.lib.newevent
ImportError: No module named lib.newevent
(virtualenv_1)toto:~/virtualenv_1/djangoProj_1$ pip install wxpython

### CONCLUSION ###

None of the above tools was handy for my app. They did not allow me to see clearly what lines of my code where taking the most time. Almost all the time seemed to be spent in Kombu module which is used by AMPQ. See my next post about manual profiling to see how I progressed nevertheless and managed to divide by 3000 the time spent in my most time consuming function!

DevOps

Setting proxy for bash in debian

vestphone November 26, 2012 No comments

English: Xfce Terminal. on Debian Lenny. 日本語: Xfce Terminal。Debian Lennyで。 (Photo credit: Wikipedia)

In this post, I will explain you how to configure the default proxy for bash in Debian. I often see questions about this in forums, so I hope it will help. If so, I would love to see links pointing to this article: it will help others find it more easily thanks to a better ranking in Google search. Thanks!

Setting up the proxy globally

You can set the proxy globally in the files:

/etc/environment
/etc/profile

In this file you can specify the proxy for a given application for instance (see http://askubuntu.com/questions/158557/setting-proxy-from-terminal):

http_proxy=http://10.1.3.1:8080 firefox

For bash

It's possible to specify the proxy in bash directly:

sudo env http_proxy=http://10.1.3.1:8080

Or in your .bashrc profile (not advised as it may not be taken into account by some applications):

export http_proxy=http://username:password@proxyhost:port/ 
export ftp_proxy=http://username:password@proxyhost:port/

For aptitude (apt-get)

If you want to set the proxy for apt, the proper way is to edit /etc/apt/apt.conf and add:

Acquire::http::proxy "http://10.1.3.1:8080/";

For pip / easy_install

For pip, you can use the --proxy option to specify a proxy:

pip install toto --proxy "user:password@proxy.server:port"
pip install toto --proxy "example.com:1080"

English: Diagram of an open forward proxy. (Photo credit: Wikipedia)

But pip sometimes calls setup.py files that do not follow the proxy directives. In this case the best way is to position http_proxy environment variable, and use easy_install as an alternative to pip.

For SSH

You can use ssh through a SOCKS proxy, but you need to install a few tools:

sudo apt-get install connect-proxy
man connect-proxy
vim ~/.ssh/config

In this configuration file, you specify which proxy you want to use for which hosts. For example, on this site, they provide the following configuration sample:

## Outside of the firewall, with HTTPS proxy
Host my-ssh-server-host.net
  ProxyCommand connect -H proxy.free.fr:3128 %h 443
## Inside the firewall (do not use proxy)
Host *
   ProxyCommand connect %h %p

Note the -S command, that is used for SOCKS proxy (-H) is for HTTP proxies, for example:

ProxyCommand /usr/bin/connect-proxy -4 -S pproxy:port %h %p

SANS institute published a good article about all what you can do with SSH trough a proxy :)

If you have liked this article, please put a link to it on your Google+ profile are any other kind of web site: it will help others find it in the search engines. Thanks!

configuring a proxy in bash terminal for Linux

DevOps, Programming

How to replicate / export a virtualenv from a machine to another ?

vestphone November 26, 2012 No comments

Sudo (Photo credit: Scelus' Comix)

If you code in Python, you probably use virtualenv to maintain a clean setup and avoid problems with different versions of pythons. But sometimes, you need to replicate / export a virtualenv from a machine to another. This post will explain you exactly how to do this.

First, use the following commands on the origin machine:

source ./bin/activate
pip freeze > requirements.txt

Then, copy your source code files from the virtualenv folder (called virtualenv_1) on the target machine. Do NOT copy the virtualenv files in folders such as bin, lib etc. We will eventually create these files with virtualenv on the new system:

sudo apt-get install python-virtualenv

sudo  virtualenv virtualenv_1

Then install all the required packages:

cd virtualenv_1
source bin/activate
sudo pip install -r requirements.txt

That's it! If you have liked this article, please put a link to it on your Google+ / facebook profile are any other kind of web site: it will help others find it in the search engines. Thanks!

DevOps

How to install brother HL 227DW driver on debian x64?

vestphone November 25, 2012 No comments

First, a big up to Brother: their printers are excellent, and they provide drivers for Linux! In only hate it that they display a warning before the toner is actually empty... But that's another story. So in this post, I will explain you how to install brother HL 227DW driver on debian x64.

To install the driver on the x64 architecture you should go to Brother's driver download page: http://welcome.solutions.brother.com/bsc/public_s/id/linux/en/download_prn.html and download the .deb packages corresponding to your printer model. Then to install these packages:

sudo apt-get install ia32-libs
sudo dpkg -i --force-all hl2270dwlpr-2.1.0-1.i386.deb cupswrapperHL2270DW-2.0.4-2.i386.deb

sudo  dpkg  -l  |  grep  Brother

firefox http://localhost:631/printers

I hope it helps. If you have liked this article, please put a link to it on your Google+ / facebook profile are any other kind of web site: it will help others find it in the search engines. Thanks!

DevOps

What to do if you accidently delete the admin user in a django project?

vestphone November 22, 2012 No comments

Django logo (Photo credit: Wikipedia)

Have you ever wondered what to do if you accidently delete the admin user in a django project? That's a nasty problem in appearance, but not too serious in practice: Django provides a command line utility to solve that kind of problems:

manage.py createsuperuser --username=joe --email=joe@example.com

django

Playing with django-invitation module

vestphone November 21, 2012 No comments

Invitation to the Dance (film) (Photo credit: Wikipedia)

Django logo (Photo credit: Wikipedia)

In this post, I will explain you how to use the great Playing django-invitation module, which as its name indicates permits to invite people to the beta of a web app, for example.

The documentation of the django-invitation module is accessible on https://bitbucket.org/david/django-invitation/wiki/Home

To install the module, I have used:

easy_install -Z django-invitation

I have followed the documentation. It indicates to:

Add invitation to the INSTALLED_APPS setting of your Django project.
Add the setting ACCOUNT_INVITATION_DAYS to your settings file; this should be the number of days invitation keys will remain valid after an invitation is sent. Add also the INVITATIONS_PER_USER setting.

At this step, I have the following lines in my settings.py:

ACCOUNT_ACTIVATION_DAYS = 7
ACCOUNT_INVITATION_DAYS = 7
INVITATIONS_PER_USER = 3

INVITE_MODE = True

NB: they forget to talk about the INVITE_MODE setting in django-invitation's documentation. It took me half a day to figure out why invalid invitation key still led to the registration form !

Then the tutorial says to add this line to your site's root URLConf before registration urls:

(r'^accounts/', include('invitation.urls')),

Afterwards, modify templates to link people to /accounts/invite/ so they can start inviting. I prefer to refer to this url by its name in my template : invitation_invite (according to the source code on bit bucket)

Now create all the required templates:

invitation/invitation_form.html displays the invitation form for users to invite contacts.
invitation/invitation_complete.html is displayed after the invitation email has been sent, to tell the user his contact has been emailed.
invitation/invitation_email_subject.txt is used for the subject of the invitation email.
invitation/invitation_email.txt is used for the body of the invitation email.
invitation/invited.html is displayed when a user attempts to register his/her account.
invitation/wrong_invitation_key.html is displayed when a user attempts to register his/her account with a wrong/expired key.

You can find some examples for these templates on: https://bitbucket.org/epicserve/django-invitation/src/062213aa6ad2/invitation/templates/invitation?at=default

I used a direct link to the registration page in the invitation email, as the invited page is pretty useless in my opinion. Here is my invitation_email.txt:

{% load i18n %}{% load url from future %}

{% trans "Hello," %}
{% blocktrans with site.name as sitename and invitation_key.from_user.username as username %}You have been invited by {{ username }} to join {{ sitename }}!{% endblocktrans %}
{% trans "Go to" %} 
http://{{site.domain}}{% url 'registration_register' %}?invitation_key={{ invitation_key.key }}
{% trans "to join!" %}

{% blocktrans with site.name as sitename %}All the best,

The {{ sitename }} Team{% endblocktrans %}

Then just setup a cronjob calling the django command cleanupinvitation

to delete expired activation keys from the database.

DevOps, django

NoReverseMatch error in python django... The solution

vestphone November 21, 2012 No comments

More accurate representation of relationship between URIs, URNs, and URLs. Revised version of File:URI Venn Diagram.svg, per Talk:Uniform Resource Identifier#Misleading Venn Diagram in a different way. (Photo credit: Wikipedia)

NoReverseMatch error is nasty with python Django. It happens often when you add a new page. In this post I will explain you how to solve this bug easily.

Typically the famous NoReverseMatch error occurs when you put quotes or not around your url name in a url tag of a template. For instance:

If you used:

{% url 'index' %}

Try:

{% url index %}

And conversely...

The source of this problem is the different syntax depending on whether you include url from future in your template or not.

Digital marketing, django, html

Playing with user registration in django

vestphone November 21, 2012 No comments

English: Gmail registration screenshots فارسی: عکس‌های برگه ثبت‌نام جی‌میل (Photo credit: Wikipedia)

Django registration module is quite handy whenever you need some users to, well, registrate to your website... In this post, I will explain you how to install and configure it for your web app.

To install django registration module:

easy_install -Z django-registration

The documentation is in the docs folder (quickstart.rst) and is also available online: https://bitbucket.org/ubernostrum/django-registration/src/27bccd108cdef30dc0a91ed1968be17bb1e60da4/docs/quickstart.rst?at=default

django-registration's role is to permit the creation of new users, which means according to the documentation:

A user signs up for an account by supplying a username, email address and password.
From this information, a new User object is created, with its is_active field set to False. Additionally, an activation key is generated and stored, and an email is sent to the user containing a link to click to activate the account.
Upon clicking the activation link, the new account is made active (the is_active field is set to True); after this, the user can log in.

For this part, I used the following tutorial.

I add 'registration' to my INSTALLED_APPS in settings.py

I add ACCOUNT_ACTIVATION_DAYS = 7 in my settings.py

I define all the mail parameters in settings.py:

EMAIL_USE_TLS = True
EMAIL_HOST = 'smtp.gmail.com'
EMAIL_HOST_USER = 'youremail@gmail.com'
EMAIL_HOST_PASSWORD = 'yourpassword'
EMAIL_PORT = 587

I add (r'^accounts/', include('registration.urls')), to urls.py

I modify my base.html template to add links to the registration page

{% if not user.is_authenticated %}<li><a href="{% url 'login' %}">Log in</a></li>
<li><a href="{% url 'registration_register' %}">Register</a></li>
{% endif %}

{% if user.is_authenticated %}
Logged in: {{ user.username }}
(Log out |
Change password)
{% endif %}

I create the required registration templates in templates/registration/ according to the examples provided in http://lightbird.net/dbe/forum3.html

activate.html
activation_complete.html
activation_email.txt
activation_email_subject.txt
login.html
registration_complete.html
registration_form.html

Note that in the template, whenever I need to refer to django-registration urls, I need to know a name to put in my {% url %} tags. I found the appropriate names here: http://code.google.com/p/django-registration/source/browse/trunk/registration/urls.py?r=169

Finally, I check if the register link defined above works. At first I had a URL problem after filling the register form and submitting it. I solved it by editing the registration/registration_form.html file to change the form submission url.

DevOps

How to install keepass2 in debian squeeze

vestphone November 20, 2012 No comments

If you want to install Keepass2 under Debian Squeeze, you will have a problem: keepass2 is not in the official repositories. Only keepassX is available.

In this blog post I explain how to install Keepass2 nevertheless.

At first glance, I see two potential solutions:

We can find the keepass2 package on the web, download it, and install it. That's the solution I chose.
Otherwise, a perhaps cleaner solution would be to download the source package and compile it. That's cumbersome because Keepass is windows oriented.

The installation of a .deb package, is usually simpler. I cleaned a bit the packages on my machine, installed the main dependencies, and gave a try to the installation of the ubuntu Natty package.

sudo apt-get autoremove
sudo apt-get install -f
sudo apt-get install mono-devel xdotool
wget http://launchpadlibrarian.net/71859414/keepass2_2.15%2Bdfsg-2%7Enatty1_all.deb
dpkg -i keepass2_2.15+dfsg-2~natty1_all.deb 
keepass2

It worked !
If you have liked this article, please put a link to it on your Google+ profile are any other kind of web site: it will help others find it in the search engines. Thanks!

DevOps

Django hosting: experience with webfaction

vestphone November 19, 2012 No comments

Webfaction payment history (Photo credit: yashh)

I hesitated a lot among several options for hosting a django based web app. After reading a lot of posts, I decided to go with webfaction, because there was a lot of positive feedback from existing customers and they have a two month cost free cancellation policy (that means a two-month free trial period). So, I subscribed.

But then, I changed my mind and thought that I would rather like to try heroku, because there's some buzz about them and they have good documentation. So, here is my experience with webfaction's cancellation process: I canceled and less than 1 hour later Paypal informed me that webfaction had refunded me due to their 60-day cost free cancellation program. Then, I received an email from webfaction's support team telling me that they were sad to see me go and that they thanked me for having provided feedback about the reasons why I left.

Conclusion: I have really appreciated my short experience with webfaction because they were professional and polite, and they hold the promise of their cost free cancellation guarantee. If, I am not satisfied with heroku, I will definitely come back to them.

analytics, DevOps

How to configure Django with PostgreSQL?

vestphone November 19, 2012 No comments

In this post, I will explain you how to configure Django with PostgreSQL instead of MySQL.

I did several changes locally before deploying my app to heroku and Amazon S3.

I changed my configuration to use postgreSQL instead of mySQL. I did not have any data stored in MySQL, so I reinitiated a brand new database. To do that, I had to come back to django's tutorial and edit settings.py:

        #'ENGINE': 'django.db.backends.mysql', # USE MYSQL
        'ENGINE': 'django.db.backends.postgresql_psycopg2', # USE postgreSQL

Of course, I had to install PostgreSQL and to create the database, end user, and password referenced in my django settings.py according to the tutorial. The installation of psycopg2 through easy_install may lead to various errors such as the ones below:

ImportError: No module named psycopg2.extensions
or
Error: pg_config executable not found.

That's why we include several packages in the first apt-get install directive. These packages solve the above errors.

sudo apt-get update

sudo apt-get install postgresql postgresql-client python-psycopg2 libpq-dev  postgresql-server-dev-all python-dev python-psycopg2 --fix-missing

easy_install psycopg2

sudo adduser MYUSER

sudo su postgres

psql
postgres=# CREATE USER MYUSER WITH PASSWORD 'MYPASSWORD';
postgres=# CREATE DATABASE thenameofmydb;
postgres=# GRANT ALL PRIVILEGES ON DATABASE thenameofmydb to MYUSER;
postgres=# \q

exit
sudo vim /etc/postgresql/8.4/main/pg_hba.conf

In this file, check the following configuration directive:

# "local" is for Unix domain socket connections only
local   all         all                               trust

Then, we restart psql and log in:

sudo /etc/init.d/postgresql reload
psql -d thenameofmydb -U MYUSER

CDN, DevOps, Digital marketing

Which server for deploying a scalable django powered web app? How to install nginx and gunicorn for django.

vestphone November 19, 2012 No comments

Debian OpenLogo (Photo credit: Wikipedia)

English: Nginx Logo Español: Logo de Nginx (Photo credit: Wikipedia)

This post explains how to install nginx and gunicorn for django under linux Debian.

From my readings, nginx should be used to serve static files. Nginx should proxy all other requests to gunicorn.

These two tools are easy to install on my debian squeeze:

sudo apt-get install python-software-properties -y
sudo apt-get install nginx
sudo apt-get remove apache2
sudo apt-get autoremove 
pip install gunicorn

Then we have to configure nginx. For this part, I relied on a the following tutorial.

sudo vim /etc/nginx/nginx.conf

I did not change this file as default configuration options seemed reasonable. At this stage, the 'it works page' should be served on localhost:

firefox http://127.0.0.1/ &

Then we have to tell nginx where the static files are located, and what URL to use to proxy other
requests to guincorn. As I use nginx for a single site, I edit directly the default configuration file:

sudo vim /etc/nginx/sites-enabled/default

In this file, I remove all the defined "location" directives. And I add the ones related to my django app, to the static files of its admin page, and to the static files (/home/toto/myapp/app/static/*):

    location /static {
        root /home/toto/virtualenv_1/myapp/app/;
    }

    location /static/admin {
        root /home/toto/virtualenv_1/lib/python2.6/site-packages/django/contrib/admin/;
    }    location / {
        proxy_pass http://127.0.0.1:8888;
    }

Then, I restart nginx:

sudo /etc/init.d/nginx restart

If you have an error such as:

Starting nginx: [emerg]: bind() to 0.0.0.0:80 failed (98: Address already in use)
[emerg]: bind() to [::]:80 failed (98: Address already in use)

try using the following command to see which server runs already, and kill it:

sudo lsof -i tcp:80

Finally, we configure gunicorn. It's incredibly easy.

First, we edit django's settings.py according to the documentation and add "gunicorn" to INSTALLED_APPS. Set the value of the STATIC* variables to '' to the URLs served by Nginx as the static files are all served from nginx. Modify also urls.py to remove the lines related to the static files. Then we simply launch the server:

./manage.py run_gunicorn

analytics, startups

How to know the size (number of row/records) of a mysql database?

vestphone November 19, 2012 No comments

In this post, I will show you how to know the size (number of row/records) of a MySql database.

I wanted to host an app on heroku. They offer free database for up to 10000 rows. So, I needed to know if my database fits into that limit...

$ mysql -u root -p 

mysql> use my_site_db
mysql> SELECT SUM(TABLE_ROWS)       FROM INFORMATION_SCHEMA.TABLES       WHERE TABLE_SCHEMA = 'my_site_db';

+-----------------+
| SUM(TABLE_ROWS) |
+-----------------+
|            9662 |
+-----------------+
1 row in set (0.03 sec)

My database contains 9662 records.

DevOps, Programming

Celery and warning messages (print) to stdout

vestphone November 18, 2012 No comments

In this post, I will explain you how to use Celery and print warning messages to stdout.

Cross section of celery stalk, showing vascular bundles, which include both phloem and xylem. (Photo credit: Wikipedia)

If you use celery, you might be frustrated by the absence of the 'print' assertions that you put in your code for debugging. There is a very simple way to solve this point and see all print. As an added benefit, it will permit you to display nice warning in a python way :).

The solution is very simple! Just import the warning package in the module that you call with celery:

import warnings

DevOps

How to install lxml (how to force easy_install to use a given version)?

vestphone November 15, 2012 No comments

Synaptic Package Manager 0.61 on ubuntu (Photo credit: Wikipedia)

To install lxml in your virtualenv, do:

sudo apt-get install libxml2-dev libxslt-dev
easy_install lxml==2.2.8

Programming

why "from toto Import *" is a bad idea in Python programs.

vestphone November 15, 2012 No comments

In this post, I will explain you why the good practice in Python programs is to carefully import only what is necessary to run your program. Let's start with an example. Sometimes, I feel lazy and use:

from toto import *

This is a bad idea, as sometimes the name of the things that you import will conflict with the name of other things in your programs. And this can be a nightmare to debug. So, try not to be lazy and rather define explicitly what you want to import with:

import toto

from toto import titi, tata

LaTeX, Programming

A simple DeTeX function in python - LaTeX to text

vestphone November 14, 2012 No comments

The LaTeX logo, typeset with LaTeX (Photo credit: Wikipedia)

I have implemented a simple DeTeX function in Python. I provide this function below, as is and without any guarantee. If you run it, and it should change the example LaTeX text into "simple" text thanks the detex() function defined in the code.

It's a quick and dirty approach: I did not try to implement the full LaTeX syntax. I just applied a few regexps to strip the commands of the text. Feedback will be appreciated in the comment form below :)

Take care to the "backslash plague" as explained in http://docs.python.org/2/howto/regex.html".

#!/usr/bin/python
# -*- coding: UTF-8 -*-
    
import re

testMode=False

def applyRegexps(text, listRegExp):
    """ Applies successively many regexps to a text"""
    if testMode:
        print '\n'.join(listRegExp)
    # apply all the rules in the ruleset
    for element in listRegExp:
        left = element['left']
        right = element['right']
        r=re.compile(left)
        text=r.sub(right,text)
    return text

"""
     _      _             ____  
  __| | ___| |_ _____  __/ /\ \ 
 / _` |/ _ \ __/ _ \ \/ / |  | |
| (_| |  __/ ||  __/>  <| |  | |
 \__,_|\___|\__\___/_/\_\ |  | |
                         \_\/_/ 
"""

def detex(latexText):
    """Transform a latex text into a simple text"""    
    # initialization
    regexps=[]
    text=latexText
    # remove all the contents of the header, ie everything before the first occurence of "\begin{document}"
    text = re.sub(r"(?s).*?(\\begin\{document\})", "", text, 1)
    
    # remove comments
    regexps.append({r'left':r'([^\\])%.*', 'right':r'\1'})
    text= applyRegexps(text, regexps)
    regexps=[]
     
    # - replace some LaTeX commands by the contents inside curly rackets
    to_reduce = [r'\\emph', r'\\textbf', r'\\textit', r'\\text', r'\\IEEEauthorblockA', r'\\IEEEauthorblockN', r'\\author', r'\\caption',r'\\author',r'\\thanks']
    for tag in to_reduce:
      regexps.append({'left':tag+r'\{([^\}\{]*)\}', 'right':r'\1'})
    text= applyRegexps(text, regexps)
    regexps=[]
    """
     _     _       _ _       _     _   
    | |__ (_) __ _| (_) __ _| |__ | |_ 
    | '_ \| |/ _` | | |/ _` | '_ \| __|
    | | | | | (_| | | | (_| | | | | |_ 
    |_| |_|_|\__, |_|_|\__, |_| |_|\__|
             |___/     |___/           
    """
    # - replace some LaTeX commands by the contents inside curly brackets and highlight these contents
    to_highlight = [r'\\part[\*]*', r'\\chapter[\*]*', r'\\section[\*]*', r'\\subsection[\*]*', r'\\subsubsection[\*]*', r'\\paragraph[\*]*'];
    # highlightment pattern: #--content--#
    for tag in to_highlight:
      regexps.append({'left':tag+r'\{([^\}\{]*)\}','right':r'\n#--\1--#\n'})
    # highlightment pattern: [content]
    to_highlight = [r'\\title',r'\\author',r'\\thanks',r'\\cite', r'\\ref'];
    for tag in to_highlight:
      regexps.append({'left':tag+r'\{([^\}\{]*)\}','right':r'[\1]'})
    text= applyRegexps(text, regexps)
    regexps=[]
    
    """
     _ __ ___ _ __ ___   _____   _____ 
    | '__/ _ \ '_ ` _ \ / _ \ \ / / _ \
    | | |  __/ | | | | | (_) \ V /  __/
    |_|  \___|_| |_| |_|\___/ \_/ \___|
                                       
    """
    # remove LaTeX tags
    # - remove completely some LaTeX commands that take arguments
    to_remove = [r'\\maketitle',r'\\footnote', r'\\centering', r'\\IEEEpeerreviewmaketitle', r'\\includegraphics', r'\\IEEEauthorrefmark', r'\\label', r'\\begin', r'\\end', r'\\big', r'\\right', r'\\left', r'\\documentclass', r'\\usepackage', r'\\bibliographystyle', r'\\bibliography',  r'\\cline', r'\\multicolumn']
    
    # replace tag with options and argument by a single space
    for tag in to_remove:
      regexps.append({'left':tag+r'(\[[^\]]*\])*(\{[^\}\{]*\})*', 'right':r' '})
      #regexps.append({'left':tag+r'\{[^\}\{]*\}\[[^\]\[]*\]', 'right':r' '})
    text= applyRegexps(text, regexps)
    regexps=[]

    """
                    _                
     _ __ ___ _ __ | | __ _  ___ ___ 
    | '__/ _ \ '_ \| |/ _` |/ __/ _ \
    | | |  __/ |_) | | (_| | (_|  __/
    |_|  \___| .__/|_|\__,_|\___\___|
             |_|                     
    """
    
    # - replace some LaTeX commands by the contents inside curly rackets
    # replace some symbols by their ascii equivalent
    # - common symbols
    regexps.append({'left':r'\\eg(\{\})* *','right':r'e.g., '})
    regexps.append({'left':r'\\ldots','right':r'...'})
    regexps.append({'left':r'\\Rightarrow','right':r'=>'})
    regexps.append({'left':r'\\rightarrow','right':r'->'})
    regexps.append({'left':r'\\le','right':r'<='})
    regexps.append({'left':r'\\ge','right':r'>'})
    regexps.append({'left':r'\\_','right':r'_'})
    regexps.append({'left':r'\\\\','right':r'\n'})
    regexps.append({'left':r'~','right':r' '})
    regexps.append({'left':r'\\&','right':r'&'})
    regexps.append({'left':r'\\%','right':r'%'})
    regexps.append({'left':r'([^\\])&','right':r'\1\t'})
    regexps.append({'left':r'\\item','right':r'\t- '})
    regexps.append({'left':r'\\\hline[ \t]*\\hline','right':r'============================================='})
    regexps.append({'left':r'[ \t]*\\hline','right':r'_____________________________________________'})
    # - special letters
    regexps.append({'left':r'\\\'{?\{e\}}?','right':r'é'})
    regexps.append({'left':r'\\`{?\{a\}}?','right':r'à'})
    regexps.append({'left':r'\\\'{?\{o\}}?','right':r'ó'})
    regexps.append({'left':r'\\\'{?\{a\}}?','right':r'á'})
    # keep untouched the contents of the equations
    regexps.append({'left':r'\$(.)\$', 'right':r'\1'})
    regexps.append({'left':r'\$([^\$]*)\$', 'right':r'\1'})
    # remove the equation symbols ($)
    regexps.append({'left':r'([^\\])\$', 'right':r'\1'})
    # correct spacing problems
    regexps.append({'left':r' +,','right':r','})
    regexps.append({'left':r' +','right':r' '})
    regexps.append({'left':r' +\)','right':r'\)'})
    regexps.append({'left':r'\( +','right':r'\('})
    regexps.append({'left':r' +\.','right':r'\.'})    
    # remove lonely curly brackets    
    regexps.append({'left':r'^([^\{]*)\}', 'right':r'\1'})
    regexps.append({'left':r'([^\\])\{([^\}]*)\}','right':r'\1\2'})
    regexps.append({'left':r'\\\{','right':r'\{'})
    regexps.append({'left':r'\\\}','right':r'\}'})
    # strip white space characters at end of line
    regexps.append({'left':r'[ \t]*\n','right':r'\n'})
    # remove consecutive blank lines
    regexps.append({'left':r'([ \t]*\n){3,}','right':r'\n'})
    # apply all those regexps
    text= applyRegexps(text, regexps)
    regexps=[]    
    # return the modified text
    return text

"""
                 _       
 _ __ ___   __ _(_)_ __  
| '_ ` _ \ / _` | | '_ \ 
| | | | | | (_| | | | | |
|_| |_| |_|\__,_|_|_| |_|
                         
"""
def main():
    """ Just for debugging"""
    #print "defining the test text\n"
    latexText=r"""
    % This paper can be formatted using the peerreviewca
    % (instead of conference) mode.
    \documentclass[twocolumn,a4paper]{article}
    %\documentclass[peerreviewca]{IEEEtran}
    % correct bad hyphenation here
    \hyphenation{op-ti-cal net-works semi-con-duc-tor IEEEtran pri-va-cy Au-tho-ri-za-tion}
    % package for printing the date and time (version)
    \usepackage{time}
    \begin{document}
    \title{Next Generation Networks}
    \author{Tot titi\thanks{Network and Security -- test company -- toto@ieee.org}}
    \maketitle
    \begin{abstract}\footnote{Version :  \today ;  \now}
    lorem ipsum(\ldots)\end{abstract}
    \emph{Keywords: IP Multimedia Subsystem, Quality of Service}
    \section{Introduction} \label{sect:introduction}
    lorem ipsum(\ldots) \% of the world population. \cite{TISPAN2006a}. \footnote{Bearer Independent Call Control protocol}. 
    \hline
    \section{Protocols used in IMS} \label{sect:protocols}
    lorem ipsum(\ldots) \cite{rfc2327, rfc3264}.
    \subsection{Authentication, Authorization, and Accounting} \label{sect:protocols_aaa}
    lorem ipsum(\ldots)
    \subsubsection{Additional protocols} \label{sect:protocols_additional}
    lorem ipsum(\ldots)
    \begin{table}
        \begin{center}
            \begin{tabular}{|c|c|c|}
            \hline
                \textbf{Capability}                                 & \textbf{UE} & \textbf{GGSN} \\ \hline
                \emph{DiffServ Edge Function}           & Optional      & Required          \\ \hline
                \emph{RSVP/IntServ}                                 & Optional      & Optional          \\ \hline
                \emph{IP Policy Enforcement Point}  & Optional      & Required          \\ \hline
            \end{tabular}
        \caption{IP Bearer Services Manager capability in the UE and GGSN}
        \label{tab_ue_ggsn}
        \end{center}
    \end{table}
     The main transport layer functions are listed below:
    \begin{my_itemize}
        \item The \emph{Resource Control Enforcement Function} (RCEF) enforces policies under the control of the A-RACF. It opens and closes unidirectional filters called \emph{gates} or \emph{pinholes}, polices traffic and marks IP packets \cite{TISPAN2006c}.
        \item  The \emph{Border Gateway Function} (BGF) performs policy enforcement and Network Address Translation (NAT) functions under the control of the S-PDF. It operates on unidirectional flows related to a particular session (micro-flows) \cite{TISPAN2006c}.
        \item  The \emph{Layer 2 Termination Point} (L2TP) terminates the Layer 2 procedures of the access network \cite{TISPAN2006c}.
    \end{my_itemize}
    Their QoS capabilities are summarized in table \ref{tab_rcef_bgf} \cite{TISPAN2006c}.
    The admission control usually follows a three step procedure:
    \begin{my_enumerate}
        \item Authorization of resources (\eg by the A-RACF)
        \item Resource reservation (\eg by the BGF)
        \item Resource commitment (\eg by the RCEF)
    \end{my_enumerate}
    \begin{figure}
    \centering
    \includegraphics[width=1.5in]{./pictures/RACS_functional_architecture}
    \caption{RACS interaction with transfer functions}
    \label{fig_RACS_functional_architecture}
    \end{figure}
    %\subsection{Example}  \label{sect:qos_example}
    % conference papers do not normally have an appendix
    % use section* for acknowledgement
    \section*{Acknowledgment}
    % optional entry into table of contents (if used)
    %\addcontentsline{toc}{section}{Acknowledgment}
    lorem ipsum(\ldots)
    \bibliographystyle{plain}
    %\bibliographystyle{alpha}
    \bibliography{./mabiblio}
    \end{document}
    """
    #print '\n'.join(diff)
    text=detex(latexText)
    print text


if __name__ == "__main__":
    main()

Enjoy! And feel free to comment below or to put a link to this article on your blog. Thanks!

DevOps

Using regexp (regular expressions) in python

vestphone November 14, 2012 No comments

Regular expressions are really useful whenever you want to search for or replace text in a file or a string. That means, all the time! They are concise, elegant, powerful and simple. So let's have a look to some examples.

I provide below a simple code example to show the usage of regular expressions in python.

import re
text= r"\begin{environment}"
left = r'\\begin'
right = r''
r=re.compile(left)
text=r.sub(right,text)print text

This code returns:

{environment}

Take care to the 'r' before the strings: they manage all the character escape subtleties for you (see http://docs.python.org/2/howto/regex.html for the explanations about the "backslash plague")

If you want to master regex, have a look to O'Reilly books, they are amazing.

Digital marketing, Google, SEO

Blogger: how to add a blog's sitemap in webmaster tools?

vestphone November 14, 2012 No comments

tumblr google webmaster tools sitemap (Photo credit: GioSaccone)

Sitemaps are useful to indicate to websearch engines the list of the pages on your site / blog. Thus, it is a good Search Engine Optimization (SEO) practice to provide the sitemaps of your site to the main search engines.

If you use blogger, the sitemap URLs for your blog are the following:

/atom.xml?redirect=false&start-index=1&max-results=500
/rss.xml?redirect=false&start-index=1&max-results=500

You can for instance declare them in Google webmaster tools

Setting up the proxy globally

For bash

For aptitude (apt-get)

For pip / easy_install

Popular Posts

Recent Posts

Categories

Unordered List

Text Widget

Pages

Blog Archive

Search This Blog

Facebook

Comments

Contact Form

Tags

Labels

Pages - Menu

About Me

FEATURED POSTS