Blog entries

  • Rss feeds aggregator based on Scikits.learn and CubicWeb

    2011/10/17 by Vincent Michel

    During Euroscipy, the Logilab Team presented an original approach for querying news using semantic information: "Rss feeds aggregator based on Scikits.learn and CubicWeb" by Vincent Michel This work is based on two major pieces of software:

    http://www.cubicweb.org/data/index-cubicweb.png
    • CubicWeb, the pythonic semantic web framework, is used to store and query Dbpedia information. CubicWeb is able to reconstruct links from rdf/nt files, and can easily execute complex queries in a database with more than 8 millions entities and 75 millions links when using a PostgreSQL backend.
    http://scipy-lectures.github.com/_images/scikit-learn-logo.png
    • Scikit.learn is a cutting-edge python toolbox for machine learning. It provides algorithms that are simple and easy to use.
    http://www.pfeifermachinery.com/img/rss.png

    Based on these tools, we built a pure Python application to query the news:

    • Named Entities are extracted from RSS articles of a few mainstream English newspapers (New York Times, Reuteurs, BBC News, etc.), for each group of words in an article, we check if a Dbpedia entry has the same label. If so, we create a semantic link between the article and the Dbpedia entry.
    • An occurrence matrix of "RSS Articles" times "Named Entities" is constructed and may be used against several machine learning algorithms (MeanShift algorithm, Hierachical Clustering) in order to provide original and informative views of recent events.
    http://wiki.dbpedia.org/images/dbpedia_logo.png

    Moreover, queries may be used jointly with semantic information from Dbpedia:

    • All musical artists in the news:

      DISTINCT Any E, R WHERE E appears_in_rss R, E has_type T, T label "musical artist"
      
    • All living office holder persons in the news:

      DISTINCT Any E WHERE E appears_in_rss R, E has_type T, T label "office holder", E has_subject C, C label "Living people"
      
    • All news that talk about Barack Obama and any scientist:

      DISTINCT Any R WHERE E1 label "Barack Obama", E1 appears_in_rss R, E2 appears_in_rss R, E2 has_type T, T label "scientist"
      
    • All news that talk about a drug:

      Any X, R WHERE X appears_in_rss R, X has_type T, T label "drug"
      

    Such a tool may be used for informetrics and news analysis. Feel free to download the complete slides of the presentation.


  • Python in Finance (and Derivative Analytics)

    2011/10/25 by Damien Garaud

    The Logilab team attended (and co-organized) EuroScipy 2011, at the end of August in Paris.

    We saw some interesting posters and a presentation dealing with Python in finance and derivative analytics [1].

    In order to debunk the idea that "all computation libraries dedicated to financial applications must be written in C/C++ or some other compiled programming language", I would like to introduce a more Pythonic way.

    You may know that financial applications such as risk management have in most cases high computational needs. For instance, it can be necessary to quickly perform a large number of Monte Carlo simulations to evaluate an American option in a few seconds.

    The Python community provides several reliable and efficient libraries and packages dedicated to numerical computations:

    http://numpy.scipy.org/_static/numpy_logo.png https://scikits.appspot.com/static/images/scipyshiny_small.png
    • the well-known SciPy and NumPy libraries. They provide a complete set of tools to work with matrix, linear algebra operations, singular values decompositions, multi-variate regression models, ...
    • scikits is a set of add-on toolkits for SciPy. For instance there are statistical models in statsmodels packages, a toolkit dedicated to timeseries manipulation and another one dedicated to numerical optimization;
    • pandas is a recent Python package which provides "fast, flexible, and expressive data structures designed to make working with relational or labeled data both easy and intuitive.". pandas uses Cython to improve its performance. Moreover, pandas has been used extensively in production in financial applications;
    http://docs.cython.org/_static/cython-logo-light.png
    • Cython is a way to write C extensions for the Python language. Since you write Cython code in the same way as you write Python code, it's easy to use it. Is it fast? Yes ! I compared a simple example from Cython's official documentation with a full Python code -- a piece of code which computes the first kth prime numbers. The Cython code is almost thirty times faster than the full-Python code (non-official). Furthermore, you can use NumPy in Cython code !

    I believe that thanks to several useful tools and libraries, Python can be used in numerical computation, even in Finance (both research and production). It is easy-to-maintain without sacrificing performances.

    Note that you can find some other references on Visixion webpages: