Managing isolated Environments with PySpark

The Spark data processing platform becomes more and more important for data scientists using Python. PySpark - the official Python API for Spark - makes it easy to get started but managing applications and their dependencies in isolated environments is no easy task.

more ...


Efficient UD(A)Fs with PySpark

Nowadays, Spark surely is one of the most prevalent technologies in the fields of data science and big data. Luckily, even though it is developed in Scala and runs in the Java Virtual Machine (JVM), it comes with Python bindings also known as PySpark, whose API was heavily influenced by …

more ...

Hive UDFs and UDAFs with Python

Sometimes the analytical power of built-in Hive functions is just not enough. In this case it is possible to write hand-tailored User-Defined Functions (UDFs) for transformations and even aggregations which are therefore called User-Defined Aggregation Functions (UDAFs). In this post we focus on how to write sophisticated UDFs and UDAFs …

more ...


Interactively visualizing distributions in a Jupyter notebook with Bokeh

If you are doing probabilistic programming you are dealing with all kinds of different distributions. That means choosing an ensemble of right distributions which describe the underlying real-world process in a suitable way but also choosing the right parameters for prior distributions. At that point I often start visualizing the …

more ...