Blog

A blog about data science, machine learning and programming.


Docker cheat sheet

Containerising your applications is a great way to ensure reproducibility of your work. Here’s a quick sheet on important Docker commands. For more info, also see the Docker page on the run command.

Sessionising clickstream data

A common task when analysing clickstream data is to sessionise the individual clicks. This invovles aggregating individual clicks from a given cookie ID, into groups of clicks, whereby successive clicks have a time difference that is not greater than the session timeout value. The session timeout value is typically taken to be 20 or 30 minutes.

Programmatically generating Jupyter notebooks

Jupyter notebooks are a great tool for data science projects, and provide a nice level of interactivity that can’t be achieved with a normal Python script (at least not with minimal effort). Sometimes, however, you may want to re-use a Jupyter notebook in different projects, where the project structure is very similar, such as when using the cookiecutter data science project structure. To aid the programmatic generation of Jupyter notebook files, I have put together a Python tool, nb-templater, to automate this process.

Automatic environment loading for Python projects

The cookiecutter project structure is quite popular with Python-based data science projects, which relies on virtual environments to work properly. This is due to the project’s re-usable code being kept in the src/ folder, and thus accessed as a Python module named src. This is fine as long as you only have one project, but if you have multiple, the modules will begin to clash. Hence, the need for virtual environments.