Towards a personal health-datalake

Created:
Category: Data Engineering

After a previous rant, I started toying with the idea of writing software to solve this issue1. The goal is to fully control and use personal health data and I develop this for myself first. Luckily there are a lot of projects and people out there to take inspiration from.

Inkling of an idea

My primary source of inspiration is Simon Willison, a developer with a long pedigree of big projects but the ones I want to focus on right now is datasette and dogsheep (but do check out his recent work on large-language-models)2. Datasette is a web-ui for Sqlite databases which makes publishing data a breeze. With some sprinkling of SQL you can dig into and analyze data inside an Sqlite-file, Dogsheep is a collection of tools for personal analytics. Tools that allow a user to import personal data into sqlite-files which can then be published.

My plan is to extend these tools to include imports for health-apps on phones, smartwatches, and other devices. The first priority are FOSS apps and easy-to-import data sources. But having an import is not enough, what is needed as well is a method of transforming and combining that data such that it can be used for analytics and visualization. Finally, the analysis tools included in Datasette are a start, but preferably we can dig into the data ad-hoc in a better way.

Lazydog

I am not really sure what to name the project yet but for now I'm using Lazydog. I am building this for myself first. It is also an excuse for me to dig into some technologies that I usually do not use. That means that most of the apps used for migrating data out of an app or service into Sqlite will be made with different compiled languages. The idea being that they are easier to deploy than my typical Java, Python, Scala projects and packaged in fewer bytes3. For transforming the data inside sqlite I am testing out dbt-core. As an Apache Spark (with Airflow or other workflow orchestration tools) developer, I'm interested in trying this tool. But if I'd ever want to ship this easily to end-users then there needs to be an alternative. Hopefully some parts of my solution can help other people.

I'm still on the lookout for the right visualization and analysis tool. Apache Superset as an open-source alternative to Tableau or PowerBI is way too heavy for a personal datalake4. Providing finished Plotly Dash or shiny dashboards are on my list but do not have the dig-around-and-experiment factor. I will probably start off with the common data-sciency Julia-Python-R notebooks (not necessarily Jupyter).

Right time

I believe having more data about myself is becoming more valuable as a user. While we are going through a large-language-model hype, it is clear that they can help as an idea-sparring-partner or simply by giving examples or advice. When combined with relevant data they can find solutions for quite advanced problems. There are still plenty of problems with hallucinations and bad advice, so be very careful when using LLM's suggestions. I do not want to downplay these issues, and they are real. However, year over year our data and insight is becoming more valuable to us, but only if it truly belongs to us.

Originally the Lazydog idea came to me when I wished that I could get advice and adjustment of my fitness plan based on how much I slept and ate and moved in the previous days. Now I see the first open-source apps popping up which use LLMs to generate specific fitness programs. It won't take many years for us to use our own data better.


  1. Developers like to throw more code at issues, even when it is not the most effective strategy. Having said that, personal projects are a great place to try new things and learn. We do not always need to use the most efficient method of creating a solution. 

  2. There are more "quantified self" projects that help with collecting personal health data. They try to solve a very similar problem and I draw a lot of inspiration from those apps as well. 

  3. Although things like graalvm native-image for Java, Scala-native, dotnet native AOT, and nuitka or beeware briefcase for Python, are all enticing solutions but usually produce much larger executables. 

  4. I somehow wish there was a lightweight GTK or QT version which is lightweight but with the most common features implemented.