FvdH

Towards a personal health-datalake

2023-11-17T16:05:00+01:00

After a previous rant, I started toying with the idea of writing software to solve this issue¹. The goal is to fully control and use personal health data and I develop this for myself first. Luckily there are a lot of projects and people out there to take inspiration from.

Inkling of an idea

My primary source of inspiration is Simon Willison, a developer with a long pedigree of big projects but the ones I want to focus on right now is datasette and dogsheep (but do check out his recent work on large-language-models)². Datasette is a web-ui for Sqlite databases which makes publishing data a breeze. With some sprinkling of SQL you can dig into and analyze data inside an Sqlite-file, Dogsheep is a collection of tools for personal analytics. Tools that allow a user to import personal data into sqlite-files which can then be published.

My plan is to extend these tools to include imports for health-apps on phones, smartwatches, and other devices. The first priority are FOSS apps and easy-to-import data sources. But having an import is not enough, what is needed as well is a method of transforming and combining that data such that it can be used for analytics and visualization. Finally, the analysis tools included in Datasette are a start, but preferably we can dig into the data ad-hoc in a better way.

Lazydog

I am not really sure what to name the project yet but for now I'm using Lazydog. I am building this for myself first. It is also an excuse for me to dig into some technologies that I usually do not use. That means that most of the apps used for migrating data out of an app or service into Sqlite will be made with different compiled languages. The idea being that they are easier to deploy than my typical Java, Python, Scala projects and packaged in fewer bytes³. For transforming the data inside sqlite I am testing out dbt-core. As an Apache Spark (with Airflow or other workflow orchestration tools) developer, I'm interested in trying this tool. But if I'd ever want to ship this easily to end-users then there needs to be an alternative. Hopefully some parts of my solution can help other people.

I'm still on the lookout for the right visualization and analysis tool. Apache Superset as an open-source alternative to Tableau or PowerBI is way too heavy for a personal datalake⁴. Providing finished Plotly Dash or shiny dashboards are on my list but do not have the dig-around-and-experiment factor. I will probably start off with the common data-sciency Julia-Python-R notebooks (not necessarily Jupyter).

Right time

I believe having more data about myself is becoming more valuable as a user. While we are going through a large-language-model hype, it is clear that they can help as an idea-sparring-partner or simply by giving examples or advice. When combined with relevant data they can find solutions for quite advanced problems. There are still plenty of problems with hallucinations and bad advice, so be very careful when using LLM's suggestions. I do not want to downplay these issues, and they are real. However, year over year our data and insight is becoming more valuable to us, but only if it truly belongs to us.

Originally the Lazydog idea came to me when I wished that I could get advice and adjustment of my fitness plan based on how much I slept and ate and moved in the previous days. Now I see the first open-source apps popping up which use LLMs to generate specific fitness programs. It won't take many years for us to use our own data better.

Developers like to throw more code at issues, even when it is not the most effective strategy. Having said that, personal projects are a great place to try new things and learn. We do not always need to use the most efficient method of creating a solution. ↩
There are more "quantified self" projects that help with collecting personal health data. They try to solve a very similar problem and I draw a lot of inspiration from those apps as well. ↩
Although things like graalvm native-image for Java, Scala-native, dotnet native AOT, and nuitka or beeware briefcase for Python, are all enticing solutions but usually produce much larger executables. ↩
I somehow wish there was a lightweight GTK or QT version which is lightweight but with the most common features implemented. ↩

Using Intel Arc

2023-05-06T17:36:00+02:00

GPU ML on Intel Arc

Intel is releasing discrete GPUs and that's a good thing for consumers. I think a lot of the progress in the Deep Learning space is thanks to cheap and plentiful compute available to researchers without the need for access to a supercomputer. The boom in computer vision research after AlexNet and all. Then I found out that Intel wanted to support deep learning on their GPU as well. I always disliked the layers of great open-source software we have build on the proprietary CUDA language. Both AMD's ROCM and Intel's OneAPI are open-source. Then I had to pick: seeing Intel creating cheaper GPUs with 16GB memory, and AMD still not officially supporting consumer GPUs (they seem to only support the consumer variants similar to their professional Instinct line). Long story short, I bought an Intel Arc 770 to try out their software stack and record my experiences.

Installation

Intel is new at this so I am expecting some teething issues. Their own dgpu-documentation's installation steps only work for Ubuntu 22.04, which is fine as a first Linux OS to support. But this means older packages than rolling release OSes like ArchLinux or Intel's very own ClearLinux. Moreover, the steps will tell you to install an older Linux kernel before you can install the specific gpgpu drivers. This means before the installation is complete no video support. I had to run all the steps in recovery mode. Compare this to starting up a new install of ClearLinux with a very up to date kernel version with pretty good normal (non-gpgpu) video support out of the box. Sadly, I couldn't install the drivers there.

These were the driver installation steps I followed.

The next step was installing the base oneAPI toolkit. I added the sources to APT and ran the sudo apt install intel-basekit succesfully.

I had some problems with running sudo apt install intel-aikit for the AI Analytics tookit next. Somehow some packages included could not be found on the server. Luckily there is a conda method for installing that toolkit as well and I know conda still from cuda installs some time in the past. So with a miniconda install in the hand I continue.

Another failed attempt was doing the described conda installation with pytorch. I got intel with MKL-backend and I got pytorch with xpu support but it couln't find the GPU.

Finally what worked for me was:

conda create -n gputest -c intel intelpython3_full=3.9
conda activate gputest
python -m pip install torch==1.13.0a0 torchvision==0.14.1a0 intel_extension_for_pytorch==1.13.10+xpu -f https://developer.intel.com/ipex-whl-stable-xpu-idp
source /opt/intel/oneapi/setvars.sh     # Doesn't work in fish :(
python -c "import torch; import intel_extension_for_pytorch as ipex; print(torch.__version__); print(ipex.__version__); [print(f'[{i}]: {torch.xpu.get_device_properties(i)}') for i in range(torch.xpu.device_count())];"

The instructions of the GPU version of ipex (intel extension for pytorch) said that the setvars script needs to run and only 3.9 of the intel distribution of python is supported.

Running some samples

There is a given sample from Intel which ran fine on both the cpu and by adding .to("xpu") on the Arc 770 GPU. I noticed that the iGPU was also detected, but sending the data to xpu:1 returns a "Double type is not supported on this platform" error. I'm curious what is supported on that platform since there are still shader units in the iGPU which could be made to multiply matrices.

Final notes

I hope that we will get to a point where training neural networks does not require running special older linux kernels or downloading proprietary drivers. Perhaps a Tinygrad with OpenCL support is currently our best bet for long term support for deep learning on any hardware. Maybe we can use MLIR between the training code and the hardware. I'm still hoping that Intel shows that by writing better Software and drivers they can become an important player for hobbyist researchers. And for the record, I'm also hoping AMD will step up and support their consumer GPU devices. This is something Nvidia does very well and there is still tons of ways to use old nvidia GPUs if you'd like to.

Controlling your data

2023-03-12T12:34:00+01:00

The most used software forge, Github, is sprawling with tools that data professionals make for their peers. And I love it. Every day I check the news to learn about new open-source goodies which we can all use. It is great to learn about tools like Polars and DuckDB, and ML model releases like stable-diffusion or LLaMa, and so much more.

But when it comes to managing personal data, I feel there is a lack of options. I want to focus on personal health data. Whether it is fitness trackers, smart scales, smartphones tracking steps, or keeping a food diary, many of us are generating health data. Yet all of this data is kept inside silos. I switched between an Apple and an Android phone some years ago, and there is no easy way to bring all the data together. We are being pushed towards putting more of our information in the same silo. Only then can for instance weight-tracking and number steps per day be combined.

At least when combining all this data inside a Google Fit or Apple Health silo we get some insights. But then I think back to the time when I was working near Healthcare-providers some time ago. How much interest there was from insurance agencies in using and modelling this personal data. I know it is in Apple’s and Google’s best interest to be very careful with this sensitive data, but many parties want access. Data breaches happen every day and large silos of information are more valuable targets. By keeping all the data in a proprietary external service I never feel like I am fully in control.

Another issue I have is that these apps make little use of the data they are provided. Goals or predictions are almost always based on a single factor. They do not give more than just the most generic advice. Any features they do try to push feel like native advertising.

There are quantified-self apps which are geared towards power-users. They have more features, but they can fall into the trap of making users feel bad for not using them. A true health app for everyone does not judge and is simply a good and useful aid. Luckily more and more FOSS app-alternatives are being created next to the large proprietary offerings. Some are very bare bones with minimal interfaces, others are more feature-full with sought after features like automatic reading of scale data over Bluetooth.

These alternatives make it (relatively) easy to export their data. In broad strokes all health apps are slowly but surely adding data export features. The new issue: there is too little software to make use of this exported data. This is where us data coders should step in. And by supporting FOSS apps which do one thing very well we avoid the pitfall of putting all our data in the Google or Apple silo. Letting users try different apps, without being afraid that all that data is useless if they decide to switch, would be amazing. At work people talk about data-lakes and analytical processing, yet where is our personal data pool at home? Where can we crunch some very personal small-sized data and share this data directly ourselves when we want to.

Introduction to Probability notes

2021-07-12T16:48:00+02:00

During my studies I got to see a lot of programming and a lot of mathematics. I remember having a hard time with some probability and statistics but after some after getting some help from fellow students, I managed to pass the courses.

Over the years after my graduation, I wanted to keep up with the field and continue learning. The main method I use to keep is by doing Coursera/Edx online courses. I also practiced a lot of exercises on Datacamp and learned some technologies on Udemy. Some were more deep than others but almost all of them did help me. In these cases I was always learning with a clear goal. I also have many books on computer science and data science topics which I use from time to time. But, up to now I never took the time to go through them cover to cover and do the exercises.

Now some of those early maths and CS courses are 10 years (or more) ago and I wonder how much stuck with me. So, I'm planning to go over some of the books I have. I want to go over them quietly at my own pace and without the pressure from University.

The first book I chose is Introduction to Probability, of which I've heard many positive things. What I also like is that there are resources available online which can help me. Since the book uses R, this is a great opportunity to do a little R programming and add the notes and explanation in R code.

If I have any notes worth sharing I'll put them in the notes section of this website. There is no planning for when or if I make them, since I want to take as long as it takes and life often gets in the way.

Methods of using Avro

2021-03-23T22:22:00+01:00

Last week I had a run in with Apache Avro. A data serialization method which I have used a couple of times in the past. Most of the times in combination with Kafka, but I remembered also being pleasantly surprised when using it on it's own.

The thing I realized this time, is that there are a lot of different ways of using Avro and I wanted to write them down for myself. So here we are, and do not expect this to be a complete overview or guide. It is just some notes. I will focus mostly on the Java side here but Avro supports more programming languages.

Methods of defining you data

The primary method for defining your data is to create ".avsc" files. These are Json files which can be used to generate classes, encoders and decoders. Always make sure that the definition files are shared between projects. Either in a dependency or by using a git submodule. You can do some other manual bookkeeping and copy the files around but this can become difficult to track over time. To generate those classes in java an external tool can be used. One tool which can be easily integrated is a maven plugin which runs automatically during your compile cycle. This way your implementation is always checked against the data definition.

Note however, if you are using Intellij¹, that this is not integrated with the plugin. This means, doing an Intellij build will fail. If you do a mvn compile then you will notice that the ".java" files are being created in the namespace you denoted in the definition. The classes you get for free will contain a Builder for builder/flow style object creation, plus a whole lot more.

If the interacting projects are all written on the JVM then the reflective API is also an option for defining your data. This means defining your classes in normal Java POJOs and creating encoders and decoders using Avro's ReflectiveDatum-Reader/-Writer. The disadvantage is clear since you have to rely on Java. It is best to have some common library in this case on which all projects can depend. The upside is that there is no need to write raw Json specifications or understand "avsc" files.

There are also other languages in the Avro ecosystem like IDL and .avpr files which allow you to describe whole remote-procedure-call (RPC) schemes. A complete example can be found here

Methods of turning objects into bytes

First off, I am not going to go into Json encoding and decoding but that's also a possibility with Avro. This blog-post will be concerning binary serialization and deserialization.

Once we have some plugin-generated class files there is a convenient build in avroObject.toByteBuffer() method. It does not say what specific avro-encoding is used for these bytes, but I think it adheres to the single-object encoding since it will have a header and 10 bytes total in front of the object itself. For decoding these objects a simple AvroObject.fromByteBuffer will do.

There is also a method which delivers just the bytes of the object and nothing more. In this example BottleMessage is defined in an .avsc file, and I am passing around byte[] but I could be using something else for the encoder input depending on the application.

import org.apache.avro.io.*;
import org.apache.avro.specific.SpecificDatumReader;
import org.apache.avro.specific.SpecificDatumWriter;

import java.io.ByteArrayOutputStream;
import java.io.IOException;

public class DatumTransformer {

  DatumWriter<BottleMessage> bottleMessageDatumWriter =
      new SpecificDatumWriter<>(BottleMessage.class);
  DatumReader<BottleMessage> bottleMessageDatumReader =
      new SpecificDatumReader<>(BottleMessage.class);
  EncoderFactory encoderFactory = EncoderFactory.get();
  DecoderFactory decoderFactory = DecoderFactory.get();

  BinaryEncoder reuseEncoder = null;
  BinaryDecoder reuseDecoder = null;

  public byte[] encode(BottleMessage bottleMessage) {
    ByteArrayOutputStream byteStream = new ByteArrayOutputStream();
    reuseEncoder = encoderFactory.binaryEncoder(byteStream, reuseEncoder);
    try {
      bottleMessageDatumWriter.write(bottleMessage, reuseEncoder);
      reuseEncoder.flush();
    } catch (IOException e) {
      e.printStackTrace();
      throw new RuntimeException();
    }
    return byteStream.toByteArray();
  }

  public BottleMessage decode(byte[] bytes) {
    reuseDecoder = decoderFactory.binaryDecoder(bytes, reuseDecoder);
    try {
      // Reuse variable not used in this example
      return bottleMessageDatumReader.read(null, reuseDecoder);
    } catch (IOException e) {
      e.printStackTrace();
      throw new RuntimeException();
    }
  }
}

Note that you can pass in reuse objects. In that case the reuse object's properties are set, instead of a new object created. This could be useful if you change the common-interface object into project specific objects immediately anyway and you do not want to create (more) garbage-collection pressure. I did not do any performance tests however to see if this really helps in this case. In general, the performance of decoding single objects is very similar for both methods. Decoding multiple objects from a stream of data is easier using the EncoderFactory approach and is also a bit faster than transforming single objects with the toByteBuffer method.

Then there is a special method for writing to files. It can be used to save many rows of data and it can be read without knowing the schema. Probably parquet-files or other columnar-based storage would be better for large sets of data. For heavily nested, record based data however, it works quite well. There is a possibility to add optional compression codecs. Code-wise, encoding to a file and decoding from a file looks very similar to the previous binary-encoding method, but with a DataFileWriter instead of a EncoderFactory + BinaryEncoder. For reading without schema a GenericDatumReader<GenericRecord> is used.

Other software

The whole reason why I ran into trouble to begin with was because the AvroCoder of Apache Beam was producing messages which the AvroObject.fromByteBuffer could not decode. Furthermore the Avro integration into Kafka generally only works with single-object encoding. So when using higher-level libaries which produce or read Avro data it is important to inspect what kind of encoding and decoding it is doing.

Reading the avro data afterwards in Python turned out to be difficult. I did try to do it again for this blog but had to resort to changing some bytes / cutting some bytes to make it work. But this could also be because I used Base64 to copy over the pure byte array. Many of the specialized Python libraries available which make reading avro faster focus on reading avro files (with the schemas attached like fastavro). To read single-object encoded byte data the confluent_kafka library is probably needed.

On Spark, MySQL, and Timezones

2021-02-27T23:51:00+01:00

Proper time handling in data can be hard. On the surface it seems like an easy problem and in many cases there are straightforward solutions which work most of the time. But really "most of the time" is not enough.

I saw a post on timezone handling in Python on Hacker News and was reminded of the different Python libraries there are for handling timestamps with timezones. In the Java world there are plenty of projects which still rely on the old Joda Time. Although the newer java.time packages in Java 8 make that dependency, in most cases, not necessary anymore. That doesn't mean we don't have to watch out for common date, time, and timezone issues in the JVM world.

One such example is when using Spark SQL. On the one hand, Spark is older than the java.time API and it also needs to integrate completely with JDBC. It is therefore important to double check all time-handling code in Spark. There are many StackOverflow posts about time-data handling in Spark. Some very useful, others leading to more problems. In the past I have had to deal such problems and always found a solution which worked well enough in a lot of tests for that particular project.

This week I was once again faced with such a problem and I wanted to note down (today-I-learned style) some of the unexpected results I found. I did this in hope of finding a more structured way of handling dates in Spark. I did not find such a solution but ended up with a reference to fall back on when working with this particular combination of technologies in the future.

Technologies used

First I want to go over the software used and important links. First there is Spark. I am using a relatively recent version of Spark: 3.0. Luckily, Databricks (creators of Spark) have published a blog post about using timestamps in Spark. In there is already noted that full timestamp with timezone is not supported in Spark and timestamp without timezone can be handled by using a timestamp with a UTC session timezone in Spark. There is also information available for timestamps in the Databricks workspace, where they go a bit more into detail about how they rely on JVM's handling of time and repeat SQL's timestamp definitions.

The database from which I am extracting data is a MySQL Database (Docker mysql:8.0.21 to be exact). MySQL timestamp with a timezone seem to rely on the session timezone, and are always stored in UTC. A tutorial can be found here, and documentation here.

Finally the JDBC driver used by Spark in this case is the mysql-connector-java:8.0.23.

The setting

I load in some generated data into MySQL using a python script (see appendix). I either leave the setting to default, I set the connection specifically to UTC or specifically to UTC+01:00.

For retrieving the data in Spark I use a

sparkSession.read
    .format("jdbc")
    ....  // Settings
    .load()

For filtering the selection I use one of 3 methods. The main method I used was this one, which I think is quite common.

val startDateTime = "2020-02-02 00:00:00"
val endDateTime = "2020-02-03 00:00:00"
dataFrame.where(
        ($"event_time" >= lit(startDateTime)) and ($"event_time" < lit(
          endDateTime)))

One potential alternative is using java.sql.Timestamp values.

dataFrame.where(
        ($"event_time" >= Timestamp
          .valueOf(startDateTime)) and ($"event_time" < Timestamp.valueOf(
          endDateTime)))

Finally we can also change the load query.

// Inside the jdbc load settings
sparkSession.read
    .format("jdbc")
    ....  // other settings
    .option("query", "SELECT * FROM ts_table2 WHERE event_time >= '2020-02-02 00:00:00+01:00' AND event_time < '2020-02-03 00:00:00+01:00'")
    .load()

In this example we only care about the event_time (a timestamp column) and event_value a increasing unique integer. I write the data out to csv files but I've tested some of the results by writing to parquet files and reading them with Pandas + Pyarrow's parquet reader.

(Un-)surprising outcomes

First it's better to explicitly set the session timezone if you rely on timezones in MySQL. But in case you did not, and you wanted to use MySQL's System timezone. It could be surprising that Spark will add a timezone based on the settings used by Spark.

If I leave everything default, I get as the first row 2020-02-02T00:00:00.000+01:00,1440. Yet if I set a spark.sql.session.timeZone=UTC and a -Duser.timezone=UTC, I get a return value of 2020-02-02T00:00:00.000Z,1440. These are fundamentally 2 different points in time. This means my Spark timezone setting influences the very data I will have in my output. Note that these are also the results if I explicitly set the session timezone in the python script to UTC

Ok, so I set the timezone in my python script to +01:00 (Current offset in Germany). Now when I query the database it matters what I set my session timezone to. I apply the filter in SQL in the same session and get back as first row 2020-02-02 00:00:00,1440. If I set my session timezone to UTC, I get the same row back by going back one hour 2020-02-01 23:00:00,1440.

Now if I would use Spark to load this data, without changing the settings. I get as row back 2020-02-02T00:00:00.000+01:00,1500. The 1500 shows that the actual row is the one at 2020-02-02T00:00:00.00Z but that the +01:00 timezone was added after loading in the data filtered at UTC time. This same result I also get if I set user/session timezone to Europe/Berlin explicitly.

Loading with spark.sql.session.timeZone=UTC and -Duser.timezone=UTC showed the data correctly as 2020-02-02T00:00:00.000Z,1500 (but filtered in UTC of course). In this case it doesn't matter which type of filter I apply and I can even filter using the load query with an filter of event_time >= '2020-02-02 00:00:00Z'.

Curiously if I set spark.sql.session.timeZone=Europe/Berlin and -Duser.timezone=UTC, I get the correct value in my current timezone 2020-02-02T00:00:00.000+01:00,1440. But I'm not sure if this is behavior I can rely on across Spark database and file sources.

Another weird result was when I left my Spark and JVM settings to default, but tried to filter in the query using timezone. The query filter was as event_time >= '2020-02-02 00:00:00+01:00'. The first row in the output looked as follows: 2020-02-01T23:00:00.000+01:00,1440. The 1440 shows that the right row was retrieved, but somehow the event_date is not correct anymore.

What about Datetime

MySQL also has support for Datetime columns. These explicitly do not have a timezone, and setting a session timezone do not influence them at all.

In my tests the filter was always correctly applied to these datetime objects but the actual values were represented as timestamps in Spark. This means that if I load them with default settings it looks like 2020-02-02T00:00:00.000+01:00 and if I load them with a UTC set user/session timezone, then it looks like 2020-02-02T00:00:00.000Z which is kind of a shame.

Parquet sources

I quickly also tested writing some parquet files with Pandas and Pyarrow, and then filtering those with Spark. Here everything worked as expected. When I set Spark & JVM to UTC the filter was correctly applied in UTC time and when set to Europe/Germany it was correctly applied and represented in the data in +01:00.

Conclusion

In a roundabout way this lead me to a conclusion I have read a couple of times before. It's best to let Spark work with time data in UTC. Using UTC for all dates might help in making dates comparable but it is no panacea sadly. If on the operational side it makes more sense to work with a custom (non-UTC, non-unixtime) solution of storing timezone data, then it needs to be solved in a bespoke way during processing in Spark.

Appendix

MySQL Data Loader

import random
from datetime import datetime, timedelta

import numpy as np
import mysql.connector


start_date = "2020-02-01 00:00"
end_date = "2020-02-04 00:00"
table_name = "dt_table1"
set_connection_timezone = False
set_to_utc = False
print(start_date, end_date, table_name, str(set_connection_timezone), str(set_to_utc))

conn = mysql.connector.connect(
    host="localhost",
    database="tztest",
    port=3306,
    user="<username>",
    password="<password>",
)

if set_connection_timezone:
    if set_to_utc:
        conn.time_zone = "+00:00"
    else:
        conn.time_zone = "+01:00"
print("Timezone: " + str(conn.time_zone))

minute_range = np.arange(start_date, end_date, dtype="datetime64[m]")
vals = np.arange(len(minute_range))


query = "INSERT INTO {} (event_time,event_count) VALUES(%s,%s)".format(table_name)
cursor = conn.cursor()

if set_connection_timezone:
    if set_to_utc:
        init_command="SET SESSION time_zone='+00:00'"
    else:
        init_command="SET SESSION time_zone='+01:00'"
    cursor.execute(init_command)
    print("Session timezone set")

insert_list = []
i = 0


def execute_inserts():
    cursor.executemany(query, insert_list)
    conn.commit()


for (ts, val) in zip(minute_range, vals):
    dt = ts.astype(datetime).strftime("%Y-%m-%d %H:%M:%S")
    insert_list.append((dt, int(val)))
    i += 1
    if i % 30 == 0:
        execute_inserts()
        insert_list = []
        print(".", end="", flush=True)
execute_inserts()
cursor.close()
conn.close()

Feature Caching Redis

2020-02-02T18:22:00+01:00

Hello there blog, it has been too long. I've been in America (to Disrupt and visiting our San Diego office) and worked on a bunch of projects in the mean time, but I want to share some useful info on putting preprocessed machine-learning features from Spark into Redis. I am still experimenting with different solutions but here are some options I'm considering.

Say there is a ML pipeline that needs to go in production, and a bunch of the feature-processing data can be prepared beforehand. It would be best to put these in some form of cache, close to where the inference will take place. A Hash-map in-memory is a solution but requires repopulating each run. Another common solution is saving the features in a Redis-cache.

#1

Luckily Redis has helped a bit here with a spark-redis connector. It enables us to directly upload a DataFrame into Redis. Given a Redis-instance running on the localhost at port 6379:

val spark = SparkSession.builder
    .appName("Uploader number 1")
    .master("local")
    .config("spark.redis.host", "localhost")
    .config("spark.redis.port", "6379")
    .getOrCreate()
import spark.implicits._

dfToUpload.write
    .format("org.apache.spark.sql.redis")
    .option("table", "tablename")
    .option("key.column", "nameKeyColumn")
    .save()

This is by far the easiest method and makes sure that on the receiving side we can get individual columns from a row, because it is stored on Redis' side by a hash (field). The hgetall command can be used to get all the features in one go.

(Using Scala-redis)

val keyVal = "tablename:aKeyName"
val r = new RedisClient("localhost", 6379)
val sredResp: Option[Map[String, String]] = r.hgetall[String, String](keyVal)
r.close()

This works well, but means a lot of converting between strings, and a representation of the feature vector. In a lot of cases we want to get all the features at once per row/sample. Something similar can be done using Spark-redis, by using it's RDD support. The idea is to encode the whole sample using some form of serialization and then store it using key, so it can be retrieved at once. The downside of this approach is that spark-redis works with strings and we will need to encode it in a string.

#2

For this solution I use Kryo like Spark in combination with Twitter's chill as serializer (I choose this because I want to use Scala / JVM on the inferencing side).

val spark = SparkSession.builder
    .appName("Uploader number 2")
    .master("local")
    .config("spark.redis.host", "localhost")
    .config("spark.redis.port", "6379")
    .getOrCreate()
import spark.implicits._

val dfToConvert = dfIn.as[FeatureCaseClass]
val instantiator = new ScalaKryoInstantiator
instantiator.setRegistrationRequired(false)
val kryo = instantiator.newKryo()
val encoder = Base64.getEncoder

val keyedRDD = dfToConvert.rdd.keyBy(_.nameKeyColumn).map(tup => {
    val output = new Output(512, -1)    // Guess the side but allows it to grow
    kryo.writeObject(output, tup._2)
    tup._1 -> encoder.encodeToString(output.getBuffer)
})

val sc = spark.sparkContext
sc.toRedisKV(keyedRDD)

Now getting the data out (this time using Jedis and a shared library with the case-class definition).

val keyVal = "keyname"
val jedis = new Jedis("localhost")

val instantiator = new ScalaKryoInstantiator
instantiator.setRegistrationRequired(false)
val kryo = instantiator.newKryo()
val decoder = Base64.getDecoder

val jedResp = jedis.get(keyVal)
if(jedResp == null) {
    println("Couldn't find key")
    // etc.
}
val decodedBytes = decoder.decode(jedResp)
val input = new Input(decodedBytes)
val dataBack =
    kryo.readObject(input, classOf[datatype.FeatureCaseClass])

This bundles the features neatly together but is not really efficient because of the Base64 encoding/decoding step needed.

It is also possible to leave the Spark-redis connector for what it is and push the samples by using some client library (like Jedis).

#3

val spark = SparkSession.builder
    .appName("Uploader number 3")
    .master("local")
    .config("spark.redis.host", "localhost")
    .config("spark.redis.port", "6379")
    .getOrCreate()
import spark.implicits._

val dfToConvert = dfIn.as[FeatureCaseClass]
val instantiator = new ScalaKryoInstantiator
instantiator.setRegistrationRequired(false)
val kryo = instantiator.newKryo()

val dfToUpload = dfToConvert.map(fcc => {
    val output = new Output(512, -1)
    kryo.writeObject(output, fcc)
    fcc.nameKeyColumn.getBytes -> output.getBuffer
})
val results = dfToUpload.mapPartitions(pairList => {
    val jedis = new Jedis("localhost")  // Every spark partition its own client
    pairList.map(pair => {
        jedis.set(pair._1, pair._2)
    })
})

// Force evaluation through writing return info
results.write.mode(SaveMode.Overwrite).csv("upload_info")

And now for reading the output (with Jedis again).

val keyVal = "keyname"
val jedis = new Jedis("localhost")

val instantiator = new ScalaKryoInstantiator
instantiator.setRegistrationRequired(false)
val kryo = instantiator.newKryo()

val jedResp = jedis.get(keyVal.getBytes)
if(jedResp == null) {
    println("Couldn't find key")
    // etc.
}
val input = new Input(jedResp)
val dataBack =
    kryo.readObject(input, classOf[datatype.FeatureCaseClass])

Sorting csv's externally

2019-09-10T20:30:00+02:00

Some time ago, I ran into the problem of having a very large comma-separated-value file which I had to sort. There are many good solutions for this simple problem, and there are many ways of going about this in general. The better ones probably being, loading the file into Sqlite or a docker with Postgresql and using SQL. Or adapting the data with a quick python script to make it easy to use Unix's sort.

Instead, at the time a friend advised Spark. And although Spark does have methods for larger-than-memory datasets, they need to be well partitioned and it is easy to go wrong. It failed spectacularly¹. I knew that this shouldn't be a hard problem and because a lot of code there was already in Python, I looked and easily found a python external sorter. It was not the quickest but it got the job done and worked fine.

But it did leave me wondering. Although Panda's csv parser is probably quite optimized (and you can turn on and off the error reporting on faulty csv lines), still there is somewhere a python performance penalty. There was also the fact that I want to experiment with a bit of Scala coding (and not just Spark flavoured Scala). So I made my own external csv sorter in Scala. I hope that by using Univocity's parsing library I am able to make a somewhat quick external sorter that has predictable behaviour even when presented with bad lines. But I still need to properly test and learn the library and learn how to work nicely with Scala. It turns out that working around type erasure in pattern matching and making this properly generic is not that easy. Hopefully I will learn a bit more about those things and others, by working on this project.

Spark is very suited for different tasks but is often misused like this ↩

Using Language Models

2019-08-06T23:44:00+02:00

In the last year Language Models have changed my approach to working with natural language processing. Some (relatively) fresh results by XLNet show that large transformer-style models work really well for many language ML tasks. On the surface it seems quite similar to training word embeddings¹ but with the advantage of training far more layers & parameters (weights). This is a great boon to any NLP practitioner as many have talked about. And for the first half of 2019, the speed of progress hasn't slowed down.

There is a downside to this. Easily replicating or retraining your own models from scratch is becoming increasingly hard. Often resorting to tricks like accumulating gradients, recalculating some weights during the backward pass, and simply still leaving some parts of a model "freezed". Saving up money for some time to buy a Nvidia 1080 Ti, only to not be able to train many models can be a bummer (lucky there is free colab, but it feels bad to depend on that). A more serious problem is gauging how a certain architecture and how training choices really add in to the final result. Anna Rogers has written a very good piece about this. Another problem could be the fact that I do not really know what data has been fed to the model.

But leaving all these sidenotes, language models help enormously and I would like to show two libraries which make working with them incredibly easy.

Zalando's Flair

Flair is together with ULMFit one of the older RNN-style language models. This might be the reason why it is not SOTA anymore, but it still performs very well, and the library is incredibly easy to use. With build in model downloader, training and all kind of tooling around their models. They also have a simple tutorial which I can recommend if you want to get started quickly.

To show you how easy, here is a short sample using their pre-trained sequence model which predicts NER tags.

from flair.data import Sentence
from flair.models import SequenceTagger

sentence = Sentence('German Chancellor Angela Merkel and British Primeminister Boris Johnson .....')
tagger = SequenceTagger.load('ner')
tagger.predict(sentence)

for entity in sentence.get_spans('ner'):
    print(entity)

And if we want to train our own, with multiple layers of embeddings.

import flair.datasets
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer
from flair.embeddings import FlairEmbeddings, WordEmbeddings, StackedEmbeddings

corpus = flair.datasets.WIKINER_ENGLISH()
ner_dict = corpus.make_tag_dictionary('ner')
embedding_types = [
    WordEmbeddings('glove'),
    FlairEmbeddings('news-forward-fast'),
]
embeddings = StackedEmbeddings(embeddings=embedding_types)

tagger = SequenceTagger(
    hidden_size=256,
    embeddings=embeddings,
    tag_dictionary=ner_dict,
    tag_type="ner",
    use_crf=True
)
trainer = ModelTrainer(tagger, corpus)
trainer.train(
    'resources/taggers/example-ner',
    learning_rate=0.1,
    mini_batch_size=32,
    max_epochs=150
)

And there is even a build in ColumnCorpus and CSVCorpus and ClassificationCorpus dataset loader which loads fasttext style inputs (also quite a useful toolkit to have).

Huggingface's pytorch-transformers

This is a well known reimplementation of modern Bert and XLNet architectures in Pytorch (originally with help from Google, and in Tensorflow). Not only are they pretty easy to work with. The pre-trained models are also available on pytorch hub and can be downloaded and used with build-in tools.

To use a pretrained network as the front-part of your network can be as easy as this:

import torch
from pytorch_transformers import BertModel, BertTokenizer

pretrained_modelname = "bert-base-uncased"

# This will download the model if it has not be found in user storage
tokenizer = BertTokenizer.from_pretrained(pretrained_modelname)
model = BertModel.from_pretrained(pretrained_modelname)

encoded_text = tokenizer.encode("Enter the text you need the embeddings from here")
input_tensor = torch.tensor([encoded_text])
print("Input tensor: ", input_tensor)
with torch.no_grad():
    output_tuple = model(input_tensor)
    last_hidden_states = output_tuple[0]
print("Last hidden states: ", last_hidden_states)
print("Shape (size): ", last_hidden_states.size())

Now, this is already a great solution for many cases where simply using the embeddings as input features give better results (especially coming from wordvector embeddings). And, like the Flair example shows, it gives room to experiment with combining embeddings (simply concatenating them over the words).

But there is more. There are build-in models, ready for word classification (like NER tagging) and generic text classification. And example run-scripts in the examples folder.

To extend the run_glue.py(and util_glue.py) model training scripts to make it run generic text classification problems I have added some code to util_glue.py:

class ImdbProcessor(DataProcessor):
    """Processor for the special IMDB dataset."""

    def get_train_examples(self, data_dir):
        """See base class."""
        logger.info("LOOKING AT {}".format(os.path.join(data_dir, "imdbtrain.tsv")))
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "imdbtrain.tsv")), "train")

    def get_dev_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "imdbtest.tsv")), "dev")

    def get_labels(self):
        """Still same classes luckily."""
        return ["0", "1"]

    def _create_examples(self, lines, set_type):
        """Creates examples for the training and dev sets."""
        examples = []
        for line in lines:
            guid = "%s-%s" % (set_type, line[0])
            text_a = line[1]
            text_b = None
            label = line[2]
            examples.append(
                InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
        return examples

processors = {
    ...
    "imdb": ImdbProcessor,
}

output_modes = {
    ...
    "imdb": "classification",
}

GLUE_TASKS_NUM_LABELS = {
    ...
    "imdb": 2,
}

In this case I named it IMDB to make it work with the IMDB classification dataset. Now it can work on a dataset with tab separated values².

Other libraries

Some libraries I am watching but haven't tested yet.

Spacy embedding pretraining: I use Spacy quite often for fast text cleaning/mangling and for creating rules matchers based on regex in combination with NER tags. Previously I have tested and used their build-in NER training and classification modules. My bet is that this will be just as great.

Deepset FARM: By the folks who also released a German trained BERT model. Looks very neatly done and is build ontop of pytorch_transformers.

JohnSnow NLP: A colleague of mine tried to do NLP on SPARK some time ago and used this library. He checked it out again this week and saw a bunch of new features. I am not completely sold on the distributed text processing just yet, but I do want to try it out.

There is a lot more to say but I am going to leave it by this for now. Hopefully showing you this will get you interested in trying out these great techniques, and dig deeper into their examples and source code.

Wordembeddings can still be used, although most models now learn embeddings for subwordunits. See sentencepiece or GPT-2's encoder ↩
In this case col 0: id, col 1: text, col 2: class labels → 0 or 1 ↩

Blogpost 001

2019-07-11T16:20:00+02:00

Hey, welcome to my new blog. This is the third time I start a blog and the second time with my own domain. I didn't use this site since the launch of the .dev domain so that will give an indication of how active I am.

I plan to use this blog to write about small or big things that I've learned and that I would like to share. Maybe nobody will ever visit it, but even then I hope to learn to write a bit better and remind me of the things I've learned.

For this first post I do not have much to mention other than my little projects. And that I digg fastai and spacy for making ML and NLP more approachable. I think we are somewhat off of automating everything using AI, but there is still a lot of low hanging fruit in simple decisions being made based on written information, where quick advice would speed up everything.