FvdHhttps://hillman.dev/2023-11-17T16:05:00+01:00Towards a personal health-datalake2023-11-17T16:05:00+01:002023-11-17T16:05:00+01:00Francis Hillmantag:hillman.dev,2023-11-17:/personal_health_data_002.html<p>A plan for controlling our own health data</p><p>After a <a href="personal_health_data_001.html">previous rant</a>, I started toying with the idea of writing software to solve this issue<sup id="fnref:1"><a class="footnote-ref" href="#fn:1">1</a></sup>. The goal is to fully control and use personal health data and I develop this for myself first. Luckily there are a lot of projects and people out there to take inspiration from.</p>
<h2>Inkling of an idea</h2>
<p>My primary source of inspiration is <a href="https://simonwillison.net/">Simon Willison</a>, a developer with a long pedigree of big projects but the ones I want to focus on right now is <a href="https://datasette.io/">datasette</a> and <a href="https://dogsheep.github.io/">dogsheep</a> (but do check out his recent work on large-language-models)<sup id="fnref:2"><a class="footnote-ref" href="#fn:2">2</a></sup>.
<em>Datasette</em> is a web-ui for <a href="https://www.sqlite.org/index.html">Sqlite</a> databases which makes publishing data a breeze. With some sprinkling of SQL you can dig into and analyze data inside an Sqlite-file, <em>Dogsheep</em> is a collection of tools for <em>personal analytics</em>. Tools that allow a user to import personal data into sqlite-files which can then be published.</p>
<p>My plan is to extend these tools to include imports for health-apps on phones, smartwatches, and other devices. The first priority are FOSS apps and easy-to-import data sources. But having an import is not enough, what is needed as well is a method of transforming and combining that data such that it can be used for analytics and visualization. Finally, the analysis tools included in Datasette are a start, but preferably we can dig into the data ad-hoc in a better way.</p>
<h2>Lazydog</h2>
<p>I am not really sure what to name the project yet but for now I'm using <a href="https://github.com/FransHeuvelmans/Lazydog">Lazydog</a>. I am building this for myself first. It is also an excuse for me to dig into some technologies that I usually do not use. That means that most of the apps used for migrating data out of an app or service into Sqlite will be made with different compiled languages. The idea being that they are easier to deploy than my typical Java, Python, Scala projects and packaged in fewer bytes<sup id="fnref:3"><a class="footnote-ref" href="#fn:3">3</a></sup>. For transforming the data inside sqlite I am testing out <a href="https://docs.getdbt.com/">dbt-core</a>. As an Apache Spark (with Airflow or other workflow orchestration tools) developer, I'm interested in trying this tool. But if I'd ever want to ship this easily to end-users then there needs to be an alternative. Hopefully some parts of my solution can help other people.</p>
<p>I'm still on the lookout for the right visualization and analysis tool. Apache Superset as an open-source alternative to Tableau or PowerBI is way too heavy for a personal datalake<sup id="fnref:4"><a class="footnote-ref" href="#fn:4">4</a></sup>. Providing finished Plotly Dash or shiny dashboards are on my list but do not have the dig-around-and-experiment factor. I will probably start off with the common data-sciency Julia-Python-R notebooks (not necessarily Jupyter).</p>
<h2>Right time</h2>
<p>I believe having more data about myself is becoming more valuable as a user. While we are going through a large-language-model hype, it is clear that they can help as an idea-sparring-partner or simply by giving examples or advice. When combined with relevant data they can find solutions for quite advanced problems. There are still plenty of problems with hallucinations and bad advice, so be very careful when using LLM's suggestions. I do not want to downplay these issues, and they are real. However, year over year our data and insight is becoming more valuable to us, but only if it truly belongs to us.</p>
<p>Originally the Lazydog idea came to me when I wished that I could get advice and adjustment of my fitness plan based on how much I slept and ate and moved in the previous days. Now I see the first open-source apps popping up which use LLMs to <a href="https://github.com/LiamMorrow/LiftLog">generate specific fitness programs</a>. It won't take many years for us to use our own data better.</p>
<div class="footnote">
<hr>
<ol>
<li id="fn:1">
<p>Developers like to throw more code at issues, even when it is not the most effective strategy. Having said that, personal projects are a great place to try new things and learn. We do not always need to use the most efficient method of creating a solution. <a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:2">
<p>There are more <a href="https://github.com/woop/awesome-quantified-self">"quantified self"</a> projects that help with collecting personal health data. They try to solve a very similar problem and I draw a lot of inspiration from those apps as well. <a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
<li id="fn:3">
<p>Although things like <em>graalvm native-image</em> for Java, <em>Scala-native</em>, <em>dotnet native AOT</em>, and <em>nuitka</em> or <em>beeware briefcase</em> for Python, are all enticing solutions but usually produce much larger executables. <a class="footnote-backref" href="#fnref:3" title="Jump back to footnote 3 in the text">↩</a></p>
</li>
<li id="fn:4">
<p>I somehow wish there was a lightweight GTK or QT version which is lightweight but with the most common features implemented. <a class="footnote-backref" href="#fnref:4" title="Jump back to footnote 4 in the text">↩</a></p>
</li>
</ol>
</div>Using Intel Arc2023-05-06T17:36:00+02:002023-05-06T17:36:00+02:00Francis Hillmantag:hillman.dev,2023-05-06:/arc01.html<p>A third player entered the game</p><h1>GPU ML on Intel Arc</h1>
<p>Intel is releasing discrete GPUs and that's a good thing for consumers. I think a lot of the progress in the Deep Learning space is thanks to cheap and plentiful compute available to researchers without the need for access to a supercomputer. The boom in computer vision research after AlexNet and all. Then I found out that Intel wanted to support deep learning on their GPU as well. I always disliked the layers of great open-source software we have build on the proprietary CUDA language. Both AMD's ROCM and Intel's OneAPI are open-source. Then I had to pick: seeing Intel creating cheaper GPUs with 16GB memory, and AMD still not officially supporting consumer GPUs (they seem to only support the consumer variants similar to their professional Instinct line). Long story short, I bought an Intel Arc 770 to try out their software stack and record my experiences.</p>
<h2>Installation</h2>
<p>Intel is new at this so I am expecting some teething issues. Their own <a href="https://dgpu-docs.intel.com/">dgpu-documentation</a>'s installation steps only work for <em>Ubuntu 22.04</em>, which is fine as a first Linux OS to support. But this means older packages than rolling release OSes like ArchLinux or Intel's very own ClearLinux. Moreover, the steps will tell you to install an older Linux kernel before you can install the specific <a href="https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units"><em>gpgpu</em></a> drivers. This means before the installation is complete no video support. I had to run all the steps in recovery mode. Compare this to starting up a new install of ClearLinux with a very up to date kernel version with pretty good normal (non-gpgpu) video support out of the box. Sadly, I couldn't install the drivers there.</p>
<p><a href="https://dgpu-docs.intel.com/installation-guides/ubuntu/ubuntu-jammy-dc.html">These were the driver installation steps I followed</a>.</p>
<p>The next step was installing the <a href="https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html#gs.vean3l">base oneAPI toolkit</a>. I added the sources to APT and ran the <code>sudo apt install intel-basekit</code> succesfully.</p>
<p>I had some problems with running <code>sudo apt install intel-aikit</code> for the AI Analytics tookit next. Somehow some packages included could not be found on the server. Luckily there is a <em>conda</em> method for installing that toolkit as well and I know conda still from cuda installs some time in the past. So with a miniconda install in the hand I continue. </p>
<p>Another failed attempt was doing the <a href="https://www.intel.com/content/www/us/en/docs/oneapi/installation-guide-linux/2023-0/install-intel-ai-analytics-toolkit-via-conda.html">described</a> conda installation with pytorch. I got intel with MKL-backend and I got pytorch with xpu support but it couln't find the GPU.</p>
<p>Finally what worked for me was:</p>
<div class="highlight"><pre><span></span><code><span class="n">conda</span> <span class="n">create</span> <span class="o">-</span><span class="n">n</span> <span class="n">gputest</span> <span class="o">-</span><span class="n">c</span> <span class="n">intel</span> <span class="n">intelpython3_full</span><span class="o">=</span><span class="mf">3.9</span>
<span class="n">conda</span> <span class="n">activate</span> <span class="n">gputest</span>
<span class="n">python</span> <span class="o">-</span><span class="n">m</span> <span class="n">pip</span> <span class="n">install</span> <span class="n">torch</span><span class="o">==</span><span class="mf">1.13.0</span><span class="n">a0</span> <span class="n">torchvision</span><span class="o">==</span><span class="mf">0.14.1</span><span class="n">a0</span> <span class="n">intel_extension_for_pytorch</span><span class="o">==</span><span class="mf">1.13.10</span><span class="o">+</span><span class="n">xpu</span> <span class="o">-</span><span class="n">f</span> <span class="n">https</span><span class="p">:</span><span class="o">//</span><span class="n">developer</span><span class="o">.</span><span class="n">intel</span><span class="o">.</span><span class="n">com</span><span class="o">/</span><span class="n">ipex</span><span class="o">-</span><span class="n">whl</span><span class="o">-</span><span class="n">stable</span><span class="o">-</span><span class="n">xpu</span><span class="o">-</span><span class="n">idp</span>
<span class="n">source</span> <span class="o">/</span><span class="n">opt</span><span class="o">/</span><span class="n">intel</span><span class="o">/</span><span class="n">oneapi</span><span class="o">/</span><span class="n">setvars</span><span class="o">.</span><span class="n">sh</span> <span class="c1"># Doesn't work in fish :(</span>
<span class="n">python</span> <span class="o">-</span><span class="n">c</span> <span class="s2">"import torch; import intel_extension_for_pytorch as ipex; print(torch.__version__); print(ipex.__version__); [print(f'[</span><span class="si">{i}</span><span class="s2">]: {torch.xpu.get_device_properties(i)}') for i in range(torch.xpu.device_count())];"</span>
</code></pre></div>
<p>The instructions of the GPU version of ipex (intel extension for pytorch) said that the setvars script needs to run and only 3.9 of the intel distribution of python is supported.</p>
<h2>Running some samples</h2>
<p>There is a given <a href="https://github.com/oneapi-src/oneAPI-samples/tree/2023.1_AIKit/AI-and-Analytics/Getting-Started-Samples/Intel_Extension_For_PyTorch_GettingStarted">sample from Intel</a> which ran fine on both the cpu and by adding <code>.to("xpu")</code> on the Arc 770 GPU. I noticed that the iGPU was also detected, but sending the data to <code>xpu:1</code> returns a "Double type is not supported on this platform" error. I'm curious what <strong>is</strong> supported on that platform since there are still shader units in the iGPU which could be made to multiply matrices.</p>
<h2>Final notes</h2>
<p>I hope that we will get to a point where training neural networks does not require running special older linux kernels or downloading proprietary drivers. Perhaps a <a href="https://github.com/geohot/tinygrad">Tinygrad</a> with <a href="https://github.com/geohot/tinygrad/blob/master/tinygrad/runtime/ops_gpu.py">OpenCL support</a> is currently our best bet for long term support for deep learning on any hardware. Maybe we can use <a href="https://github.com/llvm/torch-mlir">MLIR</a> between the training code and the hardware. I'm still hoping that Intel shows that by writing better Software and drivers they can become an important player for hobbyist researchers. And for the record, I'm also hoping AMD will step up and support their <a href="https://threedots.ovh/blog/2022/05/amd-rocm-a-wasted-opportunity/">consumer GPU devices</a>. This is something Nvidia does very well and there is still tons of ways to use old nvidia GPUs if you'd like to.</p>Controlling your data2023-03-12T12:34:00+01:002023-03-12T12:34:00+01:00Francis Hillmantag:hillman.dev,2023-03-12:/personal_health_data_001.html<p>I would like to be in control of my own data</p><p>The most used software forge, Github, is sprawling with tools that data professionals make for their peers. And I love it. Every day I check the news to learn about new open-source goodies which we can all use. It is great to learn about tools like <a href="https://www.pola.rs/">Polars</a> and <a href="https://duckdb.org/">DuckDB</a>, and ML model releases like <a href="https://stability.ai/blog/stable-diffusion-public-release">stable-diffusion</a> or <a href="https://ai.facebook.com/blog/large-language-model-llama-meta-ai/">LLaMa</a>, and so much more.</p>
<p>But when it comes to managing personal data, I feel there is a lack of options. I want to focus on personal <em>health</em> data. Whether it is fitness trackers, smart scales, smartphones tracking steps, or keeping a food diary, many of us are generating health data. Yet all of this data is kept inside silos.
I switched between an Apple and an Android phone some years ago, and there is no easy way to bring all the data together. We are being pushed towards putting more of our information in the same silo. Only then can for instance weight-tracking and number steps per day be combined.</p>
<p>At least when combining all this data inside a Google Fit or Apple Health silo we get some insights.
But then I think back to the time when I was working near Healthcare-providers some time ago. How much
interest there was from insurance agencies in using and modelling this personal data.
I know it is in Apple’s and Google’s best interest to be very careful with this sensitive data, but many parties want access. Data breaches happen every day and large silos of information are more valuable targets. By keeping all the data in a proprietary external service I never feel like I am fully in control.</p>
<p>Another issue I have is that these apps make little use of the data they are provided. Goals or predictions are almost always based on a single factor. They do not give more than just the most generic advice. Any features they do try to push feel like native advertising.</p>
<p>There are quantified-self apps which are geared towards power-users. They have more features, but they can fall into the trap of making users feel bad for not using them. A true health app for everyone does not judge and is simply a good and useful aid. Luckily more and more FOSS app-alternatives are being created next to the large proprietary offerings. Some are very bare bones with minimal interfaces, others are more feature-full with sought after features like automatic reading of scale data over Bluetooth.</p>
<p>These alternatives make it (relatively) easy to export their data. In broad strokes all health apps are slowly but surely adding data export features. The new issue: there is too little software to make use of this exported data. This is where us data coders should step in. And by supporting FOSS apps which do one thing very well we avoid the pitfall of putting all our data in the Google or Apple silo. Letting users try different apps, without being afraid that all that data is useless if they decide to switch, would be amazing. At work people talk about data-lakes and analytical processing, yet where is our personal data pool at home? Where can we crunch some very personal small-sized data and share this data directly ourselves when we want to.</p>Introduction to Probability notes2021-07-12T16:48:00+02:002021-07-12T16:48:00+02:00Francis Hillmantag:hillman.dev,2021-07-12:/probnotes.html<p>Notes on Introduction to probability</p><p>During my studies I got to see a lot of programming and a lot of mathematics.
I remember having a hard time with some probability and statistics but after some
after getting some help from fellow students, I managed to pass the courses.</p>
<p>Over the years after my graduation, I wanted to keep up with the field and continue learning.
The main method I use to keep is by doing <a href="https://www.coursera.org/">Coursera</a>/<a href="https://www.edx.org/">Edx</a> online courses.
I also practiced a lot of exercises on <a href="https://www.datacamp.com/">Datacamp</a> and learned some technologies on <a href="https://www.udemy.com/">Udemy</a>.
Some were more deep than others but almost all of
them did help me. In these cases I was always learning with a clear goal.
I also have many books on computer science and data science
topics which I use from time to time.
But, up to now I never took the time to go through them
cover to cover and do the exercises. </p>
<p>Now some of those early maths and CS courses are 10 years (or more) ago and I wonder
how much stuck with me. So, I'm planning to go over some of the books I have.
I want to go over them quietly at my own pace and without the pressure from University.</p>
<p>The first book I chose is <a href="https://projects.iq.harvard.edu/stat110/home">Introduction to Probability</a>, of which I've heard many
positive things. What I also like is that there are resources available online
which can help me. Since the book uses <em>R</em>, this is a great opportunity to
do a little R programming and add the notes and explanation in R code.</p>
<p>If I have any notes worth sharing I'll put them in the <a href="https://hillman.dev/pages/study-notes.html">notes section</a>
of this website. There is no planning for when or if I make them,
since I want to take as long as it takes and life often gets in the way.</p>Methods of using Avro2021-03-23T22:22:00+01:002021-03-23T22:22:00+01:00Francis Hillmantag:hillman.dev,2021-03-23:/avromethods.html<p>What could that Avro byte array mean and how can it be handled</p><p>Last week I had a run in with Apache Avro. A data serialization method which I have used a couple
of times in the past. Most of the times in combination with Kafka, but I remembered also being pleasantly
surprised when using it on it's own.</p>
<p>The thing I realized this time, is that there are a lot of different ways of using Avro and I wanted to write
them down for myself. So here we are, and do not expect this to be a complete overview or guide. It is
just some notes. I will focus mostly on the Java side here but Avro supports more programming languages.</p>
<h2>Methods of defining you data</h2>
<p>The primary method for defining your data is to create ".avsc" files. These are Json files which can be
used to generate classes, encoders and decoders.
Always make sure that the definition files are shared between projects. Either in a dependency or by
using a git submodule. You can do some other manual bookkeeping and copy the files around
but this can become difficult to track over time.
To generate those classes in java an external tool can be used.
One tool which can be easily integrated is a <a href="http://avro.apache.org/docs/current/gettingstartedjava.html">maven plugin</a> which runs automatically
during your compile cycle. This way your implementation is always checked against the data definition.</p>
<p>Note however, if you are using Intellij<sup id="fnref:1"><a class="footnote-ref" href="#fn:1">1</a></sup>, that this is not integrated with the plugin.
This means, doing an Intellij build will <em>fail</em>. If you do a <code>mvn compile</code> then you will notice that
the ".java" files are being created in the namespace you denoted in the definition.
The classes you get for free will contain a Builder for builder/flow style object
creation, plus a whole lot more.</p>
<p>If the interacting projects are all written on the JVM then the reflective API is also an option
for defining your data.
This means defining your classes in normal Java POJOs and creating encoders and decoders using
Avro's ReflectiveDatum-Reader/-Writer. The disadvantage is clear since you have to rely on Java.
It is best to have some common library in this case on which all projects can depend.
The upside is that there is no need to write raw Json specifications or understand "avsc" files.</p>
<p>There are also other languages in the Avro ecosystem like IDL and <em>.avpr</em> files which allow
you to describe whole remote-procedure-call (RPC) schemes. A complete example can be found <a href="https://github.com/alexholmes/avro-maven">here</a></p>
<h2>Methods of turning objects into bytes</h2>
<p>First off, I am not going to go into Json encoding and decoding but that's also a possibility with Avro.
This blog-post will be concerning binary serialization and deserialization.</p>
<p>Once we have some plugin-generated class files there is a convenient build in <code>avroObject.toByteBuffer()</code> method.
It does not say what specific avro-encoding is used for these bytes, but I think it adheres to the
<a href="https://avro.apache.org/docs/current/spec.html#single_object_encoding">single-object encoding</a> since it will have a header and 10 bytes total in front of the object itself.
For decoding these objects a simple <code>AvroObject.fromByteBuffer</code> will do.</p>
<p>There is also a method which delivers just the bytes of the object and nothing more.
In this example <code>BottleMessage</code> is defined in an <em>.avsc</em> file, and I am passing
around <code>byte[]</code> but I could be using something else for the encoder input
depending on the application.</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span><span class="w"> </span><span class="nn">org.apache.avro.io.*</span><span class="p">;</span>
<span class="kn">import</span><span class="w"> </span><span class="nn">org.apache.avro.specific.SpecificDatumReader</span><span class="p">;</span>
<span class="kn">import</span><span class="w"> </span><span class="nn">org.apache.avro.specific.SpecificDatumWriter</span><span class="p">;</span>
<span class="kn">import</span><span class="w"> </span><span class="nn">java.io.ByteArrayOutputStream</span><span class="p">;</span>
<span class="kn">import</span><span class="w"> </span><span class="nn">java.io.IOException</span><span class="p">;</span>
<span class="kd">public</span><span class="w"> </span><span class="kd">class</span> <span class="nc">DatumTransformer</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="n">DatumWriter</span><span class="o"><</span><span class="n">BottleMessage</span><span class="o">></span><span class="w"> </span><span class="n">bottleMessageDatumWriter</span><span class="w"> </span><span class="o">=</span>
<span class="w"> </span><span class="k">new</span><span class="w"> </span><span class="n">SpecificDatumWriter</span><span class="o"><></span><span class="p">(</span><span class="n">BottleMessage</span><span class="p">.</span><span class="na">class</span><span class="p">);</span>
<span class="w"> </span><span class="n">DatumReader</span><span class="o"><</span><span class="n">BottleMessage</span><span class="o">></span><span class="w"> </span><span class="n">bottleMessageDatumReader</span><span class="w"> </span><span class="o">=</span>
<span class="w"> </span><span class="k">new</span><span class="w"> </span><span class="n">SpecificDatumReader</span><span class="o"><></span><span class="p">(</span><span class="n">BottleMessage</span><span class="p">.</span><span class="na">class</span><span class="p">);</span>
<span class="w"> </span><span class="n">EncoderFactory</span><span class="w"> </span><span class="n">encoderFactory</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">EncoderFactory</span><span class="p">.</span><span class="na">get</span><span class="p">();</span>
<span class="w"> </span><span class="n">DecoderFactory</span><span class="w"> </span><span class="n">decoderFactory</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">DecoderFactory</span><span class="p">.</span><span class="na">get</span><span class="p">();</span>
<span class="w"> </span><span class="n">BinaryEncoder</span><span class="w"> </span><span class="n">reuseEncoder</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">null</span><span class="p">;</span>
<span class="w"> </span><span class="n">BinaryDecoder</span><span class="w"> </span><span class="n">reuseDecoder</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">null</span><span class="p">;</span>
<span class="w"> </span><span class="kd">public</span><span class="w"> </span><span class="kt">byte</span><span class="o">[]</span><span class="w"> </span><span class="nf">encode</span><span class="p">(</span><span class="n">BottleMessage</span><span class="w"> </span><span class="n">bottleMessage</span><span class="p">)</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="n">ByteArrayOutputStream</span><span class="w"> </span><span class="n">byteStream</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">new</span><span class="w"> </span><span class="n">ByteArrayOutputStream</span><span class="p">();</span>
<span class="w"> </span><span class="n">reuseEncoder</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">encoderFactory</span><span class="p">.</span><span class="na">binaryEncoder</span><span class="p">(</span><span class="n">byteStream</span><span class="p">,</span><span class="w"> </span><span class="n">reuseEncoder</span><span class="p">);</span>
<span class="w"> </span><span class="k">try</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="n">bottleMessageDatumWriter</span><span class="p">.</span><span class="na">write</span><span class="p">(</span><span class="n">bottleMessage</span><span class="p">,</span><span class="w"> </span><span class="n">reuseEncoder</span><span class="p">);</span>
<span class="w"> </span><span class="n">reuseEncoder</span><span class="p">.</span><span class="na">flush</span><span class="p">();</span>
<span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="k">catch</span><span class="w"> </span><span class="p">(</span><span class="n">IOException</span><span class="w"> </span><span class="n">e</span><span class="p">)</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="n">e</span><span class="p">.</span><span class="na">printStackTrace</span><span class="p">();</span>
<span class="w"> </span><span class="k">throw</span><span class="w"> </span><span class="k">new</span><span class="w"> </span><span class="n">RuntimeException</span><span class="p">();</span>
<span class="w"> </span><span class="p">}</span>
<span class="w"> </span><span class="k">return</span><span class="w"> </span><span class="n">byteStream</span><span class="p">.</span><span class="na">toByteArray</span><span class="p">();</span>
<span class="w"> </span><span class="p">}</span>
<span class="w"> </span><span class="kd">public</span><span class="w"> </span><span class="n">BottleMessage</span><span class="w"> </span><span class="nf">decode</span><span class="p">(</span><span class="kt">byte</span><span class="o">[]</span><span class="w"> </span><span class="n">bytes</span><span class="p">)</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="n">reuseDecoder</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">decoderFactory</span><span class="p">.</span><span class="na">binaryDecoder</span><span class="p">(</span><span class="n">bytes</span><span class="p">,</span><span class="w"> </span><span class="n">reuseDecoder</span><span class="p">);</span>
<span class="w"> </span><span class="k">try</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="c1">// Reuse variable not used in this example</span>
<span class="w"> </span><span class="k">return</span><span class="w"> </span><span class="n">bottleMessageDatumReader</span><span class="p">.</span><span class="na">read</span><span class="p">(</span><span class="kc">null</span><span class="p">,</span><span class="w"> </span><span class="n">reuseDecoder</span><span class="p">);</span>
<span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="k">catch</span><span class="w"> </span><span class="p">(</span><span class="n">IOException</span><span class="w"> </span><span class="n">e</span><span class="p">)</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="n">e</span><span class="p">.</span><span class="na">printStackTrace</span><span class="p">();</span>
<span class="w"> </span><span class="k">throw</span><span class="w"> </span><span class="k">new</span><span class="w"> </span><span class="n">RuntimeException</span><span class="p">();</span>
<span class="w"> </span><span class="p">}</span>
<span class="w"> </span><span class="p">}</span>
<span class="p">}</span>
</code></pre></div>
<p>Note that you can pass in reuse objects. In that case the reuse object's properties are set,
instead of a new object created. This could be useful if you change the common-interface object
into project specific objects immediately anyway and you do not want to create (more) garbage-collection
pressure. I did not do any performance tests however to see if this really helps in this case.
In general, the performance of decoding single objects is very similar for both methods.
Decoding multiple objects from a stream of data is easier
using the <code>EncoderFactory</code> approach and is also a bit faster
than transforming single objects with the <code>toByteBuffer</code> method.</p>
<p>Then there is a special method for writing to files. It can be used to save many rows of
data and it can be read without knowing the schema. Probably parquet-files or other
columnar-based storage would be better for large sets of data. For heavily nested, record
based data however, it works quite well. There is a possibility to add
optional compression codecs. Code-wise, encoding to a file and decoding from a file
looks very similar to the previous binary-encoding method, but with a <code>DataFileWriter</code> instead
of a <code>EncoderFactory</code> + <code>BinaryEncoder</code>. For reading without schema a <code>GenericDatumReader<GenericRecord></code>
is used.</p>
<h2>Other software</h2>
<p>The whole reason why I ran into trouble to begin with was because the <em>AvroCoder of Apache Beam</em>
was producing messages which the <code>AvroObject.fromByteBuffer</code> could not decode. Furthermore the
Avro integration into Kafka generally only works with <em>single-object encoding</em>. So when using
higher-level libaries which produce or read Avro data it is important to inspect what kind of encoding
and decoding it is doing.</p>
<p>Reading the avro data afterwards in Python turned out to be difficult. I did try to do it again
for this blog but had to resort to changing some bytes / cutting some bytes to make it work.
But this could also be because I used Base64 to copy over the pure byte array. Many of the specialized
Python libraries available which make reading avro faster focus on reading avro files (with the schemas
attached like <a href="https://github.com/fastavro/fastavro">fastavro</a>). To read single-object encoded byte data the <em><a href="https://github.com/confluentinc/confluent-kafka-python">confluent_kafka</a></em>
library is probably needed.</p>
<h2>Other links</h2>
<p>The code for this blogpost can be found <a href="https://bitbucket.org/francishillman/avrolessons/src/main/">here</a></p>
<p>Finally I want to point to some other links which might be useful</p>
<ul>
<li><a href="https://www.baeldung.com/java-apache-avro">Baeldung tutorial</a></li>
<li><a href="https://gist.github.com/davideicardi/e8c5a69b98e2a0f18867b637069d03a9">Gist with examples</a> of using Generic-Encoder/-Decoder and a reference to Scala's <a href="https://github.com/sksamuel/avro4s">Avro4s</a> library</li>
</ul>
<div class="footnote">
<hr>
<ol>
<li id="fn:1">
<p>I think this holds for LSP implementations too but I haven't checked <a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
</ol>
</div>On Spark, MySQL, and Timezones2021-02-27T23:51:00+01:002021-02-27T23:51:00+01:00Francis Hillmantag:hillman.dev,2021-02-27:/onsparkmysqltimezones.html<p>TIL very little about Spark timezone handling</p><p>Proper time handling in data can be hard. On the surface it seems like an easy problem and
in many cases there are straightforward solutions which work most of the time. But really
"most of the time" is not enough.</p>
<p>I saw a post on <a href="https://news.ycombinator.com/item?id=26282742">timezone handling in Python on Hacker News</a> and was reminded of the different
Python libraries there are for handling timestamps with timezones. In the Java world there
are plenty of projects which still rely on the old <a href="https://www.joda.org/joda-time/">Joda Time</a>. Although the newer <em>java.time</em> packages
in Java 8 make that dependency, in most cases, not necessary anymore. That doesn't mean we don't have to
watch out for common date, time, and timezone issues in the JVM world.</p>
<p>One such example is when using Spark SQL. On the one hand, Spark is older than the <em>java.time</em> API and
it also needs to integrate completely with JDBC. It is therefore important to double check all time-handling
code in Spark. There are many StackOverflow posts about time-data handling in Spark. Some very useful, others
leading to more problems. In the past I have had to deal such problems and always found a solution which
worked well enough in a lot of tests for that particular project.</p>
<p>This week I was once again faced with such a problem and I wanted to note down (today-I-learned style)
some of the unexpected results I found. I did this in hope of finding a more structured way of handling
dates in Spark. I did not find such a solution but ended up with a reference to fall back on when
working with this particular combination of technologies in the future.</p>
<h2>Technologies used</h2>
<p>First I want to go over the software used and important links. First there is Spark. I am using a relatively
recent version of Spark: 3.0. Luckily, Databricks (creators of Spark) have published a blog post about
<a href="https://databricks.com/blog/2020/07/22/a-comprehensive-look-at-dates-and-timestamps-in-apache-spark-3-0.html">using timestamps in Spark</a>. In there is already noted that full <em>timestamp with timezone</em> is not supported
in Spark and <em>timestamp without timezone</em> can be handled by using a timestamp with a UTC session timezone in Spark.
There is also information available for <a href="https://docs.databricks.com/spark/latest/dataframes-datasets/dates-timestamps.html">timestamps in the Databricks workspace</a>, where they go a bit
more into detail about how they rely on JVM's handling of time and repeat SQL's timestamp definitions.</p>
<p>The database from which I am extracting data is a MySQL Database (Docker mysql:8.0.21 to be exact). MySQL
timestamp with a timezone seem to rely on the session timezone, and are always stored in UTC. A tutorial
can be found <a href="https://www.mysqltutorial.org/mysql-timestamp.aspx">here</a>, and documentation <a href="https://dev.mysql.com/doc/refman/8.0/en/datetime.html">here</a>.</p>
<p>Finally the JDBC driver used by Spark in this case is the <code>mysql-connector-java:8.0.23</code>.</p>
<h2>The setting</h2>
<p>I load in some generated data into MySQL using a python script (see appendix). I either leave the setting to default,
I set the connection specifically to <code>UTC</code> or specifically to <code>UTC+01:00</code>.</p>
<p>For retrieving the data in Spark I use a</p>
<div class="highlight"><pre><span></span><code><span class="n">sparkSession</span><span class="p">.</span><span class="n">read</span>
<span class="w"> </span><span class="p">.</span><span class="n">format</span><span class="p">(</span><span class="s">"jdbc"</span><span class="p">)</span>
<span class="w"> </span><span class="p">....</span><span class="w"> </span><span class="c1">// Settings</span>
<span class="w"> </span><span class="p">.</span><span class="n">load</span><span class="p">()</span>
</code></pre></div>
<p>For filtering the selection I use one of 3 methods.
The main method I used was this one, which I think is quite common.</p>
<div class="highlight"><pre><span></span><code><span class="kd">val</span><span class="w"> </span><span class="n">startDateTime</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">"2020-02-02 00:00:00"</span>
<span class="kd">val</span><span class="w"> </span><span class="n">endDateTime</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">"2020-02-03 00:00:00"</span>
<span class="n">dataFrame</span><span class="p">.</span><span class="n">where</span><span class="p">(</span>
<span class="w"> </span><span class="p">(</span><span class="n">$</span><span class="s">"event_time"</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="n">lit</span><span class="p">(</span><span class="n">startDateTime</span><span class="p">))</span><span class="w"> </span><span class="n">and</span><span class="w"> </span><span class="p">(</span><span class="n">$</span><span class="s">"event_time"</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="n">lit</span><span class="p">(</span>
<span class="w"> </span><span class="n">endDateTime</span><span class="p">)))</span>
</code></pre></div>
<p>One potential alternative is using <code>java.sql.Timestamp</code> values.</p>
<div class="highlight"><pre><span></span><code><span class="n">dataFrame</span><span class="p">.</span><span class="n">where</span><span class="p">(</span>
<span class="w"> </span><span class="p">(</span><span class="n">$</span><span class="s">"event_time"</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="nc">Timestamp</span>
<span class="w"> </span><span class="p">.</span><span class="n">valueOf</span><span class="p">(</span><span class="n">startDateTime</span><span class="p">))</span><span class="w"> </span><span class="n">and</span><span class="w"> </span><span class="p">(</span><span class="n">$</span><span class="s">"event_time"</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="nc">Timestamp</span><span class="p">.</span><span class="n">valueOf</span><span class="p">(</span>
<span class="w"> </span><span class="n">endDateTime</span><span class="p">)))</span>
</code></pre></div>
<p>Finally we can also change the load query.</p>
<div class="highlight"><pre><span></span><code><span class="c1">// Inside the jdbc load settings</span>
<span class="n">sparkSession</span><span class="p">.</span><span class="n">read</span>
<span class="w"> </span><span class="p">.</span><span class="n">format</span><span class="p">(</span><span class="s">"jdbc"</span><span class="p">)</span>
<span class="w"> </span><span class="p">....</span><span class="w"> </span><span class="c1">// other settings</span>
<span class="w"> </span><span class="p">.</span><span class="n">option</span><span class="p">(</span><span class="s">"query"</span><span class="p">,</span><span class="w"> </span><span class="s">"SELECT * FROM ts_table2 WHERE event_time >= '2020-02-02 00:00:00+01:00' AND event_time < '2020-02-03 00:00:00+01:00'"</span><span class="p">)</span>
<span class="w"> </span><span class="p">.</span><span class="n">load</span><span class="p">()</span>
</code></pre></div>
<p>In this example we only care about the <code>event_time</code> (a timestamp column) and <code>event_value</code> a increasing unique integer.
I write the data out to csv files but I've tested some of the results by writing to parquet files and reading them with
<a href="https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html#pandas.read_parquet">Pandas</a> + <a href="https://arrow.apache.org/docs/python/parquet.html">Pyarrow's parquet reader</a>.</p>
<h2>(Un-)surprising outcomes</h2>
<p>First it's better to explicitly set the session timezone if you rely on timezones in MySQL. But in case you did not,
and you wanted to use MySQL's System timezone. It could be surprising that Spark will add a timezone based on the settings
used by Spark.</p>
<p>If I leave everything default, I get as the first row <code>2020-02-02T00:00:00.000+01:00,1440</code>. Yet if I set
a <code>spark.sql.session.timeZone=UTC</code> and a <code>-Duser.timezone=UTC</code>, I get a return value of <code>2020-02-02T00:00:00.000Z,1440</code>.
These are fundamentally 2 different points in time. This means my Spark timezone setting influences the very data
I will have in my output. Note that these are also the results if I explicitly set the session timezone in
the python script to <code>UTC</code></p>
<p>Ok, so I set the timezone in my python script to <code>+01:00</code> (Current offset in Germany). Now when I query the database
it matters what I set my session timezone to. I apply the filter in SQL in the same session and get back as
first row <code>2020-02-02 00:00:00,1440</code>. If I set my session timezone to UTC, I get the same row back by going
back one hour <code>2020-02-01 23:00:00,1440</code>.</p>
<p>Now if I would use Spark to load this data, without changing the settings. I get as row back
<code>2020-02-02T00:00:00.000+01:00,1500</code>. The <code>1500</code> shows that the actual row is the one at <code>2020-02-02T00:00:00.00Z</code> but
that the <code>+01:00</code> timezone was added after loading in the data filtered at UTC time. This same result I also get if
I set user/session timezone to <code>Europe/Berlin</code> explicitly.</p>
<p>Loading with <code>spark.sql.session.timeZone=UTC</code> and <code>-Duser.timezone=UTC</code> showed the data correctly as
<code>2020-02-02T00:00:00.000Z,1500</code> (but filtered in <code>UTC</code> of course). In this case it doesn't matter which type of filter
I apply and I can even filter using the load query with an filter of <code>event_time >= '2020-02-02 00:00:00Z'</code>.</p>
<p>Curiously if I set <code>spark.sql.session.timeZone=Europe/Berlin</code> and <code>-Duser.timezone=UTC</code>, I get the correct value
in my current timezone <code>2020-02-02T00:00:00.000+01:00,1440</code>. But I'm not sure if this is behavior I can rely on
across Spark database and file sources.</p>
<p>Another weird result was when I left my Spark and JVM settings to default, but tried to filter in the query using timezone.
The query filter was as <code>event_time >= '2020-02-02 00:00:00+01:00'</code>. The first row in the output looked as
follows: <code>2020-02-01T23:00:00.000+01:00,1440</code>.
The <code>1440</code> shows that the right row was retrieved, but somehow the event_date is not correct anymore.</p>
<h2>What about Datetime</h2>
<p>MySQL also has support for Datetime columns. These explicitly do not have a timezone, and setting a session timezone
do not influence them at all.</p>
<p>In my tests the filter was always correctly applied to these datetime objects but the actual values were represented
as timestamps in Spark. This means that if I load them with default settings it looks like <code>2020-02-02T00:00:00.000+01:00</code>
and if I load them with a UTC set user/session timezone, then it looks like <code>2020-02-02T00:00:00.000Z</code> which is kind of
a shame.</p>
<h2>Parquet sources</h2>
<p>I quickly also tested writing some parquet files with Pandas and Pyarrow, and then filtering those with Spark.
Here everything worked as expected. When I set Spark & JVM to <code>UTC</code> the filter was correctly applied in <code>UTC</code> time
and when set to <code>Europe/Germany</code> it was correctly applied and represented in the data in <code>+01:00</code>.</p>
<h2>Conclusion</h2>
<p>In a roundabout way this lead me to a conclusion I have read a <a href="http://wrschneider.github.io/2019/09/01/timezones-parquet-redshift.html">couple of times before</a>. It's best to let Spark work with time
data in UTC. Using UTC for all dates might help in making dates comparable but <a href="https://zachholman.com/talk/utc-is-enough-for-everyone-right">it is no panacea</a> sadly. If on the operational
side it makes more sense to work with a custom (non-UTC, <a href="https://cr.yp.to/proto/utctai.html">non-unixtime</a>) solution of storing timezone data,
then it needs to be solved in a bespoke way during processing in Spark.</p>
<h2>Appendix</h2>
<h3>MySQL Data Loader</h3>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">random</span>
<span class="kn">from</span> <span class="nn">datetime</span> <span class="kn">import</span> <span class="n">datetime</span><span class="p">,</span> <span class="n">timedelta</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="kn">import</span> <span class="nn">mysql.connector</span>
<span class="n">start_date</span> <span class="o">=</span> <span class="s2">"2020-02-01 00:00"</span>
<span class="n">end_date</span> <span class="o">=</span> <span class="s2">"2020-02-04 00:00"</span>
<span class="n">table_name</span> <span class="o">=</span> <span class="s2">"dt_table1"</span>
<span class="n">set_connection_timezone</span> <span class="o">=</span> <span class="kc">False</span>
<span class="n">set_to_utc</span> <span class="o">=</span> <span class="kc">False</span>
<span class="nb">print</span><span class="p">(</span><span class="n">start_date</span><span class="p">,</span> <span class="n">end_date</span><span class="p">,</span> <span class="n">table_name</span><span class="p">,</span> <span class="nb">str</span><span class="p">(</span><span class="n">set_connection_timezone</span><span class="p">),</span> <span class="nb">str</span><span class="p">(</span><span class="n">set_to_utc</span><span class="p">))</span>
<span class="n">conn</span> <span class="o">=</span> <span class="n">mysql</span><span class="o">.</span><span class="n">connector</span><span class="o">.</span><span class="n">connect</span><span class="p">(</span>
<span class="n">host</span><span class="o">=</span><span class="s2">"localhost"</span><span class="p">,</span>
<span class="n">database</span><span class="o">=</span><span class="s2">"tztest"</span><span class="p">,</span>
<span class="n">port</span><span class="o">=</span><span class="mi">3306</span><span class="p">,</span>
<span class="n">user</span><span class="o">=</span><span class="s2">"<username>"</span><span class="p">,</span>
<span class="n">password</span><span class="o">=</span><span class="s2">"<password>"</span><span class="p">,</span>
<span class="p">)</span>
<span class="k">if</span> <span class="n">set_connection_timezone</span><span class="p">:</span>
<span class="k">if</span> <span class="n">set_to_utc</span><span class="p">:</span>
<span class="n">conn</span><span class="o">.</span><span class="n">time_zone</span> <span class="o">=</span> <span class="s2">"+00:00"</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">conn</span><span class="o">.</span><span class="n">time_zone</span> <span class="o">=</span> <span class="s2">"+01:00"</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"Timezone: "</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">conn</span><span class="o">.</span><span class="n">time_zone</span><span class="p">))</span>
<span class="n">minute_range</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="n">start_date</span><span class="p">,</span> <span class="n">end_date</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="s2">"datetime64[m]"</span><span class="p">)</span>
<span class="n">vals</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">minute_range</span><span class="p">))</span>
<span class="n">query</span> <span class="o">=</span> <span class="s2">"INSERT INTO </span><span class="si">{}</span><span class="s2"> (event_time,event_count) VALUES(</span><span class="si">%s</span><span class="s2">,</span><span class="si">%s</span><span class="s2">)"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">table_name</span><span class="p">)</span>
<span class="n">cursor</span> <span class="o">=</span> <span class="n">conn</span><span class="o">.</span><span class="n">cursor</span><span class="p">()</span>
<span class="k">if</span> <span class="n">set_connection_timezone</span><span class="p">:</span>
<span class="k">if</span> <span class="n">set_to_utc</span><span class="p">:</span>
<span class="n">init_command</span><span class="o">=</span><span class="s2">"SET SESSION time_zone='+00:00'"</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">init_command</span><span class="o">=</span><span class="s2">"SET SESSION time_zone='+01:00'"</span>
<span class="n">cursor</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="n">init_command</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"Session timezone set"</span><span class="p">)</span>
<span class="n">insert_list</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">i</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">def</span> <span class="nf">execute_inserts</span><span class="p">():</span>
<span class="n">cursor</span><span class="o">.</span><span class="n">executemany</span><span class="p">(</span><span class="n">query</span><span class="p">,</span> <span class="n">insert_list</span><span class="p">)</span>
<span class="n">conn</span><span class="o">.</span><span class="n">commit</span><span class="p">()</span>
<span class="k">for</span> <span class="p">(</span><span class="n">ts</span><span class="p">,</span> <span class="n">val</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">minute_range</span><span class="p">,</span> <span class="n">vals</span><span class="p">):</span>
<span class="n">dt</span> <span class="o">=</span> <span class="n">ts</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="n">datetime</span><span class="p">)</span><span class="o">.</span><span class="n">strftime</span><span class="p">(</span><span class="s2">"%Y-%m-</span><span class="si">%d</span><span class="s2"> %H:%M:%S"</span><span class="p">)</span>
<span class="n">insert_list</span><span class="o">.</span><span class="n">append</span><span class="p">((</span><span class="n">dt</span><span class="p">,</span> <span class="nb">int</span><span class="p">(</span><span class="n">val</span><span class="p">)))</span>
<span class="n">i</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">if</span> <span class="n">i</span> <span class="o">%</span> <span class="mi">30</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="n">execute_inserts</span><span class="p">()</span>
<span class="n">insert_list</span> <span class="o">=</span> <span class="p">[]</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"."</span><span class="p">,</span> <span class="n">end</span><span class="o">=</span><span class="s2">""</span><span class="p">,</span> <span class="n">flush</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">execute_inserts</span><span class="p">()</span>
<span class="n">cursor</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
<span class="n">conn</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
</code></pre></div>Feature Caching Redis2020-02-02T18:22:00+01:002020-02-02T18:22:00+01:00Francis Hillmantag:hillman.dev,2020-02-02:/sparkredisconnect.html<p>Some approaches moving data from Spark to Redis</p><p>Hello there blog, it has been too long. I've been in America (to Disrupt and visiting our San Diego office) and worked on a bunch of projects in the mean time, but I want to share some useful info on putting preprocessed machine-learning features from Spark into Redis. I am still experimenting with different solutions but here are some options I'm considering. </p>
<p>Say there is a ML pipeline that needs to go in production, and a bunch of the feature-processing data can be prepared beforehand. It would be best to put these in some form of cache, close to where the inference will take place. A Hash-map in-memory is a solution but requires repopulating each run. Another common solution is saving the features in a Redis-cache.</p>
<h3>#1</h3>
<p>Luckily Redis has helped a bit here with a <a href="https://github.com/RedisLabs/spark-redis">spark-redis</a> connector. It enables us to directly upload a DataFrame into Redis. Given a Redis-instance running on the <em>localhost</em> at port 6379:</p>
<div class="highlight"><pre><span></span><code><span class="kd">val</span><span class="w"> </span><span class="n">spark</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nc">SparkSession</span><span class="p">.</span><span class="n">builder</span>
<span class="w"> </span><span class="p">.</span><span class="n">appName</span><span class="p">(</span><span class="s">"Uploader number 1"</span><span class="p">)</span>
<span class="w"> </span><span class="p">.</span><span class="n">master</span><span class="p">(</span><span class="s">"local"</span><span class="p">)</span>
<span class="w"> </span><span class="p">.</span><span class="n">config</span><span class="p">(</span><span class="s">"spark.redis.host"</span><span class="p">,</span><span class="w"> </span><span class="s">"localhost"</span><span class="p">)</span>
<span class="w"> </span><span class="p">.</span><span class="n">config</span><span class="p">(</span><span class="s">"spark.redis.port"</span><span class="p">,</span><span class="w"> </span><span class="s">"6379"</span><span class="p">)</span>
<span class="w"> </span><span class="p">.</span><span class="n">getOrCreate</span><span class="p">()</span>
<span class="k">import</span><span class="w"> </span><span class="nn">spark</span><span class="p">.</span><span class="nn">implicits</span><span class="p">.</span><span class="n">_</span>
<span class="n">dfToUpload</span><span class="p">.</span><span class="n">write</span>
<span class="w"> </span><span class="p">.</span><span class="n">format</span><span class="p">(</span><span class="s">"org.apache.spark.sql.redis"</span><span class="p">)</span>
<span class="w"> </span><span class="p">.</span><span class="n">option</span><span class="p">(</span><span class="s">"table"</span><span class="p">,</span><span class="w"> </span><span class="s">"tablename"</span><span class="p">)</span>
<span class="w"> </span><span class="p">.</span><span class="n">option</span><span class="p">(</span><span class="s">"key.column"</span><span class="p">,</span><span class="w"> </span><span class="s">"nameKeyColumn"</span><span class="p">)</span>
<span class="w"> </span><span class="p">.</span><span class="n">save</span><span class="p">()</span>
</code></pre></div>
<p>This is by far the easiest method and makes sure that on the receiving side we can get individual columns from a row, because it is stored on Redis' side by a hash (field). The <a href="https://redis.io/commands/hgetall">hgetall</a> command can be used to get all the features in one go.</p>
<p>(Using <a href="https://github.com/debasishg/scala-redis">Scala-redis</a>)</p>
<div class="highlight"><pre><span></span><code><span class="kd">val</span><span class="w"> </span><span class="n">keyVal</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">"tablename:aKeyName"</span>
<span class="kd">val</span><span class="w"> </span><span class="n">r</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">new</span><span class="w"> </span><span class="nc">RedisClient</span><span class="p">(</span><span class="s">"localhost"</span><span class="p">,</span><span class="w"> </span><span class="mi">6379</span><span class="p">)</span>
<span class="kd">val</span><span class="w"> </span><span class="n">sredResp</span><span class="p">:</span><span class="w"> </span><span class="nc">Option</span><span class="p">[</span><span class="nc">Map</span><span class="p">[</span><span class="nc">String</span><span class="p">,</span><span class="w"> </span><span class="nc">String</span><span class="p">]]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">r</span><span class="p">.</span><span class="n">hgetall</span><span class="p">[</span><span class="nc">String</span><span class="p">,</span><span class="w"> </span><span class="nc">String</span><span class="p">](</span><span class="n">keyVal</span><span class="p">)</span>
<span class="n">r</span><span class="p">.</span><span class="n">close</span><span class="p">()</span>
</code></pre></div>
<p>This works well, but means a lot of converting between strings, and a representation of the feature vector. In a lot of cases we want to get all the features at once per row/sample. Something similar can be done using <em>Spark-redis</em>, by using it's <a href="https://github.com/RedisLabs/spark-redis/blob/master/doc/rdd.md">RDD support</a>. The idea is to encode the whole sample using some form of serialization and then store it using key, so it can be retrieved at once. The downside of this approach is that spark-redis works with strings and we will need to encode it in a string.</p>
<h3>#2</h3>
<p>For this solution I use <a href="https://github.com/EsotericSoftware/kryo">Kryo</a> like Spark in combination with <a href="https://github.com/twitter/chill">Twitter's chill</a> as serializer (I choose this because I want to use Scala / JVM on the inferencing side).</p>
<div class="highlight"><pre><span></span><code><span class="kd">val</span><span class="w"> </span><span class="n">spark</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nc">SparkSession</span><span class="p">.</span><span class="n">builder</span>
<span class="w"> </span><span class="p">.</span><span class="n">appName</span><span class="p">(</span><span class="s">"Uploader number 2"</span><span class="p">)</span>
<span class="w"> </span><span class="p">.</span><span class="n">master</span><span class="p">(</span><span class="s">"local"</span><span class="p">)</span>
<span class="w"> </span><span class="p">.</span><span class="n">config</span><span class="p">(</span><span class="s">"spark.redis.host"</span><span class="p">,</span><span class="w"> </span><span class="s">"localhost"</span><span class="p">)</span>
<span class="w"> </span><span class="p">.</span><span class="n">config</span><span class="p">(</span><span class="s">"spark.redis.port"</span><span class="p">,</span><span class="w"> </span><span class="s">"6379"</span><span class="p">)</span>
<span class="w"> </span><span class="p">.</span><span class="n">getOrCreate</span><span class="p">()</span>
<span class="k">import</span><span class="w"> </span><span class="nn">spark</span><span class="p">.</span><span class="nn">implicits</span><span class="p">.</span><span class="n">_</span>
<span class="kd">val</span><span class="w"> </span><span class="n">dfToConvert</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dfIn</span><span class="p">.</span><span class="n">as</span><span class="p">[</span><span class="nc">FeatureCaseClass</span><span class="p">]</span>
<span class="kd">val</span><span class="w"> </span><span class="n">instantiator</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">new</span><span class="w"> </span><span class="nc">ScalaKryoInstantiator</span>
<span class="n">instantiator</span><span class="p">.</span><span class="n">setRegistrationRequired</span><span class="p">(</span><span class="kc">false</span><span class="p">)</span>
<span class="kd">val</span><span class="w"> </span><span class="n">kryo</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">instantiator</span><span class="p">.</span><span class="n">newKryo</span><span class="p">()</span>
<span class="kd">val</span><span class="w"> </span><span class="n">encoder</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nc">Base64</span><span class="p">.</span><span class="n">getEncoder</span>
<span class="kd">val</span><span class="w"> </span><span class="n">keyedRDD</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dfToConvert</span><span class="p">.</span><span class="n">rdd</span><span class="p">.</span><span class="n">keyBy</span><span class="p">(</span><span class="n">_</span><span class="p">.</span><span class="n">nameKeyColumn</span><span class="p">).</span><span class="n">map</span><span class="p">(</span><span class="n">tup</span><span class="w"> </span><span class="o">=></span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="kd">val</span><span class="w"> </span><span class="n">output</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">new</span><span class="w"> </span><span class="nc">Output</span><span class="p">(</span><span class="mi">512</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="mi">1</span><span class="p">)</span><span class="w"> </span><span class="c1">// Guess the side but allows it to grow</span>
<span class="w"> </span><span class="n">kryo</span><span class="p">.</span><span class="n">writeObject</span><span class="p">(</span><span class="n">output</span><span class="p">,</span><span class="w"> </span><span class="n">tup</span><span class="p">.</span><span class="n">_2</span><span class="p">)</span>
<span class="w"> </span><span class="n">tup</span><span class="p">.</span><span class="n">_1</span><span class="w"> </span><span class="o">-></span><span class="w"> </span><span class="n">encoder</span><span class="p">.</span><span class="n">encodeToString</span><span class="p">(</span><span class="n">output</span><span class="p">.</span><span class="n">getBuffer</span><span class="p">)</span>
<span class="p">})</span>
<span class="kd">val</span><span class="w"> </span><span class="n">sc</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">spark</span><span class="p">.</span><span class="n">sparkContext</span>
<span class="n">sc</span><span class="p">.</span><span class="n">toRedisKV</span><span class="p">(</span><span class="n">keyedRDD</span><span class="p">)</span>
</code></pre></div>
<p>Now getting the data out (this time using <a href="https://github.com/xetorthio/jedis">Jedis</a> and a shared library with the case-class definition).</p>
<div class="highlight"><pre><span></span><code><span class="kd">val</span><span class="w"> </span><span class="n">keyVal</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">"keyname"</span>
<span class="kd">val</span><span class="w"> </span><span class="n">jedis</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">new</span><span class="w"> </span><span class="nc">Jedis</span><span class="p">(</span><span class="s">"localhost"</span><span class="p">)</span>
<span class="kd">val</span><span class="w"> </span><span class="n">instantiator</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">new</span><span class="w"> </span><span class="nc">ScalaKryoInstantiator</span>
<span class="n">instantiator</span><span class="p">.</span><span class="n">setRegistrationRequired</span><span class="p">(</span><span class="kc">false</span><span class="p">)</span>
<span class="kd">val</span><span class="w"> </span><span class="n">kryo</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">instantiator</span><span class="p">.</span><span class="n">newKryo</span><span class="p">()</span>
<span class="kd">val</span><span class="w"> </span><span class="n">decoder</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nc">Base64</span><span class="p">.</span><span class="n">getDecoder</span>
<span class="kd">val</span><span class="w"> </span><span class="n">jedResp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">jedis</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">keyVal</span><span class="p">)</span>
<span class="k">if</span><span class="p">(</span><span class="n">jedResp</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="kc">null</span><span class="p">)</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="n">println</span><span class="p">(</span><span class="s">"Couldn't find key"</span><span class="p">)</span>
<span class="w"> </span><span class="c1">// etc.</span>
<span class="p">}</span>
<span class="kd">val</span><span class="w"> </span><span class="n">decodedBytes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">decoder</span><span class="p">.</span><span class="n">decode</span><span class="p">(</span><span class="n">jedResp</span><span class="p">)</span>
<span class="kd">val</span><span class="w"> </span><span class="n">input</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">new</span><span class="w"> </span><span class="nc">Input</span><span class="p">(</span><span class="n">decodedBytes</span><span class="p">)</span>
<span class="kd">val</span><span class="w"> </span><span class="n">dataBack</span><span class="w"> </span><span class="o">=</span>
<span class="w"> </span><span class="n">kryo</span><span class="p">.</span><span class="n">readObject</span><span class="p">(</span><span class="n">input</span><span class="p">,</span><span class="w"> </span><span class="k">classOf</span><span class="p">[</span><span class="n">datatype</span><span class="p">.</span><span class="nc">FeatureCaseClass</span><span class="p">])</span>
</code></pre></div>
<p>This bundles the features neatly together but is not really efficient because of the <em>Base64</em> encoding/decoding step needed.</p>
<p>It is also possible to leave the <em>Spark-redis</em> connector for what it is and push the samples by using some client library (like Jedis).</p>
<h3>#3</h3>
<div class="highlight"><pre><span></span><code><span class="kd">val</span><span class="w"> </span><span class="n">spark</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nc">SparkSession</span><span class="p">.</span><span class="n">builder</span>
<span class="w"> </span><span class="p">.</span><span class="n">appName</span><span class="p">(</span><span class="s">"Uploader number 3"</span><span class="p">)</span>
<span class="w"> </span><span class="p">.</span><span class="n">master</span><span class="p">(</span><span class="s">"local"</span><span class="p">)</span>
<span class="w"> </span><span class="p">.</span><span class="n">config</span><span class="p">(</span><span class="s">"spark.redis.host"</span><span class="p">,</span><span class="w"> </span><span class="s">"localhost"</span><span class="p">)</span>
<span class="w"> </span><span class="p">.</span><span class="n">config</span><span class="p">(</span><span class="s">"spark.redis.port"</span><span class="p">,</span><span class="w"> </span><span class="s">"6379"</span><span class="p">)</span>
<span class="w"> </span><span class="p">.</span><span class="n">getOrCreate</span><span class="p">()</span>
<span class="k">import</span><span class="w"> </span><span class="nn">spark</span><span class="p">.</span><span class="nn">implicits</span><span class="p">.</span><span class="n">_</span>
<span class="kd">val</span><span class="w"> </span><span class="n">dfToConvert</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dfIn</span><span class="p">.</span><span class="n">as</span><span class="p">[</span><span class="nc">FeatureCaseClass</span><span class="p">]</span>
<span class="kd">val</span><span class="w"> </span><span class="n">instantiator</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">new</span><span class="w"> </span><span class="nc">ScalaKryoInstantiator</span>
<span class="n">instantiator</span><span class="p">.</span><span class="n">setRegistrationRequired</span><span class="p">(</span><span class="kc">false</span><span class="p">)</span>
<span class="kd">val</span><span class="w"> </span><span class="n">kryo</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">instantiator</span><span class="p">.</span><span class="n">newKryo</span><span class="p">()</span>
<span class="kd">val</span><span class="w"> </span><span class="n">dfToUpload</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dfToConvert</span><span class="p">.</span><span class="n">map</span><span class="p">(</span><span class="n">fcc</span><span class="w"> </span><span class="o">=></span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="kd">val</span><span class="w"> </span><span class="n">output</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">new</span><span class="w"> </span><span class="nc">Output</span><span class="p">(</span><span class="mi">512</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>
<span class="w"> </span><span class="n">kryo</span><span class="p">.</span><span class="n">writeObject</span><span class="p">(</span><span class="n">output</span><span class="p">,</span><span class="w"> </span><span class="n">fcc</span><span class="p">)</span>
<span class="w"> </span><span class="n">fcc</span><span class="p">.</span><span class="n">nameKeyColumn</span><span class="p">.</span><span class="n">getBytes</span><span class="w"> </span><span class="o">-></span><span class="w"> </span><span class="n">output</span><span class="p">.</span><span class="n">getBuffer</span>
<span class="p">})</span>
<span class="kd">val</span><span class="w"> </span><span class="n">results</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dfToUpload</span><span class="p">.</span><span class="n">mapPartitions</span><span class="p">(</span><span class="n">pairList</span><span class="w"> </span><span class="o">=></span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="kd">val</span><span class="w"> </span><span class="n">jedis</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">new</span><span class="w"> </span><span class="nc">Jedis</span><span class="p">(</span><span class="s">"localhost"</span><span class="p">)</span><span class="w"> </span><span class="c1">// Every spark partition its own client</span>
<span class="w"> </span><span class="n">pairList</span><span class="p">.</span><span class="n">map</span><span class="p">(</span><span class="n">pair</span><span class="w"> </span><span class="o">=></span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="n">jedis</span><span class="p">.</span><span class="n">set</span><span class="p">(</span><span class="n">pair</span><span class="p">.</span><span class="n">_1</span><span class="p">,</span><span class="w"> </span><span class="n">pair</span><span class="p">.</span><span class="n">_2</span><span class="p">)</span>
<span class="w"> </span><span class="p">})</span>
<span class="p">})</span>
<span class="c1">// Force evaluation through writing return info</span>
<span class="n">results</span><span class="p">.</span><span class="n">write</span><span class="p">.</span><span class="n">mode</span><span class="p">(</span><span class="nc">SaveMode</span><span class="p">.</span><span class="nc">Overwrite</span><span class="p">).</span><span class="n">csv</span><span class="p">(</span><span class="s">"upload_info"</span><span class="p">)</span>
</code></pre></div>
<p>And now for reading the output (with Jedis again).</p>
<div class="highlight"><pre><span></span><code><span class="kd">val</span><span class="w"> </span><span class="n">keyVal</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">"keyname"</span>
<span class="kd">val</span><span class="w"> </span><span class="n">jedis</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">new</span><span class="w"> </span><span class="nc">Jedis</span><span class="p">(</span><span class="s">"localhost"</span><span class="p">)</span>
<span class="kd">val</span><span class="w"> </span><span class="n">instantiator</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">new</span><span class="w"> </span><span class="nc">ScalaKryoInstantiator</span>
<span class="n">instantiator</span><span class="p">.</span><span class="n">setRegistrationRequired</span><span class="p">(</span><span class="kc">false</span><span class="p">)</span>
<span class="kd">val</span><span class="w"> </span><span class="n">kryo</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">instantiator</span><span class="p">.</span><span class="n">newKryo</span><span class="p">()</span>
<span class="kd">val</span><span class="w"> </span><span class="n">jedResp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">jedis</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">keyVal</span><span class="p">.</span><span class="n">getBytes</span><span class="p">)</span>
<span class="k">if</span><span class="p">(</span><span class="n">jedResp</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="kc">null</span><span class="p">)</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="n">println</span><span class="p">(</span><span class="s">"Couldn't find key"</span><span class="p">)</span>
<span class="w"> </span><span class="c1">// etc.</span>
<span class="p">}</span>
<span class="kd">val</span><span class="w"> </span><span class="n">input</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">new</span><span class="w"> </span><span class="nc">Input</span><span class="p">(</span><span class="n">jedResp</span><span class="p">)</span>
<span class="kd">val</span><span class="w"> </span><span class="n">dataBack</span><span class="w"> </span><span class="o">=</span>
<span class="w"> </span><span class="n">kryo</span><span class="p">.</span><span class="n">readObject</span><span class="p">(</span><span class="n">input</span><span class="p">,</span><span class="w"> </span><span class="k">classOf</span><span class="p">[</span><span class="n">datatype</span><span class="p">.</span><span class="nc">FeatureCaseClass</span><span class="p">])</span>
</code></pre></div>Sorting csv's externally2019-09-10T20:30:00+02:002019-09-10T20:30:00+02:00Francis Hillmantag:hillman.dev,2019-09-10:/sorting.html<p>I made a tool for externally sorting csv files and I should know better.</p><p>Some time ago, I ran into the problem of having a very large comma-separated-value file which I had to sort. There are many good solutions for this simple problem, and there are many ways of going about this in general. The better ones probably being, loading the file into <a href="https://www.sqlite.org">Sqlite</a> or a docker with <a href="https://hub.docker.com/_/postgres">Postgresql</a> and using SQL. Or adapting the data with a quick python script to make it easy to use <a href="https://en.wikipedia.org/wiki/Sort_(Unix)">Unix's sort</a>.</p>
<p>Instead, at the time a friend advised <a href="https://spark.apache.org/">Spark</a>. And although Spark does have methods for larger-than-memory datasets, they need to be well partitioned and it is easy to go wrong. It failed spectacularly<sup id="fnref:spark"><a class="footnote-ref" href="#fn:spark">1</a></sup>. I knew that this shouldn't be a hard problem and because a lot of code there was already in Python, I looked and easily found a python external sorter. It was not the quickest but it got the job done and worked fine.</p>
<p>But it did leave me wondering. Although <a href="https://pandas.pydata.org/">Panda's csv parser</a> is probably quite optimized (and you can turn on and off the error reporting on faulty csv lines), still there is somewhere a python performance penalty. There was also the fact that I want to experiment with a bit of Scala coding (and not just Spark flavoured Scala). So I made <a href="https://github.com/FransHeuvelmans/Exort">my own external csv sorter in Scala</a>. I hope that by using Univocity's parsing library I am able to make a somewhat quick external sorter that has predictable behaviour even when presented with bad lines. But I still need to properly test and learn the library and learn how to work nicely with Scala. It turns out that working around <em>type erasure</em> in pattern matching and making this properly generic is not that easy. Hopefully I will learn a bit more about those things and others, by working on this project. </p>
<div class="footnote">
<hr>
<ol>
<li id="fn:spark">
<p>Spark is very suited for different tasks but is often misused like this <a class="footnote-backref" href="#fnref:spark" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
</ol>
</div>Using Language Models2019-08-06T23:44:00+02:002019-08-06T23:44:00+02:00Francis Hillmantag:hillman.dev,2019-08-06:/langmodels.html<p>A lot has happened in NLP land. These are a few of my observations and software recommendations.</p><p>In the last year Language Models have changed my approach to working with natural language processing.
Some (relatively) fresh results <a href="https://paperswithcode.com/paper/xlnet-generalized-autoregressive-pretraining">by XLNet</a> show that large transformer-style models work really well for many language ML tasks. On the surface it seems quite similar to training word embeddings<sup id="fnref:wemb"><a class="footnote-ref" href="#fn:wemb">1</a></sup> but with the advantage of training far more layers & parameters (weights). This is a great boon to any NLP practitioner as <a href="http://ruder.io/nlp-imagenet/">many</a> have talked <a href="https://www.nytimes.com/2018/11/18/technology/artificial-intelligence-language.html">about</a>. And for the first half of 2019, the speed of progress hasn't slowed down.</p>
<p>There is a downside to this. Easily replicating or retraining your own models from scratch is becoming increasingly hard. Often resorting to tricks like accumulating gradients, recalculating some weights during the backward pass, and simply still leaving some parts of a model "freezed". Saving up money for some time to buy a Nvidia 1080 Ti, only to not be able to train many models can be a bummer (lucky there is free <a href="https://colab.research.google.com">colab</a>, but it feels bad to depend on that).
A more serious problem is gauging how a certain architecture and how training choices really add in to the final result.
Anna Rogers has written <a href="https://hackingsemantics.xyz/2019/leaderboards/">a very good piece</a> about this. Another problem could be the fact that I do not really know what data has been fed to the model.</p>
<p>But leaving all these sidenotes, language models help enormously and I would like to show two libraries which make working with them incredibly easy.</p>
<h2>Zalando's <a href="https://github.com/zalandoresearch/flair">Flair</a></h2>
<p>Flair is together with ULMFit one of the older RNN-style language models. This might be the reason why it is not SOTA anymore, but it still performs very well, and the library is incredibly easy to use. With build in model downloader, training and all kind of tooling around their models. They also have a <a href="https://github.com/zalandoresearch/flair#tutorials">simple tutorial</a> which I can recommend if you want to get started quickly.</p>
<p>To show you how easy, here is a short sample using their pre-trained sequence model which predicts NER tags.</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">flair.data</span> <span class="kn">import</span> <span class="n">Sentence</span>
<span class="kn">from</span> <span class="nn">flair.models</span> <span class="kn">import</span> <span class="n">SequenceTagger</span>
<span class="n">sentence</span> <span class="o">=</span> <span class="n">Sentence</span><span class="p">(</span><span class="s1">'German Chancellor Angela Merkel and British Primeminister Boris Johnson .....'</span><span class="p">)</span>
<span class="n">tagger</span> <span class="o">=</span> <span class="n">SequenceTagger</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="s1">'ner'</span><span class="p">)</span>
<span class="n">tagger</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">sentence</span><span class="p">)</span>
<span class="k">for</span> <span class="n">entity</span> <span class="ow">in</span> <span class="n">sentence</span><span class="o">.</span><span class="n">get_spans</span><span class="p">(</span><span class="s1">'ner'</span><span class="p">):</span>
<span class="nb">print</span><span class="p">(</span><span class="n">entity</span><span class="p">)</span>
</code></pre></div>
<p>And if we want to train our own, with multiple layers of embeddings.</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">flair.datasets</span>
<span class="kn">from</span> <span class="nn">flair.models</span> <span class="kn">import</span> <span class="n">SequenceTagger</span>
<span class="kn">from</span> <span class="nn">flair.trainers</span> <span class="kn">import</span> <span class="n">ModelTrainer</span>
<span class="kn">from</span> <span class="nn">flair.embeddings</span> <span class="kn">import</span> <span class="n">FlairEmbeddings</span><span class="p">,</span> <span class="n">WordEmbeddings</span><span class="p">,</span> <span class="n">StackedEmbeddings</span>
<span class="n">corpus</span> <span class="o">=</span> <span class="n">flair</span><span class="o">.</span><span class="n">datasets</span><span class="o">.</span><span class="n">WIKINER_ENGLISH</span><span class="p">()</span>
<span class="n">ner_dict</span> <span class="o">=</span> <span class="n">corpus</span><span class="o">.</span><span class="n">make_tag_dictionary</span><span class="p">(</span><span class="s1">'ner'</span><span class="p">)</span>
<span class="n">embedding_types</span> <span class="o">=</span> <span class="p">[</span>
<span class="n">WordEmbeddings</span><span class="p">(</span><span class="s1">'glove'</span><span class="p">),</span>
<span class="n">FlairEmbeddings</span><span class="p">(</span><span class="s1">'news-forward-fast'</span><span class="p">),</span>
<span class="p">]</span>
<span class="n">embeddings</span> <span class="o">=</span> <span class="n">StackedEmbeddings</span><span class="p">(</span><span class="n">embeddings</span><span class="o">=</span><span class="n">embedding_types</span><span class="p">)</span>
<span class="n">tagger</span> <span class="o">=</span> <span class="n">SequenceTagger</span><span class="p">(</span>
<span class="n">hidden_size</span><span class="o">=</span><span class="mi">256</span><span class="p">,</span>
<span class="n">embeddings</span><span class="o">=</span><span class="n">embeddings</span><span class="p">,</span>
<span class="n">tag_dictionary</span><span class="o">=</span><span class="n">ner_dict</span><span class="p">,</span>
<span class="n">tag_type</span><span class="o">=</span><span class="s2">"ner"</span><span class="p">,</span>
<span class="n">use_crf</span><span class="o">=</span><span class="kc">True</span>
<span class="p">)</span>
<span class="n">trainer</span> <span class="o">=</span> <span class="n">ModelTrainer</span><span class="p">(</span><span class="n">tagger</span><span class="p">,</span> <span class="n">corpus</span><span class="p">)</span>
<span class="n">trainer</span><span class="o">.</span><span class="n">train</span><span class="p">(</span>
<span class="s1">'resources/taggers/example-ner'</span><span class="p">,</span>
<span class="n">learning_rate</span><span class="o">=</span><span class="mf">0.1</span><span class="p">,</span>
<span class="n">mini_batch_size</span><span class="o">=</span><span class="mi">32</span><span class="p">,</span>
<span class="n">max_epochs</span><span class="o">=</span><span class="mi">150</span>
<span class="p">)</span>
</code></pre></div>
<p>And there is even a build in <a href="https://github.com/zalandoresearch/flair/blob/master/resources/docs/TUTORIAL_6_CORPUS.md#reading-your-own-sequence-labeling-dataset">ColumnCorpus and CSVCorpus</a> and ClassificationCorpus dataset loader which
loads <a href="https://github.com/facebookresearch/fastText/blob/master/README.md#text-classification">fasttext</a> style inputs (also quite a useful toolkit to have).</p>
<h2>Huggingface's <a href="https://github.com/huggingface/pytorch-transformers">pytorch-transformers</a></h2>
<p>This is a well known reimplementation of modern Bert and XLNet architectures in Pytorch (originally with help from Google, and in Tensorflow). Not only are they pretty easy to work with. The pre-trained models are also available on pytorch hub and can be downloaded and used with build-in tools.</p>
<p>To use a pretrained network as the front-part of your network can be as easy as this:</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">torch</span>
<span class="kn">from</span> <span class="nn">pytorch_transformers</span> <span class="kn">import</span> <span class="n">BertModel</span><span class="p">,</span> <span class="n">BertTokenizer</span>
<span class="n">pretrained_modelname</span> <span class="o">=</span> <span class="s2">"bert-base-uncased"</span>
<span class="c1"># This will download the model if it has not be found in user storage</span>
<span class="n">tokenizer</span> <span class="o">=</span> <span class="n">BertTokenizer</span><span class="o">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="n">pretrained_modelname</span><span class="p">)</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">BertModel</span><span class="o">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="n">pretrained_modelname</span><span class="p">)</span>
<span class="n">encoded_text</span> <span class="o">=</span> <span class="n">tokenizer</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s2">"Enter the text you need the embeddings from here"</span><span class="p">)</span>
<span class="n">input_tensor</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">tensor</span><span class="p">([</span><span class="n">encoded_text</span><span class="p">])</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"Input tensor: "</span><span class="p">,</span> <span class="n">input_tensor</span><span class="p">)</span>
<span class="k">with</span> <span class="n">torch</span><span class="o">.</span><span class="n">no_grad</span><span class="p">():</span>
<span class="n">output_tuple</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">input_tensor</span><span class="p">)</span>
<span class="n">last_hidden_states</span> <span class="o">=</span> <span class="n">output_tuple</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"Last hidden states: "</span><span class="p">,</span> <span class="n">last_hidden_states</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"Shape (size): "</span><span class="p">,</span> <span class="n">last_hidden_states</span><span class="o">.</span><span class="n">size</span><span class="p">())</span>
</code></pre></div>
<p>Now, this is already a great solution for many cases where simply using the embeddings as input features give better results (especially coming from wordvector embeddings). And, like the Flair example shows, it gives room to experiment with combining embeddings (simply concatenating them over the words).</p>
<p>But there is more. There are build-in models, ready for <a href="https://github.com/huggingface/pytorch-transformers/blob/master/pytorch_transformers/modeling_bert.py#L1109">word classification</a> (like NER tagging) and generic <a href="https://github.com/huggingface/pytorch-transformers/blob/master/pytorch_transformers/modeling_xlnet.py#L1076">text classification</a>. And example run-scripts in the examples folder.</p>
<p>To extend the <code>run_glue.py</code>(and <code>util_glue.py</code>) model training scripts to make it run generic text classification problems I have added some code to <code>util_glue.py</code>:</p>
<div class="highlight"><pre><span></span><code><span class="k">class</span> <span class="nc">ImdbProcessor</span><span class="p">(</span><span class="n">DataProcessor</span><span class="p">):</span>
<span class="w"> </span><span class="sd">"""Processor for the special IMDB dataset."""</span>
<span class="k">def</span> <span class="nf">get_train_examples</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">data_dir</span><span class="p">):</span>
<span class="w"> </span><span class="sd">"""See base class."""</span>
<span class="n">logger</span><span class="o">.</span><span class="n">info</span><span class="p">(</span><span class="s2">"LOOKING AT </span><span class="si">{}</span><span class="s2">"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">data_dir</span><span class="p">,</span> <span class="s2">"imdbtrain.tsv"</span><span class="p">)))</span>
<span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">_create_examples</span><span class="p">(</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_read_tsv</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">data_dir</span><span class="p">,</span> <span class="s2">"imdbtrain.tsv"</span><span class="p">)),</span> <span class="s2">"train"</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">get_dev_examples</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">data_dir</span><span class="p">):</span>
<span class="w"> </span><span class="sd">"""See base class."""</span>
<span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">_create_examples</span><span class="p">(</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_read_tsv</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">data_dir</span><span class="p">,</span> <span class="s2">"imdbtest.tsv"</span><span class="p">)),</span> <span class="s2">"dev"</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">get_labels</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="w"> </span><span class="sd">"""Still same classes luckily."""</span>
<span class="k">return</span> <span class="p">[</span><span class="s2">"0"</span><span class="p">,</span> <span class="s2">"1"</span><span class="p">]</span>
<span class="k">def</span> <span class="nf">_create_examples</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">lines</span><span class="p">,</span> <span class="n">set_type</span><span class="p">):</span>
<span class="w"> </span><span class="sd">"""Creates examples for the training and dev sets."""</span>
<span class="n">examples</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">lines</span><span class="p">:</span>
<span class="n">guid</span> <span class="o">=</span> <span class="s2">"</span><span class="si">%s</span><span class="s2">-</span><span class="si">%s</span><span class="s2">"</span> <span class="o">%</span> <span class="p">(</span><span class="n">set_type</span><span class="p">,</span> <span class="n">line</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
<span class="n">text_a</span> <span class="o">=</span> <span class="n">line</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
<span class="n">text_b</span> <span class="o">=</span> <span class="kc">None</span>
<span class="n">label</span> <span class="o">=</span> <span class="n">line</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span>
<span class="n">examples</span><span class="o">.</span><span class="n">append</span><span class="p">(</span>
<span class="n">InputExample</span><span class="p">(</span><span class="n">guid</span><span class="o">=</span><span class="n">guid</span><span class="p">,</span> <span class="n">text_a</span><span class="o">=</span><span class="n">text_a</span><span class="p">,</span> <span class="n">text_b</span><span class="o">=</span><span class="n">text_b</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="n">label</span><span class="p">))</span>
<span class="k">return</span> <span class="n">examples</span>
<span class="n">processors</span> <span class="o">=</span> <span class="p">{</span>
<span class="o">...</span>
<span class="s2">"imdb"</span><span class="p">:</span> <span class="n">ImdbProcessor</span><span class="p">,</span>
<span class="p">}</span>
<span class="n">output_modes</span> <span class="o">=</span> <span class="p">{</span>
<span class="o">...</span>
<span class="s2">"imdb"</span><span class="p">:</span> <span class="s2">"classification"</span><span class="p">,</span>
<span class="p">}</span>
<span class="n">GLUE_TASKS_NUM_LABELS</span> <span class="o">=</span> <span class="p">{</span>
<span class="o">...</span>
<span class="s2">"imdb"</span><span class="p">:</span> <span class="mi">2</span><span class="p">,</span>
<span class="p">}</span>
</code></pre></div>
<p>In this case I named it IMDB to make it work with the IMDB classification dataset. Now it can work on a dataset with tab separated values<sup id="fnref:tabexplain"><a class="footnote-ref" href="#fn:tabexplain">2</a></sup>.</p>
<h2>Other libraries</h2>
<p>Some libraries I am watching but haven't tested yet.</p>
<p><strong><a href="https://spacy.io/usage/v2-1#pretraining">Spacy embedding pretraining</a></strong>: I use Spacy quite often for fast text cleaning/mangling and for creating rules matchers based on regex in combination with NER tags. Previously I have tested and used their build-in NER training and classification modules. My bet is that this will be just as great.</p>
<p><strong><a href="https://github.com/deepset-ai/FARM">Deepset FARM</a></strong>: By the folks who also released a German trained BERT model. Looks very neatly done and is build ontop of pytorch_transformers.</p>
<p><strong><a href="https://nlp.johnsnowlabs.com/">JohnSnow NLP</a></strong>: A colleague of mine tried to do NLP on SPARK some time ago and used this library. He checked it out again this week and saw a bunch of new features. I am not completely sold on the distributed text processing just yet, but I do want to try it out.</p>
<p>There is a lot more to say but I am going to leave it by this for now. Hopefully showing you this will get you interested in trying out these great techniques, and dig deeper into their examples and source code.</p>
<div class="footnote">
<hr>
<ol>
<li id="fn:wemb">
<p>Wordembeddings can still be used, although most models now learn embeddings for subwordunits. See <a href="https://github.com/google/sentencepiece">sentencepiece</a> or <a href="https://github.com/openai/gpt-2/blob/master/src/encoder.py">GPT-2's encoder</a> <a class="footnote-backref" href="#fnref:wemb" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:tabexplain">
<p>In this case col 0: id, col 1: text, col 2: class labels → 0 or 1 <a class="footnote-backref" href="#fnref:tabexplain" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
</ol>
</div>Blogpost 0012019-07-11T16:20:00+02:002019-07-11T16:20:00+02:00Francis Hillmantag:hillman.dev,2019-07-11:/post-001.html<p>About this new blog of mine</p><p>Hey, welcome to my new blog. This is the third time I start a blog and the second time with my own domain. I didn't use this site since the launch of the <em>.dev</em> domain so that will give an indication of how active I am.</p>
<p>I plan to use this blog to write about small or big things that I've learned and that I would like to share. Maybe nobody will ever visit it, but even then I hope to learn to write a bit better and remind me of the things I've learned.</p>
<p>For this first post I do not have much to mention other than <a href="https://hillman.dev/pages/personal-projects.html">my little projects</a>. And that I digg <a href="https://www.fast.ai/">fastai</a> and <a href="https://spacy.io/usage/spacy-101">spacy</a> for making ML and NLP more approachable. I think we are somewhat off of automating everything using AI, but there is still a lot of low hanging fruit in simple decisions being made based on written information, where quick advice would speed up everything.</p>