Brett.is

Thinking about transactional email

Thu, 14 Sep 2017 00:00:00 +0000

I have recently started working on a new project, Mailchemy, for managing all of your transactional email integrations behind one easy to use API.

For awhile now I have been thinking about the complications of transactional email. In the past I have found myself wanting to change email service providers, either for pricing or reliability reasons, but I’ve found myself stuck. I had integrated my application so heavily into their product that making any kind of move away from their platform becomes difficult. You can’t fault me for making this choice– email applications require deep integration in order to truly leverage all of the benefits of these services.

When integrating with an email service provider it would have been preferable to have abstracted out as much of the integration as possible, making it easier to swap out email providers. Since we had not abstracted away the API calls in our code, we found ourselves having to track down every use of that specific API and update for it the new provider. Given how tightly coupled sending transactional email is with our core application, these calls are spread throughout the code base and checking to be sure we had all usages removed and tested was both time consuming and risky. What if we missed a spot and weren’t sending an important message to our customers?

In addition to deep integration with an email service provider for when we send emails we were also storing a bunch of our email templates on the provider’s platform. This was a nice feature to ensure we didn’t have to deploy a code change to update the copy in an email, enabling non-technical team members to make changes to our emails easily. However, this nice feature also locked us into the provider further and made it harder for us to extract and move all of those templates to another platform. To have more ownership of our email templates across platforms, we started moving some of the templates into our code base, but this introduced its own set of problems: shuffling templates around between providers and code is time consuming and means there is another part of the email pipeline you have to manage yourself. We also lost the ability for our non-technical team members to update templates without a code change.

Since we were a small team, we couldn’t justify the loss of flexibility and increased effort that making these changes would incur– we were bootstrapped, we had more important things to work on. So we were stuck with the initial email provider we picked.

This work problem got me thinking more about transactional email. It seems most people follow a similar basic process:

Generate the email body from a template
Optimize the email (inline CSS, minify HTML, etc)
Send the email
Create analytics or reports about the emails being sent
Track open/click rates for emails

There are a lot of really good transactional email providers, most of which offer some form of all of these steps. However, they don’t all necessarily offer the same features as other platforms or the exact features or configurations that someone will want. As I discussed above, sometimes you find yourself in a position where you don’t want to lock yourself into any given platform because of features so you abstract and manage the features you want anyways.

In software we talk a lot about DRY- don’t repeat yourself. We use abstraction and open source to leverage the time of our community to not re-invent the wheel, so why aren’t we doing this more for transactional emails?

I have started working on a new project, Mailchemy, which aims to help simplify transactional email. Configure and manage all of your favorite email integrations in one place and utilize with one simple API call.

Mailchemy is the product that I wish I had when I was configuring transactional email at previous companies. By using Mailchemy to handle all of your email integrations and calls to your email provider your initial integration along with any changes are easier to manage. Changing an email provider can happen in just a few clicks instead of the countless hours spent hunting down that one last usage.

Mailchemy allows users to define their own custom transactional email pipelines full of your favorite integrations and utilize the email provider of your choice. You will no longer have to make a decision on which email provider to use based on the features they offer.

No support for Pug? No problem, if Mailchemy supports it, you can use it.

Want to change email providers? Take all of your favorite integrations with you.

As of the writing of this article Mailchemy is still under active development, please visit https://mailchemy.com/ for more information and to subscribe for product updates.

Managing Go dependencies with git-subtree

Wed, 03 Feb 2016 00:00:00 +0000

Recently I have decided to make the switch to using git-subtree for managing dependencies of my Go projects.

For a while now I have been searching for a good way to manage dependencies for my Go projects. I think I have finally found a work flow that I really like that uses git-subtree.

When I began investigating different ways to manage dependencies I had a few small goals or concepts I wanted to follow.

Keep it simple

I have always been drawn to the simplicity of Go and the tools that surround it. I didn’t want to add a lot of overhead or complexity into my work flow when programming in Go.

Vendor dependencies

I decided right away that I wanted to vendor my dependencies, that is, where all of my dependencies live under a top level vendor/ directory in each repository.

This also means that I wanted to use the GO15VENDOREXPERIMENT="1" flag.

Maintain the full source code of each dependency in each repository

The idea here is that each project will maintain the source code for each of its dependencies instead of having a dependency manifest file, like package.json or Godeps.json, to manage the dependencies.

This was more of an acceptance than a decision. It wasn’t a hard requirement that each repository maintains the full source code for each of its dependencies, but I was willing to accept that as a by product of a good work flow.

In come git-subtree

When researching methods of managing dependencies with git, I came across a great article from Atlassian, The power of Git subtree. Which outlined how to use git-subtree for managing repository dependencies… exactly what I was looking for!

The main idea with git-subtree is that it is able to fetch a full repository and place it inside of your repository. However, it differs from git-submodule because it does not create a link/reference to a remote repository, instead it will fetch all the files from that remote repository and place them under a directory in your repository and then treats them as though they are part of your repository (there is no additional .git directory).

If you pair git-subtree with its --squash option, it will squash the remote repository down to a single commit before pulling it into your repository.

As well, git-subtree has ability to issue a pull to update a child repository.

Lets just take a look at how using git-subtree would work.

Adding a new dependency

We want to add a new dependency, github.com/miekg/dns to our project.

git subtree add --prefix vendor/github.com/miekg/dns https://github.com/miekg/dns.git master --squash

This command will pull in the full repository for github.com/miekg/dns at master to vendor/github.com/miekg/dns.

And that is it, git-subtree will have created two commits for you, one for the squash of github.com/miekg/dns and another for adding it as a child repository.

Updating an existing dependency

If you want to then update github.com/miekg/dns you can just run the following:

git subtree pull --prefix vendor/github.com/miekg/dns https://github.com/miekg/dns.git master --squash

This command will again pull down the latest version of master from github.com/miekg/dns (assuming it has changed) and create two commits for you.

Using tags/branches/commits

git-subtree also works with tags, branches, or commit hashes.

Say we want to pull in a specific version of github.com/brettlangdon/forge which uses tags to manage versions.

git subtree add --prefix vendor/github.com/brettlangdon/forge https://github.com/brettlangdon/forge.git v0.1.5 --squash

And then, if we want to update to a later version, v0.1.7, we can just run the following:

git subtree pull --prefix vendor/github.com/brettlangdon/forge https://github.com/brettlangdon/forge.git v0.1.7 --squash

Making it all easier

I really like using git-subtree, a lot, but the syntax is a little cumbersome. The previous article I mentioned from Atlassian (here) suggests adding in git aliases to make using git-subtree easier.

I decided to take this one step further and write a git command, git-vendor to help manage subtree dependencies.

I won’t go into much details here since it is outlined in the repository as well as at https://brettlangdon.github.io/git-vendor/, but the project’s goal was to make working with git-subtree easier for managing Go dependencies. Mainly, to be able to add subtrees and give them a name, to be able to list all current subtrees, and to be able to update a subtree by name rather than repo + prefix path.

Here is a quick preview:

$ git vendor add forge https://github.com/brettlangdon/forge v0.1.5
$ git vendor list
forge@v0.1.5:
    name:   forge
    dir:    vendor/github.com/brettlangdon/forge
    repo:   https://github.com/brettlangdon/forge
    ref:    v0.1.5
    commit: 4c620b835a2617f3af91474875fc7dc84a7ea820
$ git vendor update forge v0.1.7
$ git vendor list
forge@v0.1.7:
    name:   forge
    dir:    vendor/github.com/brettlangdon/forge
    repo:   https://github.com/brettlangdon/forge
    ref:    v0.1.7
    commit: 0b2bf8e484ce01c15b87bbb170b0a18f25b446d9

Why not…

Godep/<package manager here>

I decided early on that I did not want to “deal” with a package manager unless I had to. This is not to say that there is anything wrong with godep or any of the other currently available package managers out there, I just wanted to keep the work flow simple and as close to what Go supports with respect to vendored dependencies as possible.

git-submodule

I have been asked why not git-submodule, and I think anyone that has had to work with git-submodule will agree that it isn’t really the best option out there. It isn’t as though it cannot get the job done, but the extra work flow needed when working with them is a bit of a pain. Mostly when working on a project with multiple contributors, or with contributors who are either not aware that the project is using submodules or who has never worked with them before.

Something else?

This isn’t the end of my search, I will always be keeping a look out for new and different ways to manage my dependencies. However, this is by far my favorite as of yet. If anyone has any suggestions, please feel free to leave a comment.

Write code every day

Thu, 02 Jul 2015 00:00:00 +0000

Just like a poet or an athlete practicing code every day will only make you better.

Lately I have been trying to get into blogging more and any article I read always says, “you need to write every day”. It doesn’t matter if what I write down gets published, but forming the habit of trying to write something every day is what counts. The more I write the easier it will become, the more natural it will feel and the better I will get at it.

This really isn’t just true of writing or blogging, it is something that can be said of anything at all. Riding a bike, playing basketball, reading, cooking or absolutely anything at all. The more you do it, the easier it will become and the better you will get.

As the title of the post will allude you to, this is also true of programming. If you want to be really good at programming you have to write code every day. The more code you write the easier it’ll be to write and the better you will be at programming. Just like any other task I’ve listed in this article, trying to write code every day, even if you are used to it, can be really hard to do and a really hard habit to keep.

“What should I write?” The answer to this question is going to be different for everyone, but it is the hurdle which you must first overcome to work your way towards writing code every day. Usually people write code to solve problems that they have, but not everyone has problems to solve. There is usually a chicken and the egg problem. You need to write code to have coding problems, and you need to have coding problems to have something to write. So, where should you start?

For myself, one of the things I like doing is to rewrite things that already exist. Sometimes it can be hard to come up with a new and different idea or even a new approach to an existing idea. However, there are millions of existing projects out there to copy. The idea I go for is to try and replicate the overall goal of the project, but in my own way. That might mean writing it in a different language, or changing the API for it or just taking some wacky new approach to solving the same issue.

More times than not the above exercise leads me to a problem that I then can go off and solve. For example, a few weeks ago I sat down and decided I wanted to write a web server in go (think nginx/apache). I knew going into the project I wanted a really nice and easy to use configuration file to define the settings. So, I did what most people do these days I and used json, but that didn’t really feel right to me. I then tried yaml, but yet again didn’t feel like what I wanted. I probably could have used ini format and made custom rules for the keys and values, but again, this is hacky. This spawned a new project in order to solve the problem I was having and ended up being forge, which is a hand coded configuration file syntax and parser for go which ended up being a neat mix between json and nginx configuration file syntax.

Anywho, enough of me trying to self promote projects. The main point is that by trying to replicate something that already exists, without really trying to do anything new, I came up with an idea which spawned another project and for at least a week (and continuing now) gave me a reason to write code every day. Not only did I write something useful that I can now use in any future project of mine, I also learned something I did not know before. I learned how to hand code a syntax parser in go.

Ultimately, try to take “coding every day” not as a challenge to write something useful every day, but to learn something new every day. Learn part of a new language, a new framework, learn how to take something apart or put it back together. Write code every day and learn something new every day. The more you do this, the more you will learn and the better you will become.

Go forth and happy coding. :)

Forge configuration parser

Sat, 27 Jun 2015 00:00:00 +0000

An overview of how I wrote a configuration file format and parser.

Recently I have finished the initial work on a project, forge, which is a configuration file syntax and parser written in go. Recently I was working on a project where I was trying to determine what configuration language I wanted to use and whether I tested out YAML or JSON or ini, nothing really felt right. What I really wanted was a format similar to nginx but I couldn’t find any existing packages for go which supported this syntax. A-ha, I smell an opportunity.

I have always been interested by programming languages, by their design and implementation. I have always wanted to write my own programming language, but since I have never had any formal education around the subject I have always gone about it on my own. I bring it up because this project has some similarities. You have a defined syntax that gets parsed into some sort of intermediate format. The part that is missing is where the intermediate format is then translated into machine or byte code and actually executed. Since this is just a configuration language, that is not necessary.

Project overview

You can see the repository for forge for current usage and documentation.

Forge syntax is a file which is made up of directives. There are 3 kinds of directives:

settings: Which are in the form <KEY> = <VALUE>
sections: Which are used to group more directives <SECTION-NAME> { <DIRECTIVES> }
includes: Used to pull in settings from other forge config files include <FILENAME/GLOB>

Forge also supports various types of setting values:

string: key = "some value";
bool: key = true;
integer: key = 5;
float: key = 5.5;
null: key = null;
reference: key = some_section.key;

Most of these setting types are probably fairly self explanatory except for reference. A reference in forge is a way to have the value of one setting be a pointer to another setting. For example:

global = "value";
some_section {
  key = "some_section.value";
  global_ref = global;
  local_ref = .key;
  ref_key = ref_section.ref_key;
}
ref_section {
  ref_key = "hello";
}

In this example we see 3 examples of references. A reference value is one which is an identifier (global) possibly multiple identifiers separated with a period (ref_section.ref_key) as well references can begin with a perod (.key). Every reference which is not prefixed with a period is resolved from the global section (most outer level). So in this example a reference to global will point to the value of "value" and ref_section.ref_key will point to the value of "hello". A local reference is one which is prefixed with a period, those are resolved starting from the current section that the setting is defined in. So in this case, local_ref will point to the value of "some_section.value".

That is a rough idea of how forge files are defined, so lets see a quick example of how you can use it from go.

package main

import (
    "github.com/brettlangdon/forge"
)

func main() {
    settings, _ := forge.ParseFile("example.cfg")
    if settings.Exists("global") {
    	value, _ := settings.GetString("global");
    	fmt.Println(value);
    }
    settings.SetString("new_key", "new_value");

    settingsMap := settings.ToMap();
    fmt.Println(settingsMaps["new_key"]);

    jsonBytes, _ := settings.ToJSON();
    fmt.Println(string(jsonBytes));
}

How it works

Lets dive in and take a quick look at the parts that make forge capable of working.

Example config file:

# Top comment
global = "value";
section {
  a_float = 50.67;
  sub_section {
    a_null = null;
    a_bool = true;
    a_reference = section.a_float;  # Gets replaced with `50.67`
  }
}

Basically what forge does is take a configuration file in defined format and parses it into what is essentially a map[string]interface{}. The code itself is comprised of two main parts, the tokenizer (or scanner) and the parser. The tokenizer turns the raw source code (like above) into a stream of tokens. If you printed the token representation of the code above, it could look like:

(COMMENT, "Top comment")
(IDENTIFIER, "global")
(EQUAL, "=")
(STRING, "value")
(SEMICOLON, ";"
(IDENTIFIER, "section")
(LBRACKET, "{")
(IDENTIFIER, "a_float")
(EQUAL, "=")
(FLOAT, "50.67")
(SEMICOLON, ";")
....

Then the parser takes in this stream of tokens and tries to parse them based on some known grammar. For example, a directive is in the form <IDENTIFIER> <EQUAL> <VALUE> <SEMICOLON> (where <VALUE> can be <STRING>, <BOOL>, <INTEGER>, <FLOAT>, <NULL>, <REFERENCE>). When the parser sees <IDENTIFIER> it’ll look ahead to the next token to try and match it to this rule, if it matches then it knows to add this setting to the internal map[string]interface{} for that identifier. If it doesn’t match anything then it has a syntax error and will throw an exception.

The part that I think is interesting is that I opted to just write the tokenizer and parser by hand rather than using a library that converts a language grammar into a tokenizer (like flex/bison). I have done this before and was inspired to do so after learning that that is how the go programming language is written, you can see here parser.go (not a light read at 2500 lines). The scanner.go and parser.go might proof to be slightly easier reads for those who are interested.

Conclusion

There is just a brief overview of the project and just a slight dip into the inner workings of it. I am extremely interested in continuing to learn as much as I can about programming languages and parsers/compilers. I am going to put together a series of blog posts that walk through what I have learned so far and which might help guide the reader through creating something similar to forge.

Enjoy.

What I'm up to these days

Fri, 19 Jun 2015 00:00:00 +0000

It has been awhile since I have written anything in my blog. Might as well get started somewhere, like a brief summary of what I have been working on lately.

It has been far too long since I last wrote in this blog. I always have these aspirations of writing all the time about all the things I am working on. The problem generally comes back to me not feeling confident enough to write about anything I am working on. “Oh, a post like that probably already exists”, “There are smarter people than me out there writing about this, why bother”. It is an unfortunate feeling to try and get over.

So, here is where I am making an attempt. I will try to write more, it’ll be healthy for me. I always hear of people setting reminders in their calendars to block off time to write blog posts, even if they end up only writing a few sentences, which seems like a great idea that I indent to try.

Ok, enough with the “I haven’t been feeling confident dribble”, on to what I actually have been up to lately.

Since my last post I have a new job. I am now Senior Software Engineer at underdog.io. We are a small early stage startup (4 employees, just over a year old) that is in the hiring space. For candidates our site basically acts like a common application to now over 150 venture backed startups in New York City or San Francisco. In the short time I have been working there, I am very impressed and glad that I took their offer. I work with some awesome and smart people and I am still learning a lot, whether it is about coding or just trying to run a business.

I originally started to end this post by talking about a programming project I have been working on, but it ended up being 4 times longer than the text above and have decided instead to write a separate post about it. Apparently even though I have been writing lately, I have a lot to say.

Thanks for bearing with this “I have to write something” post. I am not going to make a promise that I am going to write more, because it is something that could easily fall through, like it usually does… but I shall give it my all!

Javascript Documentation Generation

Tue, 03 Feb 2015 00:00:00 +0000

I have always been trying to find a good Javascript documentation generator and I have never really been very happy with any that I have found. So I’ve decided to just write my own, DocAST.

The problem I have always had with any documentation generators is they are either hard to theme or are sometimes very strict with the way doc strings are suppose to be written, making them potentially difficult to switch between documentation generators if you had to. So for a fun exercise I’ve decided to just try writting one myself, DocAST.

What is different about DocAST? I’ve seen a few documentation parsers which use regular expressions to parse out the comment blocks, which works perfectly well, except I’ve decided to have some fun and use AST parsing to grab the code blocks from the scripts. As well, DocAST doesn’t try to force you in to any specific theme or display, instead it is used simply to extract documentation from scripts. Lastly, DocAST, doesn’t use any specific documentation format for signifying parameters, returns or exceptions, it will traverse the AST of the code block to find them for you, so most of the time you just need to add a simple block comment describing the function above it.

Lets just get to an example:

// script.js

/*
 * This is my super cool function that does all sorts of cool stuff
 **/
function superCool(arg1, arg2){
    if(arg1 === arg2){
        throw new Exception("arg1 and arg2 cant be the same");
    }

    var sum = arg1 + arg2;
    return sum;
}

$ docast extract ./script.js
$ cat out.json

[
    {
        "name": "superCool",
        "params": [
            "arg1",
            "arg2"
        ],
        "returns": [
            "sum"
        ],
        "raises": [
            "Exception"
        ],
        "doc": " This is my super cool function that does all sorts of cool stuff\n"
    }
]

For more information check out the github page for DocAST.

The other benefit I have found with a documentation parser (something that just extracts the documentation information as opposed to trying to build it) is that you can get fun and creative with how you use the information parsed. For example, I’ve had someone suggest creating your doc strings as yaml. When you parse out the string just parse the yaml to get an object which is then easy to pass on to jade or some other templating engine to generate your documentation. If you want to see an example of this, just check out the documentation for DocAST https://github.com/brettlangdon/docast/blob/master/lib/index.js#L127 and the code used to generate the docs at http://brettlangdon.github.io/docast/ https://github.com/brettlangdon/docast/tree/master/docs

Python Redis Queue Workers

Tue, 14 Oct 2014 00:00:00 +0000

Learn an easy, distributed approach to processing jobs from a Redis queue in Python.

Recently I started thinking about a new project. I want to write my own Continuous Integration (CI) server. I know what you are thinking… “Why?!” and yes I agree, there are a bunch of good ones out there now, I just want to do it. The first problem I came across was how to have distributed workers to process the incoming builds for the CI server. I wanted something that was easy to start up on multiple machines and that needed minimal configuration to get going.

The design is relatively simple, there is a main queue which jobs can be pulled from and a second queue that each worker process pulls jobs into to denote processing. The main queue is meant as a list of things that have to be processed where the processing queues is a list of pending jobs which are being processed by the workers. For this example we will be using Redis lists since they support the short feature list we require.

worker.py

Lets start with the worker process, the job of the worker is to simply grab a job from the queue and process it.

import redis


def process(job_id, job_data):
    print "Processing job id(%s) with data (%r)" % (job_id, job_data)


def main(client, processing_queue, all_queue):
    while True:
        # try to fetch a job id from "<all_queue>:jobs"
        # and push it to "<processing_queue>:jobs"
        job_id = client.brpoplpush(all_queue, processing_queue)
        if not job_id:
            continue
        # fetch the job data
        job_data = client.hgetall("job:%s" % (job_id, ))
        # process the job
        process(job_id, job_data)
        # cleanup the job information from redis
        client.delete("job:%s" % (job_id, ))
        client.lrem(process_queue, 1, job_id)


if __name__ == "__main__":
    import socket
    import os

    client = redis.StrictRedis()
    try:
        main(client, "processing:jobs", "all:jobs")
    except KeyboardInterrupt:
        pass

The above script does the following: 1. Try to fetch a job from the queue all:jobs pushing it to processing:jobs 2. Fetch the job data from a hash key with the name job:<job_id> 3. Process the job information 4. Remove the hash key job:<job_id> 5. Remove the job id from the queue processing:jobs

With this design we will always be able to determine how many jobs are currently queued for process by looking at the list all:jobs and we will also know exactly how many jobs are being processed by looking at the list processing:jobs which contains the list of job ids that all workers are working on.

Also we are not tied down to running just 1 worker on 1 machine. With this design we can run multiple worker processes on as many nodes as we want. As long as they all have access to the same Redis server. There are a few limitations which are all seeded in Redis’ limits on lists, but this should be good enough to get started.

There are a few other approaches that can be taken here as well. Instead of using a single processing queue we could use a separate queue for each worker. Then we can look at which jobs are currently being processed by each individual worker, this approach would also give us the opportunity to have the workers try to fetch from the worker specific queue first before looking at all:jobs so we can either assign jobs to specific workers or where the worker can recover from failed processing by starting with the last job it was working on before failing.

qw

I have developed the library qw or (QueueWorker) to implement a similar pattern to this, so if you are interested in playing around with this or to see a more developed implementation please checkout the projects github page for more information.

Lets Make a Metrics Beacon

Sun, 22 Jun 2014 00:00:00 +0000

Recently I wrote a simple javascript metrics beacon library. Let me show you what I came up with and how it works.

So, what do I mean by “javascript metrics beacon library”? Think RUM (Real User Monitoring) or Google Analytics, it is a javascript library used to capture/aggregate metrics/data from the client side and send that data to a server either in one big batch or in small increments.

For those who do not like reading articles and just want the code you can find the current state of my library on github: https://github.com/brettlangdon/sleuth

Before we get into anything technical, lets just take a quick look at an example usage:

<script type="text/javascript" src="//raw.githubusercontent.com/brettlangdon/sleuth/master/sleuth.min.js"></script>
<script type="text/javascript">
Sleuth.init({
  url: "/track",
});

// static tags to identify the browser/user
// these are sent with each call to `url`
Sleuth.tag('uid', userId);
Sleuth.tag('productId', productId);
Sleuth.tag('lang', navigator.language);

// set some metrics to be sent with the next sync
Sleuth.track('clicks', buttonClicks);
Sleuth.track('images', imagesLoaded);

// manually sync all data
Sleuth.sendAllData();
</script>

Alright, so lets cover a few concepts from above, tags, metrics and syncing.

Metrics

Metrics are simply data points to track for a given request. Good metrics to record are things like load times, elements loaded on the page, time spent on the page, number of times buttons are clicked or other user interactions with the page.

Syncing

Syncing refers to sending the data from the client to the server. I refer to it as “syncing” since we want to try and aggregate as much data on the client side and send fewer, but larger, requests rather than having to make a request to the server for each metric we mean to track. We do not want to overload the Client if we mean to track a lot of user interactions on the site.

How To Do It

Alright, enough of the simple examples/explanations, lets dig into the source a bit to find out how to aggregate the data on the client side and how to sync that data to the server.

Aggregating Data

Collecting the data we want to send to the server isn’t too bad. We are just going to take any specific calls to Sleuth.track(key, value) and store either in LocalStorage or in an object until we need to sync. For example this is the track method of Sleuth:

Sleuth.prototype.track = function(key, value){
  if(this.config.useLocalStorage && window.localStorage !== undefined){
    window.localStorage.setItem('Sleuth:' + key, value);
  } else {
    this.data[key] = value;
  }
};

The only thing of note above is that it will fall back to storing in this.data if LocalStorage is not available as well we are namespacing all data stored in LocalStorage with the prefix “Sleuth:” to ensure there is no name collision with anyone else using LocalStorage.

Also Sleuth will be kind enough to capture data from window.performance if it is available and enabled (it is by default). And it simply grabs everything it can to sync up to the server:

Sleuth.prototype.captureWindowPerformance = function(){
  if(this.config.performance && window.performance !== undefined){
    if(window.performance.timing !== undefined){
      this.data.timing = window.performance.timing;
    }
    if(window.performance.navigation !== undefined){
      this.data.navigation = {
        redirectCount: window.performance.navigation.redirectCount,
        type: window.performance.navigation.type,
      };
    }
  }
};

For an idea on what is store in window.performance.timing check out Navigation Timing.

Syncing Data

Ok, so this is really the important part of this library. Collecting the data isn’t hard. In fact, no one probably really needs a library to do that for them, when you just as easily store a global object to aggregate the data. But why am I making a “big deal” about syncing the data either? It really isn’t too hard when you can just make a simple AJAX call using jQuery $.ajax(...) to ship up a JSON string to some server side listener.

The approach I wanted to take was a little different, yes, by default Sleuth will try to send the data using AJAX to a server side url “/track”, but what about when the server which collects the data lives on another hostname? CORS can be less than fun to deal with, and rather than worrying about any domain security I just wanted a method that can send the data from anywhere I want back to whatever server I want regardless of where it lives. So, how? Simple, javascript pixels.

A javascript pixel is simply a script tag which is written to the page with document.write whose src attribute points to the url that you want to make the call to. The browser will then call that url without using AJAX just like it would with a normal script tag loading javascript. For a more in-depth look at tracking pixels you can read a previous article of mine: Third Party Tracking Pixels.

The point of going with this method is that we get CORS-free GET requests from any client to any server. But some people are probably thinking, “wait, a GET request doesn’t help us send data from the client to server”? This is why we will encode our JSON string of data for the url and simply send in the url as a query string parameter. Enough talk, lets see what this looks like:

var encodeObject = function(data){
  var query = [];
  for(var key in data){
    query.push(encodeURIComponent(key) + '=' + encodeURIComponent(data[key]));
  };

  return query.join('&');
};

var drop = function(url, data, tags){
  // base64 encode( stringify(data) )
  tags.d = window.btoa(JSON.stringify(data));

  // these parameters are used for cache busting
  tags.n = new Date().getTime();
  tags.r = Math.random() * 99999999;

  // make sure we url encode all parameters
  url += '?' + encodeObject(tags);
  document.write('<sc' + 'ript type="text/javascript" src="' + url + '"></scri' + 'pt>');
};

That is basically it. We simply base64 encode a JSON string version of the data and send as a query string parameter. There might be a few odd things that stand out above, mainly url length limitations of base64 encoded JSON string, the “cache busting” and the weird breaking up of the tag “script”. A safe url length limit to live under is around 2000 to accommodate internet explorer, which from some very crude testing means each reqyest can hold around 50 or so separate metrics each containing a string value. Cache busting can be read about more in-depth in my article again about tracking pixels (http://brett.is/writing/about/third-party-tracking-pixels/#cache-busting), but the short version is, we add random numbers and the current timestamp the query string to ensure that the browser or cdn or anyone in between doesn’t cache the request being made to the server, this way you will not get any missed metrics calls. Lastly, breaking up the script tag into “sc + ript” and “scri + pt” makes it harder for anyone blocking scripts from writing script tags to detect that a script tag is being written to the DOM (also an img or iframe tag could be used instead of a script tag).

Unload

How do we know when to send the data? If someone is trying to time and see how much time someone is spending on each page or wants to make sure they are collecting as much data as they want on the client side then you want to wait until the last second before syncing the data to the server. By using LocalStorage to store the data you can ensure that you will be able to access that data the next time you see that user, but who wants to wait? And what if the user never comes back? I want my data now dammit!

Simple, lets bind an event to window.onunload! Woot, done… wait… why isn’t my data being sent to me? Initially I was trying to use window.onunload to sync data back, but found that it didn’t always work with pixel dropping, AJAX requests worked most of the time. After some digging I found that with window.onunload I was hitting a race condition on whether or not the DOM was still available or not, meaning I couldn’t use document.write or even query the DOM on unload for more metrics to sync on window.onunload.

In come window.onbeforeunload to the rescue! For those who don’t know about it (I didn’t before this project), window.onbeforeunload is exactly what it sounds like an event that gets called before window.onunload which also happens before the DOM gets unloaded. So you can reliably use it to write to the DOM (like the pixels) or to query the DOM for any extra information you want to sync up.

Conclusion

So what do you think? There really isn’t too much to it is there? Especially since we only covered the client side of the piece and haven’t touched on how to collect and interpret this data on the server (maybe that’ll be a follow up post). Again this is mostly a simple implementation of a RUM library, but hopefully it sparks an interest to build one yourself or even just to give you some insight into how Google Analytics or other RUM libraries collect/send data from the client.

I think this project that I undertook was neat because I do not always do client side javascript and every time I do I tend to learn something pretty cool. In this case learning the differences between window.onunload and window.onbeforeunload as well as some of the cool things that are tracked by default in window.performance I definitely urge people to check out the documentation on window.performance.

TODO

What is next for Sleuth? I am not sure yet, I am thinking of implementing more ways of tracking data, like adding counter support, rate limiting, automatic incremental data syncs. I am open to ideas of how other people would use a library like this, so please leave a comment here or open an issue on the projects github page with any thoughts you have.

Goodbye Grunt, Hello Tend

Mon, 09 Jun 2014 00:00:00 +0000

Recently decided to give Grunt a try, which caused me to write my own node.js build system.

For the longest time I had refused to move away from Makefiles for Grunt or some other node.js build system. But I finally gave in and decided to take an afternoon to give Grunt a go. Initially it seemed promising, Grunt had a plugin for everything and ultimately it supporting watching files and directories (the one feature I really wanted for my make build setup).

I tried to move over a fairly simplistic Makefile that I already had written into a Gruntfile. However, after about an hour (or more) of trying to get grunt setup with grunt-cli and all the other plugins installed and configured to do the right thing I realized that Grunt wasn’t for me. I took a simple 10 (ish) line Makefile and turned it into a 40+ line Gruntfile and it still didn’t seem to do exactly what I wanted. What I had to reflect on was why should I spend all this time trying to learn how to configure some convoluted plugins when I already known the correct commands to execute? Then I realized what I really wanted wasn’t a new build system but simply watch for a Makefile

I have attempted to get some form of watch working with a Makefile in the past, but it usually involves using inotify and I’ve never gotten it working exactly like how I wanted. So, I decided to start writing my own system, because, why not spend more time on perfecting my build system. My requirements were fairly simple, I wanted a way to watch a directory/files for changes and when they do simply run a single command (ultimately make <target>), I wanted the ability to also run long running processing like node server.js and restart them if certain files have changed, and lastly unlike other watch based systems I have seen I wanted a way to run a command as soon as I start up the watch program (so you dont have to start the watching, then go open/save a newline change to a file to get it to build for the first time).

What I came up with was tend. Which solves mostly all of my needs, which was simply “watch for make”. So how do you use it?

Installation

npm install -g tend

Usage

Usage:
  tend
  tend <action>
  tend [--restart] [--start] [--ignoreHidden] [--filter <filter>] [<dir> <command>]
  tend (--help | --version)

Options:
  -h --help             Show this help text
  -v --version          Show tend version information
  -r --restart          If <command> is still running when there is a change, stop and re-run it
  -i --ignoreHidden     Ignore changes to files which start with "."
  -f --filter <filter>  Use <filter> regular expression to filter which files trigger the command
  -s --start            Run <command> as soon as tend executes

Example CLI Usage

The following will watch for changes to any js files in the directory ./src/ when any of them change or are added it will run uglifyjs to combine them into a single file.

tend --ignoreHidden --filter "*.js" ./src "uglifyjs -o ./public/main.min.js ./src/*.js"

The following will run a long running process, starting it as soon as tend starts and restarting the program whenever files in ./routes/ has changed.

tend --restart --start --filter "*.js" ./routes "node server.js"

Config File

Instead of running tend commands singly from the command line you can provide tend with a .tendrc file of multiple directories/files to watch with commands to run.

The following .tendrc file are setup to run the same commands as shown above.

; global settings
ignoreHidden=true

[js]
filter=*.js
directory=./src
command=uglifyjs -o ./public/main.min.js ./src/*.js

[app]
filter=*.js
directory=./routes
command=node ./app/server.js
restart=true
start=true

You can then simply run tend without any arguments to have tend watch for all changes configured in your .tendrc file.

Running:

tend

Will basically execute:

tend --ignoreHidden --filter "*.js" ./src "uglifyjs -o ./public/main.min.js ./src/*.js" \
  & tend --restart --start --filter "*.js" ./routes "node server.js"

Along with running multiple targets at once, you can run specific targets from a .tendrc file as well, tend <target>.

tend js

Will run the js target once.

tend --ignoreHidden --filter "*.js" ./src "uglifyjs -o ./public/main.min.js ./src/*.js"

With Make

If I haven’t beaten a dead horse enough, I am a Makefile kind of person and that is exactly what I wanted to use this new tool with. So below is an example of a Makefile and it’s corresponding .tendrc file.

js:
    uglifyjs -o ./public/main.min.js ./src/*.js

app:
    node server.js

.PHONY: js app

ignoreHidden=true

[js]
filter=*.js
directory=./src
command=make js

[app]
filter=*.js
directory=./routes
command=make app
restart=true
start=true

Conclusion

So that is mostly it. Nothing overly exciting and nothing really new here, just another watch/build system written in node to add to the list. For the most part this tool does exactly what I want for now, but if anyone has any ideas on how to make this better or even any other better/easier tools which do similar things please let me know, I am more than willing to continue maintaining this tool.

Sharing Data from PHP to JavaScript

Sun, 16 Mar 2014 00:00:00 +0000

A quick example of how I decided to share dynamic content from PHP with my JavaScript.

So the other day I was refactoring some of the client side code I was working on and came across something like the following:

page.php

<html>
...

<script type="text/javascript">
var modelTitle = "<?=$myModel->getTitle()?>";

// do something with modelTitle
</script>
</html>

There isn’t really anything wrong here, in fact this seems to be a fairly common practice (from the little research I did). So… whats the big deal? Why write an article about it?

My issue with the above is, what if the JavaScript gets fairly large (as mine was). The ideal thing to do is to move the js into it’s own file, minify/compress it and serve it from a CDN so it doesn’t effect page load time. But, now we have content that needs to be added dynamically from the PHP script in order for the js to run. How do we solve it? The approach that I took, which probably isn’t original at all, but I think neat enough to share, was to let PHP make the data available to the script through window.data.

page.php

<html>
...
<?php
$pageData = array(
    'modelTitle' => $myModel->getTitle(),
);
?>
<script type="text/javascript">
window.data = <?=json_encode($pageData)?>;
</script>
<script type="text/javascript" src="//my-cdn.com/scripts/page-script.min.js"></script>
</html>

page-script.js

// window.data.modelTitle is available for me to use
console.log("My Model Title: " + window.data.modelTitle);

Nothing really fancy, shocking, new or different here, just passing data from PHP to js. Something to note is that we have to have our PHP code set window.data before we load our external script so that window.data will be available when the script loads. Which this shouldn’t be too much of an issue since most web developers are used to putting all of their script tags at the end of the page.

Some might wonder why I decided to use window.data, why not just set var modelTitle = "<?=$myModel->getTitle()?>";? I think it is better to try and have a convention for where the data from the page will come from. Having to rely on a bunch of global variables being set isn’t really a safe way to write this. What if you overwrite an existing variable or if some other script overwrites your data from the PHP script? This is still a cause for concern with window.data, but at least you only have to keep track of a single variable. As well, I think organizationally it is easier and more concise to have window.data = <?=json_encode($pageData)?>; as opposed to:

var modelTitle = "<?=$myModel->getTitle()?>";
var modelId = "<?=$myModel->getId()?>";
var username = "<?=getCurrentUser()?>";
...

I am sure there are other ways to do this sort of thing, like with AJAX or having an initialization function that PHP calls with the correct variables it needs to pass, etc. This was just what I came up with and the approach I decided to take.

If anyone has other methods of sharing dynamic content between PHP and js, please leave a comment and let me know, I am curious as to what most other devs are doing to handle this.

Cookieless User Tracking

Sat, 30 Nov 2013 00:00:00 +0000

A look into various methods of online user tracking without cookies.

Over the past few months, in my free time, I have been researching various methods for cookieless user tracking. I have a previous article that talks on how to write a tracking server which uses cookies to follow people between requests. However, recently browsers are beginning to disallow third party cookies by default which means developers have to come up with other ways of tracking users.

Browser Fingerprinting

You can use client side javascript to generate a browser fingerprint, or, a unique identifier for a specific users browser (since that is what cookies are actually tracking). Once you have the browser’s fingerprint you can then send that id along with any other requests you make.

var user_id = generateBrowserFingerprint();
document.write(
    '<script type="text/javascript" src="/track/user/"' + user_id + '></ sc' + 'ript>'
);

Local Storage

Newer browsers come equipped with a feature called local storage , which is used as a simple key-value store accessible through javascript. So instead of relying on cookies as your persistent storage, you can store the user id in local storage instead.

var user_id = localStorage.getItem("user_id");
if(user_id == null){
    user_id = generateNewId();
    localStorage.setItem("user_id", user_id);
}
document.write(
    '<script type="text/javascript" src="/track/user/"' + user_id + '></ sc' + 'ript>'
);

This can also be combined with a browser fingerprinting library for generating the new id.

ETag Header

There is a feature of HTTP requests called an ETag Header which can be exploited for the sake of user tracking. The way an ETag works is when a request is made the server will respond with an ETag header with a given value (usually it is an id for the requested document, or maybe a hash of it), whenever the bowser then makes another request for that document it will send an If-None-Match header with the value of ETag provided by the server last time. The server can then make a decision as to whether or not new content needs to be served based on the id/hash provided by the browser.

As you may have figured out, instead we can assign a unique user id as the ETag header for a response, then when the browser makes a request for that page again it will send us the user id.

This is useful, except for the fact that we can only provide a single id per user per endpoint. For example, if I use the urls /track/user and /collect/data there is no way for me to get the browser to send the same If-None-Match header for both urls.

Example Server

from uuid import uuid4
from wsgiref.simple_server import make_server


def tracking_server(environ, start_response):
    user_id = environ.get("HTTP_IF_NONE_MATCH")
    if not user_id:
        user_id = uuid4().hex

    start_response("200 Ok", [
        ("ETag", user_id),
    ])
    return [user_id]


if __name__ == "__main__":
    try:
        httpd = make_server("", 8000, tracking_server)
        print "Tracking Server Listening on Port 8000..."
        httpd.serve_forever()
    except KeyboardInterrupt:
        print "Exiting..."

Redirect Caching

Redirect caching is similar in concept to the the ETag tracking method where we rely on the browser cache to store the user id for us. With redirect caching we have our tracking url /track/, when someone goes there we perform a 301 redirect to /<user_id>/track. The users browser will then cache that 301 redirect and the next time the user goes to /track it will just go to /<user_id>/track instead.

Just like the ETag method we run into an issue where this method really only works for a single endpoint url. We cannot use it for an end all be all for tracking users across a site or multiple sites.

Example Server

from uuid import uuid4
from wsgiref.simple_server import make_server


def tracking_server(environ, start_response):
    if environ["PATH_INFO"] == "/track":
        start_response("301 Moved Permanently", [
            ("Location", "/%s/track" % uuid4().hex),
        ])
    else:
        start_response("200 Ok", [])
    return [""]


if __name__ == "__main__":
    try:
        httpd = make_server("", 8000, tracking_server)
        print "Tracking Server Listening on Port 8000..."
        httpd.serve_forever()
    except KeyboardInterrupt:
        print "Exiting..."

A project worth noting is Samy Kamkar’s Evercookie which uses standard cookies, flash objects, silverlight isolated storage, web history, etags, web cache, local storage, global storage… and more all at the same time to track users. This library exercises every possible method for storing a user id which makes it a reliable method for ensuring that the id is stored, but at the cost of being very intrusive and persistent.

Other Methods

I am sure there are other methods out there, these are just the few that I decided to focus on. If anyone has any other methods or ideas please leave a comment.

References

My New Website

Sat, 16 Nov 2013 00:00:00 +0000

Why did I redo my website? What makes it any better? Why are there old posts that are missing?

I just wanted to write a quick post about my new site. Some of you who are not familiar with my site might not notice the difference, but trust me… it is different and for the better.

So what has changed? For starters, I think the new design is a little simpler than the previous, but more importantly it is not longer in Wordpress. It is now maintained with Wintersmith, which is a static site generator which is built in node.js and usesJade templates and markdown.

Why is this better? Well for started I think writing in markdown is a lot easier than using Wordpress. It means I can use whatever text editor I want (emacs in this case) to write my articles. As well, I no longer need to have PHP and MySQL setup in order to just serve up silly static content like blog posts and a few images. This also means I can keep my blog entirely in GitHub.

So far I am fairly happy with the move to Wintersmith, except having to move all my current blog posts over to markdown, but I will slowly keep porting some over until I have them all in markdown. So, please bear with me during the time of transition as there may be a few posts missing when I initially publish this new site.

Check out my blog in GitHub, brett.is.

The Fastest Python JSON Library

Sun, 22 Sep 2013 00:00:00 +0000

My results from benchmarking a handfull of Python JSON parsing libraries.

Most who know me well know that I am usually not one for benchmarks. Especially blindly posted benchmark results in blog posts (like how this one is going to be). So, instead of trying to say that “this library is better than that library” or to try and convince you that you are going to end up with the same results as me. Instead remember to take these results with a grain of salt. You might end up with different results than me. Take these results as interesting findings which help supplement your own experiments.

Ok, now that that diatribe is over with LETS GET TO THE COOL STUFF! We use JSON for a bunch of stuff at work, whether it is a third party system that uses JSON to communicate or storing JSON blobs in the database. We have done some naive benchmarking in the past and came to the conclusion that jsonlib2 is the library for us. Well, I started a personal project that also uses JSON and I decided to revisit benchmarking Python JSON libraries to see if there are any “better” ones out there.

I ended up with the following libraries to test: standard lib json, jsonlib2, simplejson, yajl (yet another json library) and lastly ujson (ultrajson). For the test, I wanted to test parsing and serializing a large json blob, in this case, I simply took a snapshot of data from the Twitter API Console. Ok, enough with this context b.s. lets see some code and some results.

import json
import timeit

# json data as a str
json_data = open("./fixture.json").read()
# json data as a list
data = json.loads(json_data)

number = 500
repeat = 4
print "Average run time over %s executions repeated %s times" % (number, repeat)

# we still store the fastest run times here
fastest_dumps = (None, -1)
fastest_loads = (None, -1)

for library in ("ujson", "simplejson", "jsonlib2", "json", "yajl"):
    print "-" * 20
    # thanks yajl for not setting __version__
    exec("""
try:
    from %s import __version__
except Exception:
    __version__ = None
         """ % library)
    print "Library: %s" % library
    # for jsonlib2 this is a tuple... thanks guys
    print "Version: %s" % (__version__, )

    # time to time json.dumps
    timer = timeit.Timer(
        "json.dumps(data)",
        setup="""
import %s as json
data = %r
              """ % (library, data)
    )

    total = sum(timer.repeat(repeat=repeat, number=number))
    per_call = total / (number * repeat)
    print "%s.dumps(data): %s (total) %s (per call)" % (library, total, per_call)
    if fastest_dumps[1] == -1 or total > fastest_dumps[1]:
        fastest_dumps = (library, total)

    # time to time json.loads
    timer = timeit.Timer(
        "json.loads(data)",
        setup="""
import %s as json
data = %r
              """ % (library, json_data)
    )
    total = sum(timer.repeat(repeat=repeat, number=number))
    per_call = total / (number * repeat)
    print "%s.loads(data): %s (total) %s (per call)" % (library, total, per_call)
    if fastest_loads[1] == -1 or total > fastest_loads[1]:
       fastest_loads = (library, total)

    print "-" * 20
    print "Fastest dumps: %s %s (total)" % fastest_dumps
    print "Fastest loads: %s %s (total)" % fastest_loads

Ok, we need to talk about this code for a second. It really is not the cleanest code I have ever written. We start off by loading the fixture json data as both the raw json text and parse it into a python list of objects. Then for each of the libraries we want to test, we try to get their version information and finally we use timeit to test how long it takes to serialize the parsed fixture data into a JSON string and then we test parsing the JSON string of the fixture data into a list of objects. And lastly, we store the name of the library with the fastest total run time for either “dumps” or “loads” and then at the end we print which was fastest.

Here are the results I got when running on my macbook pro:

Average run time over 500 executions repeated 4 times
--------------------
Library: ujson
Version: 1.33
ujson.dumps(data): 1.97361302376 (total) 0.000986806511879 (per call)
ujson.loads(data): 2.05873394012 (total) 0.00102936697006 (per call)
--------------------
Library: simplejson
Version: 3.3.0
simplejson.dumps(data): 3.24183320999 (total) 0.001620916605 (per call)
simplejson.loads(data): 2.20791387558 (total) 0.00110395693779 (per call)
--------------------
Library: jsonlib2
Version: (1, 3, 10)
jsonlib2.dumps(data): 2.211810112 (total) 0.001105905056 (per call)
jsonlib2.loads(data): 2.55381131172 (total) 0.00127690565586 (per call)
--------------------
Library: json
Version: 2.0.9
json.dumps(data): 2.35674309731 (total) 0.00117837154865 (per call)
json.loads(data): 5.23104810715 (total) 0.00261552405357 (per call)
--------------------
Library: yajl
Version: None
yajl.dumps(data): 2.85826969147 (total) 0.00142913484573 (per call)
yajl.loads(data): 3.03867292404 (total) 0.00151933646202 (per call)
--------------------
Fastest dumps: ujson 1.97361302376 (total)
Fastest loads: ujson 2.05873394012 (total)

So there we have it. My tests show that ujson is the fastest python json library (when running on my mbp and when parsing or serializing a “large” json dataset).

I have added the test scripts, fixture data and results in this gist if anyone wants to run locally and post their results in the comments below. I would be curious to see the results of others.

The Battle of the Caches

Thu, 01 Aug 2013 00:00:00 +0000

A co-worker and I set out to each build our own http proxy cache. One of them was written in Go and the other as a C++ plugin for Kyoto Tycoon.

So, I know what most people are thinking: “Not another cache benchmark post, with skewed or biased results.” But luckily that is not what this post is about; there are no opinionated graphs showing that my favorite caching system happens to be better than all the other ones. Instead, this post is about why at work we decided to write our own API caching system rather than use Varnish (a tested, tried and true HTTP caching system).

Let us discuss the problem we have to solve. The system we have is a simple request/response HTTP server that needs to have very low latency (a few milliseconds, usually 2-3 on average) and we are adding a third-party HTTP API call to almost every request that we see. I am sure some people see the issue right away, any network call is going to add at least a half to a whole millisecond to your processing time and that is if the two servers are in the same datacenter, more if they are not. That is just network traffic, now we must rely on the performance of the third-party API, hoping that they are able to maintain a consistent response time under heavy load. If, in total, this third-party API call is adding more than 2 milliseconds response time to each request that our system is processing then that greatly reduces the capacity of our system.

THE SOLUTION! Lets use Varnish. This is the logical solution, lets put a caching system in front of the API. The content we are requesting isn’t changing very often (every few days, if that) and it can help speed up the added latency from the API call. So, we tried this but had very little luck; no matter what we tried we could not get Varnish to respond in under 2 milliseconds per request (which is a main requirement of solution we were looking for). That means Varnish is out, the next solution is to write our own caching system.

Now, before people start flooding the comments calling me a troll or yelling at me for not trying this or that or some other thing, let me try to explain really why we decided to write our own cache rather than spend extra days investing time into Varnish or some other known HTTP cache. We have a fairly specific requirement from our cache, very low and consistent latency. “Consistent” is the key word that really matters to us. We decided fairly early on that getting no response on a cache miss is better for our application than blocking and waiting for a response from the proxy call. This is a very odd requirement and most HTTP caching systems do not support it since it almost defeats their purpose (be “slow” 1-2 times so you can be fast all the other times). As well, HTTP is not a requirement for us, that is, from the cache to the API server HTTP must be used, but it is not a requirement that our application calls to the cache using HTTP. Headers add extra bandwidth and processing that are not required for our application.

So we decided that our ideal cache would have 3 main requirements: 1. Must have a consistent response time, returning nothing early over waiting for a proper response 2. Support the Memcached Protocol 3. Support TTLs on the cached data

This behavior works basically like so: Call to cache, if it is a cache miss, return an empty response and queue the request to a background process to make the call to the API server, every identical request coming in (until the proxy call returns a result) will receive an empty response but not add the request to the queue. As soon as the proxy call returns, update the cache and every identical call coming in will yield the proper response. After a given TTL consider the data in the cache to be old and re-fetch.

This was then seen as a challenge between a co-worker, Dan Crosta, and myself to see who can write the better/faster caching system with these requirements. His solution, entitled “CacheOrBust”, was a Kyoto Tycoon plugin written in C++ which simply used a subset of the memcached protocol as well as some background workers and a request queue to perform the fetching. My solution, Ferrite, is a custom server written in Go (originally written in C) that has the same functionality (except using goroutines rather than background workers and a queue). Both servers used Kyoto Cabinet as the underlying caching data structure.

So… results already! As with most fairly competitive competitions it is always a sad day when there is a tie. Thats right, two similar solutions, written in two different programming languages yielded similar results (we probably have Kyoto Cabinet to thank). Both of our caching systems were able to yield us the results we wanted, consistent sub-millisecond response times, averaging about .5-.6 millisecond responses (different physical servers, but same datacenter), regardless of whether the response was a cache hit or a cache miss. Usually the morale of the story is: “do not re-invent the wheel, use something that already exists that does what you want,” but realistically sometimes this isn’t an option. Sometimes you have to bend the rules a little to get exactly what your application needs, especially when dealing with low latency systems, every millisecond counts. Just be smart about the decisions you make and make sure you have sound justification for them, especially if you decide to build it yourself.

Browser Fingerprinting

Wed, 05 Jun 2013 00:00:00 +0000

Ever want to know what browser fingerprinting is or how it is done?

What is Browser Fingerprinting?

A browser or device fingerprint is a term used to describe an identifier generated from information retrieved from a single given device that can be used to identify that single device only. For example, as you will see below, browser fingerprinting can be used to generate an identifier for the browser you are currently viewing this website with. Regardless of you clearing your cookies (which is how most third party companies track your browser) the identifier should be the same every time it is generated for your specific device/browser. A browser fingerprint is usually generated from the browsers user agent, timezone offset, list of installed plugins, available fonts, screen resolution, language and more. The EFF did a study on how unique a browser fingerprint for a given client can be and which browser information provides the most entropy. To see how unique your browser is please check out their demo application Panopticlick.

What can it used for?

Ok, so great, but who cares? How can browser fingerprinting be used? Right now the majority of user tracking is done by the use of cookies. For example, when you go to a website that has tracking pixels (which are “invisible” scripts or images loaded in the background of the web page) the third party company receiving these tracking calls will inject a cookie into your browser which has a unique, usually randomly generated, identifier that is used to associate stored data about you like collected site or search retargeting data. This way when you visit them again with the same cookie they can lookup previously associated data for you.

So, if this is how it is usually done why do we care about browser fingerprints? Well, the main problem with cookies is they can be volatile, if you manually delete your cookies then the company that put that cookie there loses all association with you and any data they have on your is no longer useful. As well, if a client does not allow third party cookies (or any cookies) on their browser then the company will be unable to track the client at all.

A browser fingerprint on the other hand is a more constant way to identify a given client, as long as they have javascript enabled (which seems to be a thing which most websites cannot properly function without), which allows the client to be identified even if they do not allow cookies for their browser.

How do we do it?

Like I mentioned before to generate a browser fingerprint you must have javascript enabled as it is the easiest way to gather the most information about a browser. Javascript gives us access to things like your screen size, language, installed plugins, user agent, timezone offset, and other points of interest. This information is basically smooshed together in a string and then hashed to generate the identifier, the more information you can gather about a single browser the more unique of a fingerprint you can generate and the less collision you will have.

Collision? Yes, if you end up with two laptops each of the same make, model, year, os version, browser version with the exact same features and plugins enabled then the hashes will be the exact same and anyone relying on their fingerprint will treat both of those devices as the same. But, if you read the white paper by EFF listed above then you will see that their method for generating browser fingerprints is usually unique for almost 3 million different devices. There may be some cases for companies where that much uniqueness is more than enough to use and rely on fingerprints to identify devices and others where they have more than 3 million users.

Where does this really come into play? Most websites usually have their users create and account and log in before allowing them access to portions of the site or to be able to lookup stored information, maybe their credit card payment information, home address, e-mail address, etc. Where browser fingerprints are useful is for trying to identify anonymous visitors to a web application. For example, third party trackers who are collecting search or other kinds of data.

Some Code

Their is a project on github by user Valentin Vasilyev (Valve) called fingerprintjs which is a client side javascript library for generating browser fingerprints. If you are interested in seeing some production worthy code of how to generate browser fingerprints please take a look at that project, it uses information like useragent, language, color depth, timezone offset, whether session or local storage is available, a listing of all installed plugins and it hashes everything using murmurhash3.

Your fingerprintjs Fingerprint: Could not generate fingerprint

Resources: * panopticlick.eff.org - find out how rare your browser fingerprint is. * github.com/Valve/fingerprintjs - client side browser fingerprinting library.

Brett.is

Thinking about transactional email

Managing Go dependencies with git-subtree

Keep it simple

Vendor dependencies

Maintain the full source code of each dependency in each repository

In come git-subtree

Adding a new dependency

Updating an existing dependency

Using tags/branches/commits

Making it all easier

Why not…

Godep/<package manager here>

git-submodule

Something else?

Write code every day

Forge configuration parser

Project overview

How it works

Conclusion

What I'm up to these days

Javascript Documentation Generation

Python Redis Queue Workers

worker.py

qw

Lets Make a Metrics Beacon

Tags

Metrics

Syncing

How To Do It

Aggregating Data

Syncing Data

Unload

Conclusion

TODO

Links

Goodbye Grunt, Hello Tend

Installation

Usage

Example CLI Usage

Config File

With Make

Conclusion

Sharing Data from PHP to JavaScript

page.php

page.php

page-script.js

Cookieless User Tracking

Browser Fingerprinting

Local Storage

ETag Header

Example Server

Redirect Caching

Example Server

Ever Cookie

Other Methods

References

My New Website

The Fastest Python JSON Library

The Battle of the Caches

Browser Fingerprinting

What is Browser Fingerprinting?

What can it used for?

How do we do it?

Some Code

Your fingerprintjs Fingerprint: Could not generate fingerprint