The Atlantic’s Cities section picks their ten favorite open data releases from 2012. An interesting twist: this list comprises a set of tools and interactive visualizations, not just the raw data. Of note: beware of SF’s “high injury corridors,” shown in the image above!

(Unrelated addendum: I successfully defended my thesis this week! Very much looking forward to the holidays, an overdue vacation, and then getting much more involved in the world of data. More details to come, soon.)

The Atlantic’s Cities section picks their ten favorite open data releases from 2012. An interesting twist: this list comprises a set of tools and interactive visualizations, not just the raw data. Of note: beware of SF’s “high injury corridors,” shown in the image above!

(Unrelated addendum: I successfully defended my thesis this week! Very much looking forward to the holidays, an overdue vacation, and then getting much more involved in the world of data. More details to come, soon.)

"It is unfortunate that in our generation, the brightest minds are being used to try to get people to click on ads. I hope in this decade - in the 2010s - we have a shift from consumerism back to mission … national security, financial risk, and health care." 

If you’re looking for a direct and quick intro to R’s ggplot2 graphics package, these summer ‘12 slides from Wickham himself seem like a good start.

If you’re looking for a direct and quick intro to R’s ggplot2 graphics package, these summer ‘12 slides from Wickham himself seem like a good start.

Drew Conway (of the now-infamous Data Science Venn Diagram) gives a great talk on the value of asking good questions over having good tools. (Monktoberfest 2012)

Very rich data visualization from the Hewlett Foundation on the size, location and year distribution of their grants.

Very rich data visualization from the Hewlett Foundation on the size, location and year distribution of their grants.

fastcodesign:

One mathematician’s ingenious solution to armoring heavy bombers inspired Facebook’s research manager to look at Facebook’s ocean of data from a new perspective.

fastcodesign:

One mathematician’s ingenious solution to armoring heavy bombers inspired Facebook’s research manager to look at Facebook’s ocean of data from a new perspective.

(via dailydatascience)

Two Finds: SQLZoo & ‘Data Literacy’ Short Course

I came across two interesting finds this week:

  • First, Y Combinator (a great twitter feed to keep tabs on) shared a link to SQLZoo.net, a source of tutorials, tests, and references for learning SQL. I don’t yet have any experience with relational databases, but it sounds like they’re used often in traditional business arenas. Having a basic familiarity with the syntax and use of this special querying language seems wise. 
  • Second, Quora sent me an email with an update to a ‘Data Science’ thread that I had previously checked out. I’m actually having trouble finding the original thread now, but the original question was something along the lines of “What skills does a data scientist need to start out with?” The update was from one of the co-instructors of this MIT short course (Independent Activities Period, IAP) in ‘Data Literacy.’ It’s a lab-heavy, six-day walkthrough of the basics in accessing data, extracting information, using statistical tools, and making visualizations. It appears that they’re primarily using Python for this course, and it looks great. Very excited to start working through this one.

On an unrelated note, I’m sad to say that I’ve elected to abandon the programming exercises in the Coursera Machine Learning course. Though it is fun and incredibly educational, I need to devote much more time to finishing my graduate work and thesis. I look forward to watching the rest of the videos, and at least obtaining some familiarity with the remaining topics. Since this was a repeat of the course, I hope it’ll be offered again and I can pick up where I left off in the programming exercises!

Learn Git: tryGit (Code School)

Among the items on my ad-hoc list of ‘to-dos’ for getting into data science: get an introduction to distributed version control systems (VCS). Just recently on the Kaggle blog, there was a great article on Engineering Practices in Data Science and the need for source control. I admit that the idea was a new one to me, but it makes complete sense. Instead of cleverly-named (or so you thought…) files stored away in random hard drive locations, having a good pipeline helps “get the goo of software out of the way so they can focus on valuable data problems.”

Sounds great. So, for a rookie to the field, where do I begin? Turns out, Git has the answer. Many of them, in fact. From free online textbooks, to uber-intro (i.e. for me) walkthroughs like tryGit, you can find a way into VCS. 

I had a hunch that Git was going to be useful to me eventually; I set up an account and used Gist to embed my ‘reference cards’ on Python and bash. Having run through the tryGit tutorial, I feel… interested. Without previous experience, it’s all still a bit mysterious. But I can see the utility, so I’m looking forward to experimenting more with it, soon.

Multi-class Classification & Neural Networks (Coursera ML class)

The third programming exercise in Coursera’s Machine Learning class deals with one-vs-all logistic regression (aka multi-class classification) and an introduction to the use of neural networks to recognize hand-written digits. This is - by far! - the most interesting assignment yet.

Getting my head wrapped around the setup for this assignment took almost as much time as actually implementing the solution. In short, the data we’re looking at is a subset of the MNIST handwritten digit dataset. We have 5000 training examples, each of which is an “unrolled” version of a 20 x 20 pixel grayscale image (ie a 400-dimensional vector). Each pixel is encoded as a grayscale intensity, and the “0” digit is labeled “10” for convenience with Octave vector indexing. We get a sense of what we’re dealing with at first, by running some provided code that displays a random 10 x 10 array of training examples:

You can see that the clarity or messiness of the digits is all over the map, so to speak. This is very clearly a relevant opportunity for a learning algorithm! 

We’re asked to vectorize previous code, but if you were thinking that way the first time around, your code from previous exercises will already be vectorized. This saves a bunch of steps in this assignment. Next we add in regularization, but once more we leave the choice of lambda to Prof. Ng in the code that tests our solutions. I imagine (and hope!) that in exercises in the not-too-distant future, we’ll be learning how to choose our own values of the regularization parameter.

Then we get to the business of implementing one-vs-all classification by training a regularized logistic classifier for one each of the K classes in the dataset (here, K=10). We also make use of some new techniques in this exercise (logical arrays, and the fmincg advanced optimization function). In the end, our trained algorithm is loosed upon the same dataset (not ideal or realistic, but ok…) and for each training example, outputs its prediction of the correct digit. On the given set of data, our algorithm correctly classifies about 95% of the training examples. Pretty good!

The second part of the programming exercise deals with neural networks (NN). Turns out NN are pretty complicated; I watched the video lectures on this a few times and still barely caught it. Since the algorithm is complex, it’s split across assignments; in this one, we’re only implementing feedforward propagation using a previously-trained set of parameters/weights (theta matrices). When we complete the forward propagation, the test code randomly chooses an entry from the MNIST 5000-example subset we have and displays it’s guess at the value along with the actual image. Impressively, it’s training accuracy is about 97%. Here are some examples of the test output (may have to “View Image” in a new tab to read the prediction. Spoiler: they’re all correct.): 

This exercise was great. Though there are still some core details that are being hidden from us (e.g. the training of this NN), this is starting to look like a legitimate machine learning exercise. The details of the code are pretty interesting, too, and I’m working on getting those files available. Stay tuned on that front!

Logistic Regression in Octave (Coursera ML class)

In programming exercise two of Prof. Ng’s Machine Learning class, we implemented logistic regression on two unique sets of data. The first dataset was a distribution of exam score pairs corresponding to students who were either admitted to a fictitious program or not. We analyzed these data by making use of the Octave function fminunc - a built-in optimization solver. We implemented the sigmoid hypothesis function, then the cost function and its gradient. Passing fminunc these functions leads us to the following decision boundary:

Once we have this decision boundary, we can use it to predict the likelihood of a admission based on a new pair of exams scores.

The second dataset was more interesting. In this case, we get the quality control results (yes, no) as a function of two test scores on microchips. These data, however, were not separable by a straight line through the plot. In order to accomodate this situation, we used a technique called feature mapping, to extend the existing features (two test scores) into additional features by creating higher-order multiples of the first two. We extended our features up to sixth-order polynomials in the original two features, and this allows our decision boundary to be non-linear in the 2D data plot.

For this dataset we also implemented regularization. Since we don’t explicitly know ab initio which features (if any) will be more significant, we include a parameter that penalizes overfitting. We again passed fminunc the same cost function and its gradient (side note, if these two functions are properly vectorized, no changes need to be made between the two datasets). We also need to fix the regularization parameter, now. The initial value was chosen to be 1, which results in a good-looking fit to the data:

By experimenting with the value of this parameter, we can observe the varying final results. For large values (e.g. 100), we find a simpler decision boundary that ultimately underfits the data:

In contrast, by removing the regularization parameter (i.e. set it to 0), we see that the regression leads to a complicated decision boundary, and ultimately overfits the data. We shouldn’t be very confident in any predictions that come from this result:

So that’s the first implementation of logistic regression. It’s a very neat continuation of the principles of linear regression; mostly it just requires thinking about a different hypothesis, the mechanism is mostly the same.