Chase, what matters?
23 Sep 2009, 18:36 PM
Last week I paid my bills. As part of the regular bill-paying process, I take any funds left over that are not required as cash over the coming weeks, and pay down my home equity facility. I pay my credit cards in full. The only rate I concern myself with is 3.8% rate on my loan. There is nothing particularly unusual about this process.
A few days after I paid my bills my bank, JPM Chase, emailed me to tell me that my account was overdrawn. I logged into the web site and saw that they had put a bunch of payments through twice. Most importantly, they put my loan principal payment through twice. In their wisdom, they helped me correct this by withdrawing from my credit card to pay down my loan. My credit card has a purchase rate of 12% and an cash rate of 19%.
Follow me for a moment: They withdrew from an account charging 19% to pay down an loan at 3.8%. And along the way they charge an overdraft fee for the 'service.'
I have three simple requests for Chase:
- Reverse the double payment, restoring my checking account to its intended balance.
- Reverse the overdraft fee.
- Return the interest they are earning at 19% on the credit card account.
I was first routed to the online banking group as it was clear that the error originated within their domain. The timestamps on the transactions are identical, the transaction numbers are near sequential; there is a clear indication that my intent was to only pay once, but they processed the transactions twice. From there I was transferred to the credit card fiefdom who told me that they would correct the overdraft issues - but it would take four days.
Four days later I called to discover that no work order was placed and there was no note of my original call. I went through the same call center waltz, but instead was routed to the home equity group. The same group that asked me to forge a signature during the application process - but that is another story for another day. During this call I was told that the home equity group would take care of everything as they were the final destination of the funds. I made sure that the work order included the three key points listed above. I also took note of the work order number and the names and times of all the people I spoke with.
Meanwhile, mind you, I still have a zero checking balance and am unable to make other payments. I am loathe to draw down the debt facilities at my disposal, as it would just make it even more complex to reverse these transactions. Friends have been helpful and luckily I have enough cash on hand.
Today I called the home equity group directly to check on the status of the work order. They gave me the same estimate as the last time I spoke with them four to five days. At which point the interest would have accumulated to $11. Not a great deal in the scheme of things, but it is my goddamn $11. I'd be happy if they simply paid me the $11 and moved on - they have earned a good deal of revenue from me over the years, and this is clearly their error. But of course, they can't return me the $11 as they have no mechanism for doing so.
"You cannot dispute this transaction for this reason: 5102 - Bank Releated[sic] Fees / Charges - Not Eligible"
As of now it appears my only option to force the return of my overdraft fee and for me to receive any accrued interest is to take action in the New York small claims court. There is no way Chase will defend this - that would cost hundreds for legal approval alone. Of course, they make an order of magnitude more than that from my family in fees, charges, interchange and net interest margin each year. As soon as I get my $11 back, they will no longer hold any of my accounts.
Customer experience matters.
Comments (4)
Bayesian Methods + MCMC
11 Jul 2009, 00:44 AM
Last night we had our fourth NY R Statistical Programming meetup. The topic was Bayesian Methods + MCMC. We had two presenters, Jake Hofman and Suresh Velagapundi, both of whom did an admirable job of presenting a very broad topic to an audience with diverse backgrounds. I want to use this post to bridge a gap between the background material and day to day utilization. This is catered towards the audience who may have some experience with R, but aren't very familiar with the Bayesian Way. While it is a simple example, the steps involved extend on to the issues that are faced in real world applications.
The source for this example can be downloaded using the internet!We are going to step through Jake's coin flip example to get a sense of what is involved in doing Bayesian inference. There are a number of packages on the CRAN Bayesian Inference view that do all of what you will see below. I decided against using them for two reasons. First off, the coin flip example is a little too trivial for using many of the techniques that rely on multivariate parameter estimation to see any utility. But more importantly, I want to use the opportunity provided by a nice simple example to step through the underlying mechanics. My hope is that after reading through this you can have a look at the available packages and be a better judge of what they are used for and where one package may stand out over another. In the course of doing this write up I went through the MCMCpack package and it is a good exercise to compare how they implement the MCbinomialbeta() against the first half of this walk through. For the curious, the MCMCmetrop1R() function is far more advanced than the simple implementation of Metropolis-Hastings shown below, and it is a good exercise to understand their tuning parameters.
As a quick recap, the point of the exercise is to go from prior belief in a distribution (in this case we believe that the coin is fair) and use observed data to arrive at a posterior distribution using both the prior and the data. There are three things that we need to know to calculate the posterior distribution:
- The likelihood of seeing the new data given our estimate of the bias
- Our prior distribution
- The 'evidence' or the integral of the likelihood and prior for each possible estimate
I won't step through the derivation of the likelihood, as this should be easy enough to derive from the binomial probability distribution function. In this case our likelihood, with N trials and h heads is:
likelihood <- function (N, h, theta) theta^h * (1 - theta)^(N-h)Check that the likelihood function makes sense:
t <- (0:100) / 100
png ("figure1.png", width=800, height=600)
par (mfrow=c(2,2))
par (bty='n')
par (col='red')
plot (t, likelihood(100, 50, t), type='l', xlab='Theta Hat', ylab="Likelihood", main='Likelihood (t=0.5)')
Great, the maximum likelihood for 50 heads from 100 flips is a theta of 0.5. (See chart below).
Jake uses the Beta distribution as his prior as it has some neat analytic properties; namely that the posterior will be of the same distribution family as the prior. We call these types of priors conjugate priors.
prior <- function (theta, a, b) dbeta (theta, a, b) a <- 2 b <- 2 plot (t, prior(t, a, b), type ='l', xlab='Theta Hat', ylab='Pr(theta|a,b)', main='Prior')If we do the integration, we can arrive at the analytic form of the evidence and thus the posterior:
evidence <- function (N, h, a, b) beta(h + a, N - h + b) / beta (a, b) posterior <- function (theta, N, h, a, b) likelihood (N, h, theta) * prior(theta, a, b) / evidence (N, h, a, b) plot (t, posterior(t, 100, 50, a, b), type ='l', xlab='Theta Hat', ylab='Pr(theta|Observations,a,b)', main='Posterior (t=0.5)') plot (t, posterior(t, 100, 70, a, b), type ='l', xlab='Theta Hat', ylab='Pr(theta|Observations,a,b)', main='Posterior (t=0.7)') dev.off()
evidenceN <- function (N, h, a, b) integrate (function(t) likelihood (N,h,t) * prior (t,a,b), 0, 1)$value
posteriorN <- function (theta, N, h, a, b) likelihood (N, h, theta) * prior(theta, a, b) / evidenceN (N, h, a, b)
N <- 100 # Trials
h <- 70 # Heads
png ("figure2.png", width=800, height=600)
par (mfrow=c(1,1))
analytic <- posterior (t, N, h, a, b)
estimated <- posteriorN(t, N, h, a, b)
plot (t, analytic, type ='l', xlab='Theta Hat', ylab='Pr(theta|Observations,a,b)', main='Posterior (t=0.7)')
lines (t, estimated, type ='l', xlab='Theta Hat', ylab='Pr(theta|Observations,a,b)', col='blue', lty=2)
err <- (analytic - estimated)^2
lines (t, (err - min(err)) / diff(range(err)) * max(analytic), lty=3, col='black')
legend (0,2, c('Analytic','Estimated','Error^2 (scaled)'), col=c('red','blue','black'), lty=c(1,2,3), bty='n', text.col='black')
dev.off()

- Integrating across theta to find the evidence (denominator)
- Once you have the posterior, integrating it to calculate summary statistics (mean, variance, etc.)
In the above example we used the integrate() function to apply adaptive quadrature to find the evidence.
We could use this method for 2, but lets not. Instead, let us use MCMC - which is at its core, a way
to draw samples from a distribution that is otherwise hard to sample from.
Given that this example is rather trivial, with just one parameter in question (theta), I won't step through the implementations of vanilla Monte Carlo methods (uniform, importance & rejection sampling) These implementations are pretty much straight forward from Jake's presentation.
I will however, implement a simple Metropolis-Hastings MCMC sampler using a simple and symmetric Gaussian proposal density (q in Jake's notes).
MHstep <- function (pdf, prevCandidate)
{
# Effectively we are taking a random walk.
newCandidate <- prevCandidate + rnorm (1, mean=0, sd=0.1)
# NB: Because we are using the normal distribution
# as our proposal density, which is symmetrical,
# we cancel out the q terms on the numerator and
# denominator, as q(x|y) = q(y|x)
a <- pdf(newCandidate) / pdf(prevCandidate)
# Draw a uniform random number from 0 to 1
u <- runif(1)
if (a > u)
{
# This candidate is likely to be a better sample
return (newCandidate)
}
# Else, stick with our previous candidate
return (prevCandidate)
}
Let's use our numerical approximate to the actual posterior function as the PDF we want
to draw samples from:
posteriorPDF <- function (t) posteriorN (t, N, h, a, b)Time to go on a random walk down coin flip street.
steps <- 1000
samples <- matrix(NA, steps)
samples[1] <- 0.5 # initial guess
for (i in 2:steps)
{
samples[i] <- MHstep (posteriorPDF, samples[i-1])
}
And how did we do?
png ("figure3.png", width=800, height=600)
par(bty='n')
par(col='red')
plot(cumsum(samples)/1:steps, type='l', xlab='Step', ylab='Estimated Mean', main='Drawing samples by Metropolis-Hastings')
dev.off()
Comments (1)
predict.i2pi
19 Jun 2009, 06:57 AM

the basics
On Monday I released predict.i2pi.com, a statistical learning web service. Designed to deal with common classification and regression problems, it takes input data in the form of a CSV file and returns to the end user a set of predictive models. For example, if you have a list of store locations, local weather data, and store revenue, you could use the service to see if location and weather impact store revenue. predict.i2pi tries to determine whether predictions are possible by running your data against a growing number of user contributed statistical learning algorithms and finding the ones that work best with your data.
In planning this I went through a range of features, bells and whistles but have decided to strip it all back. This is the simplest thing I could build to support what I wanted. It takes a file, runs predictive algorithms against the file, and returns performance measures. Data and predictions.
data
The data provided is expected to be in the form of a number of observations, with one observation per row. Each column contains measurements for these observations. One or more of the measurements we are interested in predicting. For example:
|<------ Explanatory Variables ------>| /----- Response Variables (dentoed by *) X1, X2, X3, Name, Date, *Y 12.3, 13.4, 8.32, Terry, 2008-10-12, 736.0 9.3, 34.1, 1.21, Josh, 2008-10-12, NA <-- NB: NA response variables will have ... ... ... ... ... ... will have predictions available 8.7, 38.7, 8.17, Jess, 2009-01-07, 1823.1 subsequent download.
Data may include observations for which we do not know the response. These observations can be included, with the response left blank. Once satistfactory models are found, end users can download spreadsheets containing our best predictions for that data. On my todo list is adding confidence intervals to these values.
Once uploaded we try to best detect the following data types:
- Numeric (floating point numbers)
- Integers
- Dates (YYYY-MM-DD works best)
- String Factors (e.g., State or letter scores)
- Text (longer text than factors, with analytic interpretation as language text instead of as factors)
learning
Internally, predict.i2pi performs a standard test / training protocol. Data is loaded and a random half of that data is used to train the learning algorithm. The remaining half is used to test how good the learned algorithm works against previously unseen data. Robust algorithms will do almost as well on the test as during training, while less robust approaches will lead to far poorer performance during testing. The system continues this process of picking a training sample, training and the testing as many times as possible in an allotted time. During each of these cycles, predictions are tested against the actual responses in the corresponding observation. Performance is then measured using the R-squared metric for regressions and simple classification accuracy for classification problems. The system supports user defined performance measures with the goal being to let those who supply data decide on which performance measure is best for their application. However, at the moment I'm concentrating on opening up the ability for users to upload their own learning algorithms.
Currently learning algorithms are specified in small snippets of R code that can be dynamically loaded into the main R subsystem that is responsible for coordinating training cycles. See, for example, rpart.R which links in a recursive partitioning algorithm from the rpart library.
#requires(rpart)
myModel <- function (formula, data) rpart(formula, data, na.action=na.exclude)
myPredict <- function (model, data)
{
p <- predict (model, data)
as.numeric(apply(p, 1, function(r) order(-r)[1]))
}
All learning algorithms must contain two function definitions: myModel and myPredict. myModel takes a model formula and data, returning a model object that can be used to make predictions against new data. myPredict takes two parameters, the model object returned by myModel and a set of data that may not have been seen during training. We call the prediction function with one randomly ordered half of the data for training. For testing, we provide myPredict with the model object generated from the training set, but provide it with the as yet unseen testing portion of the data.
Users are also able to define transforms that take a matrix of explanatory variables and returns a new matrix with the same number of observation rows but with one or more of the explanatory columns transformed into a new space. For example one could take a 100 column matrix and apply some form of dimensionality reduction that returns a new matrix, with the same number of rows, but only 10 columns. The transform function is not shown the response variable to ensure that no funny business occurs whereby the response is somehow embedded in the explanatory variables. These same transform functions can then be applied to response variables alone, allowing the system, for example, to construct a model log(Y) ~ PCA(X1, X2, ... , Xn).
The following example shows a transform function that replaces any columns that are more than 50% NA with an indicator variable:
myTransform <- function (x)
{
if (is.null(dim(x))) return (x);
if (ncol(x) == 1) return (x);
bad_idx <- apply(x,2,function(c) sum(is.na(c)) / length(c)) >= 0.5
if (any(bad_idx))
{
y <- x
y[,bad_idx] <- is.na(y[,bad_idx])*1 # replace NA's with an indicator variable
return (y)
} else
{
return (x)
}
}
coming soon
As for uploading code, at the best way to do this right now is via email. I hand rolled my own sandbox environment to prevent 3rd party code from hijacking my system - but as with any security code that I write myself, I loathe testing it in the real world until I've had a good chance to be as close to 100% sure that it is safe. In reality, I'll probably stop trying to reinvent the wheel, and use a pre-existing solution.
Given long term plans, and issues around data privacy, I didn't want to set up a system whereby data leaves the system for testing on external machines. While it works well for very large datasets, e.g., the Netflix Prize, the potential of over fitting is higher for smaller datasets when random portions of that data are often reused in validation cycles. That said, developing new learning algorithms (or plugging in ones from existing CRAN libraries) is fairly straight forward so you should be able to develop locally and upload.
There already is an API, but it is not at all documented. This is my next priority. Currently I'm running into some issues with using RCurl to interact with my API - issues which would not exist in any other language - but I really would like to get the R API out of the door before I open up wider access. In short, there are are 3 methods which are currently used by the web site (inspect my horrid JavaScript code to see them.) These allow you to upload data, make edits to meta data and receive predictions. Each prediction includes links to the R source that was involved in performing the learning + any transforms used. The prediction meta data also includes the quartiles for the measure after a number of test/train cycles, plus a sample of 250 predictions vs. actuals.
It has been suggested that I also include a small downloadable example snippet for each file to allow developers to get a better flavor of what they are working with. For larger files, I think this is a perfectly swell idea. In fact, I really do want to hear more of your suggestions. I took a knife to a slew of functionality before I released this, but I have code ready to go. But I want to wait for real life suggestions to see what I should be working on next.
The original plans for this project included complex routines for doing unsupervised schema detection and meta modelling to help identify which algorithms might work best with particular shapes of data. Also I had built a framework for combining multiple learnings algorithms in a boosting type environment. All of these features remain possible and will hopefully be released in the not to distant future.
One of the big issues I struggled with in deciding to release this is the nature of my target audience. At the moment there is an impedance mismatch between the sophistication required to understand what the system does and the utility of the system to sophisticated users. To those with any experience in predictive analytics, everything here should be your bread and butter - and most likely far simpler than what you do on a daily basis. However there is a large audience of people in the information business who currently make do with the 'Add Trendline' option in MS Excel. To this audience, this service would be greatly valued, but in its current form is probably a bit too much. This deeply embarrasses me, but I'm not going to let that stop me from publicizing what I'm up to. There is a plan, and it exists in increments.
For the lay information worker, there are hurdles both in providing understandable explanations of how the learning algorithms work and were applied but yet also difficulties in adapting my format to the natural shape of the data that they often work with - not to mention data cleaning. As an example time series models pose an interesting problem. They do break the model of one independent observation per row, but it is difficult to come up with a way of training and testing that is consistent with my current implementation. Even if I were to develop special case handling for time series data, it can be difficult for a computer to find appropriate periods over which to lag variables. At this point I think the simplest route is to let people include previous observations that they deem important, at lags that they think might be interesting, with each row. That way each row can be treated independently from the others and I don't have to build a lot of machinery to guess appropriate treatment of temporal dependency.
Likewise there are other problems whose natural representations don't map neatly to the one row = one observation representation - think of collaborative filtering or graph based problems. I am quite keen on keeping the one row representation as it affords me some nice system scaling properties without becoming too domain specific. That said, there is nothing stopping me from building front-ends that take data from these problem domains in their natural representation and map them to one that works better for my system.
When it comes to explaining the models, well. That is another story.
Comments (6)
Engineering vs. Architecture
16 Jun 2009, 17:18 PM

A few months back I caught up with a fellow Aussie in New York, who I first met once ten years ago. It is amazing how social network dynamics change as an expat. He is currently teaching Architecture at Columbia while completing his doctorate in the nature of representation in architecture. It was the sort of long conversation that lingers for a few months before finding a resting place in your mind. At first glance we spent quite some time discussing the work at the Spatial Information Design Lab as this most closely bridged the gap between our worlds. The deeper conversation was that of representation.
Engineers build things. They use sciences to make sure that the things they build don't fall over. Architects design things - they take ideas of the world and represent them. Their audience is both the client and the engineering and construction teams. Different representations serve different purposes. Engineering: Representation to World. Architecture: World to Representation.
Finanical engineers take what they know about how companies work and built new things to serve other companies. Economists take the real world and make model representations of reality. However there is a void in economics, between the macro and the micro, in the domain of the company. Likewise, there is a void in financial engineering. Financial engineering is currently dominated by time-series analysis. I posit a weak form of the Black Swan theorem - namely that we currently don't know enough about the past to even pretend to predict the future. We have financial historians, in the form of data providers, but we don't have the architects to take this repository of past knowledge and build representations of how companies operate. Accountants across the globe set and implement the rules of this complex system, but we don't understand it's dynamics.
Can financial engineers shed the instrumentation of time series analysis and take on this role? Or will it come from a new group - the type of people who build Googles?
Or will their buildings leak?
Image by ken mccown on Flickr.
Comments (0)
Predicting social network features from profile picture features
13 May 2009, 14:16 PM

The interwebs has made it really easy for those who are looking for data to find it. Or at least a close approximation to it. Those who have the tools to scrape the web and reverse out interesting data are typically part software developers, statisticians and hackers. Mix these three together and one is genetically predisposed to collect as much data as possible. But there must come a point when the collection stops and the inference begins. Inference is difficult, in that it requires making statements about the mushy world, whereas coding systems to collect data deals with deterministic computers. It is easy to fall into the trap of simply collecting data to avoid dealing with mush.
In my previous job I was overseeing a project which involved scraping a large publicly traded e-commerce site to find interesting information to support investment decisions. The problem was that everyone on the street was also scraping the same site. Our code was top notch and having done this before we were able to avoid common pitfalls and our system was gathering oodles of potentially useful data. Faced with all this data one of my developers had a tough time working out where to start the inference process. Sure, there are obvious places, like predicting revenue from site activity, but they are obvious. So obvious that even the sell-side researchers were doing it. Faced with the task of finding something less obvious, my advice to my colleague was to pick a pair of columns at random and come up with a model for the dynamics of the relationship. In such an exercise one picks the boring stuff, like transaction numbers instead of transaction values - and begins decoding from there. The goal is to look at the data sideways and see if anything interesting pops up.
Most often this fails. But it is a good way of breaking out of the data rut. When I do this exercise, I typically find myself desperate for one other set of observations to help explain what I am seeing. While the result may be boring, it is not failure as it gives you a direction with which to approach the data.
Currently I have a few projects that involve social networks in one shape or another. While my clients are generally looking for the standard orthogonal projection of the data, I can't resist the urge to look at things sideways.
A client was walking through the important data that they collect about social network activity, but when I talked with their developers they also mentioned in passing that they also collected profile pictures. Not for analysis, but for another part of their suite. Pictures. Cool. Thems be data. Excitedly I professed, as if I actually knew what I was talking about, that there is probably heaps of juicy stuff inside profile picture data. Intruiged at my own confidence I decided to tackle this by scraping 250,000 profile pictures from MySpace and grabbing a few key stats about each account. The first thing I wanted to examine was whether profile pictures in any way informed the number of friends that a user had. They do.
As I have other plans for this data, I didn't scrape MySpace with the complete intention of doing this project, thus only 19,214 of the images have associated friend lists. But this was enough to get started.
First off I wrote a short C program to calculate 32 features from each image. These features are pretty typical image processing functions, like size, average color levels, number of colors, smoothness, symmetry and a few keys points from the luminosity histogram. MySpace pictures tend to be a mix of faces, icons and general photos - to rougly help identify faces (without commiting to facial specific measures) I also calculate a subset of these values for the central portion of the image, including recording the location of vertical axis of greatest symmetry. Most of the values have been normalized to some image specific reference to increase variation and limit covariance, for example the average R,G,B values are expressed as a % of the images luminosity.
To get a sense of the variation of these features, I constructed an image based on the first two principal components of the feature space. At this point some kind folks on Twitter (starting with Mark Reid) pointed me towards t-distibuted stochastic neighborhood embedding. Someone mentioned that I could simply forgoe my feature calculation code and simply use tSNE on the pixel data, which sounded exciting, but after reading the paper I decided against it. In the paper the authors do demonstrate this technique, but their first step after reading in the pixel data is to perform PCA to reduce the dimensionality of the problem. And their image set was much more well behaved than the images I was working with. Maybe I'll take another look at tSNE in this context when I next have some free time.

Visualization aside, the next step was fairly simple. I divided my data into a 75% training set and reserved 25% for testing, attempting to predict log(# of friends) by my image fetures. Using a linear model was pretty poor, but not terrible. In sample I got an R^2 of 0.17, out of sample it was far worse. Using an SVM, I limited the training to classification rather than regression - trying to classify in groups by quantiles of log(# of friends). For a simple binary classification (more or less friends than the median sampled MySpace user), the accuracy was 70% - with errors evenly distributed across the two classes. I also tried 3 and 4 classes, and the lift was similar.
To visualize this I performed the regression using an SVM and, as expected by the results of the classification, got a decent R^2 (0.25) on the out of sample test set. To get a better sense of the outliers, I produced the following visualization. Note in this visualization I have sacrified some positional accuracy by enforcing a constraint that no images may overlap.

I also used a similar approach to test whether I could predict other, more interesting, network features like measures of centrality, and my initial results are positive.
At this point, if I were to run with it, I'd like to make some assesments as to the underlying process that relates social network features from image features. Until I have more time, my current hypothesis is Boobs.
For those of you with more time on your hands, I have packaged up some datasets. Grab them here.
Comments (4)
Ratings Rant
12 Mar 2009, 03:56 AM
(I guess my previous comment got lost in the system. Maybe someone can bail it out)
First up, congrats Toby & Jesper on tackling this issue. It is a shame that the Money:Tech conference was cancelled this year as that would be the perfect venue to address a good mix of quants and techs and spark some serious discussion. Barring that, I wanted to chime in with my own 2c as a techy/quant guy.
[disclaimer]In my previous life I did a fair bit of equity research on the ratings agencies, and the views here are mine and probably not shared by others. I have no positions in any CRAs right now[/disclaimer]
While the talk focussed on corporate bond ratings, the largest growth area for these agencies was in structured finance. And much of this was mortgage backed securities and similar derivatives. So to understand the mess we are in now we need to look at the history of these instruments.
MBS's were born for two key reasons. First off was the realization that in 'normal' times the dominant risks were idiosyncratic in nature and as such could be minimized through the application of portfolio theory and diversification, leading to pooled entities with smoothed cash flows and tranches providing for the needs of various risk profiles. In my opinion this story is primarily the sizzle. The real steak was the fact that by aggregating together whole loans new tradeable instruments could be formed.
The problem with whole loans was that their pricing was highly dependent on a large vector of unstandardised parameters whose diversity precluded the formation of any depth necessary to support liquidity in traditional market designs. By eliminating the idiosyncratic risk components these pooled instruments could theoretically be summarized by a small set of parameters and relatively simple models for prepayment risk meant that traders could respond to bids and asks against them.
Faced with these simple models quants took off in a great fantastical leap and applied ever more complex techniques to model out pricing. For one take, see Felix Salmon's recent piece in Wired. Or look to Paul Wilmott's take on the ever escalating departure into a mathematical wonderland that ignored the realities of the underlying loans and their associated risks.
Somewhere along the line practitioners forgot that the technology underpinning the frothy new market was based on 1970's financial and computational technology. Back then a bank of associates armed with HP-12C's could price out MBS's using a small set of descriptive parameters.
Over the next 20 years more computing power was thrown at the problem, but the basic data was still confined in scope. Sure, some funds were taking apart these pools and doing a deep analysis of the components, but there wasn't much reward in doing so as the market was moving at such an upward clip.
Even worse, if you look at the papers from Frank Partnoy, the credit ratings agencies - who were supposed to be taking a deeper look at these securities, without the demands of second by second trading - were using plainly silly assumptions. There was a huge amount of mathematical and financial stupidity going on. Not even going to mention the conflicts of interest and the regulatory arbitrage at play...
Anyway... To address Falafulu's point about MPT - I agree, MPT is great stuff; a very powerful framework by which to understand finance. But just look at the assumptions. Sure, these assumptions make the math tractable, but modern computing power enables us to take a more nuanced view of the world. We no longer have to rely on single parameters of 'default risk' to price these instruments. The market would be far better served if all available data for the underlying components and use their own information about their own risk profile to come up with better measures of value. Just compare David Einhorn's spreadsheet with a report put out by Moody's. It is night and day. Give me the data, not some puff piece of pseudoscientific nonsense passing itself off as high finance.
The original problems with trading whole loans, namely that there were too many parameters to support liquid markets, is no longer an issue. Look at WeatherBill. Look at Robin Hanson's work on combinatorial market mechanism design. Falafulu, sure some smart people were recognized for their ground breaking work of decades ago. But the most recent winner of the John Bates Clark medal in Economics went to Susan Athey, who is doing some fantastic work in mechanism design.
Computational power is such that we no longer need to pretend that all financial instruments have to be priced on with a slide rule. We have new marketplaces that can effectively support trade in financial instruments with high dimensionality. We have the computational power to let traders value these instruments. What we don't have is the data.
Give us the data and we will trade.
For a less ranty take on my world view, check out my blog post Data trades inversely to liquidityComments (5)
Data trades inversely to liquidity
20 Feb 2009, 04:36 AM
I recently voluntarily left my job running a 'renegade' equity research group to start an independent 'big data' consultancy business and in this economic environment this fact regularly gets me odd looks. Who in their right mind leaves a great position at a successful fund when half of Wall Street is battling to keep their jobs? This is not the first time I've done this, having run a similar consulting business right after the burst of the dot-com bubble. I don't have a great deal of life-wisdom, but if I did my only credo would be that when life hands you strawberries, it is time to go hunting for lemons. Don't rest on your laurels and the best time to be risk seeking is when everyone else is risk averse.
Having had to explain my rationale for launching i2pi as a consultancy frequently, I've come to rely on the phrase 'data trades inversely to liquidity.' This notion holds true in both my prior world of investment management and especially now in the data collection and analysis business. In finance, when markets are liquid price discovery is cheap. With all the talk right now of mark to market accounting treatments, it is clear that the converse is also true. Holders of illiquid securities can no longer rely on quoted prices to manage their portfolio risk. As the current crisis began to unfold earlier last year there was a very visible Mexican standoff while shops with CDO/etc. exposure refused to trade as the act of trading would force everyone else to reprice their own portfolios. Doing so could only last so long and the inevitable write downs began to occur as margins were being called. And thus the house fell.
The premise that led us to this mess was that with only a modicum of data and some threadbare models trading would be the final arbiter of value and the collective intelligence of efficient markets would result in fundamentally sound pricing. Now that liquidity has gone from the markets, traders of these illiquid instruments are bulking up their data and models to try and better their understanding of fundamental value. And so it is that when markets are liquid the market relies on trading to assimilate the information of individual agents. Without this method of price discovery these agents need to gather their own data as the market no longer performs the role of grand aggregator. Data trades inversely to liquidity.
While my work at the fund was phenomenally diverse and deeply intellectually stimulating, there was no fire. I've never had a real job. I've only ever worked at start ups where there is no time for a 'job.' In a constant state of conflagration, everything at a start up requires immediate attention. Early on in my career I worked as a back-end system engineer and 'fire' usually involved dealing with scalability and general growing pains. Late nights implementing features that were sold to paying clients well before the development team was consulted. Later I spent more time selling these features and there was a constant fire to come up with new and interesting things to attract clients and revenue. At the fund our financial stability was near certain and while there was a drive for deeper insight, the fire was luke-warm at best.
The current financial crisis is, at its core, rooted in the debt markets and this dislocation has clearly negative consequences for start up financing. Contemporaneously, new technologies and operational methods allow technology start ups to scale efficiently. Cloud Computing, as distinct from the similar buzz about Web Services just a few years ago, provides a platform upon which small companies can grow their operations in proportion to their needs without large capital investments in hardware or expensive, unwashed and hirsute systems administrators. Hadoop, Memcache and their ilk let developers build applications that operate on huge data sets without investing in the expensive vertical scaling solutions of Oracle & Co. And social networking results in network driven growth patterns that can be much steeper than products or services that live on an island. However, the skills required for scaling analysis systems are quite different to those needed to scale operations; part statistician, part database administrator, part computational micro-economist, and then and understanding of business to tie together a narrative that tells stories with numbers rather than purely stories about numbers.
The environment I see around me for technology start ups is one whereby funding is hard to come by and series B's are even more painful to founders. These companies need to be smart. They need to focus some of their attention on what to do with all the data that they gather as part of daily operations. Development teams, while facing the world with more appropriate tools than those available a few years ago still focus on operations. Someone needs to focus on research. Data driven research. In times when funding was easier and valuations were higher, companies could focus on operations and hope that that operational scale would lead to revenue. I firmly believe that in this environment data rather than scale alone is of immense value.
Beyond the empirical sciences of revenue optimization, behavioral targeting, customer segmentation and the discovery of on-line arbitrage there lies a need for basic research that develops fundamental models for understanding the environment in which a firm operates and how to question the unknown. This type of science is closer to applied sociology or experimental physics than the tradition of ponderous economics in that it is driven by and dependant on the immense mass of data that new businesses are generating. But as with all science the role is to simplify; to construct a narrative upon which new questions can be found and companies can learn to change how they think about their operations rather than simply what they know.
i2pi seeks to bring this science to companies.
Comments (3)
When to cheat
3 Jan 2009, 10:31 AM

Sometimes, when trying to optimize a computer system, you get to a point whereby on one hand the next x% of optimization will take orders of magnitude more time than the previous x%. And on the other hand, you can completely change your marginal optimization cost if you take a different approach by approximation. Good systems can have hardware thrown at the problem to scale. As has been mentioned elsewhere, hardware is often cheaper than programmers, so we tend to go as far as we can by taking this approach - until it no longer works. Perhaps a more efficient algorithm can be implemented to lower the marginal cost of scaling. Unfortunately there comes a point whereby traditional algorithmic optimizations fail to change the equation enough to manage your costs. At this point you may need to cheat to scale.
Knowing when to approximate requires an understanding of the costs of doing so. If your system is responsible for proprioception in an automated x-ray system, then getting it wrong is to be avoided at all possible costs. Everyone would like to be as accurate as possible, but this is not always cost efficient. If you are running an analytics system being wrong costs less than you might think. Any system that produces statistics based on sampled observations has room for approximate solutions. Programmers tend to forget that such statistics contain error. Programmers tend to think that if there code is free of logical flaws, then the output must be error free. If you observe 12 users clicking on an ad that was displayed 1,000 times then the click through rate must be 1.2%. Click through rate, in this case is a statistic - a way of summarizing raw data. However, click through rate is also a measure of some innate clickability that is driven by the ad, its context and its viewers. Note that we can't possibly measure these abstract quantities, so we use our observed behavior of 12 clicks from 1,000 ads as guide to navigating the underlying complex abstract system of interactions that determine the clickability of an ad.
As soon as we produce a report that states that click through rates are 1.2% on all Fridays then we are implicitly giving the caveat of 'based on all Fridays we have seen.' But we must admit some error as soon as anyone tries to generalize about all future Fridays from that one statistic. A statistic used as an estimate must come with an acknowledgement that without seeing the entire population there will always be the chance of being wrong. A correct software system coupled with perfect data capture ensures that the sample statistic will be correct but there will always be a chance for error in its estimates. In evaluating the cost of deviating to algorithms that are only probabilistically correct or involve some form of approximation, we must concentrate on the increased cost caused by a possible increase in the rate or nature of errors.
There are a number of things to consider when determining the amount of error introduced when producing estimates, but all methods involve two numbers, a confidence interval and confidence level. These will depend on what you are estimating and how you go about it. Estimating a population maximum behaves differently in the presence of approximation than does the estimation of the mean or median. Approximating by taking every second sample may have a different impact than only looking at the first half of a sample. To begin one needs to look at the baseline error in your estimate as produced by an algorithm that uses no approximation. We may find that our estimate of Friday click through rate of 1.2% has a confidence interval of +/- 0.2% at a confidence level of 95%. This tells us that 95% of the time, when we try to estimate click through rate it will fall between 1.0% and 1.4%. We then examine how changing the method of measurement by introducing approximations will change the confidence interval for a given confidence level.
From this point we compare the marginal cost of error against the difference in cost of optimization by approximation and the cost of scaling using the current cost to scale. Approximation may chance the confidence interval from +/- 0.2% to 0.3%. How much this costs you is a question of risk management and depends on the economics of the business that you are in. Click through rates might be used to tweak an advertising budget. If you are buying ads on a cost per click basis, then the cost of the increased uncertainty in click through rate will be determined by the cost of each click. The cost of implementing an approximation versus traditional scaling will be determined by how analytically optimal your current system is and the costs associated with scaling be it by hardware, software or approximation. If the increase in cost caused by approximation is lower than the cost of buying more hardware or implementing more complex algorithms, then cheat.
Comments (1)
What is i2pi?
2 Jan 2009, 19:58 PM
Apologies about the formatting here, LaTeX doesn't look so pretty as HTML. For a pdf version, click on the words 'pdf version'.
People often ask me what i2π means. To me, i2π represents the pleasure of stringing together simple principles to arrive at a
beautiful understanding of the nature of something. Here I am going to share with you the way I think about i2π in a
mathematical context. This doesn’t represent the full history or complete derivation, but it is how I like to think about
things. If you want to learn more, Wikipedia is a good place to start. If you want to see things through my eyes, for just a
moment, stick around. If you scroll ahead it may look a little scary to the mathematically uninitiated, but we won’t be
using anything more complicated than mid-range high school math to get there. And I think the journey is
worthwhile.
To understand what i2π is all about we really have to start with Euler’s constant, e. Euler’s constant is a special
number in mathematics and it appears in many equations across mathematics, physics, statistics and economics
because it has a number of unique properties. My favourite property is a variation on what is know as Euler’s
identity:

And that is why we are starting our i2π journey with e. My next favourite property of e is the relationship between ex and
its derivative
, namely

Those who have taken high-school calculus take this for granted. Those who didn’t or who have forgotten it are probably
scratching their heads. For their sake, and my own, lets quickly review derivatives.
Derivatives are a convenient way of describe the slope of a line. Take the equation for the line y = 2x, then for each unit
increase in x, we get 2 units of increase in y. The slope of a line is the ratio between the change in the output of the function
that describes it with respect to the change in the input. In this case, x is the input and y is the output, and increasing x by
1 increases y by 2, so the slope is 2:1 or simply 2. The notation
describes the slope of some function f(x) as x
changes. The equation y = 2x has the same slope for all values of x, so we say the slope is a constant. More
complicated functions, like f(x) = x2 (in the figure below) are curved, so the slope changes as we change
x.

Without going too deep into calculus, it is known that
= 2x. In other words, for any point x, the slope is 2x. If you look
in the figure above, when x = 0, the slope is 0. As x increases to the right of zero, the slope gets steeper and steeper. To the
left, as x becomes increasingly smaller than zero, the slope becomes increasingly negative. In fact, for any
k

For example,
= 7x6.
When you take the derivative (find the slope) of most functions, the answer is usually some modified form of the function you
started with. However, the exponential function, ex is the simplest example of the case where the derivative is equal to the
function. Up to this point, we haven’t even worked out what the numerical value of e is, but let us try to
define e by starting with the fact that
= ex. To do this, we will need to take a detour into the world of
factorials.
Imagine that we have 5 books that we want to place on a shelf. How many different ways can we arrange them? Working
methodically, from left to right, there are 5 possible books we could put in the leftmost spot on the shelf. Once we choose
the first book to place there, we are left with 4 possible books to place along side it. And once we choose
that book, 3 books remain. After placing the third book, only 2 more remain, and so on. This means there
must be 5 × 4 × 3 × 2 × 1 = 120 ways of arranging those 5 books. If we had 100 books, it would take up
too much paper to write down 100 × 99 × 98 ×… × 1, so we use the shorthand 100!, which we pronounce
’100 factorial’. One hundred factorial is a big number. If we built a book arranging robot that could do one
billion arrangements per second, and had one billion of them running in parallel since the big bang, they still
wouldn’t be finished trying all the possible ways to arrange the books. Thank god for the Dewey Decimal system,
eh?
Ok. So factorials are just a shorthand way of writing down a special type of successive multiplication. We can use the formula
for the slope of xk to find the slope of

We can re-arrange this to be

If you multiply some function by a constant, C, then the slope is also multiplied by C. From the rule we saw 2 paragraphs ago, we now that

We also know that k! = k × (k - 1)!. For example, 6! = 6 × 5! = 6 × 5 × 4 ×… × 1. So

Cancel out the k in the numerator and denominator, and we get

Now lets tackle a slightly more complicated function,

The first 2 terms of this function are pretty simple. Recall that factorials count the number of ways of arranging objects. There is only one way to arrange zero books, so 0! = 1. Also recall from math class that x0 = 1, so

There is only one way of arranging one book and x1 = x, so

Therefore

I know this comes out of nowhere, but just go with flow. What is the derivative of this function? Derivatives are additive so we can just do each bit individually and add them together. The derivative of 1 is 0, as it is a flat line - hence no slope. The derivative of x is 1, as it is a line with a 45 degree slope. And we just worked out the rest in the previous paragraph, so

Hey, wait up. If we drop the leading 0, which we can, then that’s just g(x) again.

To see this requires some majorly deep and majorly simple insight. When things are deep AND simple, they
are beautiful. What just happened was that even though the first term of the sum disappeared by turning
into zero, the rest of the sum remained. Because we defined g(x) to go on from
+ … to infinity, for each
term that drops off the front, there are an infinite number of terms to make up for it. Infinity less one is still
infinity.
Recall that we defined ex to have the property that

and we now have a function where

These properties are the same. The function has itself as its derivative. And it just so happens that g(x) is ex. To
understand why, we need to look at Taylor’s Theorem. So point your browser to Wikipedia if you are so inclined and join us
in the next paragraph to continue.
Great. Welcome back. The point of this post is to explain i2π and so far we have only covered e, so let’s get a move on and
have a look at i and π. I assume everyone is cool with 2.
What is i? i is the square root of negative one.

So, i2 = -1. When I first encountered i I asked a family friend / math professor to explain it to me. All the books I had read just talked about ’complex numbers’ and I wanted to understand what made them ’complex.’ She explained to me that they aren’t complex, in the sense that complex means difficult. They are just different to the normal numbers we usually encounter. In school you would have learned that the square root of negative numbers is undefined. But they turn up so frequently that man invented a new class of numbers to allow us to define them. Numbers are just symbols for the abstract concept of quantity. And i is just a symbol for the square root of negative one. While it is not possible to have i books or bananas, we can still do mathematics with i and end up with real world numbers. For example,

And i4 = i2 ×i2 = -1 ×-1 = 1. So while we can’t buy i bananas, we can buy i4 bananas, because i4 is 1. As you keep on raising i to higher and higher powers, a pattern emerges. i1 = i, i2 = -1, i3 = i2 ×i = -i and i4 = 1. When we look at i5 we find i5 = i4 × i = 1 × i = i, and the pattern repeats. For no apparant reason, lets sum up all the powers of i:



To see the pattern more clearly, lets split up the odd an even terms


Out of sheer curiosity, lets find out what pattern would we get if we expanded eix using the formula we found before.

We know that the pattern in the i’s comes out nicely if we split out the evens and odds, so lets call the even part of the right hand side of the equation a(ix) and b(ix) for the odds:

Now using the -1 + 1 - 1 + 1 -… pattern, we get

Likewise for the odd terms,

Now when you went across to Wikipedia to check out Taylor’s Theorem, you will have seen that

Which is totally the same as our a(ix). I may not be hip and fresh with the rock'n'roll and skateboarding, but I know cool when I see it, and that is cool. We also know that

Which means that b(ix) = i × sin(x). Putting a and b together, we find that

Now that we have gone from summing polynomials to trigonometry, it may be coming clear where the π fits
in. π is a special number that defines the ratio between a circle’s diameter and its circumfrence. If a circle
has a diameter of one furlong, then its circumfrence will be π furlongs. π is also used to measure angles, the
same way as degrees are. In a circle there are 360 degrees, but mathematicians like to say that a circle has 2π
radians. That is, if you have a circle with a radius of 1 foot, then the circumfrence will be 2π feet. If you
were to walk around this circle through 1/8ths of its circumfrence, you will have moved 45 degrees, or
radians.
The sine and cosine take an angle in radians and tell you the x and y coordiants of that point on a circle.
If you walk 45 degrees anti-clockwise around a circle starting from the point (1,0), then you will end up at
position

If you walk 2π radians around the circle, you will have gone 360 degrees and end up where you started. So
= (1,0)
So the function eiω tells us the coordinates of where one ends up after walking ω radians around a circle of radius 1.
We have been writing down our coordiates as (x,y), where x = cosω and y = sin(ω), but as we found out
earlier,

If we think of i as being a symbol to represent our distance in the y direction, then we can convert from x + iy to (x,y). And if we walk around a full circle and end up at the beginning point of (1,0), we can convert back to 1 + i × 0 = 1. Therefore

I like i2π as this term comes up across a range of mathematical equations and you can go a long way in learning about
mathematics by looking in the origins of this famous identity. Simplicity, depth & beauty.
Comments (4)
Twit or Love
30 Dec 2008, 20:44 PM
I recently rejoined twitter and was completely blown away by the amount of discussion on twitter that is about twitter. I am reminded of a friend who once told me that the only thing you learn by taking drugs is how to take drugs. Conscious of this, I have refrained from mentioning twitter on twitter, but given that I have talked about it before on this blog, I feel OK bringing it up here.
This morning I saw a post from my friend Aaron:
neonarcade I wish Twitter was more about what people are doing again, and less a generally boring conversation about Twitter and social media.I was compelled to act. My initial thought was to reskin twitter, or make a firefox plugin that removes all mentions of twitter from the feed. However, by the time I reached the office I settled on a single service site, as is the style of the time. So I now present Twit Or Love. Right now, the word 'twitter' has been mentioned 1.54x more frequently than the word 'love.' I use a highly accurate Markov-Chain Monte-Carlo technique to arrive as this unbiased estimate. My Gibbs sampler (unrelated to melting snow with salt) is coded in Erlang with the front end dynamically generated by a Scala program that writes out Ruby on Rails code. It is hosted on EC2. And uses map reduce. Enjoy
Comments (2)












Irwin Chen - 23 Sep 2009, 18:42 PM
Tim Levine - 24 Oct 2009, 17:38 PM
southsails - 13 Dec 2009, 20:10 PM
Alex Leverington - 26 Apr 2010, 03:47 AM