not knowing is half the battle

3 Feb 2010, 11:12 AM

A decade ago I was pitching web analytics software to a number of retail banks back in Australia. The pitching process was intense. Not only did we have to explain what web analytics was about all to marketing executives who had just discovered the internet, but we also had to get past operations teams who were held to a five nines standard. Five nines, or 99.999% uptime means six seconds of downtime per week. This attitude pervades the development of banking products, both in ways that are beneficial to the user and in ways that make what should be simple experiences painful.

Running a large teller machine network requires a serious investment in uptime. With thousands of transactions running through the American card networks each second, the fallout of losing even a fraction of a percent is too serious a risk to face light-heartedly. As a whole, the network is amazingly reliable. But point of sales machines don't adhere to the same reliability levels as the network as a whole. The machines that you read your card swipes at stores are subject to risks from power failure, phone network outage or even running out of paper before coming close to hitting a system-wide network failure. In the event of a single failure at your local bodega, affected customers have the option of walking across the street to use an ATM or make some other minor adjustment to cope with a local problem. If the network as a whole were to go down, even for a short period of time, there would be chaos.

An individual POS machine meets local demand at 99.9% availability. This is how the system is meant to work. Likewise, the internet. The internet is designed to be fault tolerant, with the realization that it is cheaper (and often plain better) to tolerate failure instead of going to great length to avoid it. Your POS terminal could reach 4 nines if it were housed in a fancy co-location center, but that would be hardly be convenient. Yet when we were pitching marketing analytics to the banks we got the impression that if banking technologists had their way, all bodegas would have multi-homed candy aisles and drip coffee machines with backup generator power.

While this attitude has shifted since the formation of independent internet banking groups as distinct from core banking operations, tight coupling between legacy systems results in a development process that is just completely wrong. Banking is complex. Retail operations are tightly regulated. Applying this mentality to every aspect of banking leads to unnecessary inflexibility. Yet successful online experiences are defined by development processes that rapidly iterate and evaluate. The idioms for online interactions are rapidly changing – and at this point in our short history of internet usage it is difficult to see many points of conversion. This is the zeroth hour and iteration begets discovery of new interactions and continuing evolution of a common language for working with the web.

As many of you know, I'm working with a great group reinventing retail banking. A big part of what we want to get right is our user experience. In researching this project, I've signed up with a bunch of different banks across America. The typical process for doing so heavily reflects the forms that you fill and experience that you would have opening an account at a branch. While many of these form fields are required by law, the end user experience is heavily informed by legacy development processes. Branches were satellite offices, connecting to mainframes via expensive WANs. The process feels unwieldy to new customers as their applications move in lock step through a system that was designed as a series of incremental improvements over a pen and paper driven process of just two decades ago.

Most banks operate their core transactional processing systems on a batch cycle. This stems from the way the Fed works with banks for overnight lending. And as the internet groups budded off from core banking operations, the process of institutional mitosis resulted in a bunch of useless DNA being carried over. One of the banks I have accounts with (a top 5 US bank) regularly closes their internet banking site on Sunday nights for scheduled maintenance. When did Facebook last shut down for maintenance, scheduled or otherwise?

One thing we are working very hard towards is the distinction between parts of our banking service that must be highly reliable/available and those where we can iterate quickly. For example, take call center operations. A good phone system, simply due to variability in customer demand, will need staffing slack to reach given quality of service – in terms of expected time on hold, and time to resolution. You can spend a tonne on reaching a certain level of technological availability, but have little impact in key measures of customer satisfaction due to the different costs involved in staffing. With a five nines phone system you can still deliver a shitty service if your call center capacity is capped out by 100% staff utilization.

When making decisions about technology behind a retail bank, such as the call center or web site, we choose to trade iterability for early wins in the cost of scaling. Large, complex and thus rigid systems make it difficult to evaluate competing operating procedures. Short sighted metrics for success lead to short sighted incremental improvements. Free from the constraints of public markets we are able to take a risk and try something different – even if we don't know, a priori, how different it has to be. We believe it is critical to be able to try new things quickly, learn from our customers and improve their experience based on data.

I'm always scared to have our audience ask "What about feature X?" during our early technology demos. More often than not "feature X" is already in our feature tracking system, but market as "Will not fix." Even very simple features that would take only minutes of development time to implement have far reaching consequences. If we added the ability to filter transactions by date, for example, a number of quick decisions would be made about how to implement the user interface - with no possible implementation resulting in an interface less complex than simply leaving the feature out.

Additional complexity, without a clear use case, is bad. The flexibility to add new features as they are justifiably demanded is good. Complex systems work best when they are adaptive. Desinging new features in bulk and dumping them on users after a 12 month development cycle is just cruel. Especially so in banking, where mistrust is rampant and fear of making a mistake is justified. Better to iterate quickly and support an adaptive complexity landscape.

People often ask us what it is that makes us better than other banks. Glibly, I respond that we are just a plain old retail bank – but we don't suck. Not sucking is our killer app.

I don't know what that means in terms of fine grained details for future features. Sure, we have prior beliefs as to what experiences suck more than others under the current banking model, and where we should appropriately spend our valuable time. But we are also fine with being wrong - hell, we expect it. The only thing that we believe is that by setting ourselves up to respect and learn from our customers experience, we win. Other banks just can't do that.


Photo by: riv / flickr
Comments (0)

Connecting with your money

17 Dec 2009, 2:12 PM

According to Mr. Meara, 90 percent of all transactions with bank tellers involve checks. If everyone had an iPhone deposit app, people wouldn't come into the branch as often. That would be fine had banks not invested so much time and energy in training branch workers to persuade checking account customers to move into more profitable products.

"One the one hand, fewer deposit transactions could mean a headcount reduction," he said. "But it invites the erosion of store profitability. The banks are struggling with the enormity of what it means."

Hurry up & credit my account New York Times, September 18, 2009

Brand apathy is rampant in retail financial services. The number one sales channel is the branch. That expensive, increasingly empty, retail space is how banks sell new products to customers. "Is there anything else that I can help you with today" is the alpha and omega of retail bank marketing, with a small epsilon for banners of smiling families plastered inside branches.

Below is a screenshot of what I see when I log into my American Express account. A small portion of the screen is stuff I care about. A small, but significant, portion of my emotional well being rides on those numbers. The bulk of the page is dedicated to selling me stuff. How about this? Make understanding and working with my money easier - make me happier, and then I'll be far more receptive to upsells. But if you can't even get basic information and interaction right, then I'm too busy worrying about my current state of affairs to consider new fangled products and the incremental complexity they entail.


Comments (2)

Selling customer data

16 Dec 2009, 3:40 PM

As someone who has spent most of their working career selling data to advertisers, I'm suprised by the number of businesses that are predicated on the model of selling data to advertisers. If you have a great widget, it is easy to get it in the hands of millions of people, especially if you are giving it away for free. As programmers and scientists we deify data. What we don't do is understand advertising. Sure we understand that advertising is about selling stuff, but we don't seem to get that the advertising industry exists to sells ways of selling stuff.

Whether an external agency or an internal group, advertising professionals have to convince others that they are adding value. And if your model is simply to sell data to advertisers, you have to convince them that your data is worth at least as much as what you are selling it for. Your data needs to be useful. Smart techniques are popping up for doing interesting things with large amounts of data. But interesting isn't always useful. If you think your customer data is useful, you should use your data to make your product better. Data that is valued by your users for the richer experience it provides is likely to be valued by others. If you can't improve your customers experience with their own data, your data is worth nothing.


Comments (2)

blockquote

12 Dec 2009, 12:12 AM

While I'm ranting, let me ask you something, Randall. At the risk of sounding like Glenn Beck Jr. - what the fuck has gone wrong with our country? Used to be, we were innovators. We were leaders. We were builders. We were engineers. We were the best and brightest. We were the kind of guys who, if they were running the biggest mobile network in the U.S., would say it's not enough to be the biggest, we also want to be the best, and once they got to be the best, they'd say, How can we get even better? What can we do to be the best in the whole fucking world? What can we do that would blow people's fucking minds? They wouldn't have sat around wondering about ways to fuck over people who loved their product. But then something happened. Guys like you took over the phone company and all you cared about was milking profit and paying off assholes in Congress to fuck over anyone who came along with a better idea, because even though it might be great for consumers it would mean you and your lazy pals would have to get off your asses and start working again in order to keep up.

And not just you. Look at Big Three automakers. Same deal. Lazy, fat, slow, stupid, from the top to the bottom - everyone focused on just getting what they can in the short run and who cares what kind of piece of shit product we're putting out. Then somehow along the way the evil motherfuckers on Wall Street got involved and became everyone's enabler, devoting all their energy and brainpower to breaking things up and parceling them out and selling them off in pieces and then putting them back together again, and it was all about taking all this great shit that our predecessors had built and "unlocking value" which really meant finding ways to leech out whatever bit of money they could get in the short run and let the future be damned. It was all just one big swindle, and the only kind of engineering that matters anymore is financial engineering.

http://www.fakesteve.net/2009/12/a-not-so-brief-chat-with-randall-stephenson-of-att.html
Comments (2)

When lenders compete, you win.

2 Nov 2009, 1:33 AM

The message is pretty clear - competition, the pillar of capitalism, results in better products and services for consumers. When lenders compete, you win. While this is the slogan for one major mortgage lead generator, the methodology is common to the industry as a whole. And people believe that magic technology fosters competition, with the net benefit of better lending rates.

The reality is a little different. When I oversaw the operations of a mortgage marketplace, the competition was not in terms of the products offered, but rather, the price paid for getting a person's attention. Lenders would bid for leads and the lenders who paid the highest price received the most leads. Thus the incentives were counter to people's rational goals. The lenders with the highest margins were able to spend the most on customer acquisition, while lenders with more affordable products were unable to reach the same audience.

Google recently publicized their direct entry into this space. Prior to their entry they captured only a portion of the marketing dollars - with lead generators buying keyword ads on Google, funneling the traffic to their site, collecting lead information and selling to the highest paid mortgage providers. When the lead generators spend on Google was lower than their revenue from the mortgage companies, they profited. A mercenary and highly unregulated bunch, the lead generators would go to great lengths to screw the consumer.

Google's product appears better in that rather than selling out the consumer for the highest price, they display a targeted list of options - clearly outlining the competing offers - letting the consumer decide which companies to contact for a quote. As is always the case, transparency leads to a better outcome for people.

Despite the numerous and simultaneous failures in the mortgage marketplace that has so deeply scarred the American economy, one upside that is often forgotten is the benefit of standardization of lending products. Prior to the development of the mortgage backed security market mortgage contracts varied greatly in their structure and terms. And while they remain complex financial contracts, standardization means that a consumer is able to properly evaluate the bulk of the financial impact of their mortgage choice by simply examining a handful of parameters.

I have a home equity loan. I also have credit cards, savings accounts and brokerage accounts. The simplest account that I have, and the one that sees the most action, is the humble checking account. My checking account has a 36 page introductory preamble that outlines the terms and conditions. These terms are fully documented on a corner of my bank's web site, and change on a semi-monthly basis. No one reads these terms.

I spent my summer reading not only the terms of my checking account, but of all of my accounts and the accounts at other major banks in America. You'd be terrified to know what they actually contain. That is, if you could find them. The GAO found that 65% of banks do not make these documents available on the web, and 35% fail to produce them if you visit a branch.

And these terms matter.

Unlike most people's mental model of retail banking operations, banks do not make most of their money on the difference between the rates at which they lend versus the rate they offer for savings. American banks, quite distinctly from banks elsewhere in the world, make the bulk of their money from fees and charges. Invisible and often unavoidable consequences of little clauses in contracts that no one ever reads.

This stands in stark contrast to the message that we hear in bank marketing. Retail bank marketing is dominated by APR: Best rate savings! Lowest rate on credit cards! Yet the largest financial impact to the consumer is fees and charges.

Fees and charges that consumers have no hope of simply understanding.

Lead generation is rife across the financial product landscape. Some companies try to offer Google-like services for better helping consumers choose financial products - but these services fall into the trap of not taking into account the obscure and non-standardized terms that most impact financial well being. And as such no one believes the offers they see in sites like Mint.com. If people honestly believed they could "Save $2,000 by switching to Bank XYZ's credit card" then the conversion rates on these offers would be vastly better than the prevailing rates.

And so with all the technology that we have at our disposal, people are no better off. Banks have no incentive to increase transparency, lead generators have no incentive to provide real offers and immense brand apathy prevails resulting in short sighted decisions further driving down customer experience. The cycle continues.

Until it stops.


Comments (3)

Chase, what matters?

23 Sep 2009, 2:36 PM

Last week I paid my bills. As part of the regular bill-paying process, I take any funds left over that are not required as cash over the coming weeks, and pay down my home equity facility. I pay my credit cards in full. The only rate I concern myself with is 3.8% rate on my loan. There is nothing particularly unusual about this process.

A few days after I paid my bills my bank, JPM Chase, emailed me to tell me that my account was overdrawn. I logged into the web site and saw that they had put a bunch of payments through twice. Most importantly, they put my loan principal payment through twice. In their wisdom, they helped me correct this by withdrawing from my credit card to pay down my loan. My credit card has a purchase rate of 12% and an cash rate of 19%.

Follow me for a moment: They withdrew from an account charging 19% to pay down an loan at 3.8%. And along the way they charge an overdraft fee for the 'service.'

I have three simple requests for Chase:

  1. Reverse the double payment, restoring my checking account to its intended balance.
  2. Reverse the overdraft fee.
  3. Return the interest they are earning at 19% on the credit card account.

I was first routed to the online banking group as it was clear that the error originated within their domain. The timestamps on the transactions are identical, the transaction numbers are near sequential; there is a clear indication that my intent was to only pay once, but they processed the transactions twice. From there I was transferred to the credit card fiefdom who told me that they would correct the overdraft issues - but it would take four days.

Four days later I called to discover that no work order was placed and there was no note of my original call. I went through the same call center waltz, but instead was routed to the home equity group. The same group that asked me to forge a signature during the application process - but that is another story for another day. During this call I was told that the home equity group would take care of everything as they were the final destination of the funds. I made sure that the work order included the three key points listed above. I also took note of the work order number and the names and times of all the people I spoke with.

Meanwhile, mind you, I still have a zero checking balance and am unable to make other payments. I am loathe to draw down the debt facilities at my disposal, as it would just make it even more complex to reverse these transactions. Friends have been helpful and luckily I have enough cash on hand.

Today I called the home equity group directly to check on the status of the work order. They gave me the same estimate as the last time I spoke with them four to five days. At which point the interest would have accumulated to $11. Not a great deal in the scheme of things, but it is my goddamn $11. I'd be happy if they simply paid me the $11 and moved on - they have earned a good deal of revenue from me over the years, and this is clearly their error. But of course, they can't return me the $11 as they have no mechanism for doing so.

"You cannot dispute this transaction for this reason: 5102 - Bank Releated[sic] Fees / Charges - Not Eligible"

As of now it appears my only option to force the return of my overdraft fee and for me to receive any accrued interest is to take action in the New York small claims court. There is no way Chase will defend this - that would cost hundreds for legal approval alone. Of course, they make an order of magnitude more than that from my family in fees, charges, interchange and net interest margin each year. As soon as I get my $11 back, they will no longer hold any of my accounts.

Customer experience matters.


Comments (3)

Bayesian Methods + MCMC

10 Jul 2009, 8:44 PM

Last night we had our fourth NY R Statistical Programming meetup. The topic was Bayesian Methods + MCMC. We had two presenters, Jake Hofman and Suresh Velagapundi, both of whom did an admirable job of presenting a very broad topic to an audience with diverse backgrounds. I want to use this post to bridge a gap between the background material and day to day utilization. This is catered towards the audience who may have some experience with R, but aren't very familiar with the Bayesian Way. While it is a simple example, the steps involved extend on to the issues that are faced in real world applications.

The source for this example can be downloaded using the internet!

We are going to step through Jake's coin flip example to get a sense of what is involved in doing Bayesian inference. There are a number of packages on the CRAN Bayesian Inference view that do all of what you will see below. I decided against using them for two reasons. First off, the coin flip example is a little too trivial for using many of the techniques that rely on multivariate parameter estimation to see any utility. But more importantly, I want to use the opportunity provided by a nice simple example to step through the underlying mechanics. My hope is that after reading through this you can have a look at the available packages and be a better judge of what they are used for and where one package may stand out over another. In the course of doing this write up I went through the MCMCpack package and it is a good exercise to compare how they implement the MCbinomialbeta() against the first half of this walk through. For the curious, the MCMCmetrop1R() function is far more advanced than the simple implementation of Metropolis-Hastings shown below, and it is a good exercise to understand their tuning parameters.

As a quick recap, the point of the exercise is to go from prior belief in a distribution (in this case we believe that the coin is fair) and use observed data to arrive at a posterior distribution using both the prior and the data. There are three things that we need to know to calculate the posterior distribution:

  1. The likelihood of seeing the new data given our estimate of the bias
  2. Our prior distribution
  3. The 'evidence' or the integral of the likelihood and prior for each possible estimate

I won't step through the derivation of the likelihood, as this should be easy enough to derive from the binomial probability distribution function. In this case our likelihood, with N trials and h heads is:

likelihood <- function (N, h, theta) theta^h * (1 - theta)^(N-h)	
Check that the likelihood function makes sense:
t <- (0:100) / 100

png ("figure1.png", width=800, height=600)
par (mfrow=c(2,2))
par (bty='n')
par (col='red')
plot (t, likelihood(100, 50, t), type='l', xlab='Theta Hat', ylab="Likelihood", main='Likelihood (t=0.5)')

Great, the maximum likelihood for 50 heads from 100 flips is a theta of 0.5. (See chart below).

Jake uses the Beta distribution as his prior as it has some neat analytic properties; namely that the posterior will be of the same distribution family as the prior. We call these types of priors conjugate priors.

prior <- function (theta, a, b) dbeta (theta, a, b)		

a <- 2
b <- 2
plot (t, prior(t, a, b), type ='l', xlab='Theta Hat', ylab='Pr(theta|a,b)', main='Prior')
If we do the integration, we can arrive at the analytic form of the evidence and thus the posterior:
evidence  <- function (N, h, a, b) beta(h + a, N - h + b) / beta (a, b)
posterior <- function (theta, N, h, a, b) likelihood (N, h, theta) * prior(theta, a, b) / evidence (N, h, a, b)

plot (t, posterior(t, 100, 50, a, b), type ='l', xlab='Theta Hat', ylab='Pr(theta|Observations,a,b)', main='Posterior (t=0.5)')
plot (t, posterior(t, 100, 70, a, b), type ='l', xlab='Theta Hat', ylab='Pr(theta|Observations,a,b)', main='Posterior (t=0.7)')

dev.off()
Let's say we don't know what the analytic form for the evidence (denominator in Bayes' rule) is, and replace it by a numerical integration over all possible theta's from 0 to 1:
evidenceN  <- function (N, h, a, b) integrate (function(t) likelihood (N,h,t) * prior (t,a,b), 0, 1)$value
posteriorN <- function (theta, N, h, a, b) likelihood (N, h, theta) * prior(theta, a, b) / evidenceN (N, h, a, b)

N <- 100 	# Trials
h <- 70		# Heads

png ("figure2.png", width=800, height=600)
par (mfrow=c(1,1))
analytic  <- posterior (t, N, h, a, b)
estimated <- posteriorN(t, N, h, a, b)
plot (t, analytic, type ='l', xlab='Theta Hat', ylab='Pr(theta|Observations,a,b)', main='Posterior (t=0.7)')
lines (t, estimated, type ='l', xlab='Theta Hat', ylab='Pr(theta|Observations,a,b)', col='blue', lty=2)

err <- (analytic - estimated)^2
lines (t, (err - min(err)) / diff(range(err)) * max(analytic), lty=3, col='black')
legend (0,2, c('Analytic','Estimated','Error^2 (scaled)'), col=c('red','blue','black'), lty=c(1,2,3), bty='n', text.col='black')
dev.off()

While things are pretty simple with this toy example, Jake made the point that real difficulty with Bayesian inference is twofold:
  1. Integrating across theta to find the evidence (denominator)
  2. Once you have the posterior, integrating it to calculate summary statistics (mean, variance, etc.)

In the above example we used the integrate() function to apply adaptive quadrature to find the evidence. We could use this method for 2, but lets not. Instead, let us use MCMC - which is at its core, a way to draw samples from a distribution that is otherwise hard to sample from.

Given that this example is rather trivial, with just one parameter in question (theta), I won't step through the implementations of vanilla Monte Carlo methods (uniform, importance & rejection sampling) These implementations are pretty much straight forward from Jake's presentation.

I will however, implement a simple Metropolis-Hastings MCMC sampler using a simple and symmetric Gaussian proposal density (q in Jake's notes).

MHstep <- function (pdf, prevCandidate)
{
	# Effectively we are taking a random walk.
	newCandidate <- prevCandidate + rnorm (1, mean=0, sd=0.1)

	# NB: Because we are using the normal distribution
	# as our proposal density, which is symmetrical,
	# we cancel out the q terms on the numerator and 
	# denominator, as q(x|y) = q(y|x)

	a <- pdf(newCandidate) / pdf(prevCandidate)		

	# Draw a uniform random number from 0 to 1	
	u <- runif(1)									

	if (a > u)
	{
		# This candidate is likely to be a better sample
		return (newCandidate)
	}

	# Else, stick with our previous candidate
	return (prevCandidate)
}

Let's use our numerical approximate to the actual posterior function as the PDF we want to draw samples from:
posteriorPDF <- function (t) posteriorN (t, N, h, a, b)
Time to go on a random walk down coin flip street.
steps <- 1000
samples <- matrix(NA, steps)

samples[1] <- 0.5		# initial guess
for (i in 2:steps)
{
	samples[i] <- MHstep (posteriorPDF, samples[i-1])
}
And how did we do?
png ("figure3.png", width=800, height=600)
par(bty='n')
par(col='red')
plot(cumsum(samples)/1:steps, type='l', xlab='Step', ylab='Estimated Mean', main='Drawing samples by Metropolis-Hastings')
dev.off()
Nice.

Comments (1)

predict.i2pi

19 Jun 2009, 2:57 AM

the basics

"If you are not embarrassed by the first version of your product, you've launched too late."

On Monday I released predict.i2pi.com, a statistical learning web service. Designed to deal with common classification and regression problems, it takes input data in the form of a CSV file and returns to the end user a set of predictive models. For example, if you have a list of store locations, local weather data, and store revenue, you could use the service to see if location and weather impact store revenue. predict.i2pi tries to determine whether predictions are possible by running your data against a growing number of user contributed statistical learning algorithms and finding the ones that work best with your data.

In planning this I went through a range of features, bells and whistles but have decided to strip it all back. This is the simplest thing I could build to support what I wanted. It takes a file, runs predictive algorithms against the file, and returns performance measures. Data and predictions.

data

The data provided is expected to be in the form of a number of observations, with one observation per row. Each column contains measurements for these observations. One or more of the measurements we are interested in predicting. For example:

|<------ Explanatory Variables ------>|   /----- Response Variables (dentoed by *)
X1,    X2,    X3,    Name,  Date,        *Y
12.3,  13.4,  8.32,  Terry, 2008-10-12,  736.0
 9.3,  34.1,  1.21,   Josh, 2008-10-12,  NA     <-- NB: NA response variables will have
...    ...    ...    ...    ...          ...            will have predictions available 
 8.7,  38.7,  8.17,   Jess, 2009-01-07,  1823.1         subsequent download.

Data may include observations for which we do not know the response. These observations can be included, with the response left blank. Once satistfactory models are found, end users can download spreadsheets containing our best predictions for that data. On my todo list is adding confidence intervals to these values.

Once uploaded we try to best detect the following data types:

  • Numeric (floating point numbers)
  • Integers
  • Dates (YYYY-MM-DD works best)
  • String Factors (e.g., State or letter scores)
  • Text (longer text than factors, with analytic interpretation as language text instead of as factors)

learning

Internally, predict.i2pi performs a standard test / training protocol. Data is loaded and a random half of that data is used to train the learning algorithm. The remaining half is used to test how good the learned algorithm works against previously unseen data. Robust algorithms will do almost as well on the test as during training, while less robust approaches will lead to far poorer performance during testing. The system continues this process of picking a training sample, training and the testing as many times as possible in an allotted time. During each of these cycles, predictions are tested against the actual responses in the corresponding observation. Performance is then measured using the R-squared metric for regressions and simple classification accuracy for classification problems. The system supports user defined performance measures with the goal being to let those who supply data decide on which performance measure is best for their application. However, at the moment I'm concentrating on opening up the ability for users to upload their own learning algorithms.

Currently learning algorithms are specified in small snippets of R code that can be dynamically loaded into the main R subsystem that is responsible for coordinating training cycles. See, for example, rpart.R which links in a recursive partitioning algorithm from the rpart library.

#requires(rpart)

myModel <- function (formula, data) rpart(formula, data, na.action=na.exclude)
myPredict <- function (model, data)
{
	p <- predict (model, data)
	as.numeric(apply(p, 1, function(r) order(-r)[1]))
}

All learning algorithms must contain two function definitions: myModel and myPredict. myModel takes a model formula and data, returning a model object that can be used to make predictions against new data. myPredict takes two parameters, the model object returned by myModel and a set of data that may not have been seen during training. We call the prediction function with one randomly ordered half of the data for training. For testing, we provide myPredict with the model object generated from the training set, but provide it with the as yet unseen testing portion of the data.

Users are also able to define transforms that take a matrix of explanatory variables and returns a new matrix with the same number of observation rows but with one or more of the explanatory columns transformed into a new space. For example one could take a 100 column matrix and apply some form of dimensionality reduction that returns a new matrix, with the same number of rows, but only 10 columns. The transform function is not shown the response variable to ensure that no funny business occurs whereby the response is somehow embedded in the explanatory variables. These same transform functions can then be applied to response variables alone, allowing the system, for example, to construct a model log(Y) ~ PCA(X1, X2, ... , Xn).

The following example shows a transform function that replaces any columns that are more than 50% NA with an indicator variable:
myTransform <- function (x)
{
	if (is.null(dim(x))) return (x);
	if (ncol(x) == 1) return (x);

	bad_idx <-  apply(x,2,function(c) sum(is.na(c)) / length(c)) >= 0.5
	if (any(bad_idx))
	{
		y <- x
		y[,bad_idx] <- is.na(y[,bad_idx])*1	# replace NA's with an indicator variable
		return (y)
	} else
	{
		return (x)
	}
}

coming soon

As for uploading code, at the best way to do this right now is via email. I hand rolled my own sandbox environment to prevent 3rd party code from hijacking my system - but as with any security code that I write myself, I loathe testing it in the real world until I've had a good chance to be as close to 100% sure that it is safe. In reality, I'll probably stop trying to reinvent the wheel, and use a pre-existing solution.

Given long term plans, and issues around data privacy, I didn't want to set up a system whereby data leaves the system for testing on external machines. While it works well for very large datasets, e.g., the Netflix Prize, the potential of over fitting is higher for smaller datasets when random portions of that data are often reused in validation cycles. That said, developing new learning algorithms (or plugging in ones from existing CRAN libraries) is fairly straight forward so you should be able to develop locally and upload.

There already is an API, but it is not at all documented. This is my next priority. Currently I'm running into some issues with using RCurl to interact with my API - issues which would not exist in any other language - but I really would like to get the R API out of the door before I open up wider access. In short, there are are 3 methods which are currently used by the web site (inspect my horrid JavaScript code to see them.) These allow you to upload data, make edits to meta data and receive predictions. Each prediction includes links to the R source that was involved in performing the learning + any transforms used. The prediction meta data also includes the quartiles for the measure after a number of test/train cycles, plus a sample of 250 predictions vs. actuals.

It has been suggested that I also include a small downloadable example snippet for each file to allow developers to get a better flavor of what they are working with. For larger files, I think this is a perfectly swell idea. In fact, I really do want to hear more of your suggestions. I took a knife to a slew of functionality before I released this, but I have code ready to go. But I want to wait for real life suggestions to see what I should be working on next.

The original plans for this project included complex routines for doing unsupervised schema detection and meta modelling to help identify which algorithms might work best with particular shapes of data. Also I had built a framework for combining multiple learnings algorithms in a boosting type environment. All of these features remain possible and will hopefully be released in the not to distant future.

One of the big issues I struggled with in deciding to release this is the nature of my target audience. At the moment there is an impedance mismatch between the sophistication required to understand what the system does and the utility of the system to sophisticated users. To those with any experience in predictive analytics, everything here should be your bread and butter - and most likely far simpler than what you do on a daily basis. However there is a large audience of people in the information business who currently make do with the 'Add Trendline' option in MS Excel. To this audience, this service would be greatly valued, but in its current form is probably a bit too much. This deeply embarrasses me, but I'm not going to let that stop me from publicizing what I'm up to. There is a plan, and it exists in increments.

For the lay information worker, there are hurdles both in providing understandable explanations of how the learning algorithms work and were applied but yet also difficulties in adapting my format to the natural shape of the data that they often work with - not to mention data cleaning. As an example time series models pose an interesting problem. They do break the model of one independent observation per row, but it is difficult to come up with a way of training and testing that is consistent with my current implementation. Even if I were to develop special case handling for time series data, it can be difficult for a computer to find appropriate periods over which to lag variables. At this point I think the simplest route is to let people include previous observations that they deem important, at lags that they think might be interesting, with each row. That way each row can be treated independently from the others and I don't have to build a lot of machinery to guess appropriate treatment of temporal dependency.

Likewise there are other problems whose natural representations don't map neatly to the one row = one observation representation - think of collaborative filtering or graph based problems. I am quite keen on keeping the one row representation as it affords me some nice system scaling properties without becoming too domain specific. That said, there is nothing stopping me from building front-ends that take data from these problem domains in their natural representation and map them to one that works better for my system.

When it comes to explaining the models, well. That is another story.


Comments (5)

Engineering vs. Architecture

16 Jun 2009, 1:18 PM

A few months back I caught up with a fellow Aussie in New York, who I first met once ten years ago. It is amazing how social network dynamics change as an expat. He is currently teaching Architecture at Columbia while completing his doctorate in the nature of representation in architecture. It was the sort of long conversation that lingers for a few months before finding a resting place in your mind. At first glance we spent quite some time discussing the work at the Spatial Information Design Lab as this most closely bridged the gap between our worlds. The deeper conversation was that of representation.

Engineers build things. They use sciences to make sure that the things they build don't fall over. Architects design things - they take ideas of the world and represent them. Their audience is both the client and the engineering and construction teams. Different representations serve different purposes. Engineering: Representation to World. Architecture: World to Representation.

Finanical engineers take what they know about how companies work and built new things to serve other companies. Economists take the real world and make model representations of reality. However there is a void in economics, between the macro and the micro, in the domain of the company. Likewise, there is a void in financial engineering. Financial engineering is currently dominated by time-series analysis. I posit a weak form of the Black Swan theorem - namely that we currently don't know enough about the past to even pretend to predict the future. We have financial historians, in the form of data providers, but we don't have the architects to take this repository of past knowledge and build representations of how companies operate. Accountants across the globe set and implement the rules of this complex system, but we don't understand it's dynamics.

Can financial engineers shed the instrumentation of time series analysis and take on this role? Or will it come from a new group - the type of people who build Googles?

Or will their buildings leak?


Image by ken mccown on Flickr.
Comments (0)

Predicting social network features from profile picture features

13 May 2009, 10:16 AM

The interwebs has made it really easy for those who are looking for data to find it. Or at least a close approximation to it. Those who have the tools to scrape the web and reverse out interesting data are typically part software developers, statisticians and hackers. Mix these three together and one is genetically predisposed to collect as much data as possible. But there must come a point when the collection stops and the inference begins. Inference is difficult, in that it requires making statements about the mushy world, whereas coding systems to collect data deals with deterministic computers. It is easy to fall into the trap of simply collecting data to avoid dealing with mush.

In my previous job I was overseeing a project which involved scraping a large publicly traded e-commerce site to find interesting information to support investment decisions. The problem was that everyone on the street was also scraping the same site. Our code was top notch and having done this before we were able to avoid common pitfalls and our system was gathering oodles of potentially useful data. Faced with all this data one of my developers had a tough time working out where to start the inference process. Sure, there are obvious places, like predicting revenue from site activity, but they are obvious. So obvious that even the sell-side researchers were doing it. Faced with the task of finding something less obvious, my advice to my colleague was to pick a pair of columns at random and come up with a model for the dynamics of the relationship. In such an exercise one picks the boring stuff, like transaction numbers instead of transaction values - and begins decoding from there. The goal is to look at the data sideways and see if anything interesting pops up.

Most often this fails. But it is a good way of breaking out of the data rut. When I do this exercise, I typically find myself desperate for one other set of observations to help explain what I am seeing. While the result may be boring, it is not failure as it gives you a direction with which to approach the data.

Currently I have a few projects that involve social networks in one shape or another. While my clients are generally looking for the standard orthogonal projection of the data, I can't resist the urge to look at things sideways.

A client was walking through the important data that they collect about social network activity, but when I talked with their developers they also mentioned in passing that they also collected profile pictures. Not for analysis, but for another part of their suite. Pictures. Cool. Thems be data. Excitedly I professed, as if I actually knew what I was talking about, that there is probably heaps of juicy stuff inside profile picture data. Intruiged at my own confidence I decided to tackle this by scraping 250,000 profile pictures from MySpace and grabbing a few key stats about each account. The first thing I wanted to examine was whether profile pictures in any way informed the number of friends that a user had. They do.

As I have other plans for this data, I didn't scrape MySpace with the complete intention of doing this project, thus only 19,214 of the images have associated friend lists. But this was enough to get started.

First off I wrote a short C program to calculate 32 features from each image. These features are pretty typical image processing functions, like size, average color levels, number of colors, smoothness, symmetry and a few keys points from the luminosity histogram. MySpace pictures tend to be a mix of faces, icons and general photos - to rougly help identify faces (without commiting to facial specific measures) I also calculate a subset of these values for the central portion of the image, including recording the location of vertical axis of greatest symmetry. Most of the values have been normalized to some image specific reference to increase variation and limit covariance, for example the average R,G,B values are expressed as a % of the images luminosity.

To get a sense of the variation of these features, I constructed an image based on the first two principal components of the feature space. At this point some kind folks on Twitter (starting with Mark Reid) pointed me towards t-distibuted stochastic neighborhood embedding. Someone mentioned that I could simply forgoe my feature calculation code and simply use tSNE on the pixel data, which sounded exciting, but after reading the paper I decided against it. In the paper the authors do demonstrate this technique, but their first step after reading in the pixel data is to perform PCA to reduce the dimensionality of the problem. And their image set was much more well behaved than the images I was working with. Maybe I'll take another look at tSNE in this context when I next have some free time.

Visualization aside, the next step was fairly simple. I divided my data into a 75% training set and reserved 25% for testing, attempting to predict log(# of friends) by my image fetures. Using a linear model was pretty poor, but not terrible. In sample I got an R^2 of 0.17, out of sample it was far worse. Using an SVM, I limited the training to classification rather than regression - trying to classify in groups by quantiles of log(# of friends). For a simple binary classification (more or less friends than the median sampled MySpace user), the accuracy was 70% - with errors evenly distributed across the two classes. I also tried 3 and 4 classes, and the lift was similar.

To visualize this I performed the regression using an SVM and, as expected by the results of the classification, got a decent R^2 (0.25) on the out of sample test set. To get a better sense of the outliers, I produced the following visualization. Note in this visualization I have sacrified some positional accuracy by enforcing a constraint that no images may overlap.

I also used a similar approach to test whether I could predict other, more interesting, network features like measures of centrality, and my initial results are positive.

At this point, if I were to run with it, I'd like to make some assesments as to the underlying process that relates social network features from image features. Until I have more time, my current hypothesis is Boobs.


For those of you with more time on your hands, I have packaged up some datasets. Grab them here.


Comments (4)
All previous posts

consulting

personal projects

etc