Selling customer data

16 Dec 2009, 20:40 PM

As someone who has spent most of their working career selling data to advertisers, I'm suprised by the number of businesses that are predicated on the model of selling data to advertisers. If you have a great widget, it is easy to get it in the hands of millions of people, especially if you are giving it away for free. As programmers and scientists we deify data. What we don't do is understand advertising. Sure we understand that advertising is about selling stuff, but we don't seem to get that the advertising industry exists to sells ways of selling stuff.

Whether an external agency or an internal group, advertising professionals have to convince others that they are adding value. And if your model is simply to sell data to advertisers, you have to convince them that your data is worth at least as much as what you are selling it for. Your data needs to be useful. Smart techniques are popping up for doing interesting things with large amounts of data. But interesting isn't always useful. If you think your customer data is useful, you should use your data to make your product better. Data that is valued by your users for the richer experience it provides is likely to be valued by others. If you can't improve your customers experience with their own data, your data is worth nothing.


Comments (2)

blockquote

12 Dec 2009, 05:12 AM

While I'm ranting, let me ask you something, Randall. At the risk of sounding like Glenn Beck Jr. - what the fuck has gone wrong with our country? Used to be, we were innovators. We were leaders. We were builders. We were engineers. We were the best and brightest. We were the kind of guys who, if they were running the biggest mobile network in the U.S., would say it's not enough to be the biggest, we also want to be the best, and once they got to be the best, they'd say, How can we get even better? What can we do to be the best in the whole fucking world? What can we do that would blow people's fucking minds? They wouldn't have sat around wondering about ways to fuck over people who loved their product. But then something happened. Guys like you took over the phone company and all you cared about was milking profit and paying off assholes in Congress to fuck over anyone who came along with a better idea, because even though it might be great for consumers it would mean you and your lazy pals would have to get off your asses and start working again in order to keep up.

And not just you. Look at Big Three automakers. Same deal. Lazy, fat, slow, stupid, from the top to the bottom - everyone focused on just getting what they can in the short run and who cares what kind of piece of shit product we're putting out. Then somehow along the way the evil motherfuckers on Wall Street got involved and became everyone's enabler, devoting all their energy and brainpower to breaking things up and parceling them out and selling them off in pieces and then putting them back together again, and it was all about taking all this great shit that our predecessors had built and "unlocking value" which really meant finding ways to leech out whatever bit of money they could get in the short run and let the future be damned. It was all just one big swindle, and the only kind of engineering that matters anymore is financial engineering.

http://www.fakesteve.net/2009/12/a-not-so-brief-chat-with-randall-stephenson-of-att.html
Comments (2)

When lenders compete, you win.

2 Nov 2009, 06:33 AM

The message is pretty clear - competition, the pillar of capitalism, results in better products and services for consumers. When lenders compete, you win. While this is the slogan for one major mortgage lead generator, the methodology is common to the industry as a whole. And people believe that magic technology fosters competition, with the net benefit of better lending rates.

The reality is a little different. When I oversaw the operations of a mortgage marketplace, the competition was not in terms of the products offered, but rather, the price paid for getting a person's attention. Lenders would bid for leads and the lenders who paid the highest price received the most leads. Thus the incentives were counter to people's rational goals. The lenders with the highest margins were able to spend the most on customer acquisition, while lenders with more affordable products were unable to reach the same audience.

Google recently publicized their direct entry into this space. Prior to their entry they captured only a portion of the marketing dollars - with lead generators buying keyword ads on Google, funneling the traffic to their site, collecting lead information and selling to the highest paid mortgage providers. When the lead generators spend on Google was lower than their revenue from the mortgage companies, they profited. A mercenary and highly unregulated bunch, the lead generators would go to great lengths to screw the consumer.

Google's product appears better in that rather than selling out the consumer for the highest price, they display a targeted list of options - clearly outlining the competing offers - letting the consumer decide which companies to contact for a quote. As is always the case, transparency leads to a better outcome for people.

Despite the numerous and simultaneous failures in the mortgage marketplace that has so deeply scarred the American economy, one upside that is often forgotten is the benefit of standardization of lending products. Prior to the development of the mortgage backed security market mortgage contracts varied greatly in their structure and terms. And while they remain complex financial contracts, standardization means that a consumer is able to properly evaluate the bulk of the financial impact of their mortgage choice by simply examining a handful of parameters.

I have a home equity loan. I also have credit cards, savings accounts and brokerage accounts. The simplest account that I have, and the one that sees the most action, is the humble checking account. My checking account has a 36 page introductory preamble that outlines the terms and conditions. These terms are fully documented on a corner of my bank's web site, and change on a semi-monthly basis. No one reads these terms.

I spent my summer reading not only the terms of my checking account, but of all of my accounts and the accounts at other major banks in America. You'd be terrified to know what they actually contain. That is, if you could find them. The GAO found that 65% of banks do not make these documents available on the web, and 35% fail to produce them if you visit a branch.

And these terms matter.

Unlike most people's mental model of retail banking operations, banks do not make most of their money on the difference between the rates at which they lend versus the rate they offer for savings. American banks, quite distinctly from banks elsewhere in the world, make the bulk of their money from fees and charges. Invisible and often unavoidable consequences of little clauses in contracts that no one ever reads.

This stands in stark contrast to the message that we hear in bank marketing. Retail bank marketing is dominated by APR: Best rate savings! Lowest rate on credit cards! Yet the largest financial impact to the consumer is fees and charges.

Fees and charges that consumers have no hope of simply understanding.

Lead generation is rife across the financial product landscape. Some companies try to offer Google-like services for better helping consumers choose financial products - but these services fall into the trap of not taking into account the obscure and non-standardized terms that most impact financial well being. And as such no one believes the offers they see in sites like Mint.com. If people honestly believed they could "Save $2,000 by switching to Bank XYZ's credit card" then the conversion rates on these offers would be vastly better than the prevailing rates.

And so with all the technology that we have at our disposal, people are no better off. Banks have no incentive to increase transparency, lead generators have no incentive to provide real offers and immense brand apathy prevails resulting in short sighted decisions further driving down customer experience. The cycle continues.

Until it stops.


Comments (3)

Chase, what matters?

23 Sep 2009, 18:36 PM

Last week I paid my bills. As part of the regular bill-paying process, I take any funds left over that are not required as cash over the coming weeks, and pay down my home equity facility. I pay my credit cards in full. The only rate I concern myself with is 3.8% rate on my loan. There is nothing particularly unusual about this process.

A few days after I paid my bills my bank, JPM Chase, emailed me to tell me that my account was overdrawn. I logged into the web site and saw that they had put a bunch of payments through twice. Most importantly, they put my loan principal payment through twice. In their wisdom, they helped me correct this by withdrawing from my credit card to pay down my loan. My credit card has a purchase rate of 12% and an cash rate of 19%.

Follow me for a moment: They withdrew from an account charging 19% to pay down an loan at 3.8%. And along the way they charge an overdraft fee for the 'service.'

I have three simple requests for Chase:

  1. Reverse the double payment, restoring my checking account to its intended balance.
  2. Reverse the overdraft fee.
  3. Return the interest they are earning at 19% on the credit card account.

I was first routed to the online banking group as it was clear that the error originated within their domain. The timestamps on the transactions are identical, the transaction numbers are near sequential; there is a clear indication that my intent was to only pay once, but they processed the transactions twice. From there I was transferred to the credit card fiefdom who told me that they would correct the overdraft issues - but it would take four days.

Four days later I called to discover that no work order was placed and there was no note of my original call. I went through the same call center waltz, but instead was routed to the home equity group. The same group that asked me to forge a signature during the application process - but that is another story for another day. During this call I was told that the home equity group would take care of everything as they were the final destination of the funds. I made sure that the work order included the three key points listed above. I also took note of the work order number and the names and times of all the people I spoke with.

Meanwhile, mind you, I still have a zero checking balance and am unable to make other payments. I am loathe to draw down the debt facilities at my disposal, as it would just make it even more complex to reverse these transactions. Friends have been helpful and luckily I have enough cash on hand.

Today I called the home equity group directly to check on the status of the work order. They gave me the same estimate as the last time I spoke with them four to five days. At which point the interest would have accumulated to $11. Not a great deal in the scheme of things, but it is my goddamn $11. I'd be happy if they simply paid me the $11 and moved on - they have earned a good deal of revenue from me over the years, and this is clearly their error. But of course, they can't return me the $11 as they have no mechanism for doing so.

"You cannot dispute this transaction for this reason: 5102 - Bank Releated[sic] Fees / Charges - Not Eligible"

As of now it appears my only option to force the return of my overdraft fee and for me to receive any accrued interest is to take action in the New York small claims court. There is no way Chase will defend this - that would cost hundreds for legal approval alone. Of course, they make an order of magnitude more than that from my family in fees, charges, interchange and net interest margin each year. As soon as I get my $11 back, they will no longer hold any of my accounts.

Customer experience matters.


Comments (4)

Bayesian Methods + MCMC

11 Jul 2009, 00:44 AM

Last night we had our fourth NY R Statistical Programming meetup. The topic was Bayesian Methods + MCMC. We had two presenters, Jake Hofman and Suresh Velagapundi, both of whom did an admirable job of presenting a very broad topic to an audience with diverse backgrounds. I want to use this post to bridge a gap between the background material and day to day utilization. This is catered towards the audience who may have some experience with R, but aren't very familiar with the Bayesian Way. While it is a simple example, the steps involved extend on to the issues that are faced in real world applications.

The source for this example can be downloaded using the internet!

We are going to step through Jake's coin flip example to get a sense of what is involved in doing Bayesian inference. There are a number of packages on the CRAN Bayesian Inference view that do all of what you will see below. I decided against using them for two reasons. First off, the coin flip example is a little too trivial for using many of the techniques that rely on multivariate parameter estimation to see any utility. But more importantly, I want to use the opportunity provided by a nice simple example to step through the underlying mechanics. My hope is that after reading through this you can have a look at the available packages and be a better judge of what they are used for and where one package may stand out over another. In the course of doing this write up I went through the MCMCpack package and it is a good exercise to compare how they implement the MCbinomialbeta() against the first half of this walk through. For the curious, the MCMCmetrop1R() function is far more advanced than the simple implementation of Metropolis-Hastings shown below, and it is a good exercise to understand their tuning parameters.

As a quick recap, the point of the exercise is to go from prior belief in a distribution (in this case we believe that the coin is fair) and use observed data to arrive at a posterior distribution using both the prior and the data. There are three things that we need to know to calculate the posterior distribution:

  1. The likelihood of seeing the new data given our estimate of the bias
  2. Our prior distribution
  3. The 'evidence' or the integral of the likelihood and prior for each possible estimate

I won't step through the derivation of the likelihood, as this should be easy enough to derive from the binomial probability distribution function. In this case our likelihood, with N trials and h heads is:

likelihood <- function (N, h, theta) theta^h * (1 - theta)^(N-h)	
Check that the likelihood function makes sense:
t <- (0:100) / 100

png ("figure1.png", width=800, height=600)
par (mfrow=c(2,2))
par (bty='n')
par (col='red')
plot (t, likelihood(100, 50, t), type='l', xlab='Theta Hat', ylab="Likelihood", main='Likelihood (t=0.5)')

Great, the maximum likelihood for 50 heads from 100 flips is a theta of 0.5. (See chart below).

Jake uses the Beta distribution as his prior as it has some neat analytic properties; namely that the posterior will be of the same distribution family as the prior. We call these types of priors conjugate priors.

prior <- function (theta, a, b) dbeta (theta, a, b)		

a <- 2
b <- 2
plot (t, prior(t, a, b), type ='l', xlab='Theta Hat', ylab='Pr(theta|a,b)', main='Prior')
If we do the integration, we can arrive at the analytic form of the evidence and thus the posterior:
evidence  <- function (N, h, a, b) beta(h + a, N - h + b) / beta (a, b)
posterior <- function (theta, N, h, a, b) likelihood (N, h, theta) * prior(theta, a, b) / evidence (N, h, a, b)

plot (t, posterior(t, 100, 50, a, b), type ='l', xlab='Theta Hat', ylab='Pr(theta|Observations,a,b)', main='Posterior (t=0.5)')
plot (t, posterior(t, 100, 70, a, b), type ='l', xlab='Theta Hat', ylab='Pr(theta|Observations,a,b)', main='Posterior (t=0.7)')

dev.off()
Let's say we don't know what the analytic form for the evidence (denominator in Bayes' rule) is, and replace it by a numerical integration over all possible theta's from 0 to 1:
evidenceN  <- function (N, h, a, b) integrate (function(t) likelihood (N,h,t) * prior (t,a,b), 0, 1)$value
posteriorN <- function (theta, N, h, a, b) likelihood (N, h, theta) * prior(theta, a, b) / evidenceN (N, h, a, b)

N <- 100 	# Trials
h <- 70		# Heads

png ("figure2.png", width=800, height=600)
par (mfrow=c(1,1))
analytic  <- posterior (t, N, h, a, b)
estimated <- posteriorN(t, N, h, a, b)
plot (t, analytic, type ='l', xlab='Theta Hat', ylab='Pr(theta|Observations,a,b)', main='Posterior (t=0.7)')
lines (t, estimated, type ='l', xlab='Theta Hat', ylab='Pr(theta|Observations,a,b)', col='blue', lty=2)

err <- (analytic - estimated)^2
lines (t, (err - min(err)) / diff(range(err)) * max(analytic), lty=3, col='black')
legend (0,2, c('Analytic','Estimated','Error^2 (scaled)'), col=c('red','blue','black'), lty=c(1,2,3), bty='n', text.col='black')
dev.off()

While things are pretty simple with this toy example, Jake made the point that real difficulty with Bayesian inference is twofold:
  1. Integrating across theta to find the evidence (denominator)
  2. Once you have the posterior, integrating it to calculate summary statistics (mean, variance, etc.)

In the above example we used the integrate() function to apply adaptive quadrature to find the evidence. We could use this method for 2, but lets not. Instead, let us use MCMC - which is at its core, a way to draw samples from a distribution that is otherwise hard to sample from.

Given that this example is rather trivial, with just one parameter in question (theta), I won't step through the implementations of vanilla Monte Carlo methods (uniform, importance & rejection sampling) These implementations are pretty much straight forward from Jake's presentation.

I will however, implement a simple Metropolis-Hastings MCMC sampler using a simple and symmetric Gaussian proposal density (q in Jake's notes).

MHstep <- function (pdf, prevCandidate)
{
	# Effectively we are taking a random walk.
	newCandidate <- prevCandidate + rnorm (1, mean=0, sd=0.1)

	# NB: Because we are using the normal distribution
	# as our proposal density, which is symmetrical,
	# we cancel out the q terms on the numerator and 
	# denominator, as q(x|y) = q(y|x)

	a <- pdf(newCandidate) / pdf(prevCandidate)		

	# Draw a uniform random number from 0 to 1	
	u <- runif(1)									

	if (a > u)
	{
		# This candidate is likely to be a better sample
		return (newCandidate)
	}

	# Else, stick with our previous candidate
	return (prevCandidate)
}

Let's use our numerical approximate to the actual posterior function as the PDF we want to draw samples from:
posteriorPDF <- function (t) posteriorN (t, N, h, a, b)
Time to go on a random walk down coin flip street.
steps <- 1000
samples <- matrix(NA, steps)

samples[1] <- 0.5		# initial guess
for (i in 2:steps)
{
	samples[i] <- MHstep (posteriorPDF, samples[i-1])
}
And how did we do?
png ("figure3.png", width=800, height=600)
par(bty='n')
par(col='red')
plot(cumsum(samples)/1:steps, type='l', xlab='Step', ylab='Estimated Mean', main='Drawing samples by Metropolis-Hastings')
dev.off()
Nice.

Comments (1)

predict.i2pi

19 Jun 2009, 06:57 AM

the basics

"If you are not embarrassed by the first version of your product, you've launched too late."

On Monday I released predict.i2pi.com, a statistical learning web service. Designed to deal with common classification and regression problems, it takes input data in the form of a CSV file and returns to the end user a set of predictive models. For example, if you have a list of store locations, local weather data, and store revenue, you could use the service to see if location and weather impact store revenue. predict.i2pi tries to determine whether predictions are possible by running your data against a growing number of user contributed statistical learning algorithms and finding the ones that work best with your data.

In planning this I went through a range of features, bells and whistles but have decided to strip it all back. This is the simplest thing I could build to support what I wanted. It takes a file, runs predictive algorithms against the file, and returns performance measures. Data and predictions.

data

The data provided is expected to be in the form of a number of observations, with one observation per row. Each column contains measurements for these observations. One or more of the measurements we are interested in predicting. For example:

|<------ Explanatory Variables ------>|   /----- Response Variables (dentoed by *)
X1,    X2,    X3,    Name,  Date,        *Y
12.3,  13.4,  8.32,  Terry, 2008-10-12,  736.0
 9.3,  34.1,  1.21,   Josh, 2008-10-12,  NA     <-- NB: NA response variables will have
...    ...    ...    ...    ...          ...            will have predictions available 
 8.7,  38.7,  8.17,   Jess, 2009-01-07,  1823.1         subsequent download.

Data may include observations for which we do not know the response. These observations can be included, with the response left blank. Once satistfactory models are found, end users can download spreadsheets containing our best predictions for that data. On my todo list is adding confidence intervals to these values.

Once uploaded we try to best detect the following data types:

  • Numeric (floating point numbers)
  • Integers
  • Dates (YYYY-MM-DD works best)
  • String Factors (e.g., State or letter scores)
  • Text (longer text than factors, with analytic interpretation as language text instead of as factors)

learning

Internally, predict.i2pi performs a standard test / training protocol. Data is loaded and a random half of that data is used to train the learning algorithm. The remaining half is used to test how good the learned algorithm works against previously unseen data. Robust algorithms will do almost as well on the test as during training, while less robust approaches will lead to far poorer performance during testing. The system continues this process of picking a training sample, training and the testing as many times as possible in an allotted time. During each of these cycles, predictions are tested against the actual responses in the corresponding observation. Performance is then measured using the R-squared metric for regressions and simple classification accuracy for classification problems. The system supports user defined performance measures with the goal being to let those who supply data decide on which performance measure is best for their application. However, at the moment I'm concentrating on opening up the ability for users to upload their own learning algorithms.

Currently learning algorithms are specified in small snippets of R code that can be dynamically loaded into the main R subsystem that is responsible for coordinating training cycles. See, for example, rpart.R which links in a recursive partitioning algorithm from the rpart library.

#requires(rpart)

myModel <- function (formula, data) rpart(formula, data, na.action=na.exclude)
myPredict <- function (model, data)
{
	p <- predict (model, data)
	as.numeric(apply(p, 1, function(r) order(-r)[1]))
}

All learning algorithms must contain two function definitions: myModel and myPredict. myModel takes a model formula and data, returning a model object that can be used to make predictions against new data. myPredict takes two parameters, the model object returned by myModel and a set of data that may not have been seen during training. We call the prediction function with one randomly ordered half of the data for training. For testing, we provide myPredict with the model object generated from the training set, but provide it with the as yet unseen testing portion of the data.

Users are also able to define transforms that take a matrix of explanatory variables and returns a new matrix with the same number of observation rows but with one or more of the explanatory columns transformed into a new space. For example one could take a 100 column matrix and apply some form of dimensionality reduction that returns a new matrix, with the same number of rows, but only 10 columns. The transform function is not shown the response variable to ensure that no funny business occurs whereby the response is somehow embedded in the explanatory variables. These same transform functions can then be applied to response variables alone, allowing the system, for example, to construct a model log(Y) ~ PCA(X1, X2, ... , Xn).

The following example shows a transform function that replaces any columns that are more than 50% NA with an indicator variable:
myTransform <- function (x)
{
	if (is.null(dim(x))) return (x);
	if (ncol(x) == 1) return (x);

	bad_idx <-  apply(x,2,function(c) sum(is.na(c)) / length(c)) >= 0.5
	if (any(bad_idx))
	{
		y <- x
		y[,bad_idx] <- is.na(y[,bad_idx])*1	# replace NA's with an indicator variable
		return (y)
	} else
	{
		return (x)
	}
}

coming soon

As for uploading code, at the best way to do this right now is via email. I hand rolled my own sandbox environment to prevent 3rd party code from hijacking my system - but as with any security code that I write myself, I loathe testing it in the real world until I've had a good chance to be as close to 100% sure that it is safe. In reality, I'll probably stop trying to reinvent the wheel, and use a pre-existing solution.

Given long term plans, and issues around data privacy, I didn't want to set up a system whereby data leaves the system for testing on external machines. While it works well for very large datasets, e.g., the Netflix Prize, the potential of over fitting is higher for smaller datasets when random portions of that data are often reused in validation cycles. That said, developing new learning algorithms (or plugging in ones from existing CRAN libraries) is fairly straight forward so you should be able to develop locally and upload.

There already is an API, but it is not at all documented. This is my next priority. Currently I'm running into some issues with using RCurl to interact with my API - issues which would not exist in any other language - but I really would like to get the R API out of the door before I open up wider access. In short, there are are 3 methods which are currently used by the web site (inspect my horrid JavaScript code to see them.) These allow you to upload data, make edits to meta data and receive predictions. Each prediction includes links to the R source that was involved in performing the learning + any transforms used. The prediction meta data also includes the quartiles for the measure after a number of test/train cycles, plus a sample of 250 predictions vs. actuals.

It has been suggested that I also include a small downloadable example snippet for each file to allow developers to get a better flavor of what they are working with. For larger files, I think this is a perfectly swell idea. In fact, I really do want to hear more of your suggestions. I took a knife to a slew of functionality before I released this, but I have code ready to go. But I want to wait for real life suggestions to see what I should be working on next.

The original plans for this project included complex routines for doing unsupervised schema detection and meta modelling to help identify which algorithms might work best with particular shapes of data. Also I had built a framework for combining multiple learnings algorithms in a boosting type environment. All of these features remain possible and will hopefully be released in the not to distant future.

One of the big issues I struggled with in deciding to release this is the nature of my target audience. At the moment there is an impedance mismatch between the sophistication required to understand what the system does and the utility of the system to sophisticated users. To those with any experience in predictive analytics, everything here should be your bread and butter - and most likely far simpler than what you do on a daily basis. However there is a large audience of people in the information business who currently make do with the 'Add Trendline' option in MS Excel. To this audience, this service would be greatly valued, but in its current form is probably a bit too much. This deeply embarrasses me, but I'm not going to let that stop me from publicizing what I'm up to. There is a plan, and it exists in increments.

For the lay information worker, there are hurdles both in providing understandable explanations of how the learning algorithms work and were applied but yet also difficulties in adapting my format to the natural shape of the data that they often work with - not to mention data cleaning. As an example time series models pose an interesting problem. They do break the model of one independent observation per row, but it is difficult to come up with a way of training and testing that is consistent with my current implementation. Even if I were to develop special case handling for time series data, it can be difficult for a computer to find appropriate periods over which to lag variables. At this point I think the simplest route is to let people include previous observations that they deem important, at lags that they think might be interesting, with each row. That way each row can be treated independently from the others and I don't have to build a lot of machinery to guess appropriate treatment of temporal dependency.

Likewise there are other problems whose natural representations don't map neatly to the one row = one observation representation - think of collaborative filtering or graph based problems. I am quite keen on keeping the one row representation as it affords me some nice system scaling properties without becoming too domain specific. That said, there is nothing stopping me from building front-ends that take data from these problem domains in their natural representation and map them to one that works better for my system.

When it comes to explaining the models, well. That is another story.


Comments (6)

Engineering vs. Architecture

16 Jun 2009, 17:18 PM

A few months back I caught up with a fellow Aussie in New York, who I first met once ten years ago. It is amazing how social network dynamics change as an expat. He is currently teaching Architecture at Columbia while completing his doctorate in the nature of representation in architecture. It was the sort of long conversation that lingers for a few months before finding a resting place in your mind. At first glance we spent quite some time discussing the work at the Spatial Information Design Lab as this most closely bridged the gap between our worlds. The deeper conversation was that of representation.

Engineers build things. They use sciences to make sure that the things they build don't fall over. Architects design things - they take ideas of the world and represent them. Their audience is both the client and the engineering and construction teams. Different representations serve different purposes. Engineering: Representation to World. Architecture: World to Representation.

Finanical engineers take what they know about how companies work and built new things to serve other companies. Economists take the real world and make model representations of reality. However there is a void in economics, between the macro and the micro, in the domain of the company. Likewise, there is a void in financial engineering. Financial engineering is currently dominated by time-series analysis. I posit a weak form of the Black Swan theorem - namely that we currently don't know enough about the past to even pretend to predict the future. We have financial historians, in the form of data providers, but we don't have the architects to take this repository of past knowledge and build representations of how companies operate. Accountants across the globe set and implement the rules of this complex system, but we don't understand it's dynamics.

Can financial engineers shed the instrumentation of time series analysis and take on this role? Or will it come from a new group - the type of people who build Googles?

Or will their buildings leak?


Image by ken mccown on Flickr.
Comments (0)

Predicting social network features from profile picture features

13 May 2009, 14:16 PM

The interwebs has made it really easy for those who are looking for data to find it. Or at least a close approximation to it. Those who have the tools to scrape the web and reverse out interesting data are typically part software developers, statisticians and hackers. Mix these three together and one is genetically predisposed to collect as much data as possible. But there must come a point when the collection stops and the inference begins. Inference is difficult, in that it requires making statements about the mushy world, whereas coding systems to collect data deals with deterministic computers. It is easy to fall into the trap of simply collecting data to avoid dealing with mush.

In my previous job I was overseeing a project which involved scraping a large publicly traded e-commerce site to find interesting information to support investment decisions. The problem was that everyone on the street was also scraping the same site. Our code was top notch and having done this before we were able to avoid common pitfalls and our system was gathering oodles of potentially useful data. Faced with all this data one of my developers had a tough time working out where to start the inference process. Sure, there are obvious places, like predicting revenue from site activity, but they are obvious. So obvious that even the sell-side researchers were doing it. Faced with the task of finding something less obvious, my advice to my colleague was to pick a pair of columns at random and come up with a model for the dynamics of the relationship. In such an exercise one picks the boring stuff, like transaction numbers instead of transaction values - and begins decoding from there. The goal is to look at the data sideways and see if anything interesting pops up.

Most often this fails. But it is a good way of breaking out of the data rut. When I do this exercise, I typically find myself desperate for one other set of observations to help explain what I am seeing. While the result may be boring, it is not failure as it gives you a direction with which to approach the data.

Currently I have a few projects that involve social networks in one shape or another. While my clients are generally looking for the standard orthogonal projection of the data, I can't resist the urge to look at things sideways.

A client was walking through the important data that they collect about social network activity, but when I talked with their developers they also mentioned in passing that they also collected profile pictures. Not for analysis, but for another part of their suite. Pictures. Cool. Thems be data. Excitedly I professed, as if I actually knew what I was talking about, that there is probably heaps of juicy stuff inside profile picture data. Intruiged at my own confidence I decided to tackle this by scraping 250,000 profile pictures from MySpace and grabbing a few key stats about each account. The first thing I wanted to examine was whether profile pictures in any way informed the number of friends that a user had. They do.

As I have other plans for this data, I didn't scrape MySpace with the complete intention of doing this project, thus only 19,214 of the images have associated friend lists. But this was enough to get started.

First off I wrote a short C program to calculate 32 features from each image. These features are pretty typical image processing functions, like size, average color levels, number of colors, smoothness, symmetry and a few keys points from the luminosity histogram. MySpace pictures tend to be a mix of faces, icons and general photos - to rougly help identify faces (without commiting to facial specific measures) I also calculate a subset of these values for the central portion of the image, including recording the location of vertical axis of greatest symmetry. Most of the values have been normalized to some image specific reference to increase variation and limit covariance, for example the average R,G,B values are expressed as a % of the images luminosity.

To get a sense of the variation of these features, I constructed an image based on the first two principal components of the feature space. At this point some kind folks on Twitter (starting with Mark Reid) pointed me towards t-distibuted stochastic neighborhood embedding. Someone mentioned that I could simply forgoe my feature calculation code and simply use tSNE on the pixel data, which sounded exciting, but after reading the paper I decided against it. In the paper the authors do demonstrate this technique, but their first step after reading in the pixel data is to perform PCA to reduce the dimensionality of the problem. And their image set was much more well behaved than the images I was working with. Maybe I'll take another look at tSNE in this context when I next have some free time.

Visualization aside, the next step was fairly simple. I divided my data into a 75% training set and reserved 25% for testing, attempting to predict log(# of friends) by my image fetures. Using a linear model was pretty poor, but not terrible. In sample I got an R^2 of 0.17, out of sample it was far worse. Using an SVM, I limited the training to classification rather than regression - trying to classify in groups by quantiles of log(# of friends). For a simple binary classification (more or less friends than the median sampled MySpace user), the accuracy was 70% - with errors evenly distributed across the two classes. I also tried 3 and 4 classes, and the lift was similar.

To visualize this I performed the regression using an SVM and, as expected by the results of the classification, got a decent R^2 (0.25) on the out of sample test set. To get a better sense of the outliers, I produced the following visualization. Note in this visualization I have sacrified some positional accuracy by enforcing a constraint that no images may overlap.

I also used a similar approach to test whether I could predict other, more interesting, network features like measures of centrality, and my initial results are positive.

At this point, if I were to run with it, I'd like to make some assesments as to the underlying process that relates social network features from image features. Until I have more time, my current hypothesis is Boobs.


For those of you with more time on your hands, I have packaged up some datasets. Grab them here.


Comments (4)

Ratings Rant

12 Mar 2009, 03:56 AM

This is a comment on Jesper & Toby's recent E-Tech presentation. For some reason I posted it twice to the blog but it never made it through. My current theory is that the gubberment is out to get me.. Or I suck at technology.

(I guess my previous comment got lost in the system. Maybe someone can bail it out)

First up, congrats Toby & Jesper on tackling this issue. It is a shame that the Money:Tech conference was cancelled this year as that would be the perfect venue to address a good mix of quants and techs and spark some serious discussion. Barring that, I wanted to chime in with my own 2c as a techy/quant guy.

[disclaimer]In my previous life I did a fair bit of equity research on the ratings agencies, and the views here are mine and probably not shared by others. I have no positions in any CRAs right now[/disclaimer]

While the talk focussed on corporate bond ratings, the largest growth area for these agencies was in structured finance. And much of this was mortgage backed securities and similar derivatives. So to understand the mess we are in now we need to look at the history of these instruments.

MBS's were born for two key reasons. First off was the realization that in 'normal' times the dominant risks were idiosyncratic in nature and as such could be minimized through the application of portfolio theory and diversification, leading to pooled entities with smoothed cash flows and tranches providing for the needs of various risk profiles. In my opinion this story is primarily the sizzle. The real steak was the fact that by aggregating together whole loans new tradeable instruments could be formed.

The problem with whole loans was that their pricing was highly dependent on a large vector of unstandardised parameters whose diversity precluded the formation of any depth necessary to support liquidity in traditional market designs. By eliminating the idiosyncratic risk components these pooled instruments could theoretically be summarized by a small set of parameters and relatively simple models for prepayment risk meant that traders could respond to bids and asks against them.

Faced with these simple models quants took off in a great fantastical leap and applied ever more complex techniques to model out pricing. For one take, see Felix Salmon's recent piece in Wired. Or look to Paul Wilmott's take on the ever escalating departure into a mathematical wonderland that ignored the realities of the underlying loans and their associated risks.

Somewhere along the line practitioners forgot that the technology underpinning the frothy new market was based on 1970's financial and computational technology. Back then a bank of associates armed with HP-12C's could price out MBS's using a small set of descriptive parameters.

Over the next 20 years more computing power was thrown at the problem, but the basic data was still confined in scope. Sure, some funds were taking apart these pools and doing a deep analysis of the components, but there wasn't much reward in doing so as the market was moving at such an upward clip.

Even worse, if you look at the papers from Frank Partnoy, the credit ratings agencies - who were supposed to be taking a deeper look at these securities, without the demands of second by second trading - were using plainly silly assumptions. There was a huge amount of mathematical and financial stupidity going on. Not even going to mention the conflicts of interest and the regulatory arbitrage at play...

Anyway... To address Falafulu's point about MPT - I agree, MPT is great stuff; a very powerful framework by which to understand finance. But just look at the assumptions. Sure, these assumptions make the math tractable, but modern computing power enables us to take a more nuanced view of the world. We no longer have to rely on single parameters of 'default risk' to price these instruments. The market would be far better served if all available data for the underlying components and use their own information about their own risk profile to come up with better measures of value. Just compare David Einhorn's spreadsheet with a report put out by Moody's. It is night and day. Give me the data, not some puff piece of pseudoscientific nonsense passing itself off as high finance.

The original problems with trading whole loans, namely that there were too many parameters to support liquid markets, is no longer an issue. Look at WeatherBill. Look at Robin Hanson's work on combinatorial market mechanism design. Falafulu, sure some smart people were recognized for their ground breaking work of decades ago. But the most recent winner of the John Bates Clark medal in Economics went to Susan Athey, who is doing some fantastic work in mechanism design.

Computational power is such that we no longer need to pretend that all financial instruments have to be priced on with a slide rule. We have new marketplaces that can effectively support trade in financial instruments with high dimensionality. We have the computational power to let traders value these instruments. What we don't have is the data.

Give us the data and we will trade.

For a less ranty take on my world view, check out my blog post Data trades inversely to liquidity
Comments (5)

Data trades inversely to liquidity

20 Feb 2009, 04:36 AM

I recently voluntarily left my job running a 'renegade' equity research group to start an independent 'big data' consultancy business and in this economic environment this fact regularly gets me odd looks. Who in their right mind leaves a great position at a successful fund when half of Wall Street is battling to keep their jobs? This is not the first time I've done this, having run a similar consulting business right after the burst of the dot-com bubble. I don't have a great deal of life-wisdom, but if I did my only credo would be that when life hands you strawberries, it is time to go hunting for lemons. Don't rest on your laurels and the best time to be risk seeking is when everyone else is risk averse.

Having had to explain my rationale for launching i2pi as a consultancy frequently, I've come to rely on the phrase 'data trades inversely to liquidity.' This notion holds true in both my prior world of investment management and especially now in the data collection and analysis business. In finance, when markets are liquid price discovery is cheap. With all the talk right now of mark to market accounting treatments, it is clear that the converse is also true. Holders of illiquid securities can no longer rely on quoted prices to manage their portfolio risk. As the current crisis began to unfold earlier last year there was a very visible Mexican standoff while shops with CDO/etc. exposure refused to trade as the act of trading would force everyone else to reprice their own portfolios. Doing so could only last so long and the inevitable write downs began to occur as margins were being called. And thus the house fell.

The premise that led us to this mess was that with only a modicum of data and some threadbare models trading would be the final arbiter of value and the collective intelligence of efficient markets would result in fundamentally sound pricing. Now that liquidity has gone from the markets, traders of these illiquid instruments are bulking up their data and models to try and better their understanding of fundamental value. And so it is that when markets are liquid the market relies on trading to assimilate the information of individual agents. Without this method of price discovery these agents need to gather their own data as the market no longer performs the role of grand aggregator. Data trades inversely to liquidity.

While my work at the fund was phenomenally diverse and deeply intellectually stimulating, there was no fire. I've never had a real job. I've only ever worked at start ups where there is no time for a 'job.' In a constant state of conflagration, everything at a start up requires immediate attention. Early on in my career I worked as a back-end system engineer and 'fire' usually involved dealing with scalability and general growing pains. Late nights implementing features that were sold to paying clients well before the development team was consulted. Later I spent more time selling these features and there was a constant fire to come up with new and interesting things to attract clients and revenue. At the fund our financial stability was near certain and while there was a drive for deeper insight, the fire was luke-warm at best.

The current financial crisis is, at its core, rooted in the debt markets and this dislocation has clearly negative consequences for start up financing. Contemporaneously, new technologies and operational methods allow technology start ups to scale efficiently. Cloud Computing, as distinct from the similar buzz about Web Services just a few years ago, provides a platform upon which small companies can grow their operations in proportion to their needs without large capital investments in hardware or expensive, unwashed and hirsute systems administrators. Hadoop, Memcache and their ilk let developers build applications that operate on huge data sets without investing in the expensive vertical scaling solutions of Oracle & Co. And social networking results in network driven growth patterns that can be much steeper than products or services that live on an island. However, the skills required for scaling analysis systems are quite different to those needed to scale operations; part statistician, part database administrator, part computational micro-economist, and then and understanding of business to tie together a narrative that tells stories with numbers rather than purely stories about numbers.

The environment I see around me for technology start ups is one whereby funding is hard to come by and series B's are even more painful to founders. These companies need to be smart. They need to focus some of their attention on what to do with all the data that they gather as part of daily operations. Development teams, while facing the world with more appropriate tools than those available a few years ago still focus on operations. Someone needs to focus on research. Data driven research. In times when funding was easier and valuations were higher, companies could focus on operations and hope that that operational scale would lead to revenue. I firmly believe that in this environment data rather than scale alone is of immense value.

Beyond the empirical sciences of revenue optimization, behavioral targeting, customer segmentation and the discovery of on-line arbitrage there lies a need for basic research that develops fundamental models for understanding the environment in which a firm operates and how to question the unknown. This type of science is closer to applied sociology or experimental physics than the tradition of ponderous economics in that it is driven by and dependant on the immense mass of data that new businesses are generating. But as with all science the role is to simplify; to construct a narrative upon which new questions can be found and companies can learn to change how they think about their operations rather than simply what they know.

i2pi seeks to bring this science to companies.


Comments (3)
All previous posts

consulting

personal projects

etc