Messing around with Strava & R. This is a map of my speed riding around different parts of the greater portland area.

Messing around with Strava & R. This is a map of my speed riding around different parts of the greater portland area.

Josh’s PostgreSQL Database Conventions

Show me your flowchart and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowchart; it’ll be obvious.

We’ve chosen to use relational databases, specifically PostgreSQL, for storing some of our data. We like ACID, we like the ease of ad-hoc query-ability and we like the fact that databases add an additional layer of security and data quality control. To make the most of this we should adopt some conventions, so that when we are accessing PG from an ORM, we don’t bring too many ORM-isms into our data model. This ensures that other staff, who might be using other ORMs, can still work with our data and it also prevents us from relying too much on the ORM or application layer for doing work that RDBMs can already do.

These are my preferred conventions, we will evolve them over time. I’ll try to provide a justification for each one, but this is a discussion.

  1. All names (table, column, sequence, index, constraint, role, etc.) should be lowercase with underscores. Postgres does support AnYSortOF casing that you’d like, but it makes manual querying painful.
  2. Table names should be a singular noun that describes one row. “account”, not “accounts”. Some people prefer plural, we just need a standard, my vote is for singular as it makes SQL a little more natural to read
    e.g., “SELECT * FROM account WHERE account.balance > 5000;”.
  3. We’re using a relational database. Have relations. Very few tables should be islands.
  4. Foreign keys should be named “<table>_id”, e.g., if the “account” table links to the “person” table, there should be a column in “account” called “person_id”. In the case where there are multiple foreign keys to the same table, prefix the ids, e.g. “from_person_id” and “to_person_id”
  5. Foreign keys must have foreign key constraints. It makes the schema more readable, both by humans and introspection tools. It also prevents mistakes at the application layer.
  6. Serial columns should have the sequence as the default value for that column. E.g., if the “account” table has a primary key of “id”, it should be defined (in SQL) as “id SERIAL PRIMARY KEY”, which is a shortcut for “INTEGER NOT NULL DEFAULT nextval(‘account_id_seq’)”.
  7. Never expose serial columns outside of the model layer. If any table is going to be exposed in any way via an API, it should have a UUID column that will be exposed instead of using the “id”.
  8. Index, constraint and sequence names should take the form of table_column_[idx | uidx | seq | ck] for indexes, unique indexes, sequences and constraints.
  9. Unique indexes should encompass all the rules for uniqueness. If the “user” table can only have one copy of each user, consider a unique constraint on first_name, last_name, address and zip, also on SSN, or whatever. There is nothing wrong with having the front-end, back end and database all check this.
  10. Constraints should reflect business rules. Just because your application does sanity checking, it doesn’t mean that some bozo at the terminal will do it. FYI Josh has access to most of our machines and is a bozo.
  11. Postgres has a rich selection of native types (IP Addresses, UUIDs, Time intervals, Polygons). Use them where appropriate. If your data is an IP address, stick it in an INET. If it is a UUID, there’s a type for that.
  12. Postgres also supports enumerated types. If we have a relatively immutable small list of possible values for a column, use an enum.
  13. If the SQL type is not descriptive enough of the type of data that is stored in a column, use the units of measurement in the column. E.g., “height_meters” if we are storing a height, in meters. God knows why we’d do that, but you get the idea.
  14. Don’t be afraid of TEXT. If you want to store free form text, VARCHAR (2048) isn’t what you want. Postgres is smart enough to move large chunks of text outside of the table and into a blob, so there VARCHARs end up taking more space than TEXT. If there are strict length constraints, don’t use TEXT.
  15. Don’t be afraid of NUMERIC. We are dealing with money. Bigint’s are fine, but we need to then rely on the application layer to do the right thing. Each application needs to know what 12345 means in dollars. When we start having interest bearing accounts, 4 decimal places may not be enough. Postgres supports arbitrary precision numbers. We should standardize on NUMERIC (18,6) for money. And please be sure that your application doesn’t silently translate arbitrary precision numbers into IEEE-754 floats or similar. We all saw Superman III.
  16. Set reasonable DEFAULTs. If you have a column called “created” which records when a row was created, a reasonable default would be now(). I saw this as a default on one of our tables “not null default ”::character varying” Not reasonable. If its not supposed to be null, setting the default as ‘’ is silly. At the very least, decide whether each column should be NULL.
  17. Don’t be afraid of schemas. Postgres supports multiple object namespaces within the same database . If you’re unaware of schemas, you are probably creating objects in the “public” schema. If we ever get to a point where any database has dozens of tables, schemas are a good way to clarify the roles of each table. Look into it.
  18. By default, don’t denormalize. At our scale, its bad form to have the same column in two tables that are joined by a 1:1 relationship. Doing this means less logic at the application layer to enforce consistency.
  19. Many to many tables should be named with the name of the two tables they join.
  20. Log modifications. If tables have mutable columns, provide a _history table that keeps track of changes. If you’re so inclined, you can do this with triggers.

Principles of Big Data / Data Science

I threw this slide up as part of my talk yesterday at the IA Ventures Big Data conference. The talk was titled "Big Data with Small Data" and was my attempt at describing how we at BankSimple apply big data techniques to relatively small data sets.

It was a short talk and while this was my key slide, I didn’t have as much time to discuss it as I would have liked. What do you think?


PayPal horror stories

I’ve spent the last 37 minutes trying to send $104 to a friend for tickets to a party. He specifically asked me to send him the money via PayPal. I have a PayPal account, so it shouldn’t have been a problem.

The problems began when I first attempted to send him the money. PayPal complained that they weren’t able to confirm that I owned the account. I’m currently on a business trip and using the hotel’s internet connection, so I figure the odd IP address is confusing PayPal. None of the suggestions provided suggested that they would fix the problem, so I decided to verify my account by linking PayPal directly to my checking account.

My checking account is with USAA, and linking the two required providing PayPal not only with the username and password to my bank account, but also answers to three security questions and my card’s PIN. It took me a good 10 minutes of switching back and forth between USAA and PayPal to sort that out. (It’s more difficult than it should be. Why don’t banks support OAuth?..)

By linking the two accounts I assumed that I would have provided PayPal with enough evidence that I owned my account. This wasn’t the case.

I also re-verified my email and set up and verified my phone number with PayPal. It took more than 5 minutes for their verification SMS to reach my phone which currently flutters between one bar and ‘Searching…’ while I’m out here in the mid-west. Not PayPal’s fault, but clunky.

None of this helped convince PayPal that I was who I said I was. If I gave anyone else all the information I had just shared with them, they could walk away with my money. But I couldn’t send $104 to pay for party tickets.

Its two thousand and ten. Money is electronic. Sending money between American banks, while clunky, is cheap(*) and easy. Doing it via PayPal is hard because PayPal’s supports international transfer and thus, rightfully, expends more effort fighting fraud than they do sending money. Horror stories, like mine, are common. But while PayPal could certainly improve their web interface, the majority of the experience failures are due to their interminable vigilance against fraud.

(*) Cheap for the banks. While it may cost them a tiny fraction of a penny, they will readily charge customers far more.

Pythonic financial simulation

Although I’m really a C programmer, I’ve been doing more work over the past three years in Python. Today, for the first time, decided to write a digital signal processing program in Python. C is usually my go to language for these types of tasks and I felt like a fish out of water.

You can check out the code, if you are so inclined, to my spectrum analyzer at gist. There are few sections where the code reads more like what a C programmer would write, rather than that of a native Python programmer. I was having trouble clearly expressing the following line of code:

[ord(s[2*i]) | ord(s[2*i+1])<<8 for i in range(0,len(s)/2)]

Essentially, that code did exactly what I wanted, but I feel that there was probably a simpler way of expressing the same intent in Python. I asked my followers on Twitter for some help with the above segment, and got some useful answers. I then re-phrased the exact same question, exposing my intent, as:

Do you know of an inbuilt way to convert a byte stream containing unsigned 16 bit integers into an array of python ints?

convert a byte stream containing unsigned 16 bit integers into an array of python ints”. Seventy-two characters to type. The code I was using consumed fifty-three characters. In some way, the code was 25% more efficient at expressing the underlying intent. And that brings me to the moral of this blog post.

I woke up this morning to a flurry of news stories about an SEC proposal to include Python code with asset backed securities (ABS) filings. The idea that while ABS documents are chock full of legalese, a computer program can provide a very concise way of understanding how a financial instrument is supposed to operate. I really liked the idea, but needed to know more.

This recommendation comes in a 667 page pdf. I just finished scanning through it, trying to find more details about the proposed implementation. You see, I spent a good chunk of last week writing a retail banking simulator in Python, and I have some questions about how they intend to do it. Of course, completely missing the point of their very sensible recommendation, no where in the document is there any Python code. Rather than making me go through hundreds of pages of text, I would have really appreciated a link to hundreds of lines of code.

Oh well. Their heart is in the right place.

So, why am I writing a retail banking simulator in Python? Well, at banksimple we have an ever growing Excel spreadsheet. Given the limitations of Excel, we make lots of broad assumptions about the distributions of things like account balances and daily spending. Given the non-linearities of both our business rules and human behavior, I want to get a sense of the sensitivity of our model to various risks. And the best way I know to do that is via simulation.

Rather than hiding code throughout gnarly cell references, I can clearly express business rules and customer responses in code and from there I can tweak inputs and asses the impact of distribution assumptions on our revenue model. Essentially, I build a universe with millions of bank customers and let them do the things that people do with money for a few years and see what happens. This is a very different approach to modelling in Excel. It is much better at capturing non-linearities.

I wanted to know whether the SEC proposal was for including full simulations of securities that can be composed of thousands of other instruments, or whether it was more like an Excel model, just written in Python. The answer is probably somewhere in those 667 pages, but I can’t find it.

Sorry I haven’t been blogging here recently. We are quite busy getting things ready with BankSimple. You can follow along over at our blog at

not knowing is half the battle

A decade ago I was pitching web analytics software to a number of retail banks back in Australia. The pitching process was intense. Not only did we have to explain what web analytics was about all to marketing executives who had just discovered the internet, but we also had to get past operations teams who were held to a five nines standard. Five nines, or 99.999% uptime means six seconds of downtime per week. This attitude pervades the development of banking products, both in ways that are beneficial to the user and in ways that make what should be simple experiences painful.

Running a large teller machine network requires a serious investment in uptime. With thousands of transactions running through the American card networks each second, the fallout of losing even a fraction of a percent is too serious a risk to face light-heartedly. As a whole, the network is amazingly reliable. But point of sales machines don’t adhere to the same reliability levels as the network as a whole. The machines that you read your card swipes at stores are subject to risks from power failure, phone network outage or even running out of paper before coming close to hitting a system-wide network failure. In the event of a single failure at your local bodega, affected customers have the option of walking across the street to use an ATM or make some other minor adjustment to cope with a local problem. If the network as a whole were to go down, even for a short period of time, there would be chaos.

An individual POS machine meets local demand at 99.9% availability. This is how the system is meant to work. Likewise, the internet. The internet is designed to be fault tolerant, with the realization that it is cheaper (and often plain better) to tolerate failure instead of going to great length to avoid it. Your POS terminal could reach 4 nines if it were housed in a fancy co-location center, but that would be hardly be convenient. Yet when we were pitching marketing analytics to the banks we got the impression that if banking technologists had their way, all bodegas would have multi-homed candy aisles and drip coffee machines with backup generator power.

While this attitude has shifted since the formation of independent internet banking groups as distinct from core banking operations, tight coupling between legacy systems results in a development process that is just completely wrong. Banking is complex. Retail operations are tightly regulated. Applying this mentality to every aspect of banking leads to unnecessary inflexibility. Yet successful online experiences are defined by development processes that rapidly iterate and evaluate. The idioms for online interactions are rapidly changing – and at this point in our short history of internet usage it is difficult to see many points of conversion. This is the zeroth hour and iteration begets discovery of new interactions and continuing evolution of a common language for working with the web.

As many of you know, I’m working with a great group reinventing retail banking. A big part of what we want to get right is our user experience. In researching this project, I’ve signed up with a bunch of different banks across America. The typical process for doing so heavily reflects the forms that you fill and experience that you would have opening an account at a branch. While many of these form fields are required by law, the end user experience is heavily informed by legacy development processes. Branches were satellite offices, connecting to mainframes via expensive WANs. The process feels unwieldy to new customers as their applications move in lock step through a system that was designed as a series of incremental improvements over a pen and paper driven process of just two decades ago.

Most banks operate their core transactional processing systems on a batch cycle. This stems from the way the Fed works with banks for overnight lending. And as the internet groups budded off from core banking operations, the process of institutional mitosis resulted in a bunch of useless DNA being carried over. One of the banks I have accounts with (a top 5 US bank) regularly closes their internet banking site on Sunday nights for scheduled maintenance. When did Facebook last shut down for maintenance, scheduled or otherwise?

One thing we are working very hard towards is the distinction between parts of our banking service that must be highly reliable/available and those where we can iterate quickly. For example, take call center operations. A good phone system, simply due to variability in customer demand, will need staffing slack to reach given quality of service – in terms of expected time on hold, and time to resolution. You can spend a tonne on reaching a certain level of technological availability, but have little impact in key measures of customer satisfaction due to the different costs involved in staffing. With a five nines phone system you can still deliver a shitty service if your call center capacity is capped out by 100% staff utilization.

When making decisions about technology behind a retail bank, such as the call center or web site, we choose to trade iterability for early wins in the cost of scaling. Large, complex and thus rigid systems make it difficult to evaluate competing operating procedures. Short sighted metrics for success lead to short sighted incremental improvements. Free from the constraints of public markets we are able to take a risk and try something different – even if we don’t know, a priori, how different it has to be. We believe it is critical to be able to try new things quickly, learn from our customers and improve their experience based on data.

I’m always scared to have our audience ask “What about feature X?" during our early technology demos. More often than not "feature X" is already in our feature tracking system, but market as "Will not fix." Even very simple features that would take only minutes of development time to implement have far reaching consequences. If we added the ability to filter transactions by date, for example, a number of quick decisions would be made about how to implement the user interface - with no possible implementation resulting in an interface less complex than simply leaving the feature out.

Additional complexity, without a clear use case, is bad. The flexibility to add new features as they are justifiably demanded is good. Complex systems work best when they are adaptive. Desinging new features in bulk and dumping them on users after a 12 month development cycle is just cruel. Especially so in banking, where mistrust is rampant and fear of making a mistake is justified. Better to iterate quickly and support an adaptive complexity landscape.

People often ask us what it is that makes us better than other banks. Glibly, I respond that we are just a plain old retail bank – but we don’t suck. Not sucking is our killer app.

I don’t know what that means in terms of fine grained details for future features. Sure, we have prior beliefs as to what experiences suck more than others under the current banking model, and where we should appropriately spend our valuable time. But we are also fine with being wrong - hell, we expect it. The only thing that we believe is that by setting ourselves up to respect and learn from our customers experience, we win. Other banks just can’t do that.

Photo by: riv / flickr

Connecting with your money

According to Mr. Meara, 90 percent of all transactions with bank tellers involve checks. If everyone had an iPhone deposit app, people wouldn’t come into the branch as often. That would be fine had banks not invested so much time and energy in training branch workers to persuade checking account customers to move into more profitable products.

"One the one hand, fewer deposit transactions could mean a headcount reduction," he said. "But it invites the erosion of store profitability. The banks are struggling with the enormity of what it means."

Hurry up & credit my account New York Times, September 18, 2009

Brand apathy is rampant in retail financial services. The number one sales channel is the branch. That expensive, increasingly empty, retail space is how banks sell new products to customers. “Is there anything else that I can help you with today” is the alpha and omega of retail bank marketing, with a small epsilon for banners of smiling families plastered inside branches.

Below is a screenshot of what I see when I log into my American Express account. A small portion of the screen is stuff I care about. A small, but significant, portion of my emotional well being rides on those numbers. The bulk of the page is dedicated to selling me stuff. How about this? Make understanding and working with my money easier - make me happier, and then I’ll be far more receptive to upsells. But if you can’t even get basic information and interaction right, then I’m too busy worrying about my current state of affairs to consider new fangled products and the incremental complexity they entail.

Selling customer data

As someone who has spent most of their working career selling data to advertisers, I’m suprised by the number of businesses that are predicated on the model of selling data to advertisers. If you have a great widget, it is easy to get it in the hands of millions of people, especially if you are giving it away for free. As programmers and scientists we deify data. What we don’t do is understand advertising. Sure we understand that advertising is about selling stuff, but we don’t seem to get that the advertising industry exists to sells ways of selling stuff.

Whether an external agency or an internal group, advertising professionals have to convince others that they are adding value. And if your model is simply to sell data to advertisers, you have to convince them that your data is worth at least as much as what you are selling it for. Your data needs to be useful. Smart techniques are popping up for doing interesting things with large amounts of data. But interesting isn’t always useful. If you think your customer data is useful, you should use your data to make your product better. Data that is valued by your users for the richer experience it provides is likely to be valued by others. If you can’t improve your customers experience with their own data, your data is worth nothing.


While I’m ranting, let me ask you something, Randall. At the risk of sounding like Glenn Beck Jr. - what the fuck has gone wrong with our country? Used to be, we were innovators. We were leaders. We were builders. We were engineers. We were the best and brightest. We were the kind of guys who, if they were running the biggest mobile network in the U.S., would say it’s not enough to be the biggest, we also want to be the best, and once they got to be the best, they’d say, How can we get even better? What can we do to be the best in the whole fucking world? What can we do that would blow people’s fucking minds? They wouldn’t have sat around wondering about ways to fuck over people who loved their product. But then something happened. Guys like you took over the phone company and all you cared about was milking profit and paying off assholes in Congress to fuck over anyone who came along with a better idea, because even though it might be great for consumers it would mean you and your lazy pals would have to get off your asses and start working again in order to keep up.

And not just you. Look at Big Three automakers. Same deal. Lazy, fat, slow, stupid, from the top to the bottom - everyone focused on just getting what they can in the short run and who cares what kind of piece of shit product we’re putting out. Then somehow along the way the evil motherfuckers on Wall Street got involved and became everyone’s enabler, devoting all their energy and brainpower to breaking things up and parceling them out and selling them off in pieces and then putting them back together again, and it was all about taking all this great shit that our predecessors had built and “unlocking value” which really meant finding ways to leech out whatever bit of money they could get in the short run and let the future be damned. It was all just one big swindle, and the only kind of engineering that matters anymore is financial engineering.

When lenders compete, you win.

The message is pretty clear - competition, the pillar of capitalism, results in better products and services for consumers. When lenders compete, you win. While this is the slogan for one major mortgage lead generator, the methodology is common to the industry as a whole. And people believe that magic technology fosters competition, with the net benefit of better lending rates.

The reality is a little different. When I oversaw the operations of a mortgage marketplace, the competition was not in terms of the products offered, but rather, the price paid for getting a person’s attention. Lenders would bid for leads and the lenders who paid the highest price received the most leads. Thus the incentives were counter to people’s rational goals. The lenders with the highest margins were able to spend the most on customer acquisition, while lenders with more affordable products were unable to reach the same audience.

Google recently publicized their direct entry into this space. Prior to their entry they captured only a portion of the marketing dollars - with lead generators buying keyword ads on Google, funneling the traffic to their site, collecting lead information and selling to the highest paid mortgage providers. When the lead generators spend on Google was lower than their revenue from the mortgage companies, they profited. A mercenary and highly unregulated bunch, the lead generators would go to great lengths to screw the consumer.

Google’s product appears better in that rather than selling out the consumer for the highest price, they display a targeted list of options - clearly outlining the competing offers - letting the consumer decide which companies to contact for a quote. As is always the case, transparency leads to a better outcome for people.

Despite the numerous and simultaneous failures in the mortgage marketplace that has so deeply scarred the American economy, one upside that is often forgotten is the benefit of standardization of lending products. Prior to the development of the mortgage backed security market mortgage contracts varied greatly in their structure and terms. And while they remain complex financial contracts, standardization means that a consumer is able to properly evaluate the bulk of the financial impact of their mortgage choice by simply examining a handful of parameters.

I have a home equity loan. I also have credit cards, savings accounts and brokerage accounts. The simplest account that I have, and the one that sees the most action, is the humble checking account. My checking account has a 36 page introductory preamble that outlines the terms and conditions. These terms are fully documented on a corner of my bank’s web site, and change on a semi-monthly basis. No one reads these terms.

I spent my summer reading not only the terms of my checking account, but of all of my accounts and the accounts at other major banks in America. You’d be terrified to know what they actually contain. That is, if you could find them. The GAO found that 65% of banks do not make these documents available on the web, and 35% fail to produce them if you visit a branch.

And these terms matter.

Unlike most people’s mental model of retail banking operations, banks do not make most of their money on the difference between the rates at which they lend versus the rate they offer for savings. American banks, quite distinctly from banks elsewhere in the world, make the bulk of their money from fees and charges. Invisible and often unavoidable consequences of little clauses in contracts that no one ever reads.

This stands in stark contrast to the message that we hear in bank marketing. Retail bank marketing is dominated by APR: Best rate savings! Lowest rate on credit cards! Yet the largest financial impact to the consumer is fees and charges.

Fees and charges that consumers have no hope of simply understanding.

Lead generation is rife across the financial product landscape. Some companies try to offer Google-like services for better helping consumers choose financial products - but these services fall into the trap of not taking into account the obscure and non-standardized terms that most impact financial well being. And as such no one believes the offers they see in sites like If people honestly believed they could “Save $2,000 by switching to Bank XYZ’s credit card” then the conversion rates on these offers would be vastly better than the prevailing rates.

And so with all the technology that we have at our disposal, people are no better off. Banks have no incentive to increase transparency, lead generators have no incentive to provide real offers and immense brand apathy prevails resulting in short sighted decisions further driving down customer experience. The cycle continues.

Until it stops.

Chase, what matters?

Last week I paid my bills. As part of the regular bill-paying process, I take any funds left over that are not required as cash over the coming weeks, and pay down my home equity facility. I pay my credit cards in full. The only rate I concern myself with is 3.8% rate on my loan. There is nothing particularly unusual about this process.

A few days after I paid my bills my bank, JPM Chase, emailed me to tell me that my account was overdrawn. I logged into the web site and saw that they had put a bunch of payments through twice. Most importantly, they put my loan principal payment through twice. In their wisdom, they helped me correct this by withdrawing from my credit card to pay down my loan. My credit card has a purchase rate of 12% and an cash rate of 19%.

Follow me for a moment: They withdrew from an account charging 19% to pay down an loan at 3.8%. And along the way they charge an overdraft fee for the ‘service.’

I have three simple requests for Chase:

  1. Reverse the double payment, restoring my checking account to its intended balance.
  2. Reverse the overdraft fee.
  3. Return the interest they are earning at 19% on the credit card account.

I was first routed to the online banking group as it was clear that the error originated within their domain. The timestamps on the transactions are identical, the transaction numbers are near sequential; there is a clear indication that my intent was to only pay once, but they processed the transactions twice. From there I was transferred to the credit card fiefdom who told me that they would correct the overdraft issues - but it would take four days.

Four days later I called to discover that no work order was placed and there was no note of my original call. I went through the same call center waltz, but instead was routed to the home equity group. The same group that asked me to forge a signature during the application process - but that is another story for another day. During this call I was told that the home equity group would take care of everything as they were the final destination of the funds. I made sure that the work order included the three key points listed above. I also took note of the work order number and the names and times of all the people I spoke with.

Meanwhile, mind you, I still have a zero checking balance and am unable to make other payments. I am loathe to draw down the debt facilities at my disposal, as it would just make it even more complex to reverse these transactions. Friends have been helpful and luckily I have enough cash on hand.

Today I called the home equity group directly to check on the status of the work order. They gave me the same estimate as the last time I spoke with them four to five days. At which point the interest would have accumulated to $11. Not a great deal in the scheme of things, but it is my goddamn $11. I’d be happy if they simply paid me the $11 and moved on - they have earned a good deal of revenue from me over the years, and this is clearly their error. But of course, they can’t return me the $11 as they have no mechanism for doing so.

"You cannot dispute this transaction for this reason: 5102 - Bank Releated[sic] Fees / Charges - Not Eligible”

As of now it appears my only option to force the return of my overdraft fee and for me to receive any accrued interest is to take action in the New York small claims court. There is no way Chase will defend this - that would cost hundreds for legal approval alone. Of course, they make an order of magnitude more than that from my family in fees, charges, interchange and net interest margin each year. As soon as I get my $11 back, they will no longer hold any of my accounts.

Customer experience matters.

Bayesian Methods + MCMC

Last night we had our fourth NY R Statistical Programming meetup. The topic was Bayesian Methods + MCMC. We had two presenters, Jake Hofman and Suresh Velagapundi, both of whom did an admirable job of presenting a very broad topic to an audience with diverse backgrounds. I want to use this post to bridge a gap between the background material and day to day utilization. This is catered towards the audience who may have some experience with R, but aren’t very familiar with the Bayesian Way. While it is a simple example, the steps involved extend on to the issues that are faced in real world applications.

The source for this example can be downloaded using the internet!

We are going to step through Jake’s coin flip example to get a sense of what is involved in doing Bayesian inference. There are a number of packages on the CRAN Bayesian Inference view that do all of what you will see below. I decided against using them for two reasons. First off, the coin flip example is a little too trivial for using many of the techniques that rely on multivariate parameter estimation to see any utility. But more importantly, I want to use the opportunity provided by a nice simple example to step through the underlying mechanics. My hope is that after reading through this you can have a look at the available packages and be a better judge of what they are used for and where one package may stand out over another. In the course of doing this write up I went through the MCMCpack package and it is a good exercise to compare how they implement the MCbinomialbeta() against the first half of this walk through. For the curious, the MCMCmetrop1R() function is far more advanced than the simple implementation of Metropolis-Hastings shown below, and it is a good exercise to understand their tuning parameters.

As a quick recap, the point of the exercise is to go from prior belief in a distribution (in this case we believe that the coin is fair) and use observed data to arrive at a posterior distribution using both the prior and the data. There are three things that we need to know to calculate the posterior distribution:

  1. The likelihood of seeing the new data given our estimate of the bias
  2. Our prior distribution
  3. The ‘evidence’ or the integral of the likelihood and prior for each possible estimate

I won’t step through the derivation of the likelihood, as this should be easy enough to derive from the binomial probability distribution function. In this case our likelihood, with N trials and h heads is:


Check that the likelihood function makes sense:

Great, the maximum likelihood for 50 heads from 100 flips is a theta of 0.5. (See chart below).

Jake uses the Beta distribution as his prior as it has some neat analytic properties; namely that the posterior will be of the same distribution family as the prior. We call these types of priors conjugate priors.

If we do the integration, we can arrive at the analytic form of the evidence and thus the posterior:
Let's say we don't know what the analytic form for the evidence (denominator in Bayes' rule) is, and replace it by a numerical integration over all possible theta's from 0 to 1:
While things are pretty simple with this toy example, Jake made the point that real difficulty with Bayesian inference is twofold:
  1. Integrating across theta to find the evidence (denominator)
  2. Once you have the posterior, integrating it to calculate summary statistics (mean, variance, etc.)

In the above example we used the integrate() function to apply adaptive quadrature to find the evidence. We could use this method for 2, but lets not. Instead, let us use MCMC - which is at its core, a way to draw samples from a distribution that is otherwise hard to sample from.

Given that this example is rather trivial, with just one parameter in question (theta), I won’t step through the implementations of vanilla Monte Carlo methods (uniform, importance & rejection sampling) These implementations are pretty much straight forward from Jake’s presentation.

I will however, implement a simple Metropolis-Hastings MCMC sampler using a simple and symmetric Gaussian proposal density (q in Jake’s notes).

MHstep  u)
		# This candidate is likely to be a better sample
		return (newCandidate)

	# Else, stick with our previous candidate
	return (prevCandidate)

Let’s use our numerical approximate to the actual posterior function as the PDF we want to draw samples from:
Time to go on a random walk down coin flip street.
And how did we do?
png ("figure3.png", width=800, height=600)
plot(cumsum(samples)/1:steps, type='l', xlab='Step', ylab='Estimated Mean', main='Drawing samples by Metropolis-Hastings')


the basics

"If you are not embarrassed by the first version of your product, you’ve launched too late."

On Monday I released, a statistical learning web service. Designed to deal with common classification and regression problems, it takes input data in the form of a CSV file and returns to the end user a set of predictive models. For example, if you have a list of store locations, local weather data, and store revenue, you could use the service to see if location and weather impact store revenue. predict.i2pi tries to determine whether predictions are possible by running your data against a growing number of user contributed statistical learning algorithms and finding the ones that work best with your data.

In planning this I went through a range of features, bells and whistles but have decided to strip it all back. This is the simplest thing I could build to support what I wanted. It takes a file, runs predictive algorithms against the file, and returns performance measures. Data and predictions.


The data provided is expected to be in the form of a number of observations, with one observation per row. Each column contains measurements for these observations. One or more of the measurements we are interested in predicting. For example:

||   /----- Response Variables (dentoed by *)
X1,    X2,    X3,    Name,  Date,        *Y
12.3,  13.4,  8.32,  Terry, 2008-10-12,  736.0
 9.3,  34.1,  1.21,   Josh, 2008-10-12,  NA     <-- NB: NA response variables will have
...    ...    ...    ...    ...          ...            will have predictions available 
 8.7,  38.7,  8.17,   Jess, 2009-01-07,  1823.1         subsequent download.

Data may include observations for which we do not know the response. These observations can be included, with the response left blank. Once satistfactory models are found, end users can download spreadsheets containing our best predictions for that data. On my todo list is adding confidence intervals to these values.

Once uploaded we try to best detect the following data types:

  • Numeric (floating point numbers)
  • Integers
  • Dates (YYYY-MM-DD works best)
  • String Factors (e.g., State or letter scores)
  • Text (longer text than factors, with analytic interpretation as language text instead of as factors)


Internally, predict.i2pi performs a standard test / training protocol. Data is loaded and a random half of that data is used to train the learning algorithm. The remaining half is used to test how good the learned algorithm works against previously unseen data. Robust algorithms will do almost as well on the test as during training, while less robust approaches will lead to far poorer performance during testing. The system continues this process of picking a training sample, training and the testing as many times as possible in an allotted time. During each of these cycles, predictions are tested against the actual responses in the corresponding observation. Performance is then measured using the R-squared metric for regressions and simple classification accuracy for classification problems. The system supports user defined performance measures with the goal being to let those who supply data decide on which performance measure is best for their application. However, at the moment I’m concentrating on opening up the ability for users to upload their own learning algorithms.

Currently learning algorithms are specified in small snippets of R code that can be dynamically loaded into the main R subsystem that is responsible for coordinating training cycles. See, for example, rpart.R which links in a recursive partitioning algorithm from the rpart library.



All learning algorithms must contain two function definitions: myModel and myPredict. myModel takes a model formula and data, returning a model object that can be used to make predictions against new data. myPredict takes two parameters, the model object returned by myModel and a set of data that may not have been seen during training. We call the prediction function with one randomly ordered half of the data for training. For testing, we provide myPredict with the model object generated from the training set, but provide it with the as yet unseen testing portion of the data.

Users are also able to define transforms that take a matrix of explanatory variables and returns a new matrix with the same number of observation rows but with one or more of the explanatory columns transformed into a new space. For example one could take a 100 column matrix and apply some form of dimensionality reduction that returns a new matrix, with the same number of rows, but only 10 columns. The transform function is not shown the response variable to ensure that no funny business occurs whereby the response is somehow embedded in the explanatory variables. These same transform functions can then be applied to response variables alone, allowing the system, for example, to construct a model log(Y) ~ PCA(X1, X2, … , Xn).

The following example shows a transform function that replaces any columns that are more than 50% NA with an indicator variable:
myTransform = 0.5
	if (any(bad_idx))

coming soon

As for uploading code, at the best way to do this right now is via email. I hand rolled my own sandbox environment to prevent 3rd party code from hijacking my system - but as with any security code that I write myself, I loathe testing it in the real world until I've had a good chance to be as close to 100% sure that it is safe. In reality, I'll probably stop trying to reinvent the wheel, and use a pre-existing solution.

Given long term plans, and issues around data privacy, I didn't want to set up a system whereby data leaves the system for testing on external machines. While it works well for very large datasets, e.g., the Netflix Prize, the potential of over fitting is higher for smaller datasets when random portions of that data are often reused in validation cycles. That said, developing new learning algorithms (or plugging in ones from existing CRAN libraries) is fairly straight forward so you should be able to develop locally and upload.

There already is an API, but it is not at all documented. This is my next priority. Currently I'm running into some issues with using RCurl to interact with my API - issues which would not exist in any other language - but I really would like to get the R API out of the door before I open up wider access. In short, there are are 3 methods which are currently used by the web site (inspect my horrid JavaScript code to see them.) These allow you to upload data, make edits to meta data and receive predictions. Each prediction includes links to the R source that was involved in performing the learning + any transforms used. The prediction meta data also includes the quartiles for the measure after a number of test/train cycles, plus a sample of 250 predictions vs. actuals.

It has been suggested that I also include a small downloadable example snippet for each file to allow developers to get a better flavor of what they are working with. For larger files, I think this is a perfectly swell idea. In fact, I really do want to hear more of your suggestions. I took a knife to a slew of functionality before I released this, but I have code ready to go. But I want to wait for real life suggestions to see what I should be working on next.

The original plans for this project included complex routines for doing unsupervised schema detection and meta modelling to help identify which algorithms might work best with particular shapes of data. Also I had built a framework for combining multiple learnings algorithms in a boosting type environment. All of these features remain possible and will hopefully be released in the not to distant future.

One of the big issues I struggled with in deciding to release this is the nature of my target audience. At the moment there is an impedance mismatch between the sophistication required to understand what the system does and the utility of the system to sophisticated users. To those with any experience in predictive analytics, everything here should be your bread and butter - and most likely far simpler than what you do on a daily basis. However there is a large audience of people in the information business who currently make do with the 'Add Trendline' option in MS Excel. To this audience, this service would be greatly valued, but in its current form is probably a bit too much. This deeply embarrasses me, but I'm not going to let that stop me from publicizing what I'm up to. There is a plan, and it exists in increments.

For the lay information worker, there are hurdles both in providing understandable explanations of how the learning algorithms work and were applied but yet also difficulties in adapting my format to the natural shape of the data that they often work with - not to mention data cleaning. As an example time series models pose an interesting problem. They do break the model of one independent observation per row, but it is difficult to come up with a way of training and testing that is consistent with my current implementation. Even if I were to develop special case handling for time series data, it can be difficult for a computer to find appropriate periods over which to lag variables. At this point I think the simplest route is to let people include previous observations that they deem important, at lags that they think might be interesting, with each row. That way each row can be treated independently from the others and I don't have to build a lot of machinery to guess appropriate treatment of temporal dependency.

Likewise there are other problems whose natural representations don't map neatly to the one row = one observation representation - think of collaborative filtering or graph based problems. I am quite keen on keeping the one row representation as it affords me some nice system scaling properties without becoming too domain specific. That said, there is nothing stopping me from building front-ends that take data from these problem domains in their natural representation and map them to one that works better for my system.

When it comes to explaining the models, well. That is another story.

Engineering vs. Architecture

A few months back I caught up with a fellow Aussie in New York, who I first met once ten years ago. It is amazing how social network dynamics change as an expat. He is currently teaching Architecture at Columbia while completing his doctorate in the nature of representation in architecture. It was the sort of long conversation that lingers for a few months before finding a resting place in your mind. At first glance we spent quite some time discussing the work at the Spatial Information Design Lab as this most closely bridged the gap between our worlds. The deeper conversation was that of representation.

Engineers build things. They use sciences to make sure that the things they build don’t fall over. Architects design things - they take ideas of the world and represent them. Their audience is both the client and the engineering and construction teams. Different representations serve different purposes. Engineering: Representation to World. Architecture: World to Representation.

Finanical engineers take what they know about how companies work and built new things to serve other companies. Economists take the real world and make model representations of reality. However there is a void in economics, between the macro and the micro, in the domain of the company. Likewise, there is a void in financial engineering. Financial engineering is currently dominated by time-series analysis. I posit a weak form of the Black Swan theorem - namely that we currently don’t know enough about the past to even pretend to predict the future. We have financial historians, in the form of data providers, but we don’t have the architects to take this repository of past knowledge and build representations of how companies operate. Accountants across the globe set and implement the rules of this complex system, but we don’t understand it’s dynamics.

Can financial engineers shed the instrumentation of time series analysis and take on this role? Or will it come from a new group - the type of people who build Googles?

Or will their buildings leak?

Image by ken mccown on Flickr.