Josh's PostgreSQL Database Conventions

Show me your flowchart and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won't usually need your flowchart; it'll be obvious.

We’ve chosen to use relational databases, specifically PostgreSQL, for storing some of our data. We like ACID, we like the ease of ad-hoc query-ability and we like the fact that databases add an additional layer of security and data quality control. To make the most of this we should adopt some conventions, so that when we are accessing PG from an ORM, we don’t bring too many ORM-isms into our data model. This ensures that other staff, who might be using other ORMs, can still work with our data and it also prevents us from relying too much on the ORM or application layer for doing work that RDBMs can already do.

These are my preferred conventions, we will evolve them over time. I’ll try to provide a justification for each one, but this is a discussion.

1. All names (table, column, sequence, index, constraint, role, etc.) should be lowercase with underscores. Postgres does support AnYSortOF casing that you’d like, but it makes manual querying painful.
2. Table names should be a singular noun that describes one row. “account”, not “accounts”. Some people prefer plural, we just need a standard, my vote is for singular as it makes SQL a little more natural to read
e.g., “SELECT * FROM account WHERE account.balance > 5000;”.
3. We’re using a relational database. Have relations. Very few tables should be islands.
4. Foreign keys should be named “<table>_id”, e.g., if the “account” table links to the “person” table, there should be a column in “account” called “person_id”. In the case where there are multiple foreign keys to the same table, prefix the ids, e.g. “from_person_id” and “to_person_id”
5. Foreign keys must have foreign key constraints. It makes the schema more readable, both by humans and introspection tools. It also prevents mistakes at the application layer.
6. Serial columns should have the sequence as the default value for that column. E.g., if the “account” table has a primary key of “id”, it should be defined (in SQL) as “id SERIAL PRIMARY KEY”, which is a shortcut for “INTEGER NOT NULL DEFAULT nextval(‘account_id_seq’)”.
7. Never expose serial columns outside of the model layer. If any table is going to be exposed in any way via an API, it should have a UUID column that will be exposed instead of using the "id".
8. Index, constraint and sequence names should take the form of table_column_[idx | uidx | seq | ck] for indexes, unique indexes, sequences and constraints.
9. Unique indexes should encompass all the rules for uniqueness. If the “user” table can only have one copy of each user, consider a unique constraint on first_name, last_name, address and zip, also on SSN, or whatever. There is nothing wrong with having the front-end, back end and database all check this.
10. Constraints should reflect business rules. Just because your application does sanity checking, it doesn’t mean that some bozo at the terminal will do it. FYI Josh has access to most of our machines and is a bozo.
11. Postgres has a rich selection of native types (IP Addresses, UUIDs, Time intervals, Polygons). Use them where appropriate. If your data is an IP address, stick it in an INET. If it is a UUID, there’s a type for that.
12. Postgres also supports enumerated types. If we have a relatively immutable small list of possible values for a column, use an enum.
13. If the SQL type is not descriptive enough of the type of data that is stored in a column, use the units of measurement in the column. E.g., "height_meters" if we are storing a height, in meters. God knows why we'd do that, but you get the idea.
14. Don’t be afraid of TEXT. If you want to store free form text, VARCHAR (2048) isn’t what you want. Postgres is smart enough to move large chunks of text outside of the table and into a blob, so there VARCHARs end up taking more space than TEXT. If there are strict length constraints, don’t use TEXT.
15. Don’t be afraid of NUMERIC. We are dealing with money. Bigint’s are fine, but we need to then rely on the application layer to do the right thing. Each application needs to know what 12345 means in dollars. When we start having interest bearing accounts, 4 decimal places may not be enough. Postgres supports arbitrary precision numbers. We should standardize on NUMERIC (18,6) for money. And please be sure that your application doesn’t silently translate arbitrary precision numbers into IEEE-754 floats or similar. We all saw Superman III.
16. Set reasonable DEFAULTs. If you have a column called “created” which records when a row was created, a reasonable default would be now(). I saw this as a default on one of our tables “not null default ''::character varying” Not reasonable. If its not supposed to be null, setting the default as ‘’ is silly. At the very least, decide whether each column should be NULL.
17. Don’t be afraid of schemas. Postgres supports multiple object namespaces within the same database . If you’re unaware of schemas, you are probably creating objects in the “public” schema. If we ever get to a point where any database has dozens of tables, schemas are a good way to clarify the roles of each table. Look into it.
18. By default, don’t denormalize. At our scale, its bad form to have the same column in two tables that are joined by a 1:1 relationship. Doing this means less logic at the application layer to enforce consistency.
19. Many to many tables should be named with the name of the two tables they join.
20. Log modifications. If tables have mutable columns, provide a _history table that keeps track of changes. If you’re so inclined, you can do this with triggers.

Principles of Big Data / Data Science

I threw this slide up as part of my talk yesterday at the IA Ventures Big Data conference. The talk was titled "Big Data with Small Data" and was my attempt at describing how we at BankSimple apply big data techniques to relatively small data sets.

It was a short talk and while this was my key slide, I didn't have as much time to discuss it as I would have liked. What do you think?

Discuss.

PayPal horror stories

I've spent the last 37 minutes trying to send $104 to a friend for tickets to a party. He specifically asked me to send him the money via PayPal. I have a PayPal account, so it shouldn't have been a problem. The problems began when I first attempted to send him the money. PayPal complained that they weren't able to confirm that I owned the account. I'm currently on a business trip and using the hotel's internet connection, so I figure the odd IP address is confusing PayPal. None of the suggestions provided suggested that they would fix the problem, so I decided to verify my account by linking PayPal directly to my checking account. My checking account is with USAA, and linking the two required providing PayPal not only with the username and password to my bank account, but also answers to three security questions and my card's PIN. It took me a good 10 minutes of switching back and forth between USAA and PayPal to sort that out. (It's more difficult than it should be. Why don't banks support OAuth?..) By linking the two accounts I assumed that I would have provided PayPal with enough evidence that I owned my account. This wasn't the case. I also re-verified my email and set up and verified my phone number with PayPal. It took more than 5 minutes for their verification SMS to reach my phone which currently flutters between one bar and 'Searching...' while I'm out here in the mid-west. Not PayPal's fault, but clunky. None of this helped convince PayPal that I was who I said I was. If I gave anyone else all the information I had just shared with them, they could walk away with my money. But I couldn't send$104 to pay for party tickets.

Its two thousand and ten. Money is electronic. Sending money between American banks, while clunky, is cheap(*) and easy. Doing it via PayPal is hard because PayPal's supports international transfer and thus, rightfully, expends more effort fighting fraud than they do sending money. Horror stories, like mine, are common. But while PayPal could certainly improve their web interface, the majority of the experience failures are due to their interminable vigilance against fraud.

(*) Cheap for the banks. While it may cost them a tiny fraction of a penny, they will readily charge customers far more.

Pythonic financial simulation

Although I'm really a C programmer, I've been doing more work over the past three years in Python. Today, for the first time, decided to write a digital signal processing program in Python. C is usually my go to language for these types of tasks and I felt like a fish out of water.

You can check out the code, if you are so inclined, to my spectrum analyzer at gist. There are few sections where the code reads more like what a C programmer would write, rather than that of a native Python programmer. I was having trouble clearly expressing the following line of code:

[ord(s[2*i]) | ord(s[2*i+1])<<8 for i in range(0,len(s)/2)]

Essentially, that code did exactly what I wanted, but I feel that there was probably a simpler way of expressing the same intent in Python. I asked my followers on Twitter for some help with the above segment, and got some useful answers. I then re-phrased the exact same question, exposing my intent, as:

Do you know of an inbuilt way to convert a byte stream containing unsigned 16 bit integers into an array of python ints?

"convert a byte stream containing unsigned 16 bit integers into an array of python ints". Seventy-two characters to type. The code I was using consumed fifty-three characters. In some way, the code was 25% more efficient at expressing the underlying intent. And that brings me to the moral of this blog post.

I woke up this morning to a flurry of news stories about an SEC proposal to include Python code with asset backed securities (ABS) filings. The idea that while ABS documents are chock full of legalese, a computer program can provide a very concise way of understanding how a financial instrument is supposed to operate. I really liked the idea, but needed to know more.

This recommendation comes in a 667 page pdf. I just finished scanning through it, trying to find more details about the proposed implementation. You see, I spent a good chunk of last week writing a retail banking simulator in Python, and I have some questions about how they intend to do it. Of course, completely missing the point of their very sensible recommendation, no where in the document is there any Python code. Rather than making me go through hundreds of pages of text, I would have really appreciated a link to hundreds of lines of code.

Oh well. Their heart is in the right place.

So, why am I writing a retail banking simulator in Python? Well, at banksimple we have an ever growing Excel spreadsheet. Given the limitations of Excel, we make lots of broad assumptions about the distributions of things like account balances and daily spending. Given the non-linearities of both our business rules and human behavior, I want to get a sense of the sensitivity of our model to various risks. And the best way I know to do that is via simulation.

Rather than hiding code throughout gnarly cell references, I can clearly express business rules and customer responses in code and from there I can tweak inputs and asses the impact of distribution assumptions on our revenue model. Essentially, I build a universe with millions of bank customers and let them do the things that people do with money for a few years and see what happens. This is a very different approach to modelling in Excel. It is much better at capturing non-linearities.

I wanted to know whether the SEC proposal was for including full simulations of securities that can be composed of thousands of other instruments, or whether it was more like an Excel model, just written in Python. The answer is probably somewhere in those 667 pages, but I can't find it.

Sorry I haven't been blogging here recently. We are quite busy getting things ready with BankSimple. You can follow along over at our blog at http://banksimple.net/blog

not knowing is half the battle

A decade ago I was pitching web analytics software to a number of retail banks back in Australia. The pitching process was intense. Not only did we have to explain what web analytics was about all to marketing executives who had just discovered the internet, but we also had to get past operations teams who were held to a five nines standard. Five nines, or 99.999% uptime means six seconds of downtime per week. This attitude pervades the development of banking products, both in ways that are beneficial to the user and in ways that make what should be simple experiences painful.

Running a large teller machine network requires a serious investment in uptime. With thousands of transactions running through the American card networks each second, the fallout of losing even a fraction of a percent is too serious a risk to face light-heartedly. As a whole, the network is amazingly reliable. But point of sales machines don't adhere to the same reliability levels as the network as a whole. The machines that you read your card swipes at stores are subject to risks from power failure, phone network outage or even running out of paper before coming close to hitting a system-wide network failure. In the event of a single failure at your local bodega, affected customers have the option of walking across the street to use an ATM or make some other minor adjustment to cope with a local problem. If the network as a whole were to go down, even for a short period of time, there would be chaos.

An individual POS machine meets local demand at 99.9% availability. This is how the system is meant to work. Likewise, the internet. The internet is designed to be fault tolerant, with the realization that it is cheaper (and often plain better) to tolerate failure instead of going to great length to avoid it. Your POS terminal could reach 4 nines if it were housed in a fancy co-location center, but that would be hardly be convenient. Yet when we were pitching marketing analytics to the banks we got the impression that if banking technologists had their way, all bodegas would have multi-homed candy aisles and drip coffee machines with backup generator power.

While this attitude has shifted since the formation of independent internet banking groups as distinct from core banking operations, tight coupling between legacy systems results in a development process that is just completely wrong. Banking is complex. Retail operations are tightly regulated. Applying this mentality to every aspect of banking leads to unnecessary inflexibility. Yet successful online experiences are defined by development processes that rapidly iterate and evaluate. The idioms for online interactions are rapidly changing – and at this point in our short history of internet usage it is difficult to see many points of conversion. This is the zeroth hour and iteration begets discovery of new interactions and continuing evolution of a common language for working with the web.

As many of you know, I'm working with a great group reinventing retail banking. A big part of what we want to get right is our user experience. In researching this project, I've signed up with a bunch of different banks across America. The typical process for doing so heavily reflects the forms that you fill and experience that you would have opening an account at a branch. While many of these form fields are required by law, the end user experience is heavily informed by legacy development processes. Branches were satellite offices, connecting to mainframes via expensive WANs. The process feels unwieldy to new customers as their applications move in lock step through a system that was designed as a series of incremental improvements over a pen and paper driven process of just two decades ago.

Most banks operate their core transactional processing systems on a batch cycle. This stems from the way the Fed works with banks for overnight lending. And as the internet groups budded off from core banking operations, the process of institutional mitosis resulted in a bunch of useless DNA being carried over. One of the banks I have accounts with (a top 5 US bank) regularly closes their internet banking site on Sunday nights for scheduled maintenance. When did Facebook last shut down for maintenance, scheduled or otherwise?

One thing we are working very hard towards is the distinction between parts of our banking service that must be highly reliable/available and those where we can iterate quickly. For example, take call center operations. A good phone system, simply due to variability in customer demand, will need staffing slack to reach given quality of service – in terms of expected time on hold, and time to resolution. You can spend a tonne on reaching a certain level of technological availability, but have little impact in key measures of customer satisfaction due to the different costs involved in staffing. With a five nines phone system you can still deliver a shitty service if your call center capacity is capped out by 100% staff utilization.

When making decisions about technology behind a retail bank, such as the call center or web site, we choose to trade iterability for early wins in the cost of scaling. Large, complex and thus rigid systems make it difficult to evaluate competing operating procedures. Short sighted metrics for success lead to short sighted incremental improvements. Free from the constraints of public markets we are able to take a risk and try something different – even if we don't know, a priori, how different it has to be. We believe it is critical to be able to try new things quickly, learn from our customers and improve their experience based on data.

I'm always scared to have our audience ask "What about feature X?" during our early technology demos. More often than not "feature X" is already in our feature tracking system, but market as "Will not fix." Even very simple features that would take only minutes of development time to implement have far reaching consequences. If we added the ability to filter transactions by date, for example, a number of quick decisions would be made about how to implement the user interface - with no possible implementation resulting in an interface less complex than simply leaving the feature out.

Additional complexity, without a clear use case, is bad. The flexibility to add new features as they are justifiably demanded is good. Complex systems work best when they are adaptive. Desinging new features in bulk and dumping them on users after a 12 month development cycle is just cruel. Especially so in banking, where mistrust is rampant and fear of making a mistake is justified. Better to iterate quickly and support an adaptive complexity landscape.

People often ask us what it is that makes us better than other banks. Glibly, I respond that we are just a plain old retail bank – but we don't suck. Not sucking is our killer app.

I don't know what that means in terms of fine grained details for future features. Sure, we have prior beliefs as to what experiences suck more than others under the current banking model, and where we should appropriately spend our valuable time. But we are also fine with being wrong - hell, we expect it. The only thing that we believe is that by setting ourselves up to respect and learn from our customers experience, we win. Other banks just can't do that.

Photo by: riv / flickr

According to Mr. Meara, 90 percent of all transactions with bank tellers involve checks. If everyone had an iPhone deposit app, people wouldn't come into the branch as often. That would be fine had banks not invested so much time and energy in training branch workers to persuade checking account customers to move into more profitable products.

"One the one hand, fewer deposit transactions could mean a headcount reduction," he said. "But it invites the erosion of store profitability. The banks are struggling with the enormity of what it means."

Hurry up & credit my account New York Times, September 18, 2009

Brand apathy is rampant in retail financial services. The number one sales channel is the branch. That expensive, increasingly empty, retail space is how banks sell new products to customers. "Is there anything else that I can help you with today" is the alpha and omega of retail bank marketing, with a small epsilon for banners of smiling families plastered inside branches.

Below is a screenshot of what I see when I log into my American Express account. A small portion of the screen is stuff I care about. A small, but significant, portion of my emotional well being rides on those numbers. The bulk of the page is dedicated to selling me stuff. How about this? Make understanding and working with my money easier - make me happier, and then I'll be far more receptive to upsells. But if you can't even get basic information and interaction right, then I'm too busy worrying about my current state of affairs to consider new fangled products and the incremental complexity they entail.

Selling customer data

## 16 Dec 2009, 8:40 PM

As someone who has spent most of their working career selling data to advertisers, I'm suprised by the number of businesses that are predicated on the model of selling data to advertisers. If you have a great widget, it is easy to get it in the hands of millions of people, especially if you are giving it away for free. As programmers and scientists we deify data. What we don't do is understand advertising. Sure we understand that advertising is about selling stuff, but we don't seem to get that the advertising industry exists to sells ways of selling stuff.

# blockquote

blockquote

While I'm ranting, let me ask you something, Randall. At the risk of sounding like Glenn Beck Jr. - what the fuck has gone wrong with our country? Used to be, we were innovators. We were leaders. We were builders. We were engineers. We were the best and brightest. We were the kind of guys who, if they were running the biggest mobile network in the U.S., would say it's not enough to be the biggest, we also want to be the best, and once they got to be the best, they'd say, How can we get even better? What can we do to be the best in the whole fucking world? What can we do that would blow people's fucking minds? They wouldn't have sat around wondering about ways to fuck over people who loved their product. But then something happened. Guys like you took over the phone company and all you cared about was milking profit and paying off assholes in Congress to fuck over anyone who came along with a better idea, because even though it might be great for consumers it would mean you and your lazy pals would have to get off your asses and start working again in order to keep up.

And not just you. Look at Big Three automakers. Same deal. Lazy, fat, slow, stupid, from the top to the bottom - everyone focused on just getting what they can in the short run and who cares what kind of piece of shit product we're putting out. Then somehow along the way the evil motherfuckers on Wall Street got involved and became everyone's enabler, devoting all their energy and brainpower to breaking things up and parceling them out and selling them off in pieces and then putting them back together again, and it was all about taking all this great shit that our predecessors had built and "unlocking value" which really meant finding ways to leech out whatever bit of money they could get in the short run and let the future be damned. It was all just one big swindle, and the only kind of engineering that matters anymore is financial engineering.

http://www.fakesteve.net/2009/12/a-not-so-brief-chat-with-randall-stephenson-of-att.html

When lenders compete, you win.

The message is pretty clear - competition, the pillar of capitalism, results in better products and services for consumers. When lenders compete, you win. While this is the slogan for one major mortgage lead generator, the methodology is common to the industry as a whole. And people believe that magic technology fosters competition, with the net benefit of better lending rates.

The reality is a little different. When I oversaw the operations of a mortgage marketplace, the competition was not in terms of the products offered, but rather, the price paid for getting a person's attention. Lenders would bid for leads and the lenders who paid the highest price received the most leads. Thus the incentives were counter to people's rational goals. The lenders with the highest margins were able to spend the most on customer acquisition, while lenders with more affordable products were unable to reach the same audience.

Google recently publicized their direct entry into this space. Prior to their entry they captured only a portion of the marketing dollars - with lead generators buying keyword ads on Google, funneling the traffic to their site, collecting lead information and selling to the highest paid mortgage providers. When the lead generators spend on Google was lower than their revenue from the mortgage companies, they profited. A mercenary and highly unregulated bunch, the lead generators would go to great lengths to screw the consumer.

Google's product appears better in that rather than selling out the consumer for the highest price, they display a targeted list of options - clearly outlining the competing offers - letting the consumer decide which companies to contact for a quote. As is always the case, transparency leads to a better outcome for people.

Despite the numerous and simultaneous failures in the mortgage marketplace that has so deeply scarred the American economy, one upside that is often forgotten is the benefit of standardization of lending products. Prior to the development of the mortgage backed security market mortgage contracts varied greatly in their structure and terms. And while they remain complex financial contracts, standardization means that a consumer is able to properly evaluate the bulk of the financial impact of their mortgage choice by simply examining a handful of parameters.

I have a home equity loan. I also have credit cards, savings accounts and brokerage accounts. The simplest account that I have, and the one that sees the most action, is the humble checking account. My checking account has a 36 page introductory preamble that outlines the terms and conditions. These terms are fully documented on a corner of my bank's web site, and change on a semi-monthly basis. No one reads these terms.

I spent my summer reading not only the terms of my checking account, but of all of my accounts and the accounts at other major banks in America. You'd be terrified to know what they actually contain. That is, if you could find them. The GAO found that 65% of banks do not make these documents available on the web, and 35% fail to produce them if you visit a branch.

And these terms matter.

Unlike most people's mental model of retail banking operations, banks do not make most of their money on the difference between the rates at which they lend versus the rate they offer for savings. American banks, quite distinctly from banks elsewhere in the world, make the bulk of their money from fees and charges. Invisible and often unavoidable consequences of little clauses in contracts that no one ever reads.

This stands in stark contrast to the message that we hear in bank marketing. Retail bank marketing is dominated by APR: Best rate savings! Lowest rate on credit cards! Yet the largest financial impact to the consumer is fees and charges.

Fees and charges that consumers have no hope of simply understanding.

Lead generation is rife across the financial product landscape. Some companies try to offer Google-like services for better helping consumers choose financial products - but these services fall into the trap of not taking into account the obscure and non-standardized terms that most impact financial well being. And as such no one believes the offers they see in sites like Mint.com. If people honestly believed they could "Save $2,000 by switching to Bank XYZ's credit card" then the conversion rates on these offers would be vastly better than the prevailing rates. And so with all the technology that we have at our disposal, people are no better off. Banks have no incentive to increase transparency, lead generators have no incentive to provide real offers and immense brand apathy prevails resulting in short sighted decisions further driving down customer experience. The cycle continues. Until it stops. Comments (3) # Chase, what matters? ## 23 Sep 2009, 6:36 PM Last week I paid my bills. As part of the regular bill-paying process, I take any funds left over that are not required as cash over the coming weeks, and pay down my home equity facility. I pay my credit cards in full. The only rate I concern myself with is 3.8% rate on my loan. There is nothing particularly unusual about this process. A few days after I paid my bills my bank, JPM Chase, emailed me to tell me that my account was overdrawn. I logged into the web site and saw that they had put a bunch of payments through twice. Most importantly, they put my loan principal payment through twice. In their wisdom, they helped me correct this by withdrawing from my credit card to pay down my loan. My credit card has a purchase rate of 12% and an cash rate of 19%. Follow me for a moment: They withdrew from an account charging 19% to pay down an loan at 3.8%. And along the way they charge an overdraft fee for the 'service.' I have three simple requests for Chase: 1. Reverse the double payment, restoring my checking account to its intended balance. 2. Reverse the overdraft fee. 3. Return the interest they are earning at 19% on the credit card account. I was first routed to the online banking group as it was clear that the error originated within their domain. The timestamps on the transactions are identical, the transaction numbers are near sequential; there is a clear indication that my intent was to only pay once, but they processed the transactions twice. From there I was transferred to the credit card fiefdom who told me that they would correct the overdraft issues - but it would take four days. Four days later I called to discover that no work order was placed and there was no note of my original call. I went through the same call center waltz, but instead was routed to the home equity group. The same group that asked me to forge a signature during the application process - but that is another story for another day. During this call I was told that the home equity group would take care of everything as they were the final destination of the funds. I made sure that the work order included the three key points listed above. I also took note of the work order number and the names and times of all the people I spoke with. Meanwhile, mind you, I still have a zero checking balance and am unable to make other payments. I am loathe to draw down the debt facilities at my disposal, as it would just make it even more complex to reverse these transactions. Friends have been helpful and luckily I have enough cash on hand. Today I called the home equity group directly to check on the status of the work order. They gave me the same estimate as the last time I spoke with them four to five days. At which point the interest would have accumulated to$11. Not a great deal in the scheme of things, but it is my goddamn $11. I'd be happy if they simply paid me the$11 and moved on - they have earned a good deal of revenue from me over the years, and this is clearly their error. But of course, they can't return me the $11 as they have no mechanism for doing so. "You cannot dispute this transaction for this reason: 5102 - Bank Releated[sic] Fees / Charges - Not Eligible" As of now it appears my only option to force the return of my overdraft fee and for me to receive any accrued interest is to take action in the New York small claims court. There is no way Chase will defend this - that would cost hundreds for legal approval alone. Of course, they make an order of magnitude more than that from my family in fees, charges, interchange and net interest margin each year. As soon as I get my$11 back, they will no longer hold any of my accounts.

Customer experience matters.

## 11 Jul 2009, 12:44 AM

Last night we had our fourth NY R Statistical Programming meetup. The topic was Bayesian Methods + MCMC. We had two presenters, Jake Hofman and Suresh Velagapundi, both of whom did an admirable job of presenting a very broad topic to an audience with diverse backgrounds. I want to use this post to bridge a gap between the background material and day to day utilization. This is catered towards the audience who may have some experience with R, but aren't very familiar with the Bayesian Way. While it is a simple example, the steps involved extend on to the issues that are faced in real world applications.

We are going to step through Jake's coin flip example to get a sense of what is involved in doing Bayesian inference. There are a number of packages on the CRAN Bayesian Inference view that do all of what you will see below. I decided against using them for two reasons. First off, the coin flip example is a little too trivial for using many of the techniques that rely on multivariate parameter estimation to see any utility. But more importantly, I want to use the opportunity provided by a nice simple example to step through the underlying mechanics. My hope is that after reading through this you can have a look at the available packages and be a better judge of what they are used for and where one package may stand out over another. In the course of doing this write up I went through the MCMCpack package and it is a good exercise to compare how they implement the MCbinomialbeta() against the first half of this walk through. For the curious, the MCMCmetrop1R() function is far more advanced than the simple implementation of Metropolis-Hastings shown below, and it is a good exercise to understand their tuning parameters.

As a quick recap, the point of the exercise is to go from prior belief in a distribution (in this case we believe that the coin is fair) and use observed data to arrive at a posterior distribution using both the prior and the data. There are three things that we need to know to calculate the posterior distribution:

1. The likelihood of seeing the new data given our estimate of the bias
2. Our prior distribution
3. The 'evidence' or the integral of the likelihood and prior for each possible estimate

I won't step through the derivation of the likelihood, as this should be easy enough to derive from the binomial probability distribution function. In this case our likelihood, with N trials and h heads is:

likelihood <- function (N, h, theta) theta^h * (1 - theta)^(N-h)

Check that the likelihood function makes sense:
t <- (0:100) / 100

png ("figure1.png", width=800, height=600)
par (mfrow=c(2,2))
par (bty='n')
par (col='red')
plot (t, likelihood(100, 50, t), type='l', xlab='Theta Hat', ylab="Likelihood", main='Likelihood (t=0.5)')


Great, the maximum likelihood for 50 heads from 100 flips is a theta of 0.5. (See chart below).

Jake uses the Beta distribution as his prior as it has some neat analytic properties; namely that the posterior will be of the same distribution family as the prior. We call these types of priors conjugate priors.

prior <- function (theta, a, b) dbeta (theta, a, b)

a <- 2
b <- 2
plot (t, prior(t, a, b), type ='l', xlab='Theta Hat', ylab='Pr(theta|a,b)', main='Prior')

If we do the integration, we can arrive at the analytic form of the evidence and thus the posterior:
evidence  <- function (N, h, a, b) beta(h + a, N - h + b) / beta (a, b)
posterior <- function (theta, N, h, a, b) likelihood (N, h, theta) * prior(theta, a, b) / evidence (N, h, a, b)

plot (t, posterior(t, 100, 50, a, b), type ='l', xlab='Theta Hat', ylab='Pr(theta|Observations,a,b)', main='Posterior (t=0.5)')
plot (t, posterior(t, 100, 70, a, b), type ='l', xlab='Theta Hat', ylab='Pr(theta|Observations,a,b)', main='Posterior (t=0.7)')

dev.off()

Let's say we don't know what the analytic form for the evidence (denominator in Bayes' rule) is, and replace it by a numerical integration over all possible theta's from 0 to 1:
evidenceN  <- function (N, h, a, b) integrate (function(t) likelihood (N,h,t) * prior (t,a,b), 0, 1)value posteriorN <- function (theta, N, h, a, b) likelihood (N, h, theta) * prior(theta, a, b) / evidenceN (N, h, a, b) N <- 100 # Trials h <- 70 # Heads png ("figure2.png", width=800, height=600) par (mfrow=c(1,1)) analytic <- posterior (t, N, h, a, b) estimated <- posteriorN(t, N, h, a, b) plot (t, analytic, type ='l', xlab='Theta Hat', ylab='Pr(theta|Observations,a,b)', main='Posterior (t=0.7)') lines (t, estimated, type ='l', xlab='Theta Hat', ylab='Pr(theta|Observations,a,b)', col='blue', lty=2) err <- (analytic - estimated)^2 lines (t, (err - min(err)) / diff(range(err)) * max(analytic), lty=3, col='black') legend (0,2, c('Analytic','Estimated','Error^2 (scaled)'), col=c('red','blue','black'), lty=c(1,2,3), bty='n', text.col='black') dev.off()  While things are pretty simple with this toy example, Jake made the point that real difficulty with Bayesian inference is twofold: 1. Integrating across theta to find the evidence (denominator) 2. Once you have the posterior, integrating it to calculate summary statistics (mean, variance, etc.) In the above example we used the integrate() function to apply adaptive quadrature to find the evidence. We could use this method for 2, but lets not. Instead, let us use MCMC - which is at its core, a way to draw samples from a distribution that is otherwise hard to sample from. Given that this example is rather trivial, with just one parameter in question (theta), I won't step through the implementations of vanilla Monte Carlo methods (uniform, importance & rejection sampling) These implementations are pretty much straight forward from Jake's presentation. I will however, implement a simple Metropolis-Hastings MCMC sampler using a simple and symmetric Gaussian proposal density (q in Jake's notes). MHstep <- function (pdf, prevCandidate) { # Effectively we are taking a random walk. newCandidate <- prevCandidate + rnorm (1, mean=0, sd=0.1) # NB: Because we are using the normal distribution # as our proposal density, which is symmetrical, # we cancel out the q terms on the numerator and # denominator, as q(x|y) = q(y|x) a <- pdf(newCandidate) / pdf(prevCandidate) # Draw a uniform random number from 0 to 1 u <- runif(1) if (a > u) { # This candidate is likely to be a better sample return (newCandidate) } # Else, stick with our previous candidate return (prevCandidate) }  Let's use our numerical approximate to the actual posterior function as the PDF we want to draw samples from: posteriorPDF <- function (t) posteriorN (t, N, h, a, b)  Time to go on a random walk down coin flip street. steps <- 1000 samples <- matrix(NA, steps) samples[1] <- 0.5 # initial guess for (i in 2:steps) { samples[i] <- MHstep (posteriorPDF, samples[i-1]) }  And how did we do? png ("figure3.png", width=800, height=600) par(bty='n') par(col='red') plot(cumsum(samples)/1:steps, type='l', xlab='Step', ylab='Estimated Mean', main='Drawing samples by Metropolis-Hastings') dev.off()  Nice. Comments (1) # predict.i2pi ## 19 Jun 2009, 6:57 AM ## the basics "If you are not embarrassed by the first version of your product, you've launched too late." On Monday I released predict.i2pi.com, a statistical learning web service. Designed to deal with common classification and regression problems, it takes input data in the form of a CSV file and returns to the end user a set of predictive models. For example, if you have a list of store locations, local weather data, and store revenue, you could use the service to see if location and weather impact store revenue. predict.i2pi tries to determine whether predictions are possible by running your data against a growing number of user contributed statistical learning algorithms and finding the ones that work best with your data. In planning this I went through a range of features, bells and whistles but have decided to strip it all back. This is the simplest thing I could build to support what I wanted. It takes a file, runs predictive algorithms against the file, and returns performance measures. Data and predictions. ## data The data provided is expected to be in the form of a number of observations, with one observation per row. Each column contains measurements for these observations. One or more of the measurements we are interested in predicting. For example: |<------ Explanatory Variables ------>| /----- Response Variables (dentoed by *) X1, X2, X3, Name, Date, *Y 12.3, 13.4, 8.32, Terry, 2008-10-12, 736.0 9.3, 34.1, 1.21, Josh, 2008-10-12, NA <-- NB: NA response variables will have ... ... ... ... ... ... will have predictions available 8.7, 38.7, 8.17, Jess, 2009-01-07, 1823.1 subsequent download. Data may include observations for which we do not know the response. These observations can be included, with the response left blank. Once satistfactory models are found, end users can download spreadsheets containing our best predictions for that data. On my todo list is adding confidence intervals to these values. Once uploaded we try to best detect the following data types: • Numeric (floating point numbers) • Integers • Dates (YYYY-MM-DD works best) • String Factors (e.g., State or letter scores) • Text (longer text than factors, with analytic interpretation as language text instead of as factors) ## learning Internally, predict.i2pi performs a standard test / training protocol. Data is loaded and a random half of that data is used to train the learning algorithm. The remaining half is used to test how good the learned algorithm works against previously unseen data. Robust algorithms will do almost as well on the test as during training, while less robust approaches will lead to far poorer performance during testing. The system continues this process of picking a training sample, training and the testing as many times as possible in an allotted time. During each of these cycles, predictions are tested against the actual responses in the corresponding observation. Performance is then measured using the R-squared metric for regressions and simple classification accuracy for classification problems. The system supports user defined performance measures with the goal being to let those who supply data decide on which performance measure is best for their application. However, at the moment I'm concentrating on opening up the ability for users to upload their own learning algorithms. Currently learning algorithms are specified in small snippets of R code that can be dynamically loaded into the main R subsystem that is responsible for coordinating training cycles. See, for example, rpart.R which links in a recursive partitioning algorithm from the rpart library. #requires(rpart) myModel <- function (formula, data) rpart(formula, data, na.action=na.exclude) myPredict <- function (model, data) { p <- predict (model, data) as.numeric(apply(p, 1, function(r) order(-r)[1])) }  All learning algorithms must contain two function definitions: myModel and myPredict. myModel takes a model formula and data, returning a model object that can be used to make predictions against new data. myPredict takes two parameters, the model object returned by myModel and a set of data that may not have been seen during training. We call the prediction function with one randomly ordered half of the data for training. For testing, we provide myPredict with the model object generated from the training set, but provide it with the as yet unseen testing portion of the data. Users are also able to define transforms that take a matrix of explanatory variables and returns a new matrix with the same number of observation rows but with one or more of the explanatory columns transformed into a new space. For example one could take a 100 column matrix and apply some form of dimensionality reduction that returns a new matrix, with the same number of rows, but only 10 columns. The transform function is not shown the response variable to ensure that no funny business occurs whereby the response is somehow embedded in the explanatory variables. These same transform functions can then be applied to response variables alone, allowing the system, for example, to construct a model log(Y) ~ PCA(X1, X2, ... , Xn). The following example shows a transform function that replaces any columns that are more than 50% NA with an indicator variable: myTransform <- function (x) { if (is.null(dim(x))) return (x); if (ncol(x) == 1) return (x); bad_idx <- apply(x,2,function(c) sum(is.na(c)) / length(c)) >= 0.5 if (any(bad_idx)) { y <- x y[,bad_idx] <- is.na(y[,bad_idx])*1 # replace NA's with an indicator variable return (y) } else { return (x) } }  ## coming soon As for uploading code, at the best way to do this right now is via email. I hand rolled my own sandbox environment to prevent 3rd party code from hijacking my system - but as with any security code that I write myself, I loathe testing it in the real world until I've had a good chance to be as close to 100% sure that it is safe. In reality, I'll probably stop trying to reinvent the wheel, and use a pre-existing solution. Given long term plans, and issues around data privacy, I didn't want to set up a system whereby data leaves the system for testing on external machines. While it works well for very large datasets, e.g., the Netflix Prize, the potential of over fitting is higher for smaller datasets when random portions of that data are often reused in validation cycles. That said, developing new learning algorithms (or plugging in ones from existing CRAN libraries) is fairly straight forward so you should be able to develop locally and upload. There already is an API, but it is not at all documented. This is my next priority. Currently I'm running into some issues with using RCurl to interact with my API - issues which would not exist in any other language - but I really would like to get the R API out of the door before I open up wider access. In short, there are are 3 methods which are currently used by the web site (inspect my horrid JavaScript code to see them.) These allow you to upload data, make edits to meta data and receive predictions. Each prediction includes links to the R source that was involved in performing the learning + any transforms used. The prediction meta data also includes the quartiles for the measure after a number of test/train cycles, plus a sample of 250 predictions vs. actuals. It has been suggested that I also include a small downloadable example snippet for each file to allow developers to get a better flavor of what they are working with. For larger files, I think this is a perfectly swell idea. In fact, I really do want to hear more of your suggestions. I took a knife to a slew of functionality before I released this, but I have code ready to go. But I want to wait for real life suggestions to see what I should be working on next. The original plans for this project included complex routines for doing unsupervised schema detection and meta modelling to help identify which algorithms might work best with particular shapes of data. Also I had built a framework for combining multiple learnings algorithms in a boosting type environment. All of these features remain possible and will hopefully be released in the not to distant future. One of the big issues I struggled with in deciding to release this is the nature of my target audience. At the moment there is an impedance mismatch between the sophistication required to understand what the system does and the utility of the system to sophisticated users. To those with any experience in predictive analytics, everything here should be your bread and butter - and most likely far simpler than what you do on a daily basis. However there is a large audience of people in the information business who currently make do with the 'Add Trendline' option in MS Excel. To this audience, this service would be greatly valued, but in its current form is probably a bit too much. This deeply embarrasses me, but I'm not going to let that stop me from publicizing what I'm up to. There is a plan, and it exists in increments. For the lay information worker, there are hurdles both in providing understandable explanations of how the learning algorithms work and were applied but yet also difficulties in adapting my format to the natural shape of the data that they often work with - not to mention data cleaning. As an example time series models pose an interesting problem. They do break the model of one independent observation per row, but it is difficult to come up with a way of training and testing that is consistent with my current implementation. Even if I were to develop special case handling for time series data, it can be difficult for a computer to find appropriate periods over which to lag variables. At this point I think the simplest route is to let people include previous observations that they deem important, at lags that they think might be interesting, with each row. That way each row can be treated independently from the others and I don't have to build a lot of machinery to guess appropriate treatment of temporal dependency. Likewise there are other problems whose natural representations don't map neatly to the one row = one observation representation - think of collaborative filtering or graph based problems. I am quite keen on keeping the one row representation as it affords me some nice system scaling properties without becoming too domain specific. That said, there is nothing stopping me from building front-ends that take data from these problem domains in their natural representation and map them to one that works better for my system. When it comes to explaining the models, well. That is another story. Comments (6) # Engineering vs. Architecture ## 16 Jun 2009, 5:18 PM A few months back I caught up with a fellow Aussie in New York, who I first met once ten years ago. It is amazing how social network dynamics change as an expat. He is currently teaching Architecture at Columbia while completing his doctorate in the nature of representation in architecture. It was the sort of long conversation that lingers for a few months before finding a resting place in your mind. At first glance we spent quite some time discussing the work at the Spatial Information Design Lab as this most closely bridged the gap between our worlds. The deeper conversation was that of representation. Engineers build things. They use sciences to make sure that the things they build don't fall over. Architects design things - they take ideas of the world and represent them. Their audience is both the client and the engineering and construction teams. Different representations serve different purposes. Can financial engineers shed the instrumentation of time series analysis and take on this role? Or will it come from a new group - the type of people who build Googles? Or will their buildings leak? Image by ken mccown on Flickr. Comments (0) # Predicting social network features from profile picture features ## 13 May 2009, 2:16 PM The interwebs has made it really easy for those who are looking for data to find it. Or at least a close approximation to it. Those who have the tools to scrape the web and reverse out interesting data are typically part software developers, statisticians and hackers. Mix these three together and one is genetically predisposed to collect as much data as possible. But there must come a point when the collection stops and the inference begins. Inference is difficult, in that it requires making statements about the mushy world, whereas coding systems to collect data deals with deterministic computers. It is easy to fall into the trap of simply collecting data to avoid dealing with mush. In my previous job I was overseeing a project which involved scraping a large publicly traded e-commerce site to find interesting information to support investment decisions. The problem was that everyone on the street was also scraping the same site. Our code was top notch and having done this before we were able to avoid common pitfalls and our system was gathering oodles of potentially useful data. Faced with all this data one of my developers had a tough time working out where to start the inference process. Sure, there are obvious places, like predicting revenue from site activity, but they are obvious. So obvious that even the sell-side researchers were doing it. Faced with the task of finding something less obvious, my advice to my colleague was to pick a pair of columns at random and come up with a model for the dynamics of the relationship. In such an exercise one picks the boring stuff, like transaction numbers instead of transaction values - and begins decoding from there. The goal is to look at the data sideways and see if anything interesting pops up. Most often this fails. But it is a good way of breaking out of the data rut. When I do this exercise, I typically find myself desperate for one other set of observations to help explain what I am seeing. While the result may be boring, it is not failure as it gives you a direction with which to approach the data. Currently I have a few projects that involve social networks in one shape or another. While my clients are generally looking for the standard orthogonal projection of the data, I can't resist the urge to look at things sideways. A client was walking through the important data that they collect about social network activity, but when I talked with their developers they also mentioned in passing that they also collected profile pictures. Not for analysis, but for another part of their suite. Pictures. Cool. Thems be data. Excitedly I professed, as if I actually knew what I was talking about, that there is probably heaps of juicy stuff inside profile picture data. Intruiged at my own confidence I decided to tackle this by scraping 250,000 profile pictures from MySpace and grabbing a few key stats about each account. The first thing I wanted to examine was whether profile pictures in any way informed the number of friends that a user had. They do. As I have other plans for this data, I didn't scrape MySpace with the complete intention of doing this project, thus only 19,214 of the images have associated friend lists. But this was enough to get started. First off I wrote a short C program to calculate 32 features from each image. These features are pretty typical image processing functions, like size, average color levels, number of colors, smoothness, symmetry and a few keys points from the luminosity histogram. MySpace pictures tend to be a mix of faces, icons and general photos - to rougly help identify faces (without commiting to facial specific measures) I also calculate a subset of these values for the central portion of the image, including recording the location of vertical axis of greatest symmetry. Most of the values have been normalized to some image specific reference to increase variation and limit covariance, for example the average R,G,B values are expressed as a % of the images luminosity. To get a sense of the variation of these features, I constructed an image based on the first two principal components of the feature space. At this point some kind folks on Twitter (starting with Mark Reid) pointed me towards t-distibuted stochastic neighborhood embedding. Someone mentioned that I could simply forgoe my feature calculation code and simply use tSNE on the pixel data, which sounded exciting, but after reading the paper I decided against it. In the paper the authors do demonstrate this technique, but their first step after reading in the pixel data is to perform PCA to reduce the dimensionality of the problem. And their image set was much more well behaved than the images I was working with. Maybe I'll take another look at tSNE in this context when I next have some free time. Visualization aside, the next step was fairly simple. I divided my data into a 75% training set and reserved 25% for testing, attempting to predict log(# of friends) by my image fetures. Using a linear model was pretty poor, but not terrible. In sample I got an R^2 of 0.17, out of sample it was far worse. Using an SVM, I limited the training to classification rather than regression - trying to classify in groups by quantiles of log(# of friends). For a simple binary classification (more or less friends than the median sampled MySpace user), the accuracy was 70% - with errors evenly distributed across the two classes. I also tried 3 and 4 classes, and the lift was similar. To visualize this I performed the regression using an SVM and, as expected by the results of the classification, got a decent R^2 (0.25) on the out of sample test set. To get a better sense of the outliers, I produced the following visualization. Note in this visualization I have sacrified some positional accuracy by enforcing a constraint that no images may overlap. I also used a similar approach to test whether I could predict other, more interesting, network features like measures of centrality, and my initial results are positive. At this point, if I were to run with it, I'd like to make some assesments as to the underlying process that relates social network features from image features. Until I have more time, my current hypothesis is Boobs. For those of you with more time on your hands, I have packaged up some datasets. Grab them here. I have no positions in any CRAs right now[/disclaimer] While the talk focussed on corporate bond ratings, the largest growth area for these agencies was in structured finance. And much of this was mortgage backed securities and similar derivatives. So to understand the mess we are in now we need to look at the history of these instruments. MBS's were born for two key reasons. First off was the realization that in 'normal' times the dominant risks were idiosyncratic in nature and as such could be minimized through the application of portfolio theory and diversification, leading to pooled entities with smoothed cash flows and tranches providing for the needs of various risk profiles. In my opinion this story is primarily the sizzle. The real steak was the fact that by aggregating together whole loans new tradeable instruments could be formed. The problem with whole loans was that their pricing was highly dependent on a large vector of unstandardised parameters whose diversity precluded the formation of any depth necessary to support liquidity in traditional market designs. By eliminating the idiosyncratic risk components these pooled instruments could theoretically be summarized by a small set of parameters and relatively simple models for prepayment risk meant that traders could respond to bids and asks against them. Faced with these simple models quants took off in a great fantastical leap and applied ever more complex techniques to model out pricing. For one take, see Felix Salmon's recent piece in Wired. Or look to Paul Wilmott's take on the ever escalating departure into a mathematical wonderland that ignored the realities of the underlying loans and their associated risks. Somewhere along the line practitioners forgot that the technology underpinning the frothy new market was based on 1970's financial and computational technology. Back then a bank of associates armed with HP-12C's could price out MBS's using a small set of descriptive parameters. Over the next 20 years more computing power was thrown at the problem, but the basic data was still confined in scope. Sure, some funds were taking apart these pools and doing a deep analysis of the components, but there wasn't much reward in doing so as the market was moving at such an upward clip. Even worse, if you look at the papers from Frank Partnoy, the credit ratings agencies - who were supposed to be taking a deeper look at these securities, without the demands of second by second trading - were using plainly silly assumptions. There was a huge amount of mathematical and financial stupidity going on. Not even going to mention the conflicts of interest and the regulatory arbitrage at play... Anyway... To address Falafulu's point about MPT - I agree, MPT is great stuff; a very powerful framework by which to understand finance. But just look at the assumptions. Sure, these assumptions make the math tractable, but modern computing power enables us to take a more nuanced view of the world. We no longer have to rely on single parameters of 'default risk' to price these instruments. The market would be far better served if all available data for the underlying components and use their own information about their own risk profile to come up with better measures of value. Just compare David Einhorn's spreadsheet with a report put out by Moody's. It is night and day. Give me the data, not some puff piece of pseudoscientific nonsense passing itself off as high finance. The original problems with trading whole loans, namely that there were too many parameters to support liquid markets, is no longer an issue. Look at WeatherBill. Look at Robin Hanson's work on combinatorial market mechanism design. Falafulu, sure some smart people were recognized for their ground breaking work of decades ago. But the most recent winner of the John Bates Clark medal in Economics went to Susan Athey, who is doing some fantastic work in mechanism design. Computational power is such that we no longer need to pretend that all financial instruments have to be priced on with a slide rule. We have new marketplaces that can effectively support trade in financial instruments with high dimensionality. We have the computational power to let traders value these instruments. What we don't have is the data. Give us the data and we will trade. Having had to explain my rationale for launching i2pi as a consultancy frequently, I've come to rely on the phrase 'data trades inversely to liquidity.' This notion holds true in both my prior world of investment management and especially now in the data collection and analysis business. In finance, when markets are liquid price discovery is cheap. With all the talk right now of mark to market accounting treatments, it is clear that the converse is also true. Holders of illiquid securities can no longer rely on quoted prices to manage their portfolio risk. As the current crisis began to unfold earlier last year there was a very visible Mexican standoff while shops with CDO/etc. exposure refused to trade as the act of trading would force everyone else to reprice their own portfolios. Doing so could only last so long and the inevitable write downs began to occur as margins were being called. And thus the house fell. The premise that led us to this mess was that with only a modicum of data and some threadbare models trading would be the final arbiter of value and the collective intelligence of efficient markets would result in fundamentally sound pricing. Now that liquidity has gone from the markets, traders of these illiquid instruments are bulking up their data and models to try and better their understanding of fundamental value. And so it is that when markets are liquid the market relies on trading to assimilate the information of individual agents. Without this method of price discovery these agents need to gather their own data as the market no longer performs the role of grand aggregator. Data trades inversely to liquidity. While my work at the fund was phenomenally diverse and deeply intellectually stimulating, there was no fire. I've never had a real job. I've only ever worked at start ups where there is no time for a 'job.' In a constant state of conflagration, everything at a start up requires immediate attention. Early on in my career I worked as a back-end system engineer and 'fire' usually involved dealing with scalability and general growing pains. Late nights implementing features that were sold to paying clients well before the development team was consulted. Later I spent more time selling these features and there was a constant fire to come up with new and interesting things to attract clients and revenue. At the fund our financial stability was near certain and while there was a drive for deeper insight, the fire was luke-warm at best. The current financial crisis is, at its core, rooted in the debt markets and this dislocation has clearly negative consequences for start up financing. Contemporaneously, new technologies and operational methods allow technology start ups to scale efficiently. Cloud Computing, as distinct from the similar buzz about Web Services just a few years ago, provides a platform upon which small companies can grow their operations in proportion to their needs without large capital investments in hardware or expensive, unwashed and hirsute systems administrators. Hadoop, Memcache and their ilk let developers build applications that operate on huge data sets without investing in the expensive vertical scaling solutions of Oracle & Co. And social networking results in network driven growth patterns that can be much steeper than products or services that live on an island. However, the skills required for scaling analysis systems are quite different to those needed to scale operations; part statistician, part database administrator, part computational micro-economist, and then and understanding of business to tie together a narrative that tells stories with numbers rather than purely stories about numbers. The environment I see around me for technology start ups is one whereby funding is hard to come by and series B's are even more painful to founders. These companies need to be smart. They need to focus some of their attention on what to do with all the data that they gather as part of daily operations. Development teams, while facing the world with more appropriate tools than those available a few years ago still focus on operations. Someone needs to focus on research. Data driven research. In times when funding was easier and valuations were higher, companies could focus on operations and hope that that operational scale would lead to revenue. I firmly believe that in this environment data rather than scale alone is of immense value. Comments (3) # When to cheat ## 3 Jan 2009, 10:31 AM Sometimes, when trying to optimize a computer system, you get to a point whereby on one hand the next x% of optimization will take orders of magnitude more time than the previous x%. And on the other hand, you can completely change your marginal optimization cost if you take a different approach by approximation. Good systems can have hardware thrown at the problem to scale. As has been mentioned elsewhere, hardware is often cheaper than programmers, so we tend to go as far as we can by taking this approach - until it no longer works. Perhaps a more efficient algorithm can be implemented to lower the marginal cost of scaling. Unfortunately there comes a point whereby traditional algorithmic optimizations fail to change the equation enough to manage your costs. At this point you may need to cheat to scale. Knowing when to approximate requires an understanding of the costs of doing so. If your system is responsible for proprioception in an automated x-ray system, then getting it wrong is to be avoided at all possible costs. Everyone would like to be as accurate as possible, but this is not always cost efficient. If you are running an analytics system being wrong costs less than you might think. Any system that produces statistics based on sampled observations has room for approximate solutions. Programmers tend to forget that such statistics contain error. Programmers tend to think that if there code is free of logical flaws, then the output must be error free. If you observe 12 users clicking on an ad that was displayed 1,000 times then the click through rate must be 1.2%. Click through rate, in this case is a statistic - a way of summarizing raw data. However, click through rate is also a measure of some innate clickability that is driven by the ad, its context and its viewers. Note that we can't possibly measure these abstract quantities, so we use our observed behavior of 12 clicks from 1,000 ads as guide to navigating the underlying complex abstract system of interactions that determine the clickability of an ad. As soon as we produce a report that states that click through rates are 1.2% on all Fridays then we are implicitly giving the caveat of 'based on all Fridays we have seen.' But we must admit some error as soon as anyone tries to generalize about all future Fridays from that one statistic. A statistic used as an estimate must come with an acknowledgement that without seeing the entire population there will always be the chance of being wrong. A correct software system coupled with perfect data capture ensures that the sample statistic will be correct but there will always be a chance for error in its estimates. In evaluating the cost of deviating to algorithms that are only probabilistically correct or involve some form of approximation, we must concentrate on the increased cost caused by a possible increase in the rate or nature of errors. There are a number of things to consider when determining the amount of error introduced when producing estimates, but all methods involve two numbers, a confidence interval and confidence level. These will depend on what you are estimating and how you go about it. Estimating a population maximum behaves differently in the presence of approximation than does the estimation of the mean or median. Approximating by taking every second sample may have a different impact than only looking at the first half of a sample. To begin one needs to look at the baseline error in your estimate as produced by an algorithm that uses no approximation. We may find that our estimate of Friday click through rate of 1.2% has a confidence interval of +/- 0.2% at a confidence level of 95%. This tells us that 95% of the time, when we try to estimate click through rate it will fall between 1.0% and 1.4%. We then examine how changing the method of measurement by introducing approximations will change the confidence interval for a given confidence level. From this point we compare the marginal cost of error against the difference in cost of optimization by approximation and the cost of scaling using the current cost to scale. Approximation may chance the confidence interval from +/- 0.2% to 0.3%. How much this costs you is a question of risk management and depends on the economics of the business that you are in. Click through rates might be used to tweak an advertising budget. If you are buying ads on a cost per click basis, then the cost of the increased uncertainty in click through rate will be determined by the cost of each click. If you want to see things through my eyes, for just a moment, stick around. If you scroll ahead it may look a little scary to the mathematically uninitiated, but we won’t be using anything more complicated than mid-range high school math to get there. And I think the journey is worthwhile. To understand what i2π is all about we really have to start with Euler’s constant, e. Euler’s constant is a special number in mathematics and it appears in many equations across mathematics, physics, statistics and economics because it has a number of unique properties. My favourite property is a variation on what is know as Euler’s identity: And that is why we are starting our i2π journey with e. My next favourite property of e is the relationship between ex and its derivative , namely Those who have taken high-school calculus take this for granted. Those who didn’t or who have forgotten it are probably scratching their heads. For their sake, and my own, lets quickly review derivatives. Derivatives are a convenient way of describe the slope of a line. Take the equation for the line y = 2x, then for each unit increase in x, we get 2 units of increase in y. The slope of a line is the ratio between the change in the output of the function that describes it with respect to the change in the input. In this case, x is the input and y is the output, and increasing x by 1 increases y by 2, so the slope is 2:1 or simply 2. The notation describes the slope of some function f(x) as x changes. The equation y = 2x has the same slope for all values of x, so we say the slope is a constant. More complicated functions, like f(x) = x2 (in the figure below) are curved, so the slope changes as we change x. Without going too deep into calculus, it is known that = 2x. In other words, for any point x, the slope is 2x. If you look in the figure above, when x = 0, the slope is 0. As x increases to the right of zero, the slope gets steeper and steeper. To the left, as x becomes increasingly smaller than zero, the slope becomes increasingly negative. In fact, for any k For example, = 7x6. When you take the derivative (find the slope) of most functions, the answer is usually some modified form of the function you started with. However, the exponential function, ex is the simplest example of the case where the derivative is equal to the function. Up to this point, we haven’t even worked out what the numerical value of e is, but let us try to define e by starting with the fact that = ex. To do this, we will need to take a detour into the world of factorials. Imagine that we have 5 books that we want to place on a shelf. How many different ways can we arrange them? Working methodically, from left to right, there are 5 possible books we could put in the leftmost spot on the shelf. Once we choose the first book to place there, we are left with 4 possible books to place along side it. And once we choose that book, 3 books remain. After placing the third book, only 2 more remain, and so on. This means there must be 5 × 4 × 3 × 2 × 1 = 120 ways of arranging those 5 books. If we had 100 books, it would take up too much paper to write down 100 × 99 × 98 × × 1, so we use the shorthand 100!, which we pronounce ’100 factorial’. One hundred factorial is a big number. If we built a book arranging robot that could do one billion arrangements per second, and had one billion of them running in parallel since the big bang, they still wouldn’t be finished trying all the possible ways to arrange the books. Thank god for the Dewey Decimal system, eh? Ok. So factorials are just a shorthand way of writing down a special type of successive multiplication. We can use the formula for the slope of xk to find the slope of We can re-arrange this to be If you multiply some function by a constant, C, then the slope is also multiplied by C. From the rule we saw 2 paragraphs ago, we now that We also know that k! = k × (k - 1)!. For example, 6! = 6 × 5! = 6 × 5 × 4 × × 1. So Cancel out the k in the numerator and denominator, and we get Now lets tackle a slightly more complicated function, The first 2 terms of this function are pretty simple. Recall that factorials count the number of ways of arranging objects. There is only one way to arrange zero books, so 0! = 1. Also recall from math class that x0 = 1, so There is only one way of arranging one book and x1 = x, so Therefore I know this comes out of nowhere, but just go with flow. What is the derivative of this function? Derivatives are additive so we can just do each bit individually and add them together. The derivative of 1 is 0, as it is a flat line - hence no slope. The derivative of x is 1, as it is a line with a 45 degree slope. And we just worked out the rest in the previous paragraph, so Hey, wait up. If we drop the leading 0, which we can, then that’s just g(x) again. To see this requires some majorly deep and majorly simple insight. When things are deep AND simple, they are beautiful. What just happened was that even though the first term of the sum disappeared by turning into zero, the rest of the sum remained. Because we defined g(x) to go on from + to infinity, for each term that drops off the front, there are an infinite number of terms to make up for it. Infinity less one is still infinity. Recall that we defined ex to have the property that and we now have a function where These properties are the same. The function has itself as its derivative. And it just so happens that g(x) is ex. To understand why, we need to look at Taylor’s Theorem. So point your browser to Wikipedia if you are so inclined and join us in the next paragraph to continue. Great. Welcome back. The point of this post is to explain i2π and so far we have only covered e, so let’s get a move on and have a look at i and π. I assume everyone is cool with 2. What is i? i is the square root of negative one. So, i2 = -1. When I first encountered i I asked a family friend / math professor to explain it to me. All the books I had read just talked about ’complex numbers’ and I wanted to understand what made them ’complex.’ She explained to me that they aren’t complex, in the sense that complex means difficult. They are just different to the normal numbers we usually encounter. In school you would have learned that the square root of negative numbers is undefined. But they turn up so frequently that man invented a new class of numbers to allow us to define them. Numbers are just symbols for the abstract concept of quantity. And i is just a symbol for the square root of negative one. While it is not possible to have i books or bananas, we can still do mathematics with i and end up with real world numbers. For example, And i4 = i2 ×i2 = -1 ×-1 = 1. So while we can’t buy i bananas, we can buy i4 bananas, because i4 is 1. As you keep on raising i to higher and higher powers, a pattern emerges. i1 = i, i2 = -1, i3 = i2 ×i = -i and i4 = 1. When we look at i5 we find i5 = i4 × i = 1 × i = i, and the pattern repeats. For no apparant reason, lets sum up all the powers of i: To see the pattern more clearly, lets split up the odd an even terms Out of sheer curiosity, lets find out what pattern would we get if we expanded eix using the formula we found before. We know that the pattern in the i’s comes out nicely if we split out the evens and odds, so lets call the even part of the right hand side of the equation a(ix) and b(ix) for the odds: Now using the -1 + 1 - 1 + 1 - pattern, we get Likewise for the odd terms, Now when you went across to Wikipedia to check out Taylor’s Theorem, you will have seen that Which is totally the same as our a(ix). I may not be hip and fresh with the rock'n'roll and skateboarding, but I know cool when I see it, and that is cool. We also know that Which means that b(ix) = i × sin(x). Putting a and b together, we find that Now that we have gone from summing polynomials to trigonometry, it may be coming clear where the π fits in. π is a special number that defines the ratio between a circle’s diameter and its circumfrence. If a circle has a diameter of one furlong, then its circumfrence will be π furlongs. π is also used to measure angles, the same way as degrees are. In a circle there are 360 degrees, but mathematicians like to say that a circle has 2π radians. That is, if you have a circle with a radius of 1 foot, then the circumfrence will be 2π feet. If you were to walk around this circle through 1/8ths of its circumfrence, you will have moved 45 degrees, or radians. The sine and cosine take an angle in radians and tell you the x and y coordiants of that point on a circle. If you walk 45 degrees anti-clockwise around a circle starting from the point (1,0), then you will end up at position If you walk 2π radians around the circle, you will have gone 360 degrees and end up where you started. So = (1,0) So the function e tells us the coordinates of where one ends up after walking ω radians around a circle of radius 1. We have been writing down our coordiates as (x,y), where x = cosω and y = sin(ω), but as we found out earlier, If we think of i as being a symbol to represent our distance in the y direction, then we can convert from x + iy to (x,y). And if we walk around a full circle and end up at the beginning point of (1,0), we can convert back to 1 + i × 0 = 1. Therefore I like i2π as this term comes up across a range of mathematical equations and you can go a long way in learning about mathematics by looking in the origins of this famous identity. Simplicity, depth & beauty. Right now, the word 'twitter' has been mentioned 1.54x more frequently than the word 'love.' I use a highly accurate Markov-Chain Monte-Carlo technique to arrive as this unbiased estimate. My Gibbs sampler (unrelated to melting snow with salt) is coded in Erlang with the front end dynamically generated by a Scala program that writes out Ruby on Rails code. It is hosted on EC2. And uses map reduce. Enjoy Comments (2) # Gibbs Free Economics ## 22 Dec 2008, 4:56 PM This weekend marked the first time I shoveled snow. To cope with the odd sensation of being very cold while being very hot & sweaty, I thought about the mechanism by which salt melts ice. It has been a while since I studied physics and chemistry, but I had a basic notion of what was going on. I don't want to give it away early on in this post, so I'll share with you the chronology of my process of trying to confirm my understanding of why salt melts ice. I started off with a simple Google search for "salt snow." The top link was to a page on About.com (as often is the case.) And the article was totally useless (as often is the case.) The page gave me some information, specifically it told me Freezing point depression is a colligative property of water. A colligative property is one which depends on the number of particles in a substance. All liquid solvents with dissolved particles (solutes) demonstrate colligative properties. Other colligative properties include boiling point elevation, vapor pressure lowering, and osmotic pressure. All true. But no knowledge is contained within that paragraph. I followed some links and came across another page trying to explain the mechanism. Coupled with the animation, the crux of the argument is presented in the paragraph: Adding salt to the system will also disrupt the equilibrium. Consider replacing some of the water molecules with molecules of some other substance. The foreign molecules dissolve in the water, but do not pack easily into the array of molecules in the solid. Try hitting the "Add Solute" button in the animation above. Notice that there are fewer water molecules on the liquid side because the some of the water has been replaced by salt. The total number of waters captured by the ice per second goes down, so the rate of freezing goes down. The rate of melting is unchanged by the presence of the foreign material, so melting occurs faster than freezing. Here we get a better sense of why - namely it has something to do with equilibrium between the solid and liquid phases of water. They don't mention it in the paragraph, nor does the animation give any indication, but the solid phase of water (ice) is a crystal. Water molecules share electrons between oxygen and hyrdogen forming H20. In crystalline form, these molecules align to form an organized structure. Wikipedia starts heading us along an interesting route as a full third of the discussion of Ice Ih (common ice) is dedicated to Proton Disorder: The protons (hydrogen atoms) in the crystal lattice lie very nearly along the hydrogen bonds, and in such a way that each water molecule is preserved. This means that each oxygen atom in the lattice has two protons adjacent to it, and about 101 pm along the 275 pm length of the bond. The crystal lattice allows a substantial amount of disorder in the positions of the protons frozen into the structure as it cools to absolute zero. As a result, the crystal structure contains some residual entropy inherent to the lattice and determined by the number of possible configurations of proton positions which can be formed while still maintaining the requirement for each oxygen atom to have only two protons in closest proximity, and each H-bond joining two oxygen atoms having only one proton. This residual entropy S0 is equal to 3.5 J mol−1 K−1. There are various ways of approximating this number from first principles. Assuming a given N water molecules each has 6 possible arrangements this yields 6N possible combinations. Given random orientations of molecules, a given bond will have only a ½ chance that it has exactly one proton, or in other words, each molecule has a ¼ chance that its protons lie on bonds containing exactly one proton, leaving a total number of (3 / 2)N possible valid combinations. Using Boltzmann's principle, we find that S0 = Nkln(3 / 2), where k is Boltzmann's Constant, which yields a value of 3.37 J mol−1 K−1, a value very close to the measured value. More complex methods can be employed to better approximate the exact number of possible configurations, and achieve results closer to measured values. Hmmm. Entropy.. Equilibrium.. Sounds like Thermodynamics. Some more clicking gets us to an explanation in terms of Gibbs Free energy. The answer given on the page basically states that by the first and second laws of thermodynamics, salt melts ice. Why? Well, by the second law, entropy wants to increase and salty water has greater entropy than ice so it is a lower energy state, and for energy to be conserved (the first law), ice will turn into water to maintain the total energy of the system. This is certainly a more satisfactory answer than 'because water has colligative properties.' But why does salty water have greater entropy than pure water? What is entropy? To understand this, we have to think at the level of statistical mechanics. Entropy is a convenient probablistic measure of complicated stochastic processes. Entropy (in the thermodynamic sense) is a measure of disorder in a system. As we saw from the description of proton disorder in ice, entropy measures the number of different ways the components of a system can be arranged to firm different structures. Without diverging too far into a discussion of urns filled with different numbers of red and blue balls, it is easy to see that for ice to be ice there are only a relatively small number of ways by which the hydrogen and oxygen atoms of water can be arranged with respect to eachother. However, in liquid water, there are many many, but not an uncountably infinite number of, ways by which you can arrange the water molecules and still have liquid water. The key thing about the role of entropy in melting snow is that each water molecule is indistinguishable from another. Thus, when considering the number of states (or arrangements) that would result in liquid water, swapping two molecules of water would not count as a new state. As you add a solute such as salt to the liquid the number of distinguishable states increase even as the total number of molecules in the system remain constant. We began this search with a proof by definition, namely that salt melts ice because the fact that salt melts ice is a colligative property of water. We then saw found that salt melts ice because it minimizes the Gibbs free energy of the system. We then found out that this was due to increasing entropy. Along the way, I may have inadvertantly introduced a proof by obfuscation. Of course, nothing I have said during my hand-waving explains why salt melts ice. But we know that it does. The thermodynamic explanation relies on a model of the underlying statistical mechanics - entropy being a descriptive statistic of an underlying complex system. And the underlying statistical mechanics is just a model of the statistical quantum dynamics, and so forth until turtles. Salt doesn't always melt ice, it is just very likely to do so. Much of microeconomics follows a similar path of evolving explanations, but the predictive power of the results depends on the quality of the model of the individual agents. Water 'wants' to maximize entropy and conserve energy, but not as much as people want to maximize a simple model of utility. It would be remiss of me to end this Wikipedia link fest without mentioning Shannon entropy from information theory. Comments (0) # Proof ## 22 Dec 2008, 3:10 PM Via God Plays Dice I came across a great list of mathematical proof techniques, of which my favorite is probably Proof by Induction: Proof by Induction Proof by Induction claims that where is the number of pages used to contain the proof and is the time required to prove something, relative to the trivial case. For the common, but special case of generalising the proof, where is the number of pages used to contain the proof, is the number of things which are being proved and is the time required to prove something, relative to the trivial case. The actual method of constructing the proof is irrelevant. Great jokes don't need explanation, but I just can't help myself. The equation listed above is actually Faraday's law of induction, which describes how a changing magnetic field induces an electrical field. C.f. mathematical induction, which is a method of mathematical proof typically used to establish that a given statement is true of all natural numbers. It is done by proving that the first statement in the infinite sequence of statements is true, and then proving that if any one statement in the infinite sequence of statements is true, then so is the next one. Mathematical induction is often misunderstood by those trying to apply it, often resulting in what Uncyclopedia calls Engineer's Induction: Suppose P(n) is a statement. 1. Prove true for P(1). 2. Prove true for P(2). 3. Prove true for P(3). 4. Therefore P(n) is true for all . And to snuff out any life remaining in the joke, the technique above only proves P(n) for n = 1, 2 & 3. To make it a real proof you would need to show that if P(n) is true for some n then it must be true for n+1, which coupled with the proof for P(1) would show that it was then true for all natural n. Comments (0) # Hold your horses ## 17 Dec 2008, 2:09 PM I like Felix Salmon. He's a pretty awesome finance and economics blogger and on the few occasions I've spoken with him, he seems like a genuinely nice guy. But a recent post left me scratching my head: John Gapper worries that taxing non-diet sodas would be regressive. I worry that it wouldn't even do what it is designed to do, which is reduce obesity: all it would do is increase the amount of diet sodas consumed, and there's a strong link between diet-soda consumption and increased obesity. Unless and until there's some empirical evidence that switching from non-diet to diet sodas helps people lose weight, this should be filed under "very bad ideas". If you follow that strong link link, you'll see: Fowler is quick to note that a study of this kind does not prove that diet soda causes obesity. More likely, she says, it shows that something linked to diet soda drinking is also linked to obesity. "One possible part of the explanation is that people who see they are beginning to gain weight may be more likely to switch from regular to diet soda," Fowler suggests. "But despite their switching, their weight may continue to grow for other reasons. So diet soft-drink use is a marker for overweight and obesity." Right. Correlation is not causation. In fact, if the elasticity of demand for soda is not correlated with propensity for obesity, then a Pigovian tax on soda would make for a great instrumental variable to see whether diet soda has any causal relationship with obesity. Today's xkcd comic is fantastic. Check it out. Comments (4) # Wesabe + Perl + PostgreSQL ## 16 Dec 2008, 10:56 PM I was looking for some code that would allow me to automagically populate my database with transactions from Wesabe. It needed to work with Postgres, and Perl would be a decent language for the job. I found it here. Enjoy. Comments (0) # Morons ## 16 Dec 2008, 8:41 PM It is good to know that1,793 power cables are useful even without replacing the miles of standard copper between the power station and your house.

George,

Shortly after receiving two 5ft Quadlink power cables, I dragged them to my office. I am employed in a 24/7 engineering support environment and thought my "always on" computer would be a good starting point to burn in the cables. Working 12 hour night shifts I have grown accustomed to eye fatigue while working with a computer. Several busy days later (currently on my 9th straight 12 hour night) other priorities and a busy work schedule had caused me to forget about the cables burning-in beneath my desk. Yesterday I noted how uncharacteristically good my eyes felt. Today, just an hour ago, a coincidental look at my feet put me face to face with the Quadlink.

I have tried to identify another cause for this remarkable change but am unable to attribute the effect to anything but the Quadlink power cable. I trust you'll find that satisfying.

Needless to say I intend to purchase a replacement cable for my home audio use as soon as possible - I don't think this one will be leaving my computer monitor anytime soon. I am anxious to loan my cables to others in the office who have complained of similar eye fatigue problems.

See here. And then here.

WTF: Where does a night-shift support desk worker get the $to spend on such crap? Comments (0) # Benford's Law ## 14 Dec 2008, 7:28 PM A few weeks ago at lunch we were discussing forensic accounting and Benford's law came up. I've always been a little uncomfortable with the law, which stipulates that the distribution of leading digits in 'natural' measurements follow a non-uniform distribution. I believed the law but never had a good grasp of why. In undergrad I recall proving the existence of the law by starting with the assumption that if such a law exists, then it must be scale invariant. That is, if there is some law that describes the way leading digits of measurements are distributed, then it shouldn't matter whether you measure the underlying quantity in Celsius or Farenheit; that is, it should remain after any linear transformation. From that point its pretty simple to derive the distribution. But it was a fairly weasely way to answer the question as I started with the assumption that the law must exist. Mathematical proofs frequently start with the assumption that the converse is true and then show that if the converse were true it would be non-sensical, hence the converse of the converse must be true - this is called proof by contradiction. Proof by assumption of truth is not really proof. Either way, my 'proof' was accepted and my discomfort with the law remained until today. Last night I was browsing through Hacker's Delight, a compendium of neat coding tricks that would get you fired at most workplaces with the charge of being too clever and abstruse. And in my reading I came across Bedford's law and decided that today I would finally take the time to grok it. As has become my default browsing behavior, I started at Wikipedia and came across a description and picture that finally let the pieces click. I also got to feel a little bit better about the shame that I have carried from my undergrad 'proof', as they include my same argument as evidence of the law: The law can alternatively be explained by the fact that, if it is indeed true that the first digits have a particular distribution, it must be independent of the measuring units used. For example, this means that if one converts from e.g. feet to yards (multiplication by a constant), the distribution must be unchanged -- it is scale invariant, and the only distribution that fits this is one whose logarithm is uniformly distributed. To neatly close off the issue, I logged into a database at the office and grabbed 123 million financial values taken from around 65 thousand companies from around the world. It is nice having access to this kinda data. And lo and behold the distribution of leading digits follows the law perfectly. However, the sizzle on the steak is that they take advantage of network effects in tagging your transactions. For example, when one user tags a cryptic line item such as '238927 12/04SOU EQUIN' as 'Gym Membership', all future transactions across all user accounts will benefit. That said, I still find most of my time using these sites (I have accounts on both) spent tweaking their automatic suggestions for tags. However I understand that me spending 5 minutes correcting their 10% classification error rate is better than spending an hour hand classifying every transaction [1]. I also take issue with some of the interface design choices, especially on Mint, whereby quick searching and sorting is prevented by their excessive use of 'pretty' transitions and AJAX style effects. I usually keep my mouth shut about UI issues, but given that I've spent the last week designing a visualization app, I feel suitably qualified. But my biggest grief is one I raised both with the founder of Wesabe and on this blog at one time. I can't seem to find my original post on the topic, as it was lost when I went through the Blogger to Wordpress switch - I'll try and extract it later. But anyway, my point was that I was disheartened by the lack of double entry book-keeping. This problem manifested itself on Wesabe when it congratulated me for earning$5k when in reality I had actually just paid off a $5k balance on my credit card - net economic impact of far less than$5k. This annoys me.

This morning I was doing my semi-regular perusal of my accounts and came across a balance advance from a line of credit and I was curious as to where the money went. It is a royal pain to do arbitrary searches with either of the sites, so I opened up my SQL clone of their site that I scrape on a regular basis. From there it was easy to find all transactions within a time window that matched the dollar amount of the balance advance. This is a non-GAAP technique, but even for the 8 accounts my wife and I have, joining by transaction timestamp, amount and debit/credit is a pretty accurate way to build a financial journal - in the accounting 101 sense of the word 'journal.'

I don't know why these services don't offer this feature. It is easy to see useful extensions that bring much more color to our financial activity. Just in the same way that session management gives you oodles more insight into online marketing transactions than simply looking at individual requests in a web server log file. Now maybe I'm the only one nerdy enough to specifically request double entry accounting, but in these times everyone wants to know more about their financial data. Couple this data with logs from my cell phone bill, cable service, Netflix and we'd soon be able to fairly accurately recognize the rationale being each transaction with our banks. If only there was some kind of online Vault service that could own this data for me...

## Mortgage Risks

## Mortgage Risks

Two main risks concern those bankers who originate mortgages: prepayment risk and default risk. Prepayment risk is the chance that during a decline in interest rates, borrowers will pre-pay their mortgages as they refinance at a lower rate. This hurts lenders as they originated the mortgage expecting 30 years of juicy interest payments. If a borrower pre-pays then the lender receives the principal balance, and possibly some fee income, but loses out on the opportunity for future interest income.

Default risk is the chance that a borrower will just stop paying altogether. Lenders would compensate themselves for this risk by looking at the credit rating of the borrower and charging a higher rate for those who show quantitative signs of default risk – as determined by their credit score, income and debt level and the size of the loan relative to the value of the property.

Or at least that is how the models work. At a higher level all risks can be placed on a spectrum from idiosyncratic to the systemic. Idiosyncratic risks are those that arise from situations that only impact the individual borrower, such as falling ill and having medical bills that take priority over mortgage payments. Systematic risks impact the mortgage market as a whole, such as volatility in interest rates. The whole idea of pooling together mortgages is that idiosyncratic risks are uncorrelated. Joe in Montana getting hit by a car is a statistically independent event from June in Miami getting fired. By pooling together these uncorrelated risks, the portfolio of loans has a lower idiosyncratic risk than any individual loan.

The cash flows generated by these pools of mortgages could be then segregated into tranches with different guarantees as to their payment priority. If there were, say, 1,000 loans in the portfolio, each paying $1,000 per month then the total monthly income would be$1m. The highest priority tranche would be set up so that each month $800k of the$1m would be paid out to those holders of the top tranche. Thus it would require 20% of the individual loans to default before the regular flow of 800k per month would be disrupted. Subsequent tranches would have higher risks of defaulting. This structure meant that the top tranches would end up with a far higher credit rating than any of the individual mortgages, or even a simple pool of mortgages. As highly rated instruments institutions that traditionally were unable to buy assets as risky as mortgages could now enjoy the benefits of higher interest rates on their money whilst maintaining a portfolio of highly rated debt. Consequently these products were in high demand. Couple this with the desire to allow more Americans to fulfill the dream of home ownership and an environment was created that allowed more and more people to attain mortgages. But somewhere along the line the risk-reward relationship that sits at the cornerstone of capitalism became dislodged. ## Brokers + Regulatory Arbitrage First there were the mortgage brokers. As poorly trained and lightly regulated commission sales agents, their role was to pre-qualify potential borrowers for mortgages. But as their income stream was tied to the number of applicants that passed the qualifications and ended up as borrowers, they had every incentive to make sure that those who applied for loans got them. And as the mortgage brokers bore no personal liability, apart from being charged of criminal mortgage fraud, many bent the rules and flat out lied along with applicants to get these deals done. This is what happens when people with no fiduciary duty are executing large financial transactions. However, on paper fiddling of applicant details wouldn't solve the problem of some of these applicants who simply couldn’t afford the monthly payments. For these borrowers complex ‘option ARM’ mortgages were offered, where initial payments were low but jumped after a certain number of years. These were issued to borrowers on the assumption that house prices would go up, giving borrowers the opportunity to refinance before their payment rates rose. The brokers bore no liability from qualifying unqualified applicants, as they were not the ultimate source of cash that funded the loan. This cash was sourced by the banks that were happy as long as they could earn higher interest than what was available from other debt instruments with the same credit rating. Even better, from the banks' point of view, was that these mortgage backed securities and collateralized debt obligations were built in the form of special purpose vehicles and, as such, were not consolidated on the banks' balance sheet. If the account rules that govern consolidation of such vehicles sound familiar, you probably are remembering a thing or two from the Enron scandal. As these new financial instruments weren’t listed on the balance sheet, important leverage ratios were not impacted. Regulation, such as the Basel accord, limits how much debt can be used to finance operations. Off balance sheet financing gave a way to game the system, a game known as regulatory arbitrage. Leverage is great for financial institutions when their bets are working for them. They can juice up their returns, borrowing at X% and earning money at 20 x (Y – X); which is all fine and well when Y is greater than X. It was this strict interpretation of the letter of the law, but not the spirit, that drove the demand for mortgages to be repacked. And as long as the mortgage backed securities received high credit ratings, this was a great way for bankers and brokers to make money whilst giving affordable housing to those who would not usually qualify for loans. ## Credit Ratings But the models that were used, both by the banks and by the credit rating agencies (often working together from the same bed) underestimated the impact of systemic risk. If house prices are artificially elevated by an environment where everyone is suddenly able to qualify for a mortgage due to diminished underwriting standards, then as soon as the music stops home prices should return to their natural, unsubsidized level. And that they did. Across America home prices began to collapse in 2006 as lenders began to see the impact of the declining quality of borrowers. And anything that has an effect across America is no longer an idiosyncratic risk. Add on to that the fact that many borrowers had borrowed more than their house was worth, under the assumption that home prices would continue to rise, suddenly found themselves owing far more than their house would be worth at any point in the near future. This brought to an end the domestic spending fueled by home equity withdrawals, a significant and growing part of GDP over the past 10 years. At its peak between 2004 and 2006, mortgage equity withdrawals were about 9% of GDP, nearly five times larger than they were just ten years prior. While corners of the press/punditsphere were predicting a severe decline in house prices, the prognosticators at the credit agencies seemed to lack the same foresight. Even worse, the agencies had no problem with slapping AAA credit ratings on vehicles filled with dodgy loans. Here again we see the consequences of a disconnect between risk and reward. In a situation reminiscent of Arthur Andersen, the consulting / accounting firm that audited Enron's books, the credit rating agencies also provided consulting services. If you were a bank looking to get a AAA rating, the rating agencies were more than happy to provide consulting services to ensure that the rating arm of the firm would get your desired rating. As an aside, I don't really understand why ratings agencies even exist. Investors are more than happy to do their own leg-work and purchase stock in companies without any seal of approval from third parties. ## Credit Default Swaps There is another piece of the puzzle that also leaves me somewhat confused: credit default swaps. When you decide to loan money to someone, you are at risk of not receiving all of your money back if the other party somehow defaults on the loan. To get around this, you will often charge a higher interest rate to make up for the credit risk. But if you lend them money today, and they go bankrupt tomorrow, then you won't even get your first interest payment. Enter credit default swaps (CDSs). In a CDS, you essentially buy insurance from a third-party that will pay out if your counter-party (the person you loaned money to) defaults. But CDS's are not regulated as insurance. As a homeowner it is fraudulent to take out multiple policies against your single home. If not, I could insure my house 20 times, burn it down tomorrow and buy 20 more the following day. This isn't the case for credit default swaps. I can buy as many as I like, and insure as much money as I want against the chance of some company going bankrupt. Although not regulated as insurance policies, many insurers got into this game – including AIG. It seemed like a no-brainer, here were companies looking to buy insurance against the possible default of AAA rated companies and bonds. With AAA ratings the purported likelihood of default was de minimus so the insurers were happy to oblige. And so the CDS market grew to cover about 50 trillion dollars worth of insurance - about the same size as the total sum of the entire US corporate bond market. This worked great for the insurers of mortgage bonds until they started to default. But that's the risk you take as an insurer. However, the dollar value of the insurance can be much larger than the dollar value of the loss when more people are insured against a default than there are counter-parties exposed to that default. This becomes particularly interesting when insuring against corporate bond default. As more and more people buy insurance against the risk of a company going bankrupt, the cost of that insurance goes up – simple supply and demand. And as the cost of insurance goes up, the market perceives a greater risk of default and demands higher interest rates on bonds issued by the target company. Thus the cost of financing goes up and it can get to the point whereby the company's cost of capital is greater than their return on equity, and the company can no longer operate profitably. The more reliant a company is on debt to fund its operations, the more risk they are exposed to from their cost of financing. And companies that rely of high levels of leverage, by definition, are highly exposed to such risks. ## And then... This is what was playing out over the past few weeks. Banks, using their low cost of capital, to trade in highly-(mis)rated mortgage instruments, earned highly levered returns to fuel their profits. Seeing this as a problem, people took the opportunity to buy credit default swaps against these banks, and at the same time short sold their equity to doubly profit from the demise of the bank. Depending on your point of view, this is either capitalism at its finest, preying on the weak to allow the strong to survive. Or it is rampant manipulation, using credit default swaps to force banks into bankruptcy. In reality, it is probably a bit of both. Short-selling, while still not widely understood by the folk outside of finance, has been blamed by many for the market activity of late. This is like a poor student failing a test and blaming the teacher for making the test too hard. As best as I can tell, there is hardly any impact on a firm's ability to operate in the face of declining stock prices. So unlike speculative purchases of default swaps, which negatively impact a firm's ability to borrow, short-selling stock should not decrease the ability of a firm to continue operating. [See here] Nonetheless, regulators in the USA, the UK and Pakistan(!) have intervened in the market to limit short-selling. But they have done nothing to address the underlying disconnects that led to our current predicament. Credit default swap markets remain lightly regulated and highly opaque. The Treasury wants to buy up 'toxic' mortgage debt to the tune of half a trillion dollars and while this will probably solve the problem quickly, it may not be the best way of doing so. The way I see it, doing so is simply an extension of the stimulus check program, but with benefits going to the 5 million odd borrowers from the past decade who already spent that money. It is like putting a helicopter in a time machine and handing out free money to those who responded positively to the requests of fraudulent mortgage brokers to lie on their loan applications. I'm not sure what the right solution is. There probably isn't one. Certainly nothing that won't be painful. In the last 22 years, about 233 such abductions have occurred in the United States. About 4 million babies are born each year, which means that a baby has a 1-in-375,000 chance of being abducted. Compare this with the infant mortality rate in the U.S. -- one in 145 -- and it becomes clear where the real risks are. And the 1-in-375,000 chance is not today's risk. Infant abduction rates have plummeted in recent years, mostly due to education programs at hospitals. So why are hospitals bothering with RFID bracelets? I think they're primarily to reassure the mothers. Many times during my friends' stay at the hospital the doctors had to take the baby away for this or that test. Millions of years of evolution have forged a strong bond between new parents and new baby; the RFID bracelets are a low-cost way to ensure that the parents are more relaxed when their baby was out of their sight. Security is both a reality and a feeling. The reality of security is mathematical, based on the probability of different risks and the effectiveness of different countermeasures. We know the infant abduction rates and how well the bracelets reduce those rates. We also know the cost of the bracelets, and can thus calculate whether they're a cost-effective security measure or not. But security is also a feeling, based on individual psychological reactions to both the risks and the countermeasures. And the two things are different: You can be secure even though you don't feel secure, and you can feel secure even though you're not really secure. The RFID bracelets are what I've come to call security theater: security primarily designed to make you feel more secure. I've regularly maligned security theater as a waste, but it's not always, and not entirely, so. It's only a waste if you consider the reality of security exclusively. There are times when people feel less secure than they actually are. In those cases -- like with mothers and the threat of baby abduction -- a palliative countermeasure that primarily increases the feeling of security is just what the doctor ordered. Tamper-resistant packaging for over-the-counter drugs started to appear in the 1980s, in response to some highly publicized poisonings. As a countermeasure, it's largely security theater. It's easy to poison many foods and over-the-counter medicines right through the seal -- with a syringe, for example -- or to open and replace the seal well enough that an unwary consumer won't detect it. But in the 1980s, there was a widespread fear of random poisonings in over-the-counter medicines, and tamper-resistant packaging brought people's perceptions of the risk more in line with the actual risk: minimal. Much of the post-9/11 security can be explained by this as well. I've often talked about the National Guard troops in airports right after the terrorist attacks, and the fact that they had no bullets in their guns. As a security countermeasure, it made little sense for them to be there. They didn't have the training necessary to improve security at the checkpoints, or even to be another useful pair of eyes. But to reassure a jittery public that it's OK to fly, it was probably the right thing to do. Security theater also addresses the ancillary risk of lawsuits. Lawsuits are ultimately decided by juries, or settled because of the threat of jury trial, and juries are going to decide cases based on their feelings as well as the facts. It's not enough for a hospital to point to infant abduction rates and rightly claim that RFID bracelets aren't worth it; the other side is going to put a weeping mother on the stand and make an emotional argument. In these cases, security theater provides real security against the legal threat. Like real security, security theater has a cost. It can cost money, time, concentration, freedoms and so on. It can come at the cost of reducing the things we can do. Most of the time security theater is a bad trade-off, because the costs far outweigh the benefits. But there are instances when a little bit of security theater makes sense. We make smart security trade-offs -- and by this I mean trade-offs for genuine security -- when our feeling of security closely matches the reality. When the two are out of alignment, we get security wrong. Security theater is no substitute for security reality, but, used correctly, security theater can be a way of raising our feeling of security so that it more closely matches the reality of security. It makes us feel more secure handing our babies off to doctors and nurses, buying over-the-counter medicines, and flying on airplanes -- closer to how secure we should feel if we had all the facts and did the math correctly. I don't understand why we are getting into such a fuss attempting to stop the decline in stock prices. If a company defaults on its debt then the equity holders get nothing. That's just how it works. The likelihood of default is reflected in corporate bond rates and CDS spreads. As those spreads widen, we see that the market believes in a greater chance for default, therefore the value of the equity declines. Whether they are right or wrong is another thing, but markets are supposed to reflect beliefs. Facts are for the future to reveal. As far as I can tell there are no great operational reasons why a company would care about the current value of its equity. Yes, if they are trying to make purchases with equity, it could be a problem, but that's not at play right now. Yes, companies have a duty to their shareholders, both internal and external. But if the external market believes something that isn't shared by insiders you shouldn't be taking aim at changing the mechanism for reflecting beliefs. That's nothing more than shooting the messenger. If you make the argument that insiders care about stock prices because of options and the effect on morale, then you have a bigger problem. If insiders honestly worry about what management considers a temporary misplaced belief, then clearly the insiders don't hold those same beliefs. And maybe then facts are actually closer to what the external market is reflecting. This would mean that the mecahism is working, and shooting the messanger would be trying to shoot the future. :wq Comments (0) # Arthur ## 17 Sep 2008, 7:55 PM Now I don't know how many of you have watched Bloomberg TV in the evening, but without the distraction of an active trading session the channel feels much more like a low-rent community college broadcast. All that was missing was a potted plant and an American flag. It follows naturally from the fact that exchange rates are asset prices that embody expectations of future movements in macroeconomic fundamentals, specifically ones that will directly affect the exchange rates. For commodity currencies, global commodity prices matter to their exchange rate values. Inspired, I bounded out of bed and decided to pretend to be an economist this morning. Without reading the paper, I grabbed what data I could and put together a simple vector autoregressive model. I couldn't find data for the Chilean Peso, or at least my data set suggests that it was, up until recently, pegged to the USD, so I worked with only the AUD, CAD, NZD and ZAR. Even so, as my pretty picture above shows - the model is pretty spiffy.The chart shows the out-of-sample 1 month ahead prediction. Overall the model gets me an R-squared of 0.85. Neat. Now onto real work for the rest of the day. :wq Comments (0) # Gait Analysis - Squared ## 5 Sep 2008, 6:16 PM Hey, you, meet you Nearly seven years after Osama Bin Laden disappeared, US intelligence agencies are still chasing his shadow. And shadows are precisely what they should be looking for, says NASA's Jet Propulsion Laboratory in Pasadena, California. By analysing the movements of human shadows in aerial and satellite footage, JPL engineer Adrian Stoica says it should be possible to identify people from the way they walk - a technique called gait analysis, whose power lies in the fact that a person's walking style is very hard to disguise. ... The results showed that the appropriately trained sexologists were able to correctly infer vaginal orgasm through watching the way the women walked over 80 percent of the time. Further analysis revealed that the sum of stride length and vertebral rotation was greater for the vaginally orgasmic women. "This could reflect the free, unblocked energetic flow from the legs through the pelvis to the spine," the authors note. There are several plausible explanations for the results shown by this study. One possibility is that a woman's anatomical features may predispose her to greater or lesser tendency to experience vaginal orgasm. According to Brody, "Blocked pelvic muscles, which might be associated with psychosexual impairments, could both impair vaginal orgasmic response and gait." In addition, vaginally orgasmic women may feel more confident about their sexuality, which might be reflected in their gait. "Such confidence might also be related to the relationship(s) that a woman has had, given the finding that specifically penile-vaginal orgasm is associated with indices of better relationship quality," the authors state. Research has linked vaginal orgasm to better mental health. ... Extending the idea to satellites could prove trickier, though. Space imaging expert Bhupendra Jasani at King's College London says geostationary satellites simply don't have the resolution to provide useful detail. "I find it hard to believe they could apply this technique from space," he says. :wq Comments (0) # Michael Palin for VP ## 4 Sep 2008, 12:26 AM See here. Comments (0) # The best + shortest paper I have read this week ## 28 Aug 2008, 1:41 AM ## Parachute use to prevent death and major trauma related to gravitational challenge: systematic review of randomised controlled trials Gordon C S Smith, professor1, Jill P Pell, consultant2 1 Department of Obstetrics and Gynaecology, Cambridge University, Cambridge CB2 2QQ, 2 Department of Public Health, Greater Glasgow NHS Board, Glasgow G3 8YU ## Abstract Objectives To determine whether parachutes are effective in preventing major trauma related to gravitational challenge. Design Systematic review of randomised controlled trials. Hat tip to Overcoming Bias Comments (0) # Madagascar Photos ## 13 Aug 2008, 6:08 AM Just got them back from the 1-hour photo place this afternoon. Took the rest of the afternoon off work, scanned them, and then uploaded them to the Internet website Flickr. Comments (0) # Back from honeymoon. Straight into work. ## 7 Aug 2008, 10:57 PM One of the good things about not being able to blog about my work is that I can upload random charts and not have to exert the keyboard bashing required to explain my thoughts.[1] X axis would be current account balance.[2] Y axis is loan to deposit ratios for random banks [1] Which are largely derived from another happily married Aussie bloke. [2] Dada thanks to another Aussie bloke. Comments (0) # Going Phising ## 15 Jul 2008, 8:39 PM Just so folks know, I'm heading on honeymoon for almost two weeks starting Friday. My posting frequency will be undisturbed by this event. Should be fun. Comments (0) # when i am president ## 15 Jul 2008, 4:47 AM Comments (0) # married ## 30 Jun 2008, 8:28 AM despite getting married outside in pouring rain, a good time was had by all. Comments (0) # lemma ## 18 Jun 2008, 3:32 AM 18 / 06 = 3 (i) 18 - 08 = 10 (ii) 3 x 10 = 30 (i x ii; QED)  4eyes Comments (0) # Micro-brands for bands ## 9 Jun 2008, 8:05 PM I spent Saturday afternoon at a birthday picnic for one of S.'s good friends who works in the film biz. The great thing about that crowd is that I get to both live vicariously through their stories and also I am reminded of the brief period when I worked in film post production. No offense to my current coworkers, but there is something special about working with creative creative people, rather than technically or financially creative people. In a conversation with a guy who does indy film distribution, after fawning over his current project with Werner Herzog, we got into a discussion about the future of the industry with all the youtube and whatnot. I made some throw-away line about how branding isn't yet well established online and there may be some possibilities there. But what really got me thinking was the lack brand identities in the re-democratization of arts and entertainment. Directors, record labels, musicians and producers succeed when they become brands. As much as I wish it weren't so, I would have never sat through all of Gerry if I didn't know it was a Gus Van Sant film. Countless songs would have missed out on the critical 5th replay required for them to catch on if I didn't know that they were written by X or released by Y. Sure, it is not a hard and fast rule and I discover new and unknown stuff all the time. But we all go the extra mile in accepting familiar brands. I am loyal. If it weren't for the coin slot, people would have no qualms in stealing their morning newspaper, or so would be the case if we were all Homo Economicus. Plenty-o-folk download music without worrying about paying their dues to the artists. But would we do so if the band was watching us at the time? Radiohead's great experiment in behavioral economics showed us that when given the option of free, many people still elected to pay. I posit that it was merely the act of making the consumer aware of the option at the time of download that resulted in this. Not a really stunning observation, but I think it is key. And it wouldn't work for bands who weren't brands. If you don't know what you are getting, you probably won't pay. However, if you were aware of the artists situation, you might come back and pay. Online, this might be as simple as putting up a download link with a big message along side: "The band remains poor and starving. Please help." But being a technologist, I think we can do better. Imagine a system whereby independent artists collected payments, voluntary or not, through a central clearinghouse. That way each artist could accurately display how much income they are making from their music, and various stats about downloads. Starving artists can use their prandially challenged status to help convince listeners to share a few dollars. And popular artists will get all the benefits of popularity. This could easily be another feature of last.fm or similar. But they are busy building a walled garden of listener behavior data. However there is no reason why this needs to be a centralized service. Distribute this sucker. That way record labels can do whatever it is that online record labels do, and can collect the cash. They can then distribute the cash to the artists, which I hear they do, on rare occasion, do. Production costs are decreasing. Anyone with talent & a computer can make a great film or album. Companies yearn for strong brands as consumers can then align their personal images with these brand identities. Art and entertainment is a visceral identifier of personal characteristics. I like the music that I like because it says something about me. I strongly believe that given repeated exposure anyone can at least enjoy, if not love, any form of art. At some stage I chose the music, film and art that I wanted to be identified with, and later did the deep connection and enjoyment grow. As individuals the clothes we buy, the food we eat and the entertainment that we enjoy often provides a shortcut to our own deeper identity. So, why do entertainer incomes follow a power distribution? Why does Coldplay have the benefit of being able to turn down multi-million dollar advertising deals when countless other musicians have to rely on government subsidies to feed themselves? These poorer artists are not always lesser in their art. I can understand that in a world where search costs are high, attention is focussed on a smaller set of well marketed artists. However with the diminished cost of search, consumers should be more selective in their choice of art brands. And if the audience is free to sample unrestricted digital works, but are aware of the status, needs and micro-brand of the artists, I would like to think that we can arrive at a more even distribution of wealth across the arts. And then I'd find a way of sticking a fiver in this guy's cap. Comments (0) # Bitter: My 2 Cents ## 31 May 2008, 12:33 PM After some email goading from Jerry[1] I have decided to share a few of my thoughts on the problems with twitter. First off, let me say that I don't know the people behind twitter and I have no access to the sort of information I'd want to see before I could even come close to forming a cogent opinion. So everything from here on is most definitely wrong and bone-headed. In my reply to Jerry I made some claims based on my initial rumination about Twitter's problems. I decided to do some further digging and found that Jeff Atwood's[1] opinion is pretty close to mine. Namely, that the first order of business is to stop blaming Ruby or Rails (neither of which I have used), and instead thinking about the underlying platform. For me, this means thinking long and hard about their database. Whatever I'm actually paid to do in my career, I always seem to end up working with sizeable datasets. And I think databases are totally swell. Bees knees. I'm one transaction shy of putting a poster of Codd on my wall. But sometimes, not due to lack of effort, I give up on databases and end up rolling my own data crunching application. Databases are so swimmingly handsome because they let you approach them and make general queries on data. Sure, there is an art and science to optimizing a database for specific queries, but 90% of the time you don't have to think about that. c.f. The Twitter API specification[1] When Twitter was just an Obvious twinkle in someone's eye, a database made perfect sense. Who knew what Twitter would become and the flexibility of a database is a huge advantage. But now Twitter is Twitter: A messaging service. And save the discovery of a business model, the announced coming features don't dramatically impact architectural requirements for the service. Oh, and since the original implementation of the prototype service, Twitter has experienced massive growth and downtime is crushing goodwill. I say, ditch the general purpose database, implement a custom solution. The great thing about a custom database is that it really doesn't take much time to build if you know your usage patterns. Whereas a RDBMS has to make guesstimates as to how to execute a query, and optimize that execution against a set of generic index types, a custom solution can use custom index algorithms. For example, as part of a recent bit of hacking I got a few thousand fold speed improvement from moving an app from a well managed commercial database to some hand rolled C. For a key join, the RDBMS was performing its best, but we were able to perform better by making a merge-bsearch-linear scan algorithm that made perfect sense for this application. In short, if you are database bound, and your database is well managed, and your application has known query patterns - ditch the database. People seem to forget the computational bandwidth at their fingertips. Grab the nearest napkin, scrible down a rough estimate of the bandwidth requirements of your app and then compare that to your computer. If A << B and you are suffering performance issues, then you are in a happy place - it should be easy and rewarding to solve. Step away from your computer and just think about what it means to have 2 GILLION CYCLES PER SECOND. Thems be a lot. Use them. My napkin calculations based on over-heard numbers suggest that a platform change would suffice and make Twitter as happy as Larry. But let's say that someone goes along and implements what I'm talking about and that's still not enough. Well, then it gets to be _real_ fun. I took a wee bit of graph theory at school, but I also have the pleasure of having friends who took an unhealthy amount of graph theory. It is a well studied domain. If platform level optimization isn't enough there is a wealth of knowledge in this space on optimially placing resources to maximize bandwidth. Twitter is probably pretty close to a fully connected graph. But where are the cut points? How lumpable is the distance matrix? How does the matrix evolve over time? (Pretty slowly, I guess) What is the optimal time period for relocating clusters of users across your horizontally scaled system? And why am I not hearing any discussion online about these questions? (Probably because I haven't looked.) OK. It is 4:30AM and I should be packing for the move tomorrow. So I'll cut it short. Edgar[2] and Edsger[3]. People have been thiking about these types of problems for a long time. I know how daunting it can be when you are running a system and everything goes to shit. Immediate reaction is to patch stuff. Sometimes that just makes it worse. Read the literature, mine your operational data and find a better way. And whatever you do, don't listen to some wacky Australian on the blog-o-sphere. [1]: I'll come back tomorrow and add links to this blog after I become a Brooklynite. While I understand that variable font sizing on the Kindle erases the notion of a 'page', the device does provide a progress bar to indicate how far along one is in the scope of the entire book. I want the same thing, but on the chapter level. Or at least a way of flicking to the next chapter and looking at the 'page number' and comparing that against where I currently am in the book. So I read the manual. At first, I started reading it on the Kindle itself, but after reading the first 3 chapters (within which I was instructed on how to change the font size on 4 separate occasions), I gave up and went to my desktop. I quickly realized that the manual made no mention of my chapter-progress-bar feature. So I called Amazon's Kindle customer support (1-866-321-8851) Me: Hi AmaDroid: Email address? Me: Huh, um, xxxx@i2pi.com AmaDroid: Billing address? Me: Is this the Kindle help desk? AmaDroid: Yes. Billing address? Me: XX Horatio St, New York AmaDroid: ZIP? Me: 10014 AmaDroid: How can I help you today? Me: Well, first you can explain why you asked for my personal details without saying 'hello' to me. AmaDroid: Um... Me: Or maybe you can tell me if there is any way to work out how far along you are in a given chapter when you are reading a book on the Kindle? AmaDroid: There is the progress bar that tells you how far along you are in your book. Me: But can you tell how far along within a chapter? AmaDroid: No. At this point two things come to my mind: 1. Seth Godin's piece about in-bound calls 2. And instructions for getting root access on the Kindle. FedEx tells me that my parts will arrive at the office tomorrow. Yay. Me: [HANGUP] :wq [1] As a prematurely grumpy old man, I'll leave compliments to others. Comments (0) # I <heart> David Byrne ## 9 May 2008, 5:46 PM Thanks to a recent message on twitter I learned that David Bryne is blogging. Not only do I really dig Talking Heads, but I love his work with Ryuichi Sakamoto and it is great to peek into the mind of a musician through writing that is as well composed and shares an unsurprisingly concordant asesthetic and ethic as his music. Go read his blog: http://journal.davidbyrne.com/. And if you do go read, you will find out about this: Playing the Building, a 9,000-square-foot, interactive, site-specific installation by David Byrne, will transform the interior of the landmark Battery Maritime Building in Lower Manhattan into a massive sound sculpture that all visitors are invited to sit and “play.” Byrne’s project will consist of a retrofitted antique organ placed in the center of the building’s cavernous second-floor gallery that will control a series of devices attached to its structural features—metal beams, plumbing, electrical conduits, and heating and water pipes. These machines will vibrate, strike, and blow across the building elements, triggering unique harmonics and producing finely tuned sounds. As Byrne explains, it is an elaborate system for “activating the sound-producing qualities that are inherent in all materials.” Playing the Building marks the first time in decades that the second floor of the Battery Maritime Building will be accessible to the public. The space will be open and free to all visitors on Friday, Saturday, and Sunday throughout the summer of 2008. Everyone will be invited to sit at the organ, tap on the keys, and create a unique array of sounds that travel through the space. In addition, David Byrne and Creative Time will invite guest musicians to challenge his creation through a series of performances and jam sessions. Now, I don't really know much about art, but I really like it when I see a simple idea executed well and I end up walking away saying "Wow. I could have done that. If only I thought of it first." I guess I'll be heading down to Battery Maritime this summer to check it out. Comments (0) # Scatter Plots ## 8 May 2008, 3:54 AM Via Junk Charts: (Thanks to reader Josh R. for the tip.) The "plucky statisticians" at Urbanspoon decided to tackle the political hot potato: is Barack Obama an elitist? Scratch that -- what they actually did was to determine if Obama supporters were elitists (of course, Obama would then be, due to guilt by association.) Scratch that -- what they actually analyzed was if there tended to be more Starbucks per capita in those states in which Obama won Democratic primaries. Hey, that Josh R. character sounds like a mighty fine bloke. Mr. Urbanspoon, the statistics professor is here and he disapproves. As discussed before (and here), plotting two series of data on the same chart and applying two different scales is a recipe for disaster. Not reaching immediately for the scatter plot when one has two data series is another serious misstep. (Indeed, Josh sent the link in with a note wondering why "people dislike scatter plots so much".) Wow. I'm famous. [1] I have no idea what that plot is of other than being the first image returned from my home directory search for scatter* Comments (0) # Fort Greene Tannery Works closing! ## 24 Apr 2008, 5:37 PM After over a year of following the uber-bear-bloggers, I've decided to jump the shark and have learned to love Lawrence Yun. After doing careful research I discovered that now is the perfect time to buy or sell a house. So I did. After much paper shuffling, I am now the proud owner of a townhouse in Fort Greene. The problem is, that during the closing I discovered that an 1890 law prevents me from running a tannery on my block. I'm kinda bummed out about that. The hurly burly of financial markets have been taking their toll and I was looking forward to a simpler and more rewarding life tanning me some hide. I guess I'll have to think of something else fast. You see, I have child-bearing hips. I can't help it, I just have wide and prominent hip bones. Whenever it comes up in conversation, usually constrained to clothes buying situations, but also the odd well lubricated party, I refer to my 'child-bearing hips.' Always good for a laugh. I definitely mentioned my concern at my initial meeting at Duncan Quinn. Perhaps not with the vociferousness concomitant with my level of concern.

"Make sure the back of the jacket doesn't flare out too much."
"Keep the pants low on my waist so that pockets don't flare."
"I like slim pants and sleeves, but my shoulders need to be broad -- I have child bearing hips, you see! tee hee hee."

I wore my Costume National off-the-rack suit in to the fitting today. The idea was to take a suit that I like, and ensure that the significantly more expensive 'bespoke' one would be better. It's not. On my way there I had my conversation all mapped out. When it came down to it, all I could manage was a moping mumle. The first thing that stood out to me was the shoulder breadth. It turns out that they can't let them out any farther as there is not enough fabric inside to allow it. I didn't mention the height of the pants - I was too concerned with the jacket.

Of course, I'm now sitting at home stewing over this. Not so much the cut of the suit, but more that I didn't kick up a bigger fuss. One day I'll learn.

Dolphin Hotel

Here. Have some free music*. Thanks.

Craque:"Steve Reich meets Thomas Brinkmann in an alley with John Adams"
Twerk:"Farben meets Zamfir meets Kenny G"

*NB: Not mastered by Audible Oddities. Milage may vary.

More random blogging

Care for the Fed funds rate probability curve?

For the most part about enabling rapid growth.. The nerd has the first to do it without lining someone else's pocket pc or some other controller and the voice recognition software jobs, great people who have been a feature of the financial landscape for a few weeks ago, and the first, as liquidity pours into a lot of time at the Hudson hotel, Maritime, and that the balance between opportunity and risk of the world's largest banks to lend to be the first institutional money. The same holds for auto companies in the subprime mortgage lead.

[click for 66mb quicktime video]

Ugh. thanks for submitting to WhiteWhine.com. I'm so excited to have yet another email to read. Like I don't read enough email at the office everyday. What ever happened to 'me time'?

Markov me blogs, plz.

Ok. Taking some time off from eyeing the markets and whatnot, I decided to see what markov chains would make of my daily blog reading. This is what I get:
A New York Fed-Princeton University Liquidity Conference today. The conditions that led to this problem, but what about supply? Stalin, who was looking for their openness, or thier APIs like Craigslist, Facebook, del.icious, or Google have always garnered their fare share of their total budget. It makes perfect sense that the strategy in a discussion. The drive to excel is cultural, I think you get into the rib place in what is the poor performance of the surface world, there may be available, or we can't get the right word. Version 2.0 of the deal.
Amen.

Bonus round:
Robin Hanson over at the national economy in the event exploring the possibility of timing sector rotation highly dubious, the fact is that it corresponds to picking the best modes of production, e.g. their explanations for the first two hurdles? Passing the PhD qualifying exams and obtaining a research project in Shared Capitalism at the same holds true for other reasons.

Ok. I can' stop!
A new film, about Bob Dylan, takes an abstract look at the MacWorld keynote. Steve offered up an email account on my blog (or both)? Should I hire him even though this piece represents only a subtle shift away from the context of imperfect international financial markets, they certainly shed a different sort of confusion that I will discuss goals and principles that, if this is happening now, the strengths and weaknesses of the path and experimenting for the next step occurs, behind the felixsalmon.com curve!
This is depressing. The algorithm writes better than me.

Adding to my market microstructure library....

Items not yet shipped:
Delivery estimate: January 11, 2008
 1 of: Stock Market Liquidity: Implications for Market Microstructure and Asset Pricing (Wiley Finance)Sold by: Amazon.com, LLC 1 of: Empirical Market Microstructure: The Institutions, Economics, and Econometrics of Securities TradingSold by: Amazon.com, LLC 1 of: Parimutuel Applications In Finance: New Markets for New Risks (Finance and Capital Markets)Sold by: Amazon.com, LLC

# How not to store a tree in a database table

## 3 Jan 2008, 12:30 AM

Reading delicious/popular has become quite a habit for me. It is often a good place to find interesting and obscure programming tricks. The down side is that there is a lot of web-dev stuff in there that really doesn't interest me too much. Even so, the latest and greatest color picker is no where near as evil as this article by a Rock Star coder about storing tree structures in relational databases. Rather than going on a long rant , i'll try to keep this simple:

STORE DATA IN YOUR DATABASE IN A FORMAT THAT YOUR DATABASE UNDERSTANDS.
Done.

OK. Maybe I'll flesh that out a lil' more.

People much smarter than you or me write databases. And amongst those people lurk a hardcore group who write query optimizers. Their job is to understand your SQL and find the fastest execution plan given available indices and related statistics. And to do this they tend to rely on the fact that an INTEGER column is used to store INTEGERS and that all the various rules of mathematics apply. Now, unles your database supports the INTEGERS-SEPERATED-BY-PERIODS-REPRESENTING-AN-ORDERED-PATH data type, then you are SOL if you want any help from the query optimizer.

(Of course, if you are using PostgreSQL, then you will find an integer array datatype which supports ordered lists of integers. And the query optimizer knows how to use properly constructed indices to find stuff.)

According to Rock Star, the standard way of doing things fails because "It involves recursion, which is slow and complex." Hmm. Maybe people who find recursion complex shouldn't be re-inventing the wheel? And slow, maybe if you told us more about your application, we could help you work out why a perfectly normal solution doesn't work for you. Rocky then names two more problems with using MySQL's susggested Nested Set Model. OK. Fine, that model does suck. Even so, the only complaint he has listed against the standard 'Adjaceny List Model' is that it uses recursion.

I guess I knew his proposal would suck when one of his requirements was that "Selection of items needs to be possible with SQL.". Dude. You are using a database, so SQL _should_ work. You are just entering a world of pain by totally ignoring the role of the query optimizer, which will have no idea what to do with your janky path representation. And even if thats not a concernt, you have made it super difficult for any DBA to work with this schema as you'll end up with bastardized SQL to do anything apart form your original purpose. If you are 100% sure that your requirements will never change, fine. But what happens when one of your PHP scripts screws up and you have to manually correct the data? What about referential integrity? What do you do when your LIKE's are not of the form '1.2.3%' - e.g., find me all Speakers? Using your format, that turns into a SELECT * FROM junk WHERE parent LIKE '%.2.%'. Ugh. Throw out any misconceptions you may have about LIKE being fast.

In short, if you have exhausted the benefits that you get from using a database as a database you have 2 options:

1. Stop using a relational database if you don't want any of the benefits of a relational database.
2. Still use a relational database, but store your base level data properly. And then create materialized views (or similar) with optimized views for particular use cases.
In your case, I recommend #1. Stick it all in a text file. Or maybe a PDF that you print out and stick on a nice wooden table.

Doppelganger Update

Known Joshua Reichs:

Term Auction Facility results

Release Date: December 19, 2007
For release at 10:00 a.m. EST

On December 17, 2007, the Federal Reserve conducted an auction of $20 billion in 28-day credit through its Term Auction Facility. Following are the results of the auction: Stop-out rate: 4.65 percent [Josh: Only 10bp less than discount] Total propositions submitted:$61.553 billion [Mo Money Please!]
Total propositions accepted: $20.000 billion Bid/cover ratio: 3.08 Number of bidders: 93 Bids at the stop-out rate were prorated at 1.96% and resulting awards were rounded to the nearest$10,000 (except that all awards below $10,000 are rounded up to$10,000).

The awarded loans will settle on December 20, 2007, and will mature on January 17, 2008. The stop-out rate shown above will apply to all awarded loans.

Institutions that submitted winning bids will be contacted by their respective Reserve Banks by Noon EST on December 19, 2007. Participants have until 3:00 p.m. EST on December 19, 2007 to inform their local Reserve Bank of any error.

Let me go out on a limb and say that the markets will be down on this. A rate of 4.65% basically confirms that institutions haven’t been willing to go to the discount window because any such action becomes public knowledge. The TAF is a silent auction and thus avoids this stigma. Interesting stuff.

# How it went wrong.

How it went wrong.

The BBC at least went to some effort in trying to describe the mechanics of the current credit crunch, but they start off on the entirely wrong foot by conflating the ’sub-prime model’ with securitization. What they describe in the image above is securitization. Portfolio Magazine has a much better graphic describing the details of recent innovations in securitization.

Arguably, securitization lowered rates for all borrowers. It also re-distributed the fees from local lenders to larger investment banks who could earn commissions for writing and selling CDOs. But along the way information asymmetry increased and everyone had an incentive to keep filling the pipes with new whole loans. Mortgage brokers got a quick buck from selling a refinance to an originator. The originator made a quick buck flipping the loan to the secondary market. Banks and funds made a quick buck flipping MBSs into CDOs. And investors (usually via other banks and funds) made some money holding onto paper that was paying out a rate marginally higher than similarly rate debt.

The only problem with making so much money is that enough is never enough. And so more brokers needed to originate more loans to be repackaged into more CDOs. However, nearly everyone who could already afford a mortgage had already refinanced into a lower rate and so the only people left to be refinanced were those who couldn’t afford a mortgage. Brilliant! Mortgages for people who can’t afford them!

But they can. You see, a few years ago adjustable rate mortgages (ARMs) were invented to allow borrowers who were both financially savvy and creditworthy to structure their cash flows so that the bulk of their payments would occur a few years in the future. Perfect for up and coming hedge fund analysts who knew their bonus 2 years from now would be considerably bigger than their entire earnings to date. But as of 3 years ago, the same product became perfect Joe Trailerpark who never learned that nothing comes for free. And that a 3% rate this year means a 10% rate 2 years from now.

Now, I don’t fully blame Mssrs Trailerpark, I’m quite sure a large number of brokers glossed over the latter detail. (Did I ever tell you about the time I was sitting at John Wayne Airport, in the heart of mortgage origination land, and overheard 2 brokers talking about the finer points of faking W2s ? Maybe another time…) Either way, everyone was happy as Trailerpark & Co. provided the much needed raw materials for new CDOs.

Of course, this all goes to shit when people stop paying their mortgages. Statistical models suck when faced with a small number of observations and flawed assumptions. And humans are all too happy to ignore the fundamental problems with their risk models when they are just making so much gosh darned money.

And that’s how it all went wrong.

(PS: Did you get your bid in today? I may post some thoughts on that in 2 days when the results are released).

Writers strike

I, for one, whole heartedly support the writers in their strike. In fact, I hope they remain on strike. Forever. The internet has never before had such a great range of funny videos to watch. However after a while it gets a little boring to continually watch videos about the strike. So, writers, we get your point but please move on to some other topics. You have so much talent not to waste it all as one trick ponies. Or LoLcats.

Books

Books I have read in the past few weeks:

1. Twinkie Deconstructed. Really boring. Last time I trust Tyler Cowen for a book recommendation.
2. The Electric Kool Aid Acid Test - Great. The closest thing fiction that I have read in a while, without actually being fiction. It’s the story of Ken Kesey and his merry pranksters who suffered heavily from the burden of being some of the earliest LSD addicts. And as such weren’t enlightened by the now commonplace knowledge that _everything_ you do on acid seems really interesting and important.
3. Moneyball From the author of Liars Poker comes the story of Billy Beane and how he used refined statistical analysis to lead the Oakland A’s to a world series victory. Or at least thats what I thought until I got to the end chapter. Turns out that they don’t win, but they get close. Either way, its a good read. I walked away even more convinced of the fact that most established knowledge is right, but every now and then its very wrong.
4. The Informant. My dad left this book with me when he was over in New York last week. On the 600-odd pages of this book we learn the story of international price fixing in the lysine market and the development of a huge white collar crime case by the FBI and DoJ, centering on the Archer Daniels Midland company. I had dinner on the weekend with some FBI agents [[Why is it that a statistically significant portion of my friends work in law enforcement in America, yet most of my Australian friends are criminals?]] and they had never heard of the case. Supposedly they are making a film of this book, which doesn’t surprise me as it reads like a script. Should be a great film, very exciting story.

1. Judgement under Uncertainty. After my disappointment with Behavioural Finance, by James Montier, I decided to go straight to the horses mouth and finally read the works of Kahnemann and Tversky. So far the book is confirming my prior belief that behavioral finance/economics is simply plain old economics with more complex utility functions than the analytically pleasing forms that are often assumed to describe rationality.

Books I probably won’t ever read:

1. The Tipping Point, by Malcolm Gladwell
2. Anything by Richard Dawkins

Machines I am currently building:

1. Soup! - I’m not quite sure what it does yet. But its fun to be building machines again. I really suck at drilling holes in a straight line. I want a drill press for Christmas. And a lab in which to use it.

Bernbach's Law

Jerry just wrote an interesting post about the impact of behavioral targeting on media prices. Half of our ads work, we just don’t know which half, so the adage goes. Jerry asks:

If advertisers could put their ad only in front of the fifty people who are likely to buy the product rather than the hundred that might or might not, those other fifty ad slots go begging. Half the advertising inventory will not be bought. This 50% drop in demand should drastically cut prices.

On the other hand, if an advertiser can target better, she should be willing to spend more to get in front of that targeted audience. But how much more? Twice as much? Does the advertiser end up paying less, resulting in lower overall media revenues? Or does media benefit by being able to charge premium prices for all of their inventory?

Ceteris paribus, net revenues to the publisher should remain the same. They are selling half the units at twice the cost. And the advertiser ends up paying the same amount for the same number of customers. Of course, if this were the case, behavioral targeting companies would not exist - the industry would be indifferent as to whether to target or not. Behavioral targeting companies market their technology with the premise that there is no need to pay $50 CPM’s for premium placements when the same eyeballs also see remnant inventory at$0.50 CPM. The targeting platform makes sure that the right eyeballs see the right ad at the right time, regardless of placement. That said, as Jerry recognizes, if everyone switches to behavioral targeting, then it all works out in the wash, and pricing per placement becomes a thing of the past. But net, the cash flows remain the same.

“Nobody counts the number of ads you run;they just remember the impression you make.”

William Bernbach

Where things get interesting is if we allow for the possibility that premium placements really deserve higher CPM’s on the basis of placement alone. In which case, we can no longer lump together the n units of premium inventory with the N units of remnant/run of site, into a fungible (n+N) size pool. If there is anything fundamentally different about premium inventory then ceteris stops looking paribus. There are plenty of smart cookies, with more PhD’s than you can poke a University of Phoenix marketing diploma at, who continue to pay a premium for front page ad runs. I reckon they might be onto something, behavioral targeting be damned.

Word Study

I’ve been following the 20×200 blog / store since inception. I love the idea and and many of the works. A few weeks ago Shianling and I decided to buy a print (above). Unfortunately I’m not terribly happy with the print quality. I don’t know much about printing, but am familiar with the problems of viewing art online, where the gamut of a monitor is greatly limited compared with what can be printed. However, the print we received lacks much of the vibrancy of the online picture. Oh welp. We supported an artist and have a shiny new thing to hang on our rapidly crowding walls.

Complaints about the print quality aside, I am very tempted to buy a new work that became available today (below) as I seem to have a deep emotional connection to custom bound books.

Pinot? No, I, R.

## 19 Nov 2007, 2:49 PM

One of the best ways to improve your Page Rank is to slavishly follow other blogs and provide witty responses to the posts of more credible bloggers. In that spirit, let me one-up Felix Salmon and raise his correlation analysis with a p-value. It turns out that Felix holds annual Pinot parties. Sorry, not parties, but scientific evaluations of Pinot Noir. Eschewing issues arising from normality, rank scoring, sample size, and underrepresentation of Australian wines, I present the following confirmation of Felix’s guesstimate that Pinot rankings and price are uncorrelated. Rather than rely on the unwieldy crutch of an R-square, I prefer to look at the p-value on the regression of score against price:

 > score=c(115,196,137,146,175,212,193,184,180,167,154,143); > price=c(18,23,29,52,33,12,26,18,10,50,13,12); > l<-lm(score ~ price); > summary(l) Call: lm(formula = score ~ price) Residuals: Min 1Q Median 3Q Max -53.982 -19.423 8.385 17.913 41.085 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 174.7818 17.4850 9.996 1.60e-06 *** price -0.3222 0.6200 -0.520 0.615 — Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 29.36 on 10 degrees of freedom Multiple R-Squared: 0.0263, Adjusted R-squared: -0.07107 F-statistic: 0.2701 on 1 and 10 DF, p-value: 0.6146

The above analysis, done in my BFF language - R, demonstrates that price is not a significant predictor of wine score. With a p-value of 0.615, the null hypothesis is looking a lot like a Pinot Party. A place where we can all be comfortable.

For bonus points, I leave it to my good readers to perform a contingency table analysis.

Romance King

One of the best things about being crowned a romance buff by the prestigious New York Post is that I now can get away with pretty much anything at home.

“Honey, can you do the dishes?”
“But I’m busy knitting you a scarf”
“Yes, but doing the dishes would be romantic. I declare it so.”

I now have a Frank Bruni-esque[1] power to declare anything romantic and it must be so. Anything else would be denying the Post’s role as an arbiter of New York cultural icons. Of which I am one.

[1]: Or whomever the NY Post’s equivalent may be.

Update: According to one of Shianling’s patients, we also made it to The Gothamist. Now I’m truly a cultural icon.

Why MySQL is the spawn of the devil.

Postgres:
 think=# select 'hello' = 'hello'; ?column? ---------- t (1 row) think=# select 'hello' = 'goodbye'; ?column? ---------- f (1 row) think=# select 'hello' = 'HELLO'; ?column? ---------- f (1 row)
MySQL:
 mysql> select 'hello' = 'hello'; +-------------------+ | 'hello' = 'hello' | +-------------------+ | 1 | +-------------------+ 1 row in set (0.00 sec) mysql> select 'hello' = 'goodbye'; +---------------------+ | 'hello' = 'goodbye' | +---------------------+ | 0 | +---------------------+ 1 row in set (0.00 sec) mysql> select 'hello' = 'HELLO'; +-------------------+ | 'hello' = 'HELLO' | +-------------------+ | 1 | +-------------------+ 1 row in set (0.00 sec)

This is not kindergarten people, this is UNIX. Do what I say, not what you think I mean.

Since appearing on the popular section of del.icio.us, a number of people have asked me to explain how it is, in fact, possible to replicate what we see in the following video:

Whats going on here is really quite simple. All you need to do is hold the video camera with one hand and wrap the foil etc. using your free hand. It definitely helps if you have a small, hand-held video camera. If you have one of these it might be good to have a friend around while attempting the above.

Mike called my bluff. The book I refer to in this post was actually from 1988, not the 1960’s. The book is Dynamic Graphics for Statistics and can be had for a princely sum of 149 from Amazon. I’m certain that I paid approximately 1/100th of that for it. In a semi-unrelated meme, I was chatting with my friend Aaron a few weeks back about why I disliked 3D graphics, especially in games. To his credit, I don’t play computer games so I miss out any much of the emotional nuance that comes from interacting with simulated gore and I probably shroud my argument with rationales purely to hide my irrational aesthetic bias. But that said, I don’t like 3D graphics for two interrelated reasons. Firstly, I assume that the designers of these graphics curse themselves at the lack of rendering power at their fingertips. Sure, they think their blood spatter looks better than any game that came before it, but if only they had 64 GPU pipelines instead of 32, it would look so much better. George Suerat never complained about the number of lines per inch in his Gillot paper. The paper was chosen for its capabilities and its characteristics were exploited to promote the aesthetic that Seurat was seeking. I have long felt that art and constraints go hand in hand and successful creators express themselves within boundaries. For the most part, game graphics are not timeless. Today people look back on games from 2003 and chuckle at the lack of realism. In 2009 the creators of Halo will secretly wish that they could go back and remake their games with 2009 era technology. Who knows if the makers of Galaga would prefer to recreate their game using today’s technology, my guess is not. Fundamentally, I find it a little offensive when people make graphics and aren’t happy with the outcome. Sure, game creators (and players) probably really enjoy their work. But whenever they admit that they need a better computer to ‘get the full experience’, they are stating that they are not happy with what they have. Computers are immensely powerful. But I fear that much of that power is being dedicated to doing things very uncomputer-like. If people adapted to and adopted the medium, we may have graphics that look very different to reality but in the abstraction away from reality I think we can find a fluid, informative, and medium appropriate representations. People don’t have an innate understanding of what data looks like, and in the world of statistical analysis and visualization we can afford to break the mold. Perhaps not in many games, but I have seen some games that take this approach and invent new visual realities that are (to my untrained eyes) as emotionally engaging and poorly rendered trees and bump-mapped blood spatter. I won’t dwell too much on my second concern with contemporary computer graphics as it is well covered by Tufte. Namely, the core concern is that too many pixels and graphic elements are used for non-informative purposes. Shading a chart, or purposelessly extruding a bar chart in 3D adds no new information to a graphic. This is why on any new install of Excel I spend a good hour or so setting up new chart defaults to get rid of all the junk it tends to decorate my charts with. (Yes, at times, I use Excel.) Going beyond the world of images, the same principle can be extended in interface design. Just as one needs fewer distractions from the information, interfaces need fewer mouse clicks and menu selections to play with the data. In this age of computer power, people often neglect the value of exploratory data analysis. More, now than ever, is EDA important. I have terabytes of data at my fingertips, and most visualization tools just don’t keep up. I want to be able to slice, dice, plot and histogram data at a breeze. Data retrieval isn’t the bottleneck, but rather poor interface design. What this world needs is a nice, simple, stand alone, does-one-thing-and-does-it-well exploratory data analysis tool. Maybe I should make one. Although the movie is focussed largely on America and its historical impercial capitalist value system, it is equally relevant for Australia , since as the famous old addage goes “When America sneeze’s the rest of the world gets a cold” , seems particularly true in light of the information presented. Particularly regarding some of the more notable “Black flag” operations , such as 911 , 7/7, and bali bombings ,the date of which , sadly escapes my current frame of mental reference; and the strikingly similar circumstances in which those so called “terrorist” attacks occured. I have always had my doubts on the validity of the information the media presented on these generic operations. This movie has served only to assist me in strengthening my resolve to think for myself and question authority. I must be getting old. sense Comments (0) # Getting Married ## 5 Nov 2007, 3:18 PM This weekend was my girlfriend’s birthday and my parents happened to be passing through New York. I figured the aligning of the stars was a sign that it was about time for me to propose. Being unabashed nerds, our first date involved spending a decent chunk of the day browsing through the math book section at Strand. In a very xkcd spirit, we hunted down erotically suggestive math titles. Our favourite was “Tight and Taught Sub-manifolds.” Topology always seemed a little kinky. Being particularly unimaginative I decided to propose at the site of our first date by constructing some books that spelled out the proposal along their spines. The ruse was completed by convincing her that I was unable to get her present delivered to the house on time, and thus we would have to drop by the store to pick it up. Rather than setting the books up in the humid and always busy basement of Strand, I arranged to set things up in the Rare Book collection on the third floor, away from most traffic. The store was very keen to help out as the New York Post had recently been looking for love stories to tie into an article celebrating the stores 80th birthday. Given that my now-fiance’s interests are more aligned around the natural world rather than mathematics, I planned to set the books up next to the rare natural history collection. Its a lovely nook in the store and seemed like the perfect place. Having nonchalantly covered my tracks for the previous month or so, it all broke down when we entered the book store. My plan was to casually peruse books and gently lead her to the natural history collection. Instead, it was more a case of yanking her towards the books and me childishly pointing to my proposal books and saying “Look!” She was stunned and I went on one knee to ask the big question. Despite being a relatively capable public speaker, I had considered the possibility that my mouth might stop working at that moment, so my goal was to get out at least four words; “Love”, “Light”, “Sweetness” and “Joy”. I got as far as “Love…. Marry me”. She said yes. After some crying and hugging I realized that the books were set up, not in the natural history section, but next to it - in the medical section. In the photos, the book closest to my proposal is a title on gynecology. None of that mattered, of course. It took a few moments to compose ourselves and then we wandered through Union Square to the W hotel where I had booked a suite. Thanks to some friends who work for Starwood we got comped up to their largest suite. Champagne, Strawberries, Bubble Bath. Nap. Then we walked down to Casimir in the East Village for a surprise dinner with 25 close friends and family. A group of us returned to the suite for some more drinks, dancing and fried food. By midnight we were exhausted. It was a fantastic day. My only regret was not seeing my friends from Australia and around the world. Hopefully I can rectify this shortly. And now for some R code:  #Get big.dat at http://www.state.nj.us/lottery/data/big.dat big=read.table("big.dat",sep="%",fill=T) bigdate=as.Date(apply(big[,1:3],1,paste,collapse="-")) big$type=ifelse(big$date>"1999-1-13",ifelse(big$date>"2002-3-15",ifelse(big$date>"2005-06-22",4,3),2),1) big$maxnorm=c(50,50,52,56)[big$type] big$maxspecial=c(25,35,52,46)[big$type] maxnorms=table(big$maxnorm) p=rep(0,56) for(i in 1:nrow(maxnorms)) p[1:as.numeric(names(maxnorms)[i])]=p[1:as.numeric(names(maxnorms)[i])]+maxnorms[i]*5 maxspecial=table(big$maxspecial) for(i in 1:nrow(maxspecial)) p[1:as.numeric(names(maxspecial)[i])]=p[1:as.numeric(names(maxspecial)[i])]+maxspecial[i] p=prop.table(p) allnum=unlist(big[,5:10]) t=table(allnum) chisq.test(t,p=p) plot(t/p)

As a quick hint: read.csv can accept a URL in addition to local files.

TED quotes a very memorable statement from Derman’s “My Life as a Quant”:

Quantitative finance “superficially resembles physics,” he says, “but the efficacy is very different. In physics, you can do things to 10 significant figures and get the right answer. In finance, you’re lucky if you can tell up from down.”

Derman likes to bring up the analogy often:

Now I think that maybe “financial engineering” suggests too much precision, like using too many decimal places in specifying your height as 6′ 1.2345″. Maybe FE is a politically correct word for something more fuzzy.

Computer scientists should be called computer engineers. Nutritional scientists should be called nutritionists, without the science. Maybe people shouldn’t be allowed to self-describe and name their occupation.

What is the most accurate job name to describe what financial engineers do?

As someone who never made it beyond the national level in the Physics Olympiad due to my poor mathematics, I can’t wax lyrical about heat diffusion equations and their place in finance. However, I can guesstimate that a decent financial engineer spends 90% of their time trying to profit from other people’s mathematical approximations of human behavior. The best financial engineers spend 90% of their time convincing others that their models better correspond to reality.

TED’s post reminds me of an anecdote about the early mortgage market. At that time there were no models for prepayment or default risk. A prominent player in the market (with no formal mathematical education), hired a PhD student to build a rudimentary model and then applied the model to find a basket of underpriced loans. After amassing the portfolio he circulated the model to the street, creating demand for his loans. Now, thats arbitrage.

# Is there room for visualization in art?

I just happened to come across this article while working on some code for analyzing financial data:

In short, the picture of Wall Street designers that comes across is revealing. The designers in are smart, able, savvy. But they make up an distinct community of practice, one with lower status, limited financial knowledge and one that does not seem to fully communicate with the traders and bankers. In terms of innovation, they also seem to be paralyzed by the needs of their users. As as we know from Christensen’s “The Innovator’s Dilema,” users are a conservative group.

I’m quite busy right now, but let me add some random bullet points to the discussion:

• My current codebase has ~300 lines of R code dedicated to analysis
• A further 150 lines are for visualization
• Thats a 2:1 ratio for the mathematically inclined
• My charts have black backgrounds, no extraneous borders or labels.
• Every single pixel is important (See Tufte)
• No offense people, but 99% of the designed I have worked with in the past have none of the technical caliber of John Maeda
• My favourite book on statistical visualization was written in the 1960’s and I have yet to come across a visualization package as powerful as the one they describe.

In addition to poor underwriting standards, Ranieri says the mortgage market has brought in too many “middlemen with no axe to grind other than to collect money.” By this, he means that the traditional relationship between a local thrift or savings-and-loan and a depositor, who might have borrowed from the thrift, had changed.

From an article in Investment Dealers’ Digest about my former gig.

I really dislike youtube’s compression and random dropping of frames.

For full slashion [25MB]

Yes, I know its been a while, but after a short vacation from blogging I want to kick things off with a rather detailed reply from one of the successful stock or not players. JP writes the following:

Someone else said: “If distinguishing stock market data from random data is a skill that takes study and or talent, no amount of testing the untalented or unstudied would give you accurate results in determining if stock market data is distinguishable from random data.”

I don’t think so, assuming your method is right novices will learn the game over time. Furthermore, if you give people with absolutely no stock experience a minute of coaching, you can dramatically improve their score. My girlfriend could care less about the stock market, but after I told her the trick about high volatility at high prices and low volatility at low price she scored a decent 68%, or 34/50.

If you want to win at Stock Or Not, download this great document which describes all my failings at generating fake stock prices. Well done JP, and sorry that it took me a while to getting around to publishing this.

A number of peeps have now played with my phone - including an ex-Apple, reverse engineering expert. He was pretty useless, as he had far too many beers in him and required my girlfriend to do the typing for him. Everyone seems pretty impressed. My girlfriend noticed a few UI bugs. Nothing major.

I want to lay down a $599 bet that, at some point in the phuture, we will have a low cost iPhone Shuffle. One button - call a random friend. It’ll sell like hotcakes! Comments (0) # i pone ## 29 Jun 2007, 11:47 PM This is coming from my phone. I couldnt work out how to edit my old post . Text area has no scrollbar so i cant append as the keyboard has no arrows . Predictive text doesnt work on webpages as far as i can tell. I am currently at dinner being very rude. As i walked here we saw lpts of iphones. I hope this gets less frustrating. Comments (0) # I Phone ## 29 Jun 2007, 11:03 PM Not quite sure how it happened but I now have 2 iPhones. A coworker (lets call him Luke) had paid someone to wait in line all day, but then that guy sold his spot - leaving my colleague down some cash, an umbrella and one lunch. Oh well, this is how free markets work, right? I went along with Luke to observe the feeding frenzy and somehow got sucked in myself. We ended up paying a guy$50 to take his spot (which was considerably less than what Luke had paid his first schlub.) At about 5 to 6, the line started moving. It took about 20 minutes to go through around 200 people. As we approach the cube of the 5th St. store, Apple hipster staff were waiting outside applauding people as they came in. Inside the store it was a swift process to part with $1.2k and we were out withing minutes of being inside. After activating my first phone (which required a download of the latest iTunes), I was up and running in no time. However I have _zero_ reception in my apartment. This wasn’t completely un-expected as I had no reception on my old T-Mobile GSM phone. I tried getting on to my WiFi connection, but in my buzzed state I seem incapable of remembering my password. Surprisingly, unlike on my BlackBerry 8800, it was very easy to enter in my password. As Mossberg says, the keyboard is a complete non-issue. I can type just as fast on the iPhone, using 2 thumbs, as on the berry. As for the WiFi reception, I live in a small apartment, and I am 6ft away from my airport and I only have 2 out 4 bars. The screen is great. I took my keys to it and tried to scratch it in vain. My original plan was to take it apart and work out what was inside, but I decided against that course and will just wait for someone else’s dissection later tonight. I had to take my phone outside to get any reception and EDGE is god awful. It took me over 1 minute just to load the eBay homepage. Same for Slashdot. YouTube - YouGottaBeKidding. I tried making some calls and without the headset, using speaker mode, I found it very difficult to hear the other end of the line. I will try again shortly. I took some shots with the camera, both inside and out, and it’s lackluster. It looks like every other crappy phone camera. Yes, the UI is beautiful, I can pinch and flick to see how noisy the image is with ease. Despite my mixed review another co-worker pounced on the chance to buy my second phone. That’s what the hype is like. Luke and I went to grab a coffee after we bought the phone to reflect on our participation in frenzied consumerism and I have never seen wait staff behave like they did. It was a fairly high end coffee joint (no Starbucks) and every waiter came to see what it was. Then the maitre de, and managers. No qualms in taking the phones out of our hands to touch and feel it. Mind you, at that point they were yet to be activated and so there wasn’t much playing to be had with them. General score: meh. UI is everything and more I could want from a phone. Technically, its not that great. Keyboard - great. Phone - meh. Camera - pfft. Internet (EDGE) - argh. But then again, it just could be buyers remorse. I wasn’t planning on getting an iPhone, and I could be actively looking for flaws. I’ll use the phone more over the next few hours and provide updates if my feelings change. Comments (0) # iPhone 74GB model?? ## 26 Jun 2007, 1:03 PM Ok. I’m totally sucked into the iPhone. And today I officially cross over into the world of fanboyism (or is that fanboyizm?) by posting this screenshot from Apple’s recently released activation video. Who knew a video about setting up a mobile phone account could be so exciting? Far be it for me to cast the first stone, but could it possibly be that a first version of an Apple device/software contains bugs!? Comments (0) # Hot or Not ## 24 Jun 2007, 12:24 AM  Wired Tired Pulp Fiction Pulp Fiction on Bravo. Translated from English to Bravlish. And about 2 hours shorter than the original Comments (0) # Post #125 ## 22 Jun 2007, 11:04 AM One of the upsides of not smoking is that rather than using a cigarette to snap into crystal lucidity as soon as I wake, I often find myself mumbling around the house while I get ready in the morning. “The ironical thing about Funky Town is that, as a song, its not so funky.” “I don’t think ‘ironical’ is a word. Irregardless of that, you are misusing the concept of irony.” “Yeah. Whenever I say irony, I can’t help thinking of all the times I’ve heard other people comment about its misuse.” “I’ll tell you what irony is: people misusing the word ‘irony’.” “Maybe thats the joke.” “Maybe.” “Hm.” “What?” “Did you hear that Queen Margret died?” One of the great things about having a slightly stocky Arizona preacher as my digital doppelganger (who is more highly ranked on Google than I) is that when people meet me in real life, they are somewhat pleasantly surprised. [1] [1] My fact checkers inform me, that as of writing time I have now clawed my way above the ‘other’ Joshua Reich in Google rankings. But Joshua Reich, the Jewish swarm intelligence researcher from Columbia University still outranks me, but that is the way it should be. PostScript: Other posts have been similarly fueled by lack o’ nicotine. See if you can work out which! Comments (1) # Blog Search WTF ## 19 Jun 2007, 6:26 PM I’m IMing an entrepreneur and something he says reminds me of a recent post by Fred Wilson. I want to paste a link in the conversation, so I go to my always open Google Reader window and find that Google doesn’t have a _search_ feature. Google no search. WTF? So I go to Fred’s site, which happens to be notoriously slow given the large number of widgets he uses. Knowing this, I figured he’d have a search widget. And he does - provided by Lijit (founded by Wandering Stan - one of my original colleagues from Root). Unfortunately, it doesn’t find what I’m looking for. Off to Technorati I head, and I have to wade through their interface to find the post. 10 minutes wasted. Comments (0) # The Goon Show ## 19 Jun 2007, 11:42 AM At Root we (I) saw a huge need for introducing structured futures contracts to the lead market. With the then impending meltdown of the mortgage market, expectations for future price volatility were much higher than historical volatility. Mortgage brokers operate primarily fixed cost businesses and in my conversations with our buyers they fundamentally understood the purpose of futures contracts and were eager to sign up for them. Brokers tend to manage their business on a cost per funded loan basis, and leave revenue management to their portfolio people. Likewise, most lead generators of note have a significant portion of their traffic driven by negotiated fixed-price advertising deals. Before the advent of lead ‘exchanges’, most advertisers sold their leads under negotiated deals. The exchanges introduced spot pricing. I’ll cut this long story short, as I have written about this before. While we did launch something along the lines of what I was thinking, we never ended up going live with a fully fleshed out product. This was a real shame as the opportunities for this market extended beyond natural hedging to people with a speculative interest in mortgage market dynamics. Before Greg’s ignominious departure, we spent quite some time grappling with what would happen if the derivative market grew larger than that of the underlying. The unique nature of the lead market made this particularly concerning as, if you squint, a gold-chain wearing hairy-chested mortgage goon from Long Island looks mighty similar to devious Goldman trader looking to exploit loopholes in whichever market they are in. On the Root Exchange, participants were always trying the weirdest tricks. Couple that with fungibility issues and the inherent inability to store leads for any significant period of time and the gaming opportunities could be immense if certain mechanisms weren’t properly designed. The blogger that I-most-wish-I-had-a-crush-on sums these concerns up well with her thesis regarding the eventual unwinding of the sub-prime market. I’ll say no more save pointing you in her direction. Comments (0) # Sell out ## 12 Jun 2007, 10:31 PM Starting a new job always means filling out lots of forms and letting loose with personal information. I can only assume that some of this information has leaked resulting in me, in addition to my usual junk mail, receiving ‘personalized’ offers from Amnesty International and the ACLU. Which database marketer worked out that new employees at hedge funds make excellent prospects for both of these organizations? What if marketing dollars were efficiently allocated… Whats up with that? Comments (0) # What I Do ## 29 May 2007, 10:52 PM Having recently changed jobs, it is not surprising that quite a few people are asking me what I now do. It took me a good two years to get my answer to that question down pat for my previous position, so I am not sure why I expect to be able to answer this after only a few weeks at the new firm. Perhaps it is because my current employer, who (purely for the sake of maintaining some level of intrigue) will remain nameless, has been in existence for a good 40 years or so. My role at the firm is to take over management of an internal skunkworks group that traditional provided collateral equity research to supplement the work of our portfolio managers. Let me back up a bit. The firm is an investment manager, for the most part in equities, both locally and abroad. Our investment thesis takes a purely growth oriented, fundamental analysis approach. And to a fly on the wall, the conversations around the office would be indistinct from the stream of pitches heard at a VC, save for the fact that we deal in publicly listed companies. I like this. Despite the fact that I am very much a numbers guy, the most enjoyable and challenging aspects of my roles in the past have been understanding the stories behind the companies where I have worked. In the future I would not be upset at all if I were involved in venture capital as I think I have a good sense of story. Unfortunately, as a numbers guy, it is difficult to distill the elements of a good story into a spreadsheet. And thus the quandary with respect to describing my new role as Director of Technical Research. Overarchingly the position entails bringing quantitative techniques to fundamental analysis; numbers to stories. Mind you, the portfolio managers and research analysts have no fear of spreadsheet modeling, so the question I have to answer is where, above and beyond valuation 101, can new techniques be applied. Day to day this involves managing a small team of technologists who have traditional flexed an artful knack for discovering unique and proprietary data sources, somewhat a la Majestic Research. Recently we have geared down such activities for a number of reasons. But being developers, rather than waste cycles spinning, attention was turned to improving internal work-flow and a number of tools were developed to aide in performance monitoring and crunching through SEC filings. As useful as these tools are the firm already has a dedicated development team, and dammit, tools are just not glamorous. And who works in finance, if not for the glamor. The other major difficulty in defining my role is the pesky problem of efficient markets. For the past 6 years I have worked in companies that rallied behind the cry of efficient markets. At Root Markets we sought to stamp out inefficiencies in the lead generation business. Previously, at Traffion (which was eventually subsumed into Nielsen//NetRatings, where I hear my code is still alive and kickin’) our goal was to design an exchange for efficient trading of online advertising inventory (much like Google & Right Media do today.) So whenever I think about trying to out-smart the system I find myself lamenting that all of my ideas must, by definition, be for naught due to the power of markets efficiently disseminating information. Yet whenever I do, I simply have to remind myself that efficient markets don’t pay for vanity tail numbers on private jets. My first week on the job was fairly typical; remembering and forgetting names, working out how much to tip the shoe shine guy and where to get lunch. In week two, I was full of big ideas as to how to make a difference. Week three I worked out why none of those ideas had worked in the past. Now I am getting to terms with the fact that while my directive is easy to comprehend ("help our clients make more money") the direction I need to take is more fuzzy. And the pace is certainly very different to working at a startup - I have much more time to think about things and I hope that with time, the ideas will come. And happy to discover that just as Dark Side of the Moon syncs perfectly with Alice in Wonderland, Jean Michel Jarre’s Zoolook album syncs perfectly with a random walk around the West Village. Comments (0) # twittering ## 30 Apr 2007, 12:00 PM After much watching from the sidelines, I signed up for twitter today. But I have no friends on twitter, so excuse me whilst I twitter here. As I shuffle towards the end of my Melbourne tour, I had a pretty interesting day today. In twitter style I’ll keep it short’n’sweet. 1. Coffee with private investor 2. Sandwich with advertising executive 3. Phone with fund startup guy 4. Dinner with TV post-production whiz 5. #4 coins the phrase “invespionage” Two more days back home and then I return to the U.S. of A. Comments (0) # Not a drop to drink ## 20 Apr 2007, 10:38 PM I’ve been in Australia for 24 hours now and the biggest news here is the drought engulfing much of South-East Australia. This region, which includes my home town, not only supports the bulk of the Australian population but also provides for the majority of fresh food for local consumption and export. Both of these are supported by the fact that the rest of Australia, for the most part, is arid. Just prior to one of my first meetings with the then to be chairman of Root Markets (my former employer), Lew Ranieri had returned from a series of meetings in Washington with Julian Robertson (ex Tiger). While these gentlemen are probably not the first names that come to mind when thinking of environmentalists, their agenda in Washington was to rally behind environmental initiatives, arguably to support their interest in various trading opportunities. Lew relayed to me Julian’s belief that Australia was selling itself short with its export food industry as his opinion was that rather than selling food to the rest of the world, Australia was really just selling water - one commodity that Australia does not have in ample supply. From an article from today’s The Age newspaper: “Do you like steak?” one farmer asks The Age. “City people are all worried about whether to press the half or full-flush button on their toilet, or how to save 20 litres of water in the shower. It’s pointless. It takes 55,000 litres of water to make a kilo of beef. That’s where your water goes, to make food.” One mechanism currently in place to prioritize the allocation of water to farmers in the region is a system of water trading: A water right is an entitlement to irrigate using a certain amount from the river system. Some bought their rights when purchasing their farms, others inherited them. […] Water traded permanently has fetched more than$2000 a megalitre, while temporary water has sold for as much as $950 a megalitre. For example, Mr Lee says that if he sold his yearly allocation of water he would make$160,000 — and without growing a single grape. It would be more than double what he made last year, slogging his guts out.

As a freemarketeer I find the essence of this plan strongly appealing. However the majority of farmers hold a different opinion. It appears that the concept that their livelihood has been supported through an effective subsidy on our most precious resource is entirely foreign. If Mr Lee is unable to produce grapes with sufficient profit to pay for the water he uses, then we must learn to live without Mr Lee’s grapes. For a long time Japan has been criticized for subsidizing local rice production, given the vast tracts of land required to grow the grain. I suppose Japan supports local rice growers given their institutionalized nationalism and would never deign to admit that rice grown from the expansive paddies of China is of the same quality. Likewise, Australian farmers feel deep shame now that their inefficiency has be efficiently priced. I agree that the effects on a region can be immense when the bulk of farmers can no longer afford water, but we must understand that the environment, and the economy at large, will benefit from efficient allocation of a scarce resource.

“They’re arrogant, they’re selfish, it’s totally immoral,” shouts one farmer, John, at this week’s meeting. His anger is met with thunderous applause.

I, too, applaud this statement. Two out of three is not bad. Yes, I am arrogant. And yes, we are selfish. But free market pricing of water is not immoral as the selfish actions of individuals in a free market results in fair prices and benefit the greater good.

Update: The numbers in the article don't seem to make sense. If it takes 55k liters to produce 1kg of beef, and annual water rights sell for 'as much as' $950, then 1kg of beef would use$52 of water. According to my mum (Hi Mum!), 1kg of beef costs around $25. Either the water markets are (dare I say) illiquid, or the quoted numbers are wrong. I will do some more research into this.

Ooh! I spot a free power outlet in the janky bar. Time to order a scotch, plug in and tune out.

Only 27 hours until I’m in Australia.

Update: Man, I've had faster internet access over an acoustic coupler.

# John Waters - News Director CNN

## 17 Apr 2007, 3:12 PM

In my unemployed state, I am using my time to catch up on news. CNN International is my source of choice. I have also been catching up on reading too. When not watching TV I am currently reading John Waters’ Crackpot. The book is a series of short stories giving a glimpse into the authors world. The opening chapter is “John Waters’ Tour of L.A. (1985)” and includes the following paragraph:

Much more elusive was Annette Funicello’s garbageman. If you hang out all Wednesday night, the night she puts out her garbage (16102 Sandy Lane, Encino), you might spot him. His boss graciously declined to give out information, falsely assuming I wanted to look through Annette’s cans.

CNN just aired an informative interview with Cho Seung-Hui’s parent’s mailman. Paraphrasing:

Q: What can you tell me about the family?
M: Not much. They usually are not home when I deliver the mail. I guess that means they work.

I’m enriched and I’m sure Mr. Waters is enthralled.

Update:Now CNN is introducing the mailman as "Someone who has gotten to know the family quite well".

I am obliged to note Google’s recent acquisition.

# What is stock or not testing?

So it seems to me you’ve proved that stock market data is not random. Your procedure for generating “random” data generates a dataset that is far from random. Have you even looked at how many datasets from the universe of truely random data can not be generated by your “random data” procedure?

Dispite the fact that you keep trying to generate data that is more and more “like” actual stock market data (and hence less and less random) skilled players are still able to spot the differences.

The comment above highlights a common question people are asking both over this blog and email.

I’m not sure as to how someone would construct a test to see whether one dataset was more random than another set. However I am familiar with the efficient market hypothesis and I believe the method I use to generate random stock prices is in accordance with at least weak form of efficiency. Namely, the current price is the best predictor of the future price and you can’t beat the market by looking at patterns.

If I wanted to go out and test whether technical analysis can predict future values, I could simply do a test whereby I show a chart and ask people whether they would buy, sell or hold on the basis of the chart alone. We could then look at returns over certain periods and see whether some people are able to generate statistically significant returns above the market rate.

Now, if we assume that people can generate above market returns simply by looking at a chart (with no additional information about fundamentals), then we must believe that the chart itself has predictive power. In other words, a skilled examination of historical prices and trends can be used to predict future prices. The weak form of EMH states that the expected value of the price tomorrow is equal to the price today. And my method of generating random stock data observes this. So, by testing whether people can tell the difference between randomly generated data and actual data I am testing a super-set of EMH. If people can tell the difference, there are 3 possible reasons:

1. My model is imperfect. It is quite possible that there are better models that can be used to generate random stock prices that still obey the requirement that future prices are best predicted by the current market price.
2. My model is poorly calibrated. Even with a ‘correct’ model formulation, the model may not be calibrated to realistic market parameters.
3. It is impossible to generate ‘realistic’ stock price data under weak form EMH.

I believe all three factors are at play here, with many problems arising for highly illiquid stocks or stock that trade in small-penny ranges. An easy way to test this would be, as many people have suggested, to remove such stocks from the site. I will probably do this some time this week - at the moment I’m having a few technical difficulties related to the server move from the US to Europe (see my previous post).

I’d like to thank everyone for their feedback, and for continuing to play the game. I did not expect it to go this far and am enjoying all of your suggestions and critiques.

I just moved this site to a server in Europe. If there were a faster way to transfer ~100GB of dada to some random datacenter in Haarlem then the rest of i2pi would be on the new server too.

See here.

Lets draw a line from (0,0) to (100,1):

Still with me? OK! This line takes y values along the real line from 0 to 1. Its nice and continuous. Lets pretend that instead of using all values from 0 to 1, we can only use 0 or 1. For a first try, lets round each of the values below 0.5 to 0 and everything else to 1:

Nothing surprising here. But we have clearly introduced error into our time series. This is what the error looks like:

What if we kept track of our cumulative error and whenever it became larger than 1, we rounded up instead of down. Lets call this function ns(). At first ns(4.1) => 4. After 9 more times ns(4.1) => 5. Over these 10 iterations, the average value is now (4×9 + 5)/10 = 4.1. Much better than rounding 4.1 ten times with the simple rounding operation and ending up with an average value of 4.

Do you want to see a picture?

Doesn’t look much like a line from 0 to 1, but the at any point along the chart the local average is much closer to the underlying continuous line than the simple rounding version. Reminds me of noise shaping.

Here is how I do it in R:

 ns <- function (x) { rx <- round(x); ns_err <<- ns_err + (x - rx); if (ns_err > 1) { rx <- rx + 1; ns_err <<- ns_err - 1; } else if (ns_err < -1) { rx <- rx - 1; ns_err <<- ns_err + 1; } rx } rounds <- function (x) { n <- x; ns_err <<- 0; for (i in 1:length(x)) n[i] = ns(x[i]); n }

I think there is some market microstructure in there somewhere. Or I had one too many at the bar. Perhaps both?

I’ve had 175 people play stock or not and the results have me scratching my balding-like-Fama head. Of the 3,515 games, people were right 1,832 times (52%). Now I may have a case of Friday stats trauma, but:

Alternatively, if we assume that there is a 50/50 chance of picking the right chart, the observation of successes so far is 2.51 standard deviations away from the expected mean of 50%. This leads me to think one or more of the following is occurring:

1. People have worked out that by hitting ‘refresh’ they can take a pass on making a choice until a really ‘obvious’ set of alternatives is presented
2. Implying that there is some factor I am not successful in capturing in my GARCH+Jumps+Rounding random stock generator
3. OR, there really is something funky going on

If you are one of the people who is doing amazingly well at the game, please get in touch with me. I’d love to know your strategy.

Update: OK. If I use the refresh trick, I can get a high score. There are some patterns that I see that are dead give aways. However, the person currently leading the scores did not use this trick. Curiouser & curiouser…

Update 2: My port-a-conomist is not being very helpful in identifying the flaws in my model. However, he has shared with me some wisdom:

When faced with a seemingly insurmountable problem, I tend to make clicking noises or chew my pen. I’ve decided that it might just be more effective to do this instead :

Session cookies, if we are feeling anal

## 29 Mar 2007, 2:48 AM

I have taken a short break from working on my calculator to vent my frustration at technical analysis. After the fact, sure, I can see the heads and shoulders and Elliot waves and Bollinger bands and resistance giving way to support. A priori, nope. If technical analysis made sense, then that would imply that there is more to asset price evolution than random innovations. I took some time out today to build a little model to generate ‘realistic’ stock prices based on pretty much standard models that are EMH-Kosher. If you are a great chartist, then you should be able to tell the difference between random prices and real data. Wander over to stockornot.i2pi.com and test your mad skillz.

If I feel so inclined, I’ll add scores and such later tonight so you can see if you can beat the magical Technical Turing Score of 50% accuracy!

Update: I felt so inclined and added some scoring and a shiny new logo! To date I’ve had 60 different people play, each playing an average of 19 turns. Across the board the average score is 51.09%. It looks like Fama is winning.

My personal best score is 57 / 100 (or 57% for the statistically inclined). My next challenge is to get this registered as an official sport at get it on betfair. With my track record (highlighting the value of knowing exactly how the random data is generated) I could make a killing!

One of the great things about living so close to the Magnolia Bakery is that I can avoid eating their cupcakes and instead browse books across the road at the Biography Bookstore. However, whenever I do, I always end up buying a book or two.

This weekend I read Confessions of an Economic Hitman. I didn’t know anything about the book before I picked it up, but after reading the first chapter I felt kinda dirty. Don’t get me wrong, I love post-imperialist conspiracy theories, but this book is terribly limp. The basic premise of Perkins’ tale is that he was employed at a major consulting company which allowed him to work as an Economic Hit Man (EHM). According to Perkins, EHM’s are employed around the globe with the sole goal of selling debt to small nations. The debt serves two purposes, firstly financing the purchase of American goods and services. And secondly the debt presents such a burden on these small nations, that they become trapped as allies of America.

While this may be true - one only has to listen to Eisenhower’s final speech to get hints of this - my gut feeling is that rather than a covert conspiracy we have a situation whereby incentives are aligned to produce these outcomes. I know that statement is terribly vague, but its not what I am blogging about today. One particular detail that Perkins reveals in his book is that his first international consulting gigs was to the island of Java in Indonesia. His role was to inflate electricity consumption forecasts so as to justify the need for spending on infrastructure which in turn would be financed by foreign debt. Perkins talks of the gut wrenching he experienced by ignoring his superior, who booked an estimate of 5% year on year growth, and forecasting 15% for 25 years!

One of the great things about forecasts is that 25 years later we can get the historical stats and have a look-see at exactly how far off Perkins was. If you take a gander at the chart below hopefully you, like me, will fail to understand how Perkins lost sleep by corrupting the hapless Indonesians. My eyeballing gives a compound growth rate of 13%. I’d be pretty happy if I could pull off estimates that stood the test of time so well.

Nothing pains me more than poorly-formed data. In a showcase of fantastic Web 2.0 technology, this site demonstrates how easy it is to let form corrupt function. With only three clicks, I was able to determine that there were 4 winners of the Nobel Prize in Physics who were male and female. And in an odd coincidence, all four of these people are John Bardeen.

I would copy-and-paste the self-aggrandizing text from the offending web site, but this is not possible. Because the site is so damn craptacular. Grrr.

How is it that this page gets highly ranked on various aggregators, but is so clearly broken?

About 2 years ago, I was introduced to the blog-o-sphere. Seth and Greg convinced me to install a feed reader and since then keeping up to date on posts has become part of my morning routine. A few days ago Rands briefly mentioned the question “What if you had two minutes to live and could only send text messages?”. It is somewhat morbidly perverse to frame ones life within the bounds of a technology platform, nonetheless, I’ll follow the same vein and list my top five blogs:

Perhaps with the exception of the latter, there is a clear theme here. I like finance & economics. I joined Root to bridge the gap between my background in marketing analytics & technology and finance. Two weeks ago I resigned.

As the first employee at Root it was a difficult decision to make. I first became involved in the business over 2 years ago and saw it grow from a small idea, to a number of different ideas, and back down to a core idea. Over that time we acquired two companies and launched a number of innovative changes in the mortgage lead business. Today the company clearly offers the best exchange technology and with our new leadership we have an unrivaled pedigree in the mortgage business. As we raised capital, I met with some of America’s greatest financiers and learned a good deal about building marketplaces, companies and the mortgage business in general. More so, I helped put together a fantastic team and Root is now in a position to succeed with or without me.

I came to America in 2004 to pursue my MBA, with the intention of moving into a sales & trading type role. Working in campaign analytics gave me a good sense of the value of my quantitative skills, however working with traditional advertising types was somewhat demoralizing. While I had quite a few savvy clients, for the most part the concept of ‘ROI’ didn’t mean much to your typical campaign manager. Much has changed since then, but I felt that I would be better off applying my skills in an industry where P&L is everything.

As the mortgage market is experiencing a downturn it will be the more sophisticated participants in the industry and on the exchange stand to benefit. Even today the majority of the mortgage lead business is catered towards small brokers who look to profit by turning quick refinances. I strongly believe that this will change and more top-tier lenders will enter the market. At least for the sake of the consumers, I hope that this is the case. And while I would like to be involved in this transition, no matter what lengths we go to bringing financial innovations to this space, it is essentially a marketing services opportunity. And my passions lie elsewhere.

I am planning to take April off to concentrate on some side projects and to visit friends & family in Australia. Come May, I’ll be doing something new.

The tables at the diner were almost stacked on top of each other. With my burger, thanks to the vocal lawyer or architect perched dangerously close to my lap, I learned that the only real way to understand the ins and outs of litigation was to find yourself in a situation whereby you were lost in the depths of suing yourself.

The gradual erosion of inter-patron distance has lead to a distinct change in the rhythm of dining out. Waiters are now bipedal alternators, gyrating as they maneuver their way from table to table. Their constant swiveling achieves two goals. Firstly it imbues the venue with a cadence that keeps one from getting too antsy about how long it takes to get a meal in this town - clearly things are happening and its only a matter of time before your meal is served. But I feel the true purpose of sashaying whilst channeling Chubby Checker is to avoid patron-head to staff-crotch/butt contact. This is a big no-no and in these litigious times could clearly land an otherwise upright waiter in court. These are truly devolutionary times.

In my calculator infused haze, I almost forgot that yesterday was Number 6’s 79th birthday! Pop goes the weasel.

Update: Woops! Number 6's birthday is actually 19th of March. Which also happens to be Patrick MacGoohans' birthday too. Talk about a creepy coincidence.

Update:I’ve overlaid the HP-42S keyboard on this. The function keys don’t 100% map to my layout, but it gives you an idea of what it does.

The calculator keyboard is now working and hooked up. Nothing too technical going on here. Yet it took me a while to get it to work right. I have to learn to go to bed at a reasonable hour and not keep hammering away at projects when I’m not at my best. The rational Josh knows that he should stop long before spending 2 hours writing a software work around to avoid what would have otherwise been achievable with a 15 minute soldering job. The annoying fact is that I always know when I get into such states, but unfortunately there is little I can do to get myself to walk away and get some rest. During the work day its easy to take a 15 minute break. However when its 2am, taking a 15 minute break means 15 minutes less sleep. And I always seem to believe that the problem is always solvable in 10 minutes. Thus the cycle continues, and before I know it the sun is coming up.

Oh welp. One day I will learn.

In the past week I have:

• Ditched the Yagarto/Eclipse/Windows development environment. It was taking me 10x longer to do anything compared to my usual dev setup. I think this old dog is totally incapable of learning new tricks. I am very comfortable using gcc/gdb/openocd/vi/linux.
• Ported my calculator code to the ARM7. I’m pretty happy that my malloc routines worked out of the box. I had to write some minimal code to replace curses with an LCD driver. I considered using all of newlib for sprintf/floating point. In the end I wrote my own sprintf as I only needed a few small functions. I did steal some newlib floating point code.
• Placed a big order on mouser.com for random 4000 series CMOS chips, resistors, capacitors, and importantly 40 push button switches. I plan to assemble the keypad in the next few days.
• Learned that although 64k of RAM was cool for me until circa 1992, I need more RAM. The original HP-42S had 7k (IIRC), but I did spend quite some time writing SQL like functions for my language and would love to have 32MB+ of SDRAM at my disposal. The problem is that the ARM7 chip I am using has no SDRAM interface, so I’m currently deciding between using an LPC-3180 or a SAM7SE. The latter is very similar to my current chip, but with the added bonus of both NAND flash and SDRAM controllers built in. The LPC3180 is an ARM9 architecture chip which boasts memory management in addition a vector floating point unit. Ideally I would use the LPC3180, as hardware floating point would give a massive speed boost to a calculator. However, over the past week I have become fairly comfortable with the ARM7. If I were to go with the ARM9, I would have to buy a new evaluation kit and re-do all the work I have done over the past week. I need to do some soul searching to find out whether my drive for parsimony (laziness) is greater than my desire for hardware floating point.

As you can see, I have the screen working.

If someone uses a quote from me in their signature on a Final Fantasy Legend forum, does that make me a nerd by reference?

After way too many hours putzing around with a JTAG wiggler I bought from diygadget I gave up and built my own. The DIY Gadget one doesn’t come with a full schematic so it was a pain to work out exactly why it wasn’t working. My wiggler is based on the schematic from Dominic Rath (of OpenOCD fame) but without any buffering. I’m running it at a slower speed, but it seems to work a charm.

I worked through the Yagarto tutorial I got a simple blinking LED program working!

This makes me very happy. Now that my development environment is all set up and working, I can start wiring up my screen and keypad. After that I’ll work on porting the calculator OS that I have been working on.

Here is the dead simple wiggler schematic that I am using:

 DB-25 Function JTAG 3 TMS 7 4 TCK 9 5 TDI 5 11 TDO 13 18-25 GND 4-20

On pins 3,4,5 & 11 I have a 100 ohm resistor. NB: This is working with the Olimex SAM7-P256.

At risk of “forfeiture of [my] entire interest“, I’d love to share some “CONFIDENTIAL INFORMATION” regarding the latest news in the Skye Ventures / Bandagro bond case. The news is that Skye has gone to extraordinary lengths to implement silly security on their website. While I will not share with you the full details (you should be able to work it out yourself), Skye has chosen to embed the password to their ’secure’ website within the source of their page. Of course, they have also ‘removed’ the ability to view the source of their page by disabling the right click mouse action.

Let me explain…

1. Skye Ventures has a documents that they only want to share with select parties. One must assume that these parties have been given a password upon signing some agreement.
2. Rather than using standard security practices, they have implemented their own security system.
3. This system checks the supplied password against a password stored within their web page (un-encrypted)
4. If the passwords match, it redirects the browser to a document. The location of this document is also stored within the web page completely un-encrypted.
6. In an extraordinarily feeble attempt to prevent humans from accessing this source, the web designer has used a trick to disable the right click button on your mouse (which is usually used to access the ‘View Source’ option
8. Rather than jumping through so many hoops in attempt to build a secure web site, Skye should have used any number of standard methods for web security…

The first mistake in writing your own security system is the belief that you should be writing your own security system.

As paralleled in the world of calculating stock indices averages, yesterday I increased in age by about 1.7 years in a matter of seconds. Well maybe parallel is not the right term - my age increased by 6%, but the DJIA dropped around 4%. But both arose thanks to the power of modern computers. (NB: I don’t think the market drop was caused by a computing snafu)

As I ponder my new found years, I find myself paying more attention to the the rising commentary on mortality bonds. My question to the financial glitterati is that if I get a lump sum payment on retirement, will I be able to obtain an annuity that pays me at an increased rate if I promise to take up smoking again? If not, where is all this market efficiency I’ve been hearing about?

I was just on the phone and the operator wanted to know how old I was. I said 27. I also happened to have a SQL window open, so I did a quick check.
 select extract(days from now() - '1978-06-18') / 365; ?column? ------------------ 28.7150684931507
It turns out that I am actually 28.

In other news, it turns out that I am actually a dork.

I spent far too much time on the weekend looking into various microcontrollers for my calculator project. Rather than make a rational choice, I decided to just buy one from an online store that sold pretty much every part I needed. Its not the ideal evaluation board, but its a start.

Here’s what you got:

1 x Development Board Atmel SAM7-256 (SKU#: SAM7-P256) = $69.95 1 x Graphic LCD 128×64 STN LED Backlight (SKU#: LCD-G12864) =$19.95
1 x USB Cable A to B - 6 Foot (SKU#: Cable-USB-AtoB-6) = $2.95 1 x Wall Adapter Power Supply - 6VDC 300mA (SKU#: Tools-PS-6V) =$4.95
1 x LiPoly Charger - Single Cell 3.7-7V Input (SKU#: Batt-CG-MAX1555) = $14.95 1 x Polymer Lithium Ion Batteries - 2000mAh (SKU#: Batt-LIP2000) =$12.95

If I have learned anything about promoting software, it is that a screencast makes or breaks the deal! So here is a little movie I call

How to count down from 5,000 in under one minute: A Tale of Tail Recursion.

A comment on my previous post pointed me to the Joy Programming Language, which appears to be similar to what I am developing. However, my language is designed to be used on a calculator. Which means that it must be usable with a small screen (I’m shooting for 30 characters x 4 lines). Importantly it is designed to be used on with a limited keyboard and thus with a minimal number of keystrokes.

The program above is ((32 # 1 - 32 @) (3 #) i), which can be read as
 1. Store the following in register 3: If (register 32 less 1 (store back in register 32)) is not zero then execute register 3 ' 2. Store 5000 in register 32 3. Execute register 3 

I love long weekends!

Since the death of my HP-42S, I’ve been fantasizing about what my ideal calculator would be like. Last week I started coding up a proof of concept for a calculator language that combines what I think is the best of RPN and Lisp. I have a deep rationale behind how and why I want to build this and the importance of designing a language specifically for a calculator and the HID considerations of such. I wont go into any of that now. But here is a quick snippet of a test session (my comments are in italics:

 push 1 onto the stack > 1 > 2 > 3  pop all numbers off the stack and turn them into a vector  > v -> [ 1.000000 2.000000 3.000000 ]  turn on Programming Mode  > ‘ Program Mode: ON > 10  In programming mode operations are pushed onto the stack, rather than executed  > * -> ‘* > ‘ Program Mode: OFF  show the stack, top down  > show (000) ‘* (001) 10.000000 (002) [ 1.000000 2.000000 3.000000 ]  The > operator turns the contents of the stack into a list  > > -> ( [ 1.000000 2.000000 3.000000 ] 10.000000 ‘* )  Execute the top of the stack  > x -> [ 10.000000 20.000000 30.000000 ] 

Colin Raney wrote:
> http://www.midasoracle.org/2007/02/04/blog-of-the-day-joshua-reich/
>
> hows things man? How have you been?
>
Things be good. Living life with all the benefits that come from being blog-of-the-day-joshua-reich.

Chris, my theme is a bastardization of fun-times-inspired-by-mistylook-1. I rejigged some of the CSS and removed all of the hidden SEO tricks that the author had embedded to promote his/her own site.

Whats this? I just discovered Technorati and found out that most of my readers came to my old blog to learn & ogle at my camera hack project (original, preblogging page is here). Here is one post that refers to me:

I have no idea what made Josh Reich decide to take apart his VX2000 and replace the lens, but I am certainly glad he did. It is very well documented minus a real purpose. I can’t see anyone else doing this, but it does produce some interesting images. Take a peak inside the mind of an optical engineer madman. Oh, and I realized the purpose of this; just because he can. I’m still amazed.

From this I learn a few (N=2) things:

1. People like my hacks (wow, I wish I knew this back in Australia when most of my spare time was spent on being a ‘useless scientist‘, to quote a dear friend)
2. People like my drunken posts.

The good news is that some books arrived from Amazon the other day, including a few on optics. My favorite would have to be Optical System Design and I have plenty of ideas. In fact last night was spent taking apart a number of 35mm lens systems to get at the glassy goodness inside. In addition to the tactile enjoyment of working with physical objects, rather than abstract concepts, I have always enjoyed browsing through engineering texts. Much of engineering leaves off from the idealism of physics and has to deal with solving real world optimization problems. Any generic text on optimization always misses out on the subtle art that evolves in particular realms of engineering. I’m quickly learning that optics is no different. There are just so many ways that one can construct a lens system and the math is just too hard to tackle analytically. I also learned that there are often competitions for experts in the art to design lenses that fit certain specifications and so many varied designs can result. When I have some cash, my next project is to make a small optics workbench to play with some designs of my own. I’ll let y’all know how it goes.

I came across this story thanks to Marginal Revolution:

A few years back, Toronto-based gold mining company Goldcorp (GG) was in trouble. Besieged by strikes, lingering debts, and an exceedingly high cost of production, the company had terminated mining operations….Chief Executive Officer Rob McEwen needed a miracle. Frustrated that his in-house geologists couldn’t reliably estimate the value and location of the gold on his property, McEwen did something unheard of in his industry: He published his geological data on the Web for all to see and challenged the world to do the prospecting. The “Goldcorp Challenge” made a total of $575,000 in prize money available to participants who submitted the best methods and estimates… Within weeks, submissions from around the world were flooding into Goldcorp headquarters. There were entries from graduate students, management consultants, mathematicians, military officers, and a virtual army of geologists…. The contestants identified 110 targets on the Red Lake property, more than 80% of which yielded substantial quantities of gold. In fact, since the challenge was initiated, an astounding 8 million ounces of gold have been found—worth well over$3 billion. Not a bad return on a half million dollar investment.

$24 M /year. More like a 9X revenue valuation. Otherwise, why would they take VC money at all!? It's so little cash for Yahoo, it's just gotta be an interesting experiment from their point of view. I.e. "What to do with this RMS (sic) Network thing? - Should we throw some stuff their way?" In other words, the Yahoo investment terms don't necessarily tell you a lot... Comments (0) # Right Media valuation ## 18 Oct 2006, 6:37 PM Totally back of the envelope here, but why not add to the speculation: • RMX Direct has grown over the past few months to 65m impressions per day • Assume growth in RMX direct over the next 12 months will be minimal wrt to entire number of impressions served across their complete publisher packag3 • Figure they are doing around 10x the volume outside of RMX direct. Call it 650m impressions per day, or 230bn per year. • Assume average CPM of$0.75, and they take 10% of that. Call it $170m gross. • Give them a healthy 40% margin,$70m / year
• So a valuation of $45m on 20%, or$225m all up puts them at 3X

This is obviously wildly sensitive to growth and the huge volume they do outside of RMX direct - which can be estimated elsewhere more accurately. But this multiple clearly is in the range of similar deals in the internet/auction/b2b infrastructure space.

Even so, this seems very cheap.

Update: I simply must quote my favourite blog de jour Long or Short Capital:

As noted above, these estimates are empirically proven using math and advanced probability techniques.

# Null Hypothesis

1. Registered
3. ETL'd into Postgres
4. Created indices
5. Waited
6. Free'd up more disk space
7. Calculated marginals
8. RMSE = 1.0560
9. TODO: Think

I was reading a fascinating article from the Dallas Fed on the changing nature of long term rate movements (globalization Effect on Interest Rates) but am confused by the authors decomposition of the inflationary component of bond rates into an expected and risk component. Rather than emailing someone who might be able to explain it to me, I hope that a reader will comment and educate me.

If I understand things correctly, the risk-free real rate component is the market wide return required to make a risk-neutral agent indifferent between taking money today versus some time in the future. The real rate risk premium is the spread between the riskless rate and the risky rate appropriate for the entity doing the payments. Then the inflationary expectation component takes these real rates and makes them nominal by adding in the markets expected rate of inflation between today and some time in the future. This then leaves the 'inflation risk premium' -- What is this?

It clearly can't be an analog of the risk spread as inflation expectations are the same regardless of who wrote the bond. The text in the box says:

This part of R compensates lenders for the risk that inflation will be higher than expected, in which case the principal and interest returned will have less purchasing power than anticipated.

That makes no sense to me. If, at the time of pricing, there is expected to be some volatility in future inflation then this will be taken into account when calculating the expected inflation component, using risk-neutral probabilities, right? If so, then
\lamba_\pi
will be zero.

What am I missing?

Update: Duh!

The econo-blog-o-sphere is quite certain that we aren't in for a soft landing when it comes to the current housing price bubble. The fine folks over at Autodogmatic put not too fine a point with:
If you don't agree with me, try this little quiz: (1) Do you believe the mortgage interest rate risk in the next few years is to the (A) upside or (B) downside? (2) Do you believe median incomes are trending, in real terms, (A) up, or (B) down.

If you answered "A, B", you can give yourself a pat on the back for honesty. But you've also just implied that the housing bubble ain't comin' back any time soon.

If you answered anything else, your views are either incoherent or totally disconnected from reality.

I thought I'd ask the loan officers what they think. Well, I didn't do this, but the fed was clearly able to get a coherent sentence or two from the fine men and women at the front lines of our credit crisis. So, I thought I'd put together a little chart to highlight one small point, namely that loan officers want to relax credit standards when housing prices drop. Now if we are to believe that housing prices are falling due to A & B above, then I don't quite see how that environment is conducive to relaxed lending standards.

About 6 years ago (aka, long before the days of RSS), I set up an email account on my server to gather various news feeds with the aim of correlating news against market data. I also scraped a few pertinent sites and threw together some latent semantic analysis code to correlate against real time stock quotes. While my code stopped running about 6 months ago when I moved from Solaris to Linux and never bothered recompiling, I continue to amass the news feeds. And what have I learnt from this exercise: That markets are a great predictor of news. Not the other way around.

We now find ourselves in the blog-o-sphere[1] and this idea is being revisited:

Edward Hadas over at breakingviews.com (another paywall, but really good analysis site founded by Hugo Dixon) likens it to using the ‘wisdom of crowds’ to trade. I’m sorry Edward but actually I think you’ve got this one wrong. ‘Wisdom of crowd’ - mining would be things like Marketocracy and SocialPicks. Monitor110 in my opinion is all about finding the needle in the haystack; finding the individual voice or nugget that escapes crowd amplification. Finding the kernel before it becomes a snowball. Where I do agree with him however is the paradox of diminishing returns: the more people find the needle the more difficult it will be to monetise. Or paraphrasing Dash - ‘if everybody is special, it really just means that nobody is…’

Do I think there is anything in Marketocracy? Sure! If/when these ideas take off they will evolve to efficient markets that are almost as good as using real markets to predict the future. Which, to paraphrase Barry Ritholtz, is only really good for trend following. And despite our best wishes, looking at historical market data provides us with damn near nothing informative about the future.

One could argue that the problem with my attempt to correlate news with market activity is that I wasn't looking for the needles in the haystack, and in fact my haystack was being generated and distributed by journalists and PR staffers well after the event. I agree that this is a problem, but I do not see mining the blog-o-sphere as a profitable long term strategy. One just has to look at the pump'n'dump schemes that fill my inbox or the lead-gen junk that fills SERPs. When a voice no longer needs to be amplified by agreement, ala in prediction markets, it is very easy for a lone voice to manipulate others.

Prediction markets are so hot right now, but I get the feeling that people are missing out on some basic economics in terms of understand what is a prediction market and what isn't. If you can buy something now, hold it for the future (possibly a some cost) then prediction markets don't really have a reason to exist. Compare the situation with stocks verus weather derivatives. Anyone can buy a stock today and can hold it. Or they can use options to replicate that strategy. At an arguably lower cost they can even use the nascent single stock futures market to do this. This replication strategy is available to anyone. If the market has an informed view about the future evolution of returns then this will be reflected in the spot price because such replication strategies exist and there will be no arbitrage between spot and future prices. This argument holds for all commodities that can be stored.

Hence any prediction market for this class of commodities is irrelevant. Existing market mechanisms provide an environment where future predictions are embedded within the current spot price.

No one can fill their pockets with summer cooling days. I can not keep a cloud in my lounge room and unleash it on Florida during orange picking season. The weather today does not particularly informative of the weather 3 months from now and as such prediction markets are valuable for future events or things that can't be stored as there is no way to replicate strategies on the spot markets to embed information in current prices.

Yesterday my friend Jon dropped by the office and we talked about his research project as he starts his thesis under Robert Engle. One of the ideas we discussed was what I see as the three causes of information asymmetry in markets:

1. Timing: I know something before you do
2. Accuracy: I know something about something, but I'm not sure what
3. Interpretation: I know that X will happen which I think means Y, whereas you think it means Z

Of course, information is latent in financial microstructure. We can't readily see it or measure it, even after the fact. But lets assume that blog-o-mining has solved this. I still see nothing that gives any agent a distinct advantage in the war for information asymmetry.

--
[1] For some reason whenever I hear 'blogosphere', I think Biosphere

# jdigittl on BASH

Its late at night and the brain isn't feeling as sharp as it was 10 hours ago so the only thing left to do before stepping away from the computer is to check out bloglines. John Battelle links to a site that calculates the worth of my blog. My blog is worth $6k. Neat. Totally wrong, but neat. Actually, its not neat. Silly numbers piss me off. I left marketing analytics world because people didn't pay attention to ROI metrics in 2000 like they paid attention to numbers with dollar signs in front of them. Things have changed somewhat and now the entire world of lead gen and direct marketing live and die by campaign metrics. But that's not the point of this post. After toying with the calculator (aka sticking in jdigittl.blogspot.com and then leaving), I searched for jdigittl on technorati. This was the first time I did a blog-o-sphere ego search. I find only 2 links to my name -- one of which is a direct quote of a direct quote of mine that appears on bash.org (an IRC funny-quotes site). I'm not going to link to the blog that quotes this, because whilst it is in German, it looks like a spam blog designed to get organic mortgage traffic. And in all their wisdom, the algorithms behind the site chose that quote to garner $$traffic. At this point, I'd love to rant about the comparative uselessness of algorithms that 'price' things that are neither bought nor sold, but the blog--meter says that the spam mortgage blog is worth 0.00. Its worth something to someone. postscript Ok. I just revisited the German site. They are not a mortgage spam site (as far as I can tell), but rather than editing my post I'll let my point stand. I do this for two main reasons: 1. That was a pretty cool segue, wasn't it? 2. Blogger causes Firefox to run really really slow on my Linux box But rather than falling prey to the sin of silly numbers, let me reinforce my point with this blog. The same blog--meter gives also says that it is worth 0.00. This strikes me as odd given that I found this spam site linked directly from the Google organic search results. I guess that only counts as 1 link using the 'Tristian Louis' method. Actually, it counts as 0 links, because Google isn't a blog. Now, I'll be the first person in line to state that valuation is hard. People way smarter than me may even agree with me on this point. But picking one measure, and extrapolating from that can be quite dangerous. Let me suggest four criteria for choosing a good measure for valuation: 1. Measurable - Can you actually measure this for the thing you are trying to value? 2. Testable - Can you measure it for enough other things that have been valued by some 'market'? 3. Accurate - If you use your measure(s) to come up with a valuation, does it make accurate predictions? 4. Sane - Even if it is accurate, does it make sense? I.e., can you rationalize why it may actually work for observations outside of your sample. Clearly the blog-o-bling meter falls foul of a few of these. post-post script *giggle* Blogger's spell check doesn't know the words 'Google' or 'Blog'. Comments (0) # CBOT ## 15 Sep 2006, 5:34 AM A few months ago I was given the opportunity to present my thoughts on the intersection of lead generation and financial markets at the Chicago Board of Trade. As readers of this blog are aware, I am convinced that there are numerous signals indicating the desire for more advanced financial instruments to services the needs of lead generators and buyers. However, my experiences at Root (and previously at Traffion) continually remind me that adjusting market behavior is far more difficult than the economics textbooks would suggest. So while I was optimistic that change would occur, I was uncertain as to the timing. Today I am very pleased to welcome the formal public announcement of an alliance between the CBOT and Root Markets. [Press Release] My recent work has focused on pricing nascent, and as yet untradable, swaps. Today's announcement heralds the coming days when similar contracts will be readily traded. If anyone out there is interested in buying or selling leads at a fixed price I am happy to say that Root is open for business! Comments (0) # Cubes suck ## 1 Sep 2006, 2:00 AM Until recently I had been stationed in a cube at my office. The only good thing about this was that I could peer over and chat with my neighbors if I had something to say. The bad thing (as Joel Spolsky talks about) is that thinking for long period of time is difficult. Rands talks about his den and how he can lock himself in there and either work or play. I posit that sometimes both happen at once. Just before I left on vacation, I moved into an office. I have never worked in an office alone before, always either being in a cube or sharing an office with a coworker. My primary concern about moving into an office alone was that I would get sucked into a whole lot of nothing, involving refreshing bloglines, reading news and twiddling thumbs. Over the past two days, to the casual observer, this is exactly what I was doing. In the back of my mind I knew I should be working on finding an efficient way to find the distance between a point and the surface of a cube in non-Euclidean N-space. But that sounded scary. I'd come into work, clear out my emails, catch up with the news, then draw a cube on my glass wall and stare for a bit. Staring wouldn't last longer than a few minutes. If I was feeling particularly enthusiastic I'd erase my cube and draw a new one, from a different angle - hoping to grok the mysteries of 5 dimensional cubes. Then, back to slashdot. Or YouTube. Or some partially functioning Java applet written in 98 that enumerates measure polytopes while crashing Firefox. Every now and then, out of no where, would come some piece of insight. I'd scurry across to my old cube neighbor and lay him with my new knowledge. He'd point out why I was wrong, and then I'd head back to the office for more YouTube. But then today, for no particular reason, it all made sense and was blindingly simple. By doing nothing I managed to get what I estimated would have been a few weeks of work done in about an hour. It turns out (as usual) I was making the problem far more complex than it really was. The sad thing is, if I were still in a cube, with the constant self-applied pressure to not look like a slacker, it probably would have taken me 3 weeks to see the light and the solution would have been far more complex than it needed to be. Comments (0) # sony vx-2000e/canon ae-1: a dirty tryst ## 30 Aug 2006, 8:10 AM For those who can't be bothered reading my drunken ramblings: About 8 months ago I decided to try and mount an 'old-school' lens system from a classic Canon 35mm still camera on a fairly swank Sony digital video camera. It worked. 'twas the night before the night before christmas, and josh had too much rum. he wanted to test a hypothesis: that the sony vx2000 has an easily replaceable lens, contrary to what the manual says. after about 3 hours of careful screwdrivering, my lens accidentally fell off in a kinda un-accidental way.... oh yeah, and i was kinda suprised when it still worked. first test movie: and here is the interesting part the ccd fits! it's a sonon! the next step was to go visit mike at eyebeam, who kindly let me use their workspace and laser cutter. I used this to make some templates from mat board, then plexiglass. using laser cutter: testing various shapes as lens mounts: this is what my workbench looked like... just before I used the hacksaw to void my warranty by cutting up the old lens internals to make the backing of the FD mount. so, after i made the matching plate (the white plastic thing) i attached the FD mount to the hacked CCD mount. i also hacksawed off the front part of the metal side of the case (where the original lens used to be), this is so i can actually reach the mount point so i can easily change lenses and reach the aperture ring. here is the view of the assembly from the front. and the grand finale - fully assembled. you can see here i threw in another white plate to increase the lens-ccd distance as focussing was a bit tricky as above. this was the hardest part -- it really should be millimeter accurate - but its not. oh well, it seems to do the trick! today i just opened up the ccd mount, and made some shim's from aluminium sheet. the lens-ccd distance is now within 1mm -- perfect. now it is time to buy some fancy lenses. (i have my eye on a full frame 8mm....) Ok. I think I need to make a relay system. There is a slight problem - 35mm lenses are for 35mm film, whereas my CCD is much smaller. Hence, the image formed at the focal plane is too large. I once studied optics! I can fix this! Here is my initial sketch of how it should work: of course, this will make the image upside-down, but i think i can tweak the LCD display to always display upside down. It already flips the image when you turn it around, so I just have to find the sensor that does that, and reverse the sense. Note, I'm also planning to switch to Nikon F-mounts as they seem to be easier to find (at least at my local camera store). As for the prime, I'm looking at the Peleng 8mm- it's pretty cheap and looks like it will do the trick. yep. that was easy - image is now the right side up in the viewfinder (but left and right are now swapped... lesser of two evils?) ok, after a trip to B&H (and some very strange looks) I got all the bits that I needed. But they forgot to put them in my bag, so I'm going to have to go back and get the tube. As you can see below, the system lets in a little too much light on the sides :P. After I pick up an extension tube & make a coupler, I'll have to make an extension arm from the tripod to support the extra weight. And Then, I'll be rocking a truly tricked out camera. ok. I couldn't find a full set of Nikon K rings, or a BR-3. So I made the missing K rings by buying 7 split lens filters, popping out the glass to make an empty tube. I made a BR-3 by taking apart a K1 and glueing it to a filter. Luckily the inside of the K1 has a 52mm flange that was a tight fit on a male threaded filter. Anyway, she is done. Well, done in the sense that all I need now is the prime lens. Ebay, here I come. tricked out! Update: So, I got the Peleng 8mm lens and its bloody marvelous. I ended up taking it to Australia in January, where I managed to drop it off a 4th floor balcony onto the road and under a truck. The lens now has a hairline crack in the prime glass, but generally works OK. Below is a quick movie I edited that chronicles building the camera and the patience of my sweet one whilst putting up with me :) Some useful links: Peleng 8mm Mounting 8mm on Nikon Digitals Camera mounts & registers Extension tubes And a big thankyou to the people in the second hand department at B&H camera who put up with me asking for rare parts to be used for bizzare purposes. Comments (0) # Ethics & AOL ## 8 Aug 2006, 6:13 AM How many people have access to this database? (I do) Who is concerned by the breach of privacy? (I am) Who, despite their concerns about privacy, spent a good portion of tonight browsing other people's searches? (I did) How many ways can you use this data to make ? (I can think of a few) Is it legal to use this data to make money through SEM? (Probably) Is it legal to use this data to make money through identity theft? (No) Is it ethical? Comments (0) # Tsk tsk AOL.. ## 7 Aug 2006, 8:35 PM For Postgres users (not the best way of doing things, but it works): cat user-ct-test-collection-*.txt | grep -v "AnonID" | grep -v "\\\." > silly.txtcreatedb aolpsql aolaol=# create table tmp (anonid varchar(16), query varchar(1024), querytime varchar(32), itemrank varchar(5), clickurl varchar(1024));aol=# copy tmp from '/Users/josh/AOL-data/silly.txt';aol=# create table aol (anonid integer, query varchar(1024), querytime timestamp, itemrank integer, clickurl varchar(1024));aol=# insert into aol select anonid::integer, query, querytime::timestamp, case when itemrank='' then NULL else itemrank::integer end, clickurl from tmp;aol=# create index aol_id on aol (anonid); Comments (0) # Funky stitching: part II ## 7 Aug 2006, 5:36 PM I really should be posting about the privacy nightmare / SEM dream of AOL releasing silly amounts of dada last night. But precisely at the time that was happening, I was walking through the West Village trying to find St. Vincents hospital. It would be far less embarrassing if it happened 15 years ago, but given that it didn't, it was bound to happen sooner or later, especially after my girlfriend told me not to use the paring knife as a screwdriver. Last week she went overseas for a holiday, so I had a chance to catch up on dorking out and fixed my computer and sliced my finger open with a paring knife. My first official event at Carnegie Mellon was a 'what to do in an emergency' lecture, with extra emphasis on how expensive ER visits are. It went in one ear and out the other. Likewise, when I transferred over to my workplace medical plan I was told what I needed to do before making a claim. I didn't do any of it, because I had no plans to actually use it. You know you are in America when the first thing you think of when a medical emergency is upon you is 'Where did I put my insurance card?'. It is surprisingly difficult to remove a card that is stuck to a piece of paper when one hand isn't quite working right. After sorting that out I had a flash of my medical training and found some sterile gauze and a bandage and wrapped myself up. This was quickly followed by a flashback to an old Bill Cosby comedy routine where he joked about his mother always harassing him to wear clean underwear, incase of an emergency. For some reason this felt like important advice, so I got dressed. Not to say that my underwear weren't clean, but I wasn't looking my best, so I got dressed to impress. That wasn't particularly rational, but I must admit that I wasn't at my finest at that point. It is also very difficult to tie shoelaces with one hand. I found the hospital and was impressed that I was on the 'fast-track'. I wasn't impressed by the lengthy interview (which was only lengthy because the registration admin kept on pausing to continue gossiping with her friend) and requests to sign documents before I had read them. The ER ward was fairly empty; I later found out that they were closing that section, and all the hot action was on the other side of the building. A range of nurses and doctors came by, none of whom introduced themselves (contrary to the Patients Bill of Rights document that I signed and read). And none of them brought me a glass of water, even though they all said they would. When I finally worked out who my doctor was I told her that I used to be in her position and that seemed to do the trick. Trying to remember words and phrase from med school, but without sounding like someone who picked stuff up from ER, I managed to get her to actually talk to me, which made me feel much better. As did the nerve block. The six stitches required to get me back together didn't take too long, and I made it out by midnight. On the walk home I felt the lignocaine wear off and figured that it was about to hurt like hell, so I self medicated with some scotch. This morning I feel worse off from the scotch than the finger, althought typing is a bitch. Comments (0) # i2pi ## 6 Aug 2006, 9:15 AM Probably quite irrelevant to my current readers, but I finally came to terms with the fact that I needed a new powersupply and motherboard, and now i2pi.com is back. This means that when I come to terms with not really wanting to be hosted on blogspot, I'll move this blog over. I really want to take control of image aliasing again; blogspot does a terrible job at it, or I just don't know how to use it. Either way, I prefer to control my web presence. Comments (0) # Funky stitching ## 5 Aug 2006, 12:11 AM Either New York has some really funky architecture, or Google maps has some funky image stitching technology. Comments (0) # Liquidity premium = Insurance? ## 3 Aug 2006, 4:55 AM Greg raises the point that some large lead buyers actually get discounts, contrary to my previous post. I still haven't really thought through his statement that the mortgage vertical has plenty of liquidity, but I do have a possible explanation for the volume discounts. One way to paraphrase my initial wordy post is that the premium is a form of insurance: insurance against the cost of rebalancing the relationships to maintain an orderly market. If a large buyer is coming into your exchange, and his accommodation will require the outlay of expense not only to develop his relationship, but that of the suppliers to fill his order, then you need to protect that investment with insurance. If the buyer is large and reputable then the charge will fall away. And if they provide depth to your buy side that actually encourages further supply, then they may get advantageous pricing compared to smaller buyers. This argument becomes clearer if we make that (false) assumption that exchanges will buy all supply upfront, and then take fulfillment and counterparty risk whilst trying to sell held inventory. For the most part this does not happen, but if we replace the concept of 'lead inventory' with 'relationship inventory' then it all follows through. Comments (0) # Spot lead pricing II: The fishy distribution ## 2 Aug 2006, 7:05 PM When I worked in media analytics / campaign management, when a statistic was to be reported on was that 'things' were drawn from the Normal distribution. The general arm-wavy argument was along the lines of "... mumble mumble law of large numbers mumble burp ...". Of course, what they really intended to invoke was the central limit theorem. But hey, I too went to business school and understand that MBA level probability & statistics is dull and arm-wavy and hence was a great time to catch up on sleep, so I usually let that slide. In lay terms the argument is that we don't really need to know the underlying distribution because with enough samples, things look normal. In the world of media analytics, where we had billions of ad impressions and millions of clicks, the 'enough samples' part usually held. But in the world of lead-gen, where a single supplier may only provide 5 to 25 leads per day, this doesn't hold. What distribution do I use to model leads arriving into an exchange? The Poisson distribution. If anyone remembers their probability classes from school, they will remember countless examples which invariably involved people arriving to a queue at a bank teller. If you happened to take computer science, the example might be expressed as jobs arriving at a CPU, or something like that. Either way one of the key measures to describe these processes is to state the average time between successive arrivals. If you can make the assumption that the process is memoryless, i.e., the time of arrival of the next person does not depend on the time of arrival of previous persons, then you can model the time between arrivals as an Exponential distribution. And if you do this the total number of people arriving over a time interval T is distributed as the Poisson distribution. In the chart above we see a the distribution of the number of expected arrivals in one day, when the average time between arrivals is 4.8 hours (one fifth of a day). We can see that the we expect about 5 arrivals in the day, which should be blindingly obvious. I'm very fond of the R statistical programming environment. It managed to get me through my statistical arbitrage course, while I profited from the arbitrage between S-Plus ($$$) and R (FREE!). To my untrained eyes, they are pretty similar.

To product the chart above in R:
plot(dpois(0:50, 5), type='s')
If you make arrival times more frequent, we end up with a distribution that looks like a discrete version of the Normal distribution:
The big difference is that unlike a normal distribution, a Poisson will have P(X < 0) = 0. In other words the chance of having a negative number of arrivals is zero. So, if you permit my own arm-waviness, the Poisson distribution is somewhat like the discrete analog of the LogNormal.

In my previous post I made the statement that a supplier providing a large number of leads was unlikely to supply a dramatically lesser number in the future. Lets examine the chance of a supplier providing exactly zero leads if the previously provided X leads per day. In R, we express this as
plot(dpois(0, 0:50), type='s')
I won't spoil the surprise ending by including the chart, but needless to say if you provide less leads you are more likely to provide zero leads. Mathematics is wonderful for stating the obvious, but humor me here.

The Poisson probability distribution function is P(x) = l^x exp(-l/x!), where l is the mean number of arrivals and x is the number of arrivals that we want to know the probability of. If we set x=0 we get P(0) = l^0 exp (-l/0!) = exp(-l). So the chance of getting zero leads follows an exponential distribution and we are right back where we started this detour into the fishy distribution.

And as an exercise to the reader (don't you hate it when people do this..), what does the following mean in the context of lead gen?

a=2:100plot(ppois(a,a),type='l')`

From the minutes of the 1990 DataServe Annual General Meeting:

Joshua Reich's Report

Joshua's report was unavailable at the time of printing the Annual Report, however it was read out and distributed to the attendees of the Annual General Meeting. Those who didn't attend and wish to obtain a copy may contact Joshua Reich on xxx xxxx. The following points were outlined in the report.

Joshua said that he had sold an inventory program to Video Classrooms Australia for \$4.95. He also indicated he had written Graphar 9.5, and update to his simple, high0level mathematicians' language. An unidentified attendee noted that he had 'been sucked into chaos [theory]'. Luke pointed out that Joshua therefore must be the suckee. Joshua also told the meeting that had produced null-modem cables and plans to make a robot.

When prices look wrong there are two possible explanations, either the market is wrong or you are wrong. Bully for you if you can pick which!

Examining the supplier side we can easily see how these three factors may come into play. Lead aggregators (or exchanges, for that matter) have certain fixed costs associated with each participant. On the supplier side, there are the costs of technical integration, support and initial marketing expense which all have fixed components. Thus, on a per lead basis, leads from a small supplier are more costly than those from larger suppliers. Additionally, large supplier have more to lose if they were to provide poor quality leads in the exchange and it is common to see a higher return rate with smaller suppliers than for larger suppliers. These returns are costly for the aggregator and are thus reflected in the price paid to smaller suppliers. Both cost allocation and quality are factors that are readily understood in this industry, however the risk premium is not often voiced.

Dealers take a profit for providing two key services, price discovery and liquidity provision. The risk that they face is one of spread risk. In other words they make money, but are exposed to the risk of holding an unbalanced book. Before the modern era of financial engineering, where dealers (and others) can structure positions to offset spread risk, dealers would just bump their prices to cover this risk. So if a large market order came in, which would take liquidity from the market, the buyer would have to pay a 'liquidity premium' to cover this. This still happens today in the world of large block trades, but not to the same extent as in the past.

A quick point that I would like to make before returning to the lead risk premium is the difference between liquidity and volume, which I notice are often used interchangeably, especially in lead markets. Volume is simply the number of things bought or sold on a market. Liquidity, however, speaks to how easy it is to buy or sell on a market. Imagine a market with sellers providing exactly one million widgets per day, and buyers wanting exactly one million widgets. The volume of this market is one million widgets, which for the sake of argument, we will call 'large'. However, if the sellers represent all the producers of widgets and there are no more widgets to be made, a new buyer wanting to purchase 50 widgets would not be able to do so at the current market price. Those 50 widgets would have to come from the requested quantity of an existing buyer and this should only happen if the new buyer is willing to pay more. In this market we have large volume but little liquidity.

In general, market orders or offers, which demand an instantaneous purchase or sale at whatever price the market can bear take liquidity away from the market. Limit orders which set the price at which they are willing to trade provide liquidity. A market with a large number of limit orders is said to be deep. Limit orders can have two types of fill policy: all or nothing, or partial fill. If the orders for the million widgets were 100 all-or-nothing orders for 10,000 widgets each, then the market order for 50 widgets would have to pay at least 200x the per widget price of the larger all-or-nothing limit orders.

What does this mean for a small volume lead seller? Recall that small sellers are paid less for their leads than large sellers. The best way to think of this is in terms of spread risk, or the cost to an aggregator of keeping an unbalanced book. Leads are provided by suppliers in real time, whereas orders are pre-existing limit orders. If a single small sellers leaves the market there is little impact on the balance between supply and demand. If a large seller were to leave the market, a large number of orders may go unfilled. And although buy orders are mostly of the partial fill variety, if you start supplying only 100 leads to an order for 1,000 it is likely that either the order will become smaller or the buyer will leave the market. Both situations require a chunk of sales capital to bring the volume back. To prevent this from happening it is wise to pay your larger suppliers more to minimize the chance of them selling their leads elsewhere. If this indeed is the case, a portion of the premium paid to large suppliers on the spot market represents an incentive for future behavior.

Likewise, when a buyer buys 100 leads today they are making a statement that they are likely to buy 100 more tomorrow and they are paying for the probability of extracting liquidity from the dealers book in the future.

This is interesting. Without the mechanisms afforded by futures contracts, participants face differential pricing in the spot market contingent on beliefs about future events.

At ROOT Exchange I am working on developing contracts and other futures products to help better serve the needs of natural participants and speculators. Hopefully when I get some time away from working on that, I will be able to share some more analytical detail about my thoughts on these premia embedded in spot prices.

On the 18th of June I woke up, turned 28 and found my room full of balloons and books. My girlfriend had bought me a great stack of presents. After a lovely day doing birthday type things, I opened my presents to find a pile of books, many of which I already own. I think that its terribly super that my girlfriend knows me so well to have bought me books that were perfectly up my alley. One of the books is a collection of letters written by my all time favourite geek and childhood hero, Richard Feynman.

As a kid I read 'Surely your Joking...' and while involved in the Australian Physics Olympiad, I read through his infamous 'Lectures in Physics'. I felt terribly proud of myself when one day I derived the shape of water coming out of a tap and learned that he did the same exercise in his teens.

Jump forward 10+ years, and I was in yet-another-strategy meeting that wasn't holding my attention, but had a notepad and a pen and needed to come up with a way to look busy and keep my mind active. Casting my mind back to the days of APO, I drew a triangle and a box, put on some arrows for forces and started to answer a question that had been plaguing me for all of 30 seconds: how fast do you have to raise an incline so that a box starting at the top of the incline will reach the bottom in time T?

Some equations here, some there and bang! solved! And it felt really good, because even though the problem was enormously trivial, I was quite curious to see if I could remember how to solve such things. While I would be totally lost if asked "Whats Newton's 3rd Law?", I can clearly remember the seven D's of solving physics problems:
1. Diagram
• Not just any diagram, but a BLOODY BIG DIAGRAM (you must remember that this was the Australian Physics Olympiad)
2. Dimensions
• Mark out all the useful lengths, angles, times
3. Directions
• Which way is 'up' ?
4. Data
• Are there any known numbers, write them down, but don't use them yet!
5. Define
• What are you trying to solve?
6. Derive
• Solve the sucker
7. Dimension Check
• If you are looking for an answer in seconds, make sure that the units of your answer aren't in cubic seconds per coulomb.
8. subDitute
• Only at this point do you put the numbers from (step 4) into your formula to get a final number. This isn't even a real 'D', because numbers don't really matter.
You might think that its odd that there is no gut-check step listed. I think the point is that you should be gut checking along the way, and even if you don't have a good gut feeling for what you are doing, then dimension analysis and a diagram go a long way.

I quite possibly have forgotten the exact list, but I think the spirit is there.

In his autobiography, Emanuel Derman quickly talks about how bamboozled he was when he first opened a modern finance textbook. As a particle physicist he had no idea how economists could possibly need such complex/erudite/obtuse analytical mathematics to solve problems. Even the notion of 'solving' a problem in economics was bizarre in that physicists could apply relatively simple math and predict fundamental universal constants to 11 significant places, whereas economists worked in the realm of 1 or 2 significant places.

Having recently gone through business school, and one that is renowned as (while simultaneously ashamed of) being highly quantitative, I am very familiar with the inane detail involved in making extraordinarily subjective financial predictions. I recall hours of lectures on the best way to calculate WACC, knowing that in the real world (at least for my industry) people discount at 20%. Given what we know about a companies cost of equity, industry comparisons, etc., the discount rate is invariably 20%.

This is not a bad thing. This is not poo-pooing the value of quantitative analysis. In fact, it is a reinforcement of the seven D's. You need to know what is important and have a system whereby gut-checking is part of the process.

So, now I find myself with the task of coming up with a structure to open up a lead exchange that is currently hidden within a public company. The question is vague, I'm still not sure what the dimensions are, but first thing tomorrow I am going to draw myself a bloody big diagram.