Proof

22 Dec 2008, 15:10 PM

Via God Plays Dice I came across a great list of mathematical proof techniques, of which my favorite is probably Proof by Induction:

Proof by Induction

Proof by Induction claims that

math

where math is the number of pages used to contain the proof and math is the time required to prove something, relative to the trivial case.

For the common, but special case of generalising the proof,

math

where math is the number of pages used to contain the proof, math is the number of things which are being proved and math is the time required to prove something, relative to the trivial case.

The actual method of constructing the proof is irrelevant.

Great jokes don't need explanation, but I just can't help myself. The equation listed above is actually Faraday's law of induction, which describes how a changing magnetic field induces an electrical field. C.f. mathematical induction, which is a method of mathematical proof typically used to establish that a given statement is true of all natural numbers. It is done by proving that the first statement in the infinite sequence of statements is true, and then proving that if any one statement in the infinite sequence of statements is true, then so is the next one.

Mathematical induction is often misunderstood by those trying to apply it, often resulting in what Uncyclopedia calls Engineer's Induction:

Suppose P(n) is a statement.

  1. Prove true for P(1).
  2. Prove true for P(2).
  3. Prove true for P(3).
  4. Therefore P(n) is true for all math.

And to snuff out any life remaining in the joke, the technique above only proves P(n) for n = 1, 2 & 3. To make it a real proof you would need to show that if P(n) is true for some n then it must be true for n+1, which coupled with the proof for P(1) would show that it was then true for all natural n.


Comments (0)

Hold your horses

17 Dec 2008, 14:09 PM

I like Felix Salmon. He's a pretty awesome finance and economics blogger and on the few occasions I've spoken with him, he seems like a genuinely nice guy. But a recent post left me scratching my head:

John Gapper worries that taxing non-diet sodas would be regressive. I worry that it wouldn't even do what it is designed to do, which is reduce obesity: all it would do is increase the amount of diet sodas consumed, and there's a strong link between diet-soda consumption and increased obesity.

Unless and until there's some empirical evidence that switching from non-diet to diet sodas helps people lose weight, this should be filed under "very bad ideas".

If you follow that strong link link, you'll see:
Fowler is quick to note that a study of this kind does not prove that diet soda causes obesity. More likely, she says, it shows that something linked to diet soda drinking is also linked to obesity.

"One possible part of the explanation is that people who see they are beginning to gain weight may be more likely to switch from regular to diet soda," Fowler suggests. "But despite their switching, their weight may continue to grow for other reasons. So diet soft-drink use is a marker for overweight and obesity."

Right. Correlation is not causation. In fact, if the elasticity of demand for soda is not correlated with propensity for obesity, then a Pigovian tax on soda would make for a great instrumental variable to see whether diet soda has any causal relationship with obesity.


Today's xkcd comic is fantastic. Check it out.
Comments (4)

Wesabe + Perl + PostgreSQL

16 Dec 2008, 22:56 PM

I was looking for some code that would allow me to automagically populate my database with transactions from Wesabe. It needed to work with Postgres, and Perl would be a decent language for the job. I found it here. Enjoy.


Comments (0)

Morons

16 Dec 2008, 20:41 PM

It is good to know that $1,793 power cables are useful even without replacing the miles of standard copper between the power station and your house.

George,

Shortly after receiving two 5ft Quadlink power cables, I dragged them to my office. I am employed in a 24/7 engineering support environment and thought my "always on" computer would be a good starting point to burn in the cables. Working 12 hour night shifts I have grown accustomed to eye fatigue while working with a computer. Several busy days later (currently on my 9th straight 12 hour night) other priorities and a busy work schedule had caused me to forget about the cables burning-in beneath my desk. Yesterday I noted how uncharacteristically good my eyes felt. Today, just an hour ago, a coincidental look at my feet put me face to face with the Quadlink.

I have tried to identify another cause for this remarkable change but am unable to attribute the effect to anything but the Quadlink power cable. I trust you'll find that satisfying.

Needless to say I intend to purchase a replacement cable for my home audio use as soon as possible - I don't think this one will be leaving my computer monitor anytime soon. I am anxious to loan my cables to others in the office who have complained of similar eye fatigue problems.

Best Regards, Wade
See here. And then here.


WTF: Where does a night-shift support desk worker get the $ to spend on such crap?


Comments (0)

Benford's Law

14 Dec 2008, 19:28 PM

A few weeks ago at lunch we were discussing forensic accounting and Benford's law came up. I've always been a little uncomfortable with the law, which stipulates that the distribution of leading digits in 'natural' measurements follow a non-uniform distribution. I believed the law but never had a good grasp of why. In undergrad I recall proving the existence of the law by starting with the assumption that if such a law exists, then it must be scale invariant. That is, if there is some law that describes the way leading digits of measurements are distributed, then it shouldn't matter whether you measure the underlying quantity in Celsius or Farenheit; that is, it should remain after any linear transformation. From that point its pretty simple to derive the distribution. But it was a fairly weasely way to answer the question as I started with the assumption that the law must exist. Mathematical proofs frequently start with the assumption that the converse is true and then show that if the converse were true it would be non-sensical, hence the converse of the converse must be true - this is called proof by contradiction. Proof by assumption of truth is not really proof. Either way, my 'proof' was accepted and my discomfort with the law remained until today.

Last night I was browsing through Hacker's Delight, a compendium of neat coding tricks that would get you fired at most workplaces with the charge of being too clever and abstruse. And in my reading I came across Bedford's law and decided that today I would finally take the time to grok it. As has become my default browsing behavior, I started at Wikipedia and came across a description and picture that finally let the pieces click. I also got to feel a little bit better about the shame that I have carried from my undergrad 'proof', as they include my same argument as evidence of the law:

The law can alternatively be explained by the fact that, if it is indeed true that the first digits have a particular distribution, it must be independent of the measuring units used. For example, this means that if one converts from e.g. feet to yards (multiplication by a constant), the distribution must be unchanged -- it is scale invariant, and the only distribution that fits this is one whose logarithm is uniformly distributed.

To neatly close off the issue, I logged into a database at the office and grabbed 123 million financial values taken from around 65 thousand companies from around the world. It is nice having access to this kinda data. And lo and behold the distribution of leading digits follows the law perfectly. The black line in the picture at the top of this post is the observed frequency and the red dotted line is what the law would have one expect. Cool.

 digit   observed   expected 
     1 0.30200305 0.30103000
     2 0.17712325 0.17609126
     3 0.12545546 0.12493874
     4 0.09689768 0.09691001
     5 0.07977275 0.07918125
     6 0.06637933 0.06694679
     7 0.05717501 0.05799195
     8 0.05021505 0.05115252
     9 0.04497842 0.04575749

Comments (1)

Financial session tracking

6 Dec 2008, 17:35 PM

I have a love/hate relationship with Mint and Wesabe, two Web 2.0 online financial tracking applications. They do a few things well and have great potential but still manage to leave a bad taste in my mouth. The strength of both services is that they consolidate all of your transactions from your various banks and credit providers and present the information in a unified format. This alone is huge for me. Even within Chase's own website, I can't get a single consolidated view of all my accounts. However, the sizzle on the steak is that they take advantage of network effects in tagging your transactions. For example, when one user tags a cryptic line item such as '238927 12/04SOU EQUIN' as 'Gym Membership', all future transactions across all user accounts will benefit. That said, I still find most of my time using these sites (I have accounts on both) spent tweaking their automatic suggestions for tags. However I understand that me spending 5 minutes correcting their 10% classification error rate is better than spending an hour hand classifying every transaction [1]. I also take issue with some of the interface design choices, especially on Mint, whereby quick searching and sorting is prevented by their excessive use of 'pretty' transitions and AJAX style effects. I usually keep my mouth shut about UI issues, but given that I've spent the last week designing a visualization app, I feel suitably qualified.

But my biggest grief is one I raised both with the founder of Wesabe and on this blog at one time. I can't seem to find my original post on the topic, as it was lost when I went through the Blogger to Wordpress switch - I'll try and extract it later. But anyway, my point was that I was disheartened by the lack of double entry book-keeping. This problem manifested itself on Wesabe when it congratulated me for earning $5k when in reality I had actually just paid off a $5k balance on my credit card - net economic impact of far less than $5k. This annoys me.

This morning I was doing my semi-regular perusal of my accounts and came across a balance advance from a line of credit and I was curious as to where the money went. It is a royal pain to do arbitrary searches with either of the sites, so I opened up my SQL clone of their site that I scrape on a regular basis. From there it was easy to find all transactions within a time window that matched the dollar amount of the balance advance. This is a non-GAAP technique, but even for the 8 accounts my wife and I have, joining by transaction timestamp, amount and debit/credit is a pretty accurate way to build a financial journal - in the accounting 101 sense of the word 'journal.'

I don't know why these services don't offer this feature. It is easy to see useful extensions that bring much more color to our financial activity. Just in the same way that session management gives you oodles more insight into online marketing transactions than simply looking at individual requests in a web server log file. Now maybe I'm the only one nerdy enough to specifically request double entry accounting, but in these times everyone wants to know more about their financial data. Couple this data with logs from my cell phone bill, cable service, Netflix and we'd soon be able to fairly accurately recognize the rationale being each transaction with our banks. If only there was some kind of online Vault service that could own this data for me...


[1]: It is also really difficult to determine their false positive rate. It's easy enough to work out when they can't work out a tag, but sometimes it takes quite a while to realize a $20 purchase tagged as 'Pet Food' was really 'Movie Tickets'.

PS: Note to Mint.com, stop calling your inane pie-charts 'Trends', you are starting to piss me off.


Comments (1)

Teaser

5 Dec 2008, 17:04 PM

The real take away from my last post is that I am a man of mystery. To that effect:


Comments (0)

Spatial Indexing

2 Dec 2008, 23:14 PM

Sometimes I forget that Knuth is not the be all and end all of algorithms. So when I couldn't find what I was looking for last night, I slept on it hoping that something would come to me in the morning. And as hoped, on the train I found what I was looking for. It is a great feeling when your mind is jumbling around so many concepts and then everything clears away to reveal a solution. Of course, as soon as I got into the office and did some googling, I found out that this was all worked out years ago. Nonetheless, it is fun to think about.

I was looking for a spatial index for high dimensional data. I know a bit about various spatial indexes and the best approach that I knew of was an R-tree or one of their variants. For the uninitiated, R-trees are multi-branching trees that extend from the B-tree concept, but instead of indexing linear regions, R-trees index multi-dimensional rectangles. Postgres uses R-trees and previously I had made some minor contributions to that codebase. From my possibly flawed understanding of their implementation, the tree is arranged only with consideration of the marginal distributions of the dimensions. In cases where you have correlated variables, you can do better than that.

Lets say we have a 2-dimensional data set, where y = f(x) + e, where e iid. Normal. In this case, let f(x) be a simple linear transform, but everything that follows should work for any monotonic function:

If we want to find all points that are within a query box {(x_min,y_min), (x_max,y_max)} we can use our knowledge of the correlation to get us there.
The x-range of our selection is [x_min, x_max]. This range is shown as the solid green lines. We can use our estimated inverse of the function y = f(x) + e to turn find our y'-range - the dashed green lines representing [f'(y_min), f'(y_max)]. (NB: I'm using f' to denote the inverse of f, not the first derivative.)
Now instead of storing these points in a 2-dimensional tree, like a quadtree, we can use our estimate of f(x) and store our points in a binary tree. To find all the points inside our query box we can take the range that encloses where we expect the points to be, given our estimate of f(x), and use that to search the space. Naively, this range is [max(x_min, f'(y_min)), min(x_max, f'(y_max))], assuming f(x) is increasing. Call this range the selection range - it is shown in orange below:
This gets us part of the way there. However, we want a selectivity and specificity of 1 - that is, we want to find every point that is inside the box and reject any points that happen to fall out of the box. We need to improve on our selection range to take into account the error term, e. In the figure below I have included the prediction interval as two dotted red lines above and below the best fit. I have arbitrarily set the prediction level to 80%. This means that 80% of the y values fall within these lines for any given x. Knowing this, we can take 80% of our points and index them only using the x value. We adjust our search range using the prediction interval and our selectivity increases.

We can now apply this approach to design a tree structure that utilizes our knowledge of the relationship between the variables. Start by dividing the dataset by the median of x so that half the points are to the left, and half to the right:

Note: we could have divided by y, but I just chose x by convention.

Most of the points to the left of the orange line have y values less than f(median(x)), and conversely for the points to the right. We can quantify 'most', by looking at where the 80% prediction interval intersects the y=f(median(x)) line. To the right of the right-most dashed line, 90% (= (100% + 80%)/2) of the points are above the y ordinate where the prediction interval intersects the line, shown as a dashed purple line. We call the points below outliers and highlight them with red circles.

Using this we can now construct a binary tree much like what you would do if you were only keeping track of x values. The only difference is that at each node you keep a list of the outliers in the y direction.


NB: the R source code for this page is here
Comments (2)

That idiocy will not occur here.

26 Nov 2008, 05:23 AM


I finally got around to reading Memos from the Chairman today after having it languish on my ever growing book inbox for the past few months. Perhaps my purchase was prescient of Bear Stearns' demise, but nothing more so than the first memo in the book, written by Alan Greenberg on Oct 5 1978:

Bear Stearns is moving forward at an accelerated rate and everybody is contributing. It is absolutely essential for us to be able to talk to our partners at all times. All of us are entitled to eat lunch, play golf and go on vacation. But you must leave word with your secretary or associates where you can be reached at all times. Decisions have to be made and your input can be important!
I conducted a study of the 200 firms that have disappeared from Wall Street over the last few years, and I discovered that 62.349% went out of business because the important people did not leave word where they went when they left their desk if even for 10 minutes.
That idiocy will not occur here.

Flash forward 30 years and we have stories like Where in the World is Bear's Jimmy Cayne? Playing Bridge.

Last year, when he was still chief executive of Bear Stearns Cos., James Cayne took heat for hitting the bridge circuit during troubled times for his firm. Will the same rules apply to Cayne now that he's chairman?
We'll soon find out. Thursday and today, as Bear fought off a pending cash crisis that threatened to ruin its business, Mr. Cayne who relinquished his CEO title in January and become the firm's non-executive chairman - has been in Detroit, playing in the North American Bridge Championship.

Three days later, Bear is acquired by JP Morgan.


Comments (0)

#4116 Youngblood, T.

19 Nov 2008, 19:38 PM

(click for more)
Comments (0)
All previous posts

consulting

personal projects

etc