Spatial Indexing

2 Dec 2008, 23:14 PM

Sometimes I forget that Knuth is not the be all and end all of algorithms. So when I couldn't find what I was looking for last night, I slept on it hoping that something would come to me in the morning. And as hoped, on the train I found what I was looking for. It is a great feeling when your mind is jumbling around so many concepts and then everything clears away to reveal a solution. Of course, as soon as I got into the office and did some googling, I found out that this was all worked out years ago. Nonetheless, it is fun to think about.

I was looking for a spatial index for high dimensional data. I know a bit about various spatial indexes and the best approach that I knew of was an R-tree or one of their variants. For the uninitiated, R-trees are multi-branching trees that extend from the B-tree concept, but instead of indexing linear regions, R-trees index multi-dimensional rectangles. Postgres uses R-trees and previously I had made some minor contributions to that codebase. From my possibly flawed understanding of their implementation, the tree is arranged only with consideration of the marginal distributions of the dimensions. In cases where you have correlated variables, you can do better than that.

Lets say we have a 2-dimensional data set, where y = f(x) + e, where e iid. Normal. In this case, let f(x) be a simple linear transform, but everything that follows should work for any monotonic function:

If we want to find all points that are within a query box {(x_min,y_min), (x_max,y_max)} we can use our knowledge of the correlation to get us there.
The x-range of our selection is [x_min, x_max]. This range is shown as the solid green lines. We can use our estimated inverse of the function y = f(x) + e to turn find our y'-range - the dashed green lines representing [f'(y_min), f'(y_max)]. (NB: I'm using f' to denote the inverse of f, not the first derivative.)
Now instead of storing these points in a 2-dimensional tree, like a quadtree, we can use our estimate of f(x) and store our points in a binary tree. To find all the points inside our query box we can take the range that encloses where we expect the points to be, given our estimate of f(x), and use that to search the space. Naively, this range is [max(x_min, f'(y_min)), min(x_max, f'(y_max))], assuming f(x) is increasing. Call this range the selection range - it is shown in orange below:
This gets us part of the way there. However, we want a selectivity and specificity of 1 - that is, we want to find every point that is inside the box and reject any points that happen to fall out of the box. We need to improve on our selection range to take into account the error term, e. In the figure below I have included the prediction interval as two dotted red lines above and below the best fit. I have arbitrarily set the prediction level to 80%. This means that 80% of the y values fall within these lines for any given x. Knowing this, we can take 80% of our points and index them only using the x value. We adjust our search range using the prediction interval and our selectivity increases.

We can now apply this approach to design a tree structure that utilizes our knowledge of the relationship between the variables. Start by dividing the dataset by the median of x so that half the points are to the left, and half to the right:

Note: we could have divided by y, but I just chose x by convention.

Most of the points to the left of the orange line have y values less than f(median(x)), and conversely for the points to the right. We can quantify 'most', by looking at where the 80% prediction interval intersects the y=f(median(x)) line. To the right of the right-most dashed line, 90% (= (100% + 80%)/2) of the points are above the y ordinate where the prediction interval intersects the line, shown as a dashed purple line. We call the points below outliers and highlight them with red circles.

Using this we can now construct a binary tree much like what you would do if you were only keeping track of x values. The only difference is that at each node you keep a list of the outliers in the y direction.


NB: the R source code for this page is here
Comments (2)

That idiocy will not occur here.

26 Nov 2008, 05:23 AM


I finally got around to reading Memos from the Chairman today after having it languish on my ever growing book inbox for the past few months. Perhaps my purchase was prescient of Bear Stearns' demise, but nothing more so than the first memo in the book, written by Alan Greenberg on Oct 5 1978:

Bear Stearns is moving forward at an accelerated rate and everybody is contributing. It is absolutely essential for us to be able to talk to our partners at all times. All of us are entitled to eat lunch, play golf and go on vacation. But you must leave word with your secretary or associates where you can be reached at all times. Decisions have to be made and your input can be important!
I conducted a study of the 200 firms that have disappeared from Wall Street over the last few years, and I discovered that 62.349% went out of business because the important people did not leave word where they went when they left their desk if even for 10 minutes.
That idiocy will not occur here.

Flash forward 30 years and we have stories like Where in the World is Bear's Jimmy Cayne? Playing Bridge.

Last year, when he was still chief executive of Bear Stearns Cos., James Cayne took heat for hitting the bridge circuit during troubled times for his firm. Will the same rules apply to Cayne now that he's chairman?
We'll soon find out. Thursday and today, as Bear fought off a pending cash crisis that threatened to ruin its business, Mr. Cayne who relinquished his CEO title in January and become the firm's non-executive chairman - has been in Detroit, playing in the North American Bridge Championship.

Three days later, Bear is acquired by JP Morgan.


Comments (0)

#4116 Youngblood, T.

19 Nov 2008, 19:38 PM

(click for more)
Comments (0)

Oh my, I think I'm getting the vapors

13 Nov 2008, 17:31 PM

A few years ago I started specing out a project to build my own customizable video camera rig. After writing up some code to simulate various sensor types I decided to get my hands dirty with this project, which was a great success. In the following year or so I got busy with a new job and didn't have the same resources as a certain eyewear mogul and didn't follow up. While it was a fun project for me to work on, in the back of my mind I was thinking about the possibilities of commercializing my idea. Since then, RED was launched and is a smashing success.

Maybe I should dust off the old calculator project and get cracking before Dov Charney decides that it is a hot opportunity.


Comments (0)

Red Usher

9 Nov 2008, 16:47 PM


Los Angeles was the Australia of the art world.

Comments (0)

Job applicant advice

6 Nov 2008, 19:07 PM

The web is full of bullet-point lists of handy tips for job applicants. Some of them are OK, some are crap. I'm not sure which direction my post is going to go, but it's certainly what I am looking for in applicants.
  1. Read the job ad. If the ad asks you for a particular piece of information, give it. If you don't you will end up looking like one of the RAND_MAX people who apply for everything that appears on Craigslist.
  2. Write a cover letter. Especially if the ad asks for good communication skills. Don't send your cover letter as an attachment, just treat the email body as your cover letter. If you aren't feeling particularly creative a standard 3 paragraph letter works well:
    1. Who are you / what attracts you to the job
    2. What skills do you have? A small anecdote, e.g., "In my last job I used my awesome C# skills to fix the blah blah blah system..."
    3. Request to be contacted
  3. You must custom write the cover letter to the job you are applying to. You want to spend the next few years working with me, then spend the next few minutes putting some effort in to getting the job.
  4. This probably doesn't apply if you are dealing with some company with a dedicated team of HR drones who stick your resume in a database, but I'm not a drone.. Therefore, send your resume as a PDF or as plain text. I like people who pay attention to detail and MS Word resumes always look crap on my computer because it is not your computer. All of your funny spelling and odd grammar shows up with squiggly lines. A PDF lets you format the resume exactly how you want me to see it. And you get bonus points if your resume was generated in LaTeX.
  5. Keep your resume short. I don't care what you were doing 10 jobs ago, especially if it is pretty much identical to what you are doing now. I don't care about the details of the various internal systems you have worked on. Give me your last 2 or 3 jobs, and a short summary of previous work if you want.
  6. Spell correctly and use a close approximation to correct grammar. And don't misspell a computer related term. Have someone read your resume before you send it out.
  7. God save me from those who include a list of 100+ technologies they know. If you are truly proficient in 35 languages, then say "Proficient in 35 languages, ranging from Ada to Fortran." And unless the job is specifically asking for expert knowledge in some dead technology like dBaseII, don't bother listing it. I don't care.
  8. This probably means that you should tweak your resume to each job you are applying to.
  9. Please include a little bit on your resume showing something about who you are rather than simply what you do. I'd like to know if you ballroom dance, play the cello or play Go.
  10. Most resume tips warn you against anything that might result in your application being ignored. I don't ignore applications - while I may not read them all completely, I do want to find the right employee so everything gets at least a glance. And if the application is egregiously bad I'll certainly pay attention; the laughs alone are worth it.
OK. That's 10 and while I could go on, I feel better already.
Comments (4)

Tweaking

4 Nov 2008, 22:31 PM

I've tweaked my blog 'system' a little bit this afternoon. Instead of being a pile of bash scripts, now it is a pile of bash and perl scripts. Unfortunately, all my blog feeds will be reposted, but on the plus side I now have a copy of all my old posts from various other platorms. Oh, and RSS feeds should now work again. They have been broken since the last time I messed around with the system. I spent a little bit of time writing some javascript to scrape my old blog posts from inside Google Reader, which had the only copy of my old posts from wordpress and blogger.com. There is probably an easier way to extract the posts, but I can always do with some practice on the reverse engineering front.

Ok. That's enough housekeeping. Back to my regular blogging schedule


Comments (5)

What is Obama's Erdos Number?

8 Oct 2008, 15:44 PM

It has to be lower than McCains, which as far as I know is infinite.


Comments (0)

Related Links

24 Sep 2008, 15:28 PM

Let me just reiterate how much I like Amazon's recommendation service. Collaborative filtering at it's finest, if you ask me. Collaborative filtering is the process of using what a system knows about others to make recommendations for an individual. If most people who like A and B also like C, then if you like A and B, you will probably like C. It is a simple concept and it works well in many applications. Netflix implements a similar system, which also looks at people's ratings of the movies that they have rented. If it turns out that I like A and B, but really hated C, the algorithm will find other people like me and perhaps recommend E instead of D.

Most of this is widely known. And then the human interest stories started popping up, the most famous of which covering the 'My Tivo thinks I'm gay' phenomenon:

Mr. Iwanyk, 32 years old, first suspected that his TiVo thought he was gay, since it inexplicably kept recording programs with gay themes. A film studio executive in Los Angeles and the self-described "straightest guy on earth," he tried to tame TiVo's gay fixation by recording war movies and other "guy stuff."

"The problem was, I overcompensated," he says. "It started giving me documentaries on Joseph Goebbels and Adolf Eichmann. It stopped thinking I was gay and decided I was a crazy guy reminiscing about the Third Reich."

While these stories are cute, they highlight the pitfalls of making recommendations that just don't fit. I've been noticing more news stories and blogs including related links at the bottom of their posts. They do this to increase the stickiness of their site, by redirecting traffic back to the original web site. It is a nice idea, but with the prevalence of good recommendation technologies, the slightest hint of bad is enough to train users to ignore such related links sections.

Computer generated related links tend to suck because, unlike collaborative filtering, the technology tries to use high-power semantic analysis to understand the text and come up with related articles. And text analysis technology just isn't that good right now. At my firm we use some fairly run of the mill text processing to identify similar companies based on SEC filing text. This tends to work well as we have a large corpus of highly idiosyncratic text to work from. Short blog posts wouldn't work well with simple techniques, so more complex techniques are used. And they suck. Even worse, none of these technologies attempt to use any user feedback to help train the algorithms. I don't want to see explicit, click-here-if-you-think-this-story-is-related buttons, but at least track which links are clicked on to help optimize your algorithms.

My gut feel is that for the developers to prove the value of their technology they are compelled to include related links in as many stories as possible. This is a wrong-headed move. They should only include links when relevance is well established; otherwise users (like myself) will quickly learn to ignore that junk text that appears at the bottom of blog posts and articles. It took me a few days to subconsciously ignore the RSS feed 'flair' at the bottom of most posts. And now that my ignoring neurons are well trained, it took me all of nano-seconds to realize that 'Related Links' are junk.



Related Links: Email this   Digg this   Stumble it!   Share on Facebook   Share on MySpace   Link to me on LinkedIn  

Comments (0)

Interesting paper - The Promise & Perils of Credit Derivatives

23 Sep 2008, 18:21 PM

Via Marginal Revolution:

In this Article, we begin what we believe will be a fruitful area of scholarly inquiry: an in-depth analysis of credit derivatives. We survey the benefits and risks of credit derivatives, particularly as the use of these instruments affect the role of banks and other creditors in corporate governance. We also hope to create a framework for a more general scholarly discussion of credit derivatives.

We define credit derivatives as financial instruments whose payoffs are linked in some way to a change in credit quality of an issuer or issuers. Our research suggests that there are two major categories of credit derivative. First, a credit default swap is a private contract in which private parties bet on a debt issuer's bankruptcy, default, or restructuring. For example, a bank that has loaned $10 million to a company might enter into a $10 million credit default swap with a third party for hedging purposes. If the company defaults on its debt, the bank will lose money on the loan, but make money on the swap; conversely, if the company does not default, the bank will make a payment to the third party, reducing its profits on the loan.

Second, a collateralized debt obligation (CDO) is a pool of debt contracts housed within a special purpose entity (SPE) whose capital structure is sliced and resold based on differences in credit quality. In a cash flow CDO, the SPE purchases a portfolio of outstanding debt issued by a range of companies, and finances its purchase by issuing its own financial instruments, including primarily debt but also equity. In a synthetic CDO, the SPE does not purchase actual bonds, but instead enters into several credit default swaps with a third party, to create synthetic exposure to the outstanding debt issued by a range of companies. The SPE finances its purchase by issuing financial instruments to investors, but these instruments are backed by credit default swaps rather than any actual bonds.

In the Article's first substantive part, we discuss the benefits associated with both types of credit derivatives, which include increased opportunities for hedging, increased liquidity, reduced transaction costs, and a deeper and potentially more efficient market for trading credit risk. We then discuss the risks associated with credit derivatives, such as moral hazard and other incentive problems, limited disclosure, potential systemic risk, high transaction costs, and the mispricing of credit. After considering the benefits and risks, we discuss some of the implications of our findings, and make some preliminary recommendations. In particular, we focus on the issues of disclosure, regulatory licenses associated with credit ratings, and the special treatment of derivatives in bankruptcy.

You can download the paper here. It raises a few of the points I raised in my recent blog post about the lead up to the current mess[1], but does so in greater depth and eloquence.


[1] I think that post got lost in many RSS feeds due to me tweaking with my blog system. It's probably worth a read, and I'd appreciate your feedback.

Comments (0)
All previous posts

consulting

personal projects

etc