Thursday, March 3, 2011

It is time to usher in the computer era

Maybe it is time for baseball to hire Watson?

One of the most exciting things to me about baseball is the ability to measure just about every single thing that happens during the course of the game and the ability to analyze that data later. More than any other sport, baseball is one of endless measurements and the collection of data. With the advances in technology and the spread of sabermetrics, our understanding of statistics has grown.

Even with our great leaps in understanding, if the data that underpins the research is faulty, it calls into question some of the conclusions and analysis being done. Recently, there have been some looks into the accuracy of some of our data and there is reason to be concerned.

Baseball Info Solutions (BIS) has a wealth of baseball information and has done a marvelous job of making things that were previously unknown now available, but some of the things they provide, such as pitch location or fly ball vs. line drive, may have biases introduced inadvertently by the human stringers that provide this data.

With the new BIS iPhone application that provides pitch location data, Baseball Prospectus had their first chance to judge the accuracy of the human stringers and found reasons for concern.

To the extent that these two samples that were advertised by BIS are representative of the overall data quality, BIS plate locations are so inaccurate as to be useless. The stringers seem to have a profound tendency to mark strikeout pitches near the edges of the zone, regardless of the actual pitch location.
That is pretty concerning and while it is only based on a sample of two batters, it confirms the initial thought that there could be systemic bias in the data as a result of camera angles or whether or not a player swings at a pitch, as has been previously brought up.

The problem isn't just in saying where a pitch ends up, but it also rears its head in assessing if a pitch is a fly ball or a line drive. This itself is a bit subjective and there is a gray area between the two. It's incredibly difficult to differentiate and adding the biases into the equation makes things even worse.

This problem clouds advanced fielding stats, which are still in their infancy and are highly dependent on getting accurate information fed into them to turn around accurate and useful judgments. This can also affect how we evaluate pitchers effectiveness. Many advanced statistics look at batted ball data, such as ground ball data or home runs per fly ball.

There was a study done by Colin Wyers and published in the Hardball Times that shows there are issues in how batted balls are scored with a relation to press box height.

The placement of the observer has an effect on how that observer determines the trajectory of a batted ball. Let's focus on air balls—fly balls, line drives and pop-ups. Based upon what we know, we should expect that the higher the observer, the flatter a batted ball looks and the more likely it is to be scored a line drive...

Running a regression analysis, we see that a change in observer height of one foot is worth nearly .002 points of line drive percentage. That's a significant effect, for my money....

The implication of this is that we could see an effect where fielders are over- or underrated by defensive metrics based upon that scoring data, even over a period of years, because of an error introduced by a persistent bias. What I can't tell you—at least not without a lot more study—is which players, by how much or even the magnitude of the potential effect.

This isn't a repudiation of current defensive metrics, mind you. But people get the impression that they are based on a cold, calculating computer. But all current means we have of measuring defensive impact are based on human observation.
All of this makes me long for the days when we CAN have the cold calculating computer doing the measurements for us. Eliminating bias and improving the accuracy of the measurements can only help our understanding of the game.

The Pitch f/x information has opened up new lines of reasearch that were previously untouchable and the further advances in tracking batted balls could further our understanding.

So I say; bring on the computers.

h/t to @paapfly for sharing the first link about the BIS data compared to pitch f/x.

