2008

Jasper Burch

"Fiscal Policy and Growth: The Case of Puerto Rico"

Advisor: Duncan Melville (dmelville@stlawu.edu)

We statistically examine the impact that US federal transfer payments
(Medicare, Welfare, Social Security etc.) have on the economic development of Puerto Rico.
The economy of Puerto Rico is substantially different from that of the mainland United States.
Being a commonwealth however Puerto Rico is subject to US federal laws and policies.
This leads us to question the severity of inefficiencies of US federal laws and policies in Puerto Rico.

Meaghan Cahill

"Testing the Assumptions: A Simulation Study Monitoring the Effects of Manipulating Regression
and Statistical Inference Assumptions"

Advisor: Travis Atkinson (tatkinson@stlawu.edu)

Most statistical procedures allow for certain assumptions to be made in order
for the results to be valid. What will change if these assumptions aren't met?
We manipulate the assumptions involved in simple linear regression models,
confidence intervals of population means, and confidence intervals of population proportions.
We use Fathom to simulate data sets that do and do not satisfy the assumptions for these
statistical procedures, and display the changes and problems that occur.
For simple linear regression, we examine the change in SSE resulting from manipulating
the model assumptions associated with this procedure.
We also demonstrate that the coverage percentages for confidence intervals are not accurate
if the assumptions are not met. For example, the coverage percentages
for confidence intervals of population proportions is much less than expected when the probability
of success in a population is either close to zero or close to one.

Joe Cleary

"Using Stein Estimation to Predict Performance"

Advisor: Robin Lock (rlock@stlawu.edu)

Stein estimators can be used to predict future performance of individuals
when we have small samples of previous performance for many individuals.
For example, suppose we have data on save percentages for a number of NHL goalies
through ten games of the season. Because we have small sample sizes the estimates
for save percentages should be quite variable. Goalies with the highest percentages
are likely to move towards a more typical save percentage as the season progresses;
the same will tend to happen for goalies with low initial percentages.
Stein estimators provide a method for adjusting estimates based on the distribution
of the collection of goalies. The underlying method is a two stage hierarchical model
in which binomial and beta distributions are used. We examine properties and
implementation of Stein estimators theoretically and with application to real data.

Laura Daley

"Chance vs. Skill: Assessing Shootouts in the NHL"

Advisor: Michael Schuckers (schuckers@stlawu.edu)

The shootout was adopted in the 2005-06 season by the National Hockey League,
an ice hockey association consisting of 30 teams across the United States and
Canada. The shootout eliminates ties in National Hockey League (NHL) competition.
This investigation looks at the shootout in the 2005-06, 2006-07, and 2007-08
NHL seasons. We analyze results for shooters and goalies to determine the probability
of scoring and whether it differs significantly from player to player. The data
will be evaluated in an attempt to determine if chance or skill is the prevailing
characteristic that influences the outcome of the shootout.

Brent Davis

"Analyzing Queueing Models and an Issue of Independence"

Advisor: Robin Lock (rlock@stlawu.edu)

Queueing Theory is the study of waiting in line at to get service,
for example, in a checkout line at a grocery store, a toll booth or a ticket counter.
We discuss standard queueing models and a method for simulating customer service
times according to these models. A key question is the relationship between queue length
and expected waiting time. Based on simulations we find an interesting scenario
where the common statistical assumption of independence may not be valid and
examine methods for dealing with this situation.

Kristen Ehringer

"The Price is Right: Using Monte Carlo Simulation to Price Stock Options"

Advisor: Robin Lock (rlock@stlawu.edu)

Simulations are used to imitate real-life situations when other analyses
are too mathematically complex or too difficult to compute. The Monte Carlo method
of simulation generates values for uncertain variables over and over again to simulate a model.
The Monte Carlo simulation can be applied to many financial applications,
such as the pricing of options. In the case of European options, there exists a formula,
known as Black-Scholes, to price these options based on several assumptions.
We develop a Monte for Carlo simulation for pricing European options and compare the results
to those of the Black-Scholes equation. Other more complex options do not have convenient equations
to price their value. We show how the Monte Carlo simulation can be adapted to obtain
price estimates for these more exotic options.

Andrew Ford

"Estimating Atypical Baseball Career Trajectories"

Advisor: Robin Lock (rlock@stlawu.edu)

It is common in any sport for a player to continue improving throughout their career
up until a certain point, at which their production begins to fall.
For example, the number of home runs a player hits generally rises until around age thirty,
after which their number of home runs begins to fall. However, there are times when we notice
that the production rate of a player does not follow this common pattern.
For example, in baseball the number of home runs a player hits may not follow the ordinary pattern
because of a variety of different variables; ranging from being traded to a team that plays in a park
with different field dimensions, to intense training in the off-season, medical enhancements,
or even random chance. In this investigation we use the statistics and graphics software
package R to apply the loess curve fitting method to career statistics for individual baseball players.
In addition, we create several functions in R to allow us to compare the production rates of
individual players to other players to determine when a player's career trajectory appears atypical.

Jared Fostveit

"Madness of March: More Predictable Than Golf?"

Advisor: Robin Lock (rlock@stlawu.edu)

Scott Berry wrote an article in the magazine Chance in 2000
in which he used NCAA Men's Basketball Tournament data from 1986-2000 to determine
the likelihood of winning based on seeding, probabilities of each seed reaching each round
and expected number of upsets per round. We further this approach to golf.
For example, we use historical data to create a model to predict winners based on the seedings
in the World Golf Match Play Championships. The golf tournament uses a 64-golfer bracket
similar to NCAA basketball. We create a relatively simple way to achieve results
that are comparable to Berry's computationally intense model. We contrast our basketball predictions
to those of Berry, while also comparing our basketball to the golf output.
We discover whether golf is more or less predictable than the "Madness" of March.

Eloise Hilarides

"Using Weighting Schemes to Account for Coverage Bias in Internet Surveys"

Advisor: Michael Sheard (msheard@stlawu.edu)

Over the past decade, Internet surveys have become a popular method for
collecting data about the general population. In 2005, the Harris Poll published
findings which claimed that 74% of the United States Population had access to
the Internet somewhere. While this number has steadily risen over recent years,
bias still may be introduced if the population without Internet access is different
from the Internet population in regards to the variables of interest. In this
research we studied whether Internet users that only have access to the Internet
outside their home can be useful in reducing bias by assuming that they are
more similar to those without Internet access than the Internet population as
a whole. We outline several weighting adjustment schemes aimed at reducing coverage
bias. Data for this study was taken from the Computer and Internet Use Supplement
of October 2003 administered by the Current Population Survey. We evaluate the
schemes based on overall accuracy by considering the reduction in bias for ten
variables of interest and the variability of estimates from the schemes. We
find that several of the proposed schemes are successful in improving accuracy.

Jeboah Joerg

"Modeling Volatility in Today’s Marketplace"

Advisor: Duncan Melville (dmelville@stlawu.edu)

In just 7 trading days from January 10th to January 22nd 2008, the S&P 100
dropped almost 8% from 665.64 to 612.82. Eight trading days later the index
had regained half of what it lost. This variability found in financial markets
is known as the "Volatility of the market". So how do we measure
this volatility? We will discuss the different methods of measuring volatility
first, followed by the conditions in which each method theoretically will works
best, and then look into the financial markets to characterize volatility in
the real world. We compare the volatility of the S&P 100 to the volatility
of a created index consisting of small companies from the Russell 2000. We
then compare and contrast the different methods of measuring volatility across
the two indices.

Victor Kai-Rogers

"Measuring the Effects of 9/11: Intervention Models in Time Series"

Advisor: Robin Lock (rlock@stlawu.edu)

In this project we focus on the study of Intervention Analysis, a special
case of transfer functions in Time Series. Intervention Analysis has been applied
to a variety of topics to observe the impact of various policy implementations,
regime changes and more recently effects of terrorism. Specifically, we are
interested in knowing if and when an intervention analysis is appropriate by
looking at how the goodness of fit from AIC and BIC improve over the univariate
analysis itself. In this project we investigate various forms of intervention
functions such as pure jump, impulse, gradually changing and prolonged impulse,
and observe their efficiency compared to regular Box-Jenkins ARIMA models through
the use of theoretical analysis, simulations and actual data. The actual data
used in this paper is the time series data of stock prices around the 9/11 event,
namely American Airlines (AMR) and the S&P 500 market index.

Ryan Kimber

"Who's Really Better? Comparing American and Latin American Baseball Players"

Advisor: Travis Atkinson (tatkinson@stlawu.edu)

In certain experienced baseball circles, there is a common belief
that Latin American players are naturally better than American players.
I would imagine that this can be quite frustrating for American players,
especially at the professional level. How can we determine whether or not
this notion is actually true? I have decided to shed some light on the question at hand
by using a multiple binary regression analysis to predict the ethnicity of players
(0 for Latin Americans and 1 for Americans) based on numerous career offensive statistics
such as career batting average, on base percentage, slugging percentage, and fielding percentage.
As a result of my analysis, I found a model that correctly predicted the
ethnicity of 73.6% of the Americans and 68.7% of the Latin Americans.
Also, I performed a 2-sample t-test on all of the statistics I used and
found all of the significant tests gave evidence for American supremacy at least
in terms of offensive statistics.

Rob LaMere

"Estimating Win Probabilities Based on Betting Lines for Sporting Events"

Advisor: Robin Lock (rlock@stlawu.edu)

We use betting lines and point spread systems for sports to determine probabilities
of a team winning. For example, we investigate football or basketball schemes of betting
to predict winners based on their specific betting line. The study is based on data collected
from 1998 through 2007 NFL and NBA seasons. We look at the distribution of the difference
between the favored team's score and the underdog's score and compare to the predicted point spread
for each game. For both sports they follow the normal distribution. The distribution
for the NFL data is assumed normal with a mean of 0.05 and a standard deviation of 13.19.
The distribution of the NBA data is assumed normal with a mean of -.5 and a standard deviation of 11.40. The probability that a team can win is based on simple linear approximations based on the probability of winning and the point spreads.

Royce Lawrence

"Bowling and the Hot Hand"

Advisor: Michael Schuckers (schuckers@stlawu.edu)

Hot hand, or just luck? Ever wonder why streaks come as they do?
In this paper we will investigate the theory of the hot hand;
the tendency to perform at a higher level for a period of time.
For example, a bowler may be more likely to continue to throw strikes after previous strikes.
Using frame by frame bowling data and statistical methods,
we determine if the hot hand actually exists in the sport of bowling.

Dennis Lock

"Beyond +/-: A Rating System to Compare NHL Players"

Advisor: Michael Schuckers (schuckers@stlawu.edu)

While there are many statistics, such as plus/minus, to evaluate a hockey player's
performance, few of these statistics effectively compare the worth of different
kinds of players. We propose a new, more comprehensive rating method, extending
the concept of plus/minus, which aims to take most aspects of a player's
game into account. Each player will be rated on the same scale, where the value
of each play is determined by how that play increases or decreases the likelihood
of victory. This creates a method of rating where one can compare the value
of a pure goal scorer to a defensive specialist, and everyone in between.

Sarah MacFarland

"The Effects of Marriage on Spousal Happiness & Well-Being"

Advisor: Travis Atkinson (tatkinson@stlawu.edu)

With a long standing appreciation for marriage, in addition to the extensive dynamics
that surround family life, performing a research project encompassing both topics was
of very high interest. Using raw data found on the Inter-University Consortium
for Political and Social Research (ICPSR), the effects of marriage on the happiness and well being
of Michigan couples was investigated. Three hundred seventy-three newlywed couples
(who were all intra-racial and in their first marriage) were interviewed in Wayne County, Michigan
during April-June, 1986, following their first four years of marriage. All couples,
174 White and 199 Black, were between the ages of 25 and 37. Certain Independent Variables,
such as Marital Status, Total Marriage Satisfaction, Level of Happiness, and Hopeful About the Future
were chosen for the final data set. General findings indicated that couples in their early years
of marriage were very satisfied, and felt as though their lives were full resulting
from their married lives in the early years.

Peter McGoldrick

"Analysis of Basketball Statistics and Midseason NBA Trades"

Advisor: Robin Lock (rlock@stlawu.edu)

The *Journal of Quantitative Analysis in Sports* published a 2007
article by Jason Kubatko, Dean Oliver, Kevin Pelton, and Dan T. Rosenbaum entitled
"A Starting Point for Analyzing Basketball Statistics." This study resulted
in the formulation of a number of statistics that aim to give a clearer understanding
of the quality of both individual and team play in the National Basketball Association
(NBA). Based on a central theme of per possession statistics, we discuss the
efficiency of team statistics and methods to evaluate player performance using
per minute statistics. With these variations on our classical basketball box
score, we are also able to compare basketball statistics from league to league.
We define important statistical terms and themes for basketball, and provide
examples of the effectiveness of these statistics with recent NBA data. By application
of statistical procedures we analyze midseason NBA trades and their resulting
impact.

Nikolay Naletov

"Investment Risk Measurement: Approaches to Risk Measurement"

Advisor: Duncan Melville (dmelville@stlawu.edu)

This research presents different ways to measure investment risk.
In particular, it shows a comparison between VaRs (Value-at-Risk) based on methods such
as variance-covariance, historical and Monte Carlo simulations, and GARCH approaches.
The paper compares and explains the different techniques and methodologies
by using the historical database of the equity portfolio of Crown Royalties
Investments, which is the investment club at Saint Lawrence University.

Jamie Wolff

"Performance vs. Pick: A Study of the NBA Draft"

Advisor: Michael Schuckers (schuckers@stlawu.edu)

Millions of dollars are invested in the top draft picks of the
National Basketball Association (NBA). A significant amount of deliberation and
analysis is put into determining which athlete to select. Often teams make trades in order to better
their position in the draft, or they "trade down" meaning trade away an early draft pick
for more draft picks later on. By giving another team money, current players, draft picks,
a team is hoping this lower number pick will be more productive than their higher number.
This investigation will explore any significant differences among draft picks and what
would be the advantage, if any, in drafting at one pick over another. We will use NBA
career statistics (points per game, minutes per game, games played, all-star appearances)
to assess draft decisions based on player productivity. Data from the 1994 through the 2007 seasons
was compiled from Basketball-Reference.com. This study will divide the draft picks into eight zones
(four in the first round and four in the second) and compare zones to find any significant differences.
Results suggest that lottery picks are valuable and trading up or down may not be productive.

2007

Nick Alena

"Linear Discriminant Analysis"

Advisor: Robin Lock (rlock@stlawu.edu)

Can we trace The Mona Lisa to Leonardo DaVinci, or demonstrate how a
certain wing span can mean it's a deadly midge or even evaluate if seeing Viagra
in an email title means it's spam? Is it possible to use numerical characteristics
of a painting to determine its creator, or measurements of an insect to distinguish
its species, or titles of emails to separate spam from legitimate messages?
Linear Discriminant Analysis (LDA) is a statistical method used to identify group
membership using a linear combination of features. The techniques extend concepts
from ANOVA (Analysis of Variance) and Regression Analysis. We examine methods for
producing a linear combinition that best distinguishes between the groups using
univariate and multivariate sets of predictors.

Dustin Cidorwich

"Analysis of NFL 'Draft Pick Value Chart'"

Advisor: Michael Schuckers (schuckers@stlawu.edu)

The goal of this project is to evaluate the 'Draft Pick Value Chart',
which is used by most NFL teams as a way of allocating value to individual draft picks.
I have used the games played and games started as a proxy for drafted players value,
and will use regression methodology to to link the explanatory variables to the drafted
players value. The variables used in the analysis include selection number, year, team,
and position. The predicted values will be scaled and compared with the 'Draft Pick Value Chart'
values. Would it be in a teams best interest to trade a second and fourth round draft
pick for a first round pick?

Yordan Minev

"ROC Confidence Region Using Radial Sweep"

Advisor: Michael Schuckers (schuckers@stlawu.edu)

A biometric authentication system matches physiological characteristics
to a database of such characteristics. In biometric authentication,
genuine users are generally those that the system should accept and
imposters are those that the system should reject. One methodology for
evaluating the matching performance of biometric authentication systems
is the receiver operating characteristics(ROC) curve. The ROC curve
graphically illustrates the relationship between type I and type II
statistical classification errors when varying a threshold across a
genuine and an imposter match score distributions.

The goal of this paper is to estimate the performance of each biometric
system via a confidence region for a ROC curve of that system's performance.
In this project ROC confidence regions will be created using radial sweep method.
Radial sweep is based on converting the type I and type II errors to polar coordinates.
The technique of bootstrapping will be utilized to estimate the variability of each
point on an individual ROC curve. Simulations will be performed using real
biometric match score data. Next, a radial sweep method for comparing two ROC
curves will be discussed.

Julie Muetterties

"Methods for Comparing Two Survival Curves"

Advisor: Robin Lock (rlock@stlawu.edu)

A survival curve shows the proportion of a population at risk which survives
up to a certain time. Such curves can be described by theoretical parametric
models (such as exponential, lognormal, or Weibull) as well as nonparametric
methods (such as Kaplan Meier). We investigate methods for determining if survival
curves from two populations or treatments are significantly different with applications
to real data.

Julia Palmateer

"Assessing Ratings Methods for College Hockey Teams"

Advisor: Robin Lock (rlock@stlawu.edu)

Ranking college sports teams that play in leagues of different strengths can
be challenging. A number of methods have been proposed to calibrate teams while
accounting for performance and strength of schedule. We use Monte Carlo simulations
of hundreds of seasons to compare several existing methods currently used to
rate college hockey teams. These ranking methods include: raw winning percentage,
Ratings Percentage Index (RPI), Bradley-Terry (KRACH) and Poisson scoring rates
(CHODR). By using the results of the simulated seasons, we can examine many
common questions, including how the strength of a team’s league affects
their rating.

2006

Jeff DiGeronimo

"A Hitter's Salary Through Statistical Analysis of Past Performance"

Advisor: Robin Lock (rlock@stlawu.edu)

What is a current Major League Baseball(MLB) player worth?
Today controversy over a player's salary is debated every season.
Teams are forced to make the decision on whether a player is worth
keeping or be traded/released. Also many players face contract arbitration
with the organization(team) they are playing for. This process involves a
decision about the worth of a player. Collecting a large data set made
up MLB hitter's statistics of the previous three seasons, three year average
statistics and career statistics. The statistics will include things such
as batting average, runs batted in, on base percentage, home runs, ect.
Also other variables that could play a role in the salary of a player
such as age, year's experience, position, and free agent status will be used.
Then using regression techniques to analysis a player's salary and how it relates
to the performance of the individual player. Using the statistics and attributes
gathered to formulate a multiple regression equation to predict the salary of a
hitter in the Major Leagues. By means of the prediction equation, some of the
player's who filed for arbitration will have their salaries predicted, and an
arbitration ruling will be made based on the predicted value.

Raluca Dragusanu

"Autoregressive Conditional Heteroscedasctic Models"

Advisor: Robin Lock (rlock@stlawu.edu)

Traditional time-series models such as Autoregressive (AR) and moving Average
(MA) models are based on the homoscedasticity assumption, which translates into
a constant variance for the errors of a model. This assumption has been shown
to be inappropriate when dealing with some economic and financial market data.
A new class of models - conditional heteroscedastic models - was developed to
deal with data that does not exhibit constant variance of the errors. The most
well known models in this class are the Autoregresssive Conditional Heteroscedastic
model (ARCH) and its generalized version (GARCH). Stock market volatility, the
square root of the variance of stock returns presents a very good application
of this type of model. In finance, volatility is the expression of risk. Since
we must take risks to achieve rewards, finding appropriate methods to forecast
volatility is ncessary in order to optimize our behavior and, in particular,
our portfolio. I will present the general properties of the ARCH and GARCH models
and use both Monte Carlo simulations and known financial time series data to
test their performance.

Travis Gingras

"Applications of a Graphical Information System to Ice Hockey"

Advisor: Robin Lock (rlock@stlawu.edu)

Statistics and sports have been related for many years, and recently
the art of using statistics to observe players tendencies has become more
and more common among coaches. This project looks to investigate patterns
of shots taken by the St. Lawrence Men’s hockey team using a geographic information system.
ArcGIS is a mapping program generally designed for geographical data,
but in this project we have defined a database to store information about
individual shots in multiple hockey games while placing them on a map of
the offensive zone of a hockey rink. We can then study patterns and look
for the trends that might benefit individual players or the team as a whole.

Jim Hall

"Investigating the Effectiveness of Bootstrap Confidence Intervals"

Advisor: Robin Lock (rlock@stlawu.edu)

The statistical procedure known as “bootstrapping” is used to approximate
a sampling distribution for any statistic by resampling from an original sample
with replacement in order to draw conclusions about the shape, center and variability
of the sample statistic. These methods avoid traditional assumptions such as
assuming a certain population is normally distributed. We will give a brief
description of bootstrapping techniques and discuss several methods for constructing
confidence intervals based on the bootstrap distribution. We demonstrate via
computer simulation (using the statistical software packages R and Fathom) the
effectiveness, in terms of coverage and average width, of bootstrap confidence
intervals compared to traditional confidence intervals in standard situations
and in cases where standard assumptions fail.

Robin Hanson

"Statistical Analysis of CHIME Data: Relationships Involving Heart Rate VAriability"

Advisor: Michael Schuckers (schuckers@stlawu.edu)

Life threatening events including bradycardia and apnea in infants are a major
health concern for families and physicians. Events further described as "extreme
events" occurred at least once in 20.6% of asymptomatic infants and 33%
of infants born prematurely. The motivation for this study is the Collaborative
Home Infant Monitoring Evaluation (CHIME); a study formed and funded by the
National Institute of Health (NIH). The CHIME database is the largest infant
longitudinal physiologic dataset. The work proposed here is part of a larger
study to use heart rate variability to study the prediction of life-threatening
events in infants and the classification of infant sleep state. This work seeks
to develop methods to examine the effect of age on heart rate variability measures
in each sleep state. These methods include both numerical summaries and graphical
displays.

Emily Sheldon

"Sequential Analysis for the Beta-Binomial"

Advisor: Michael Schuckers (schuckers@stlawu.edu)

In this paper we attempt to derive an equation from the Beta-binomial distribution
that can be used to apply sequential probability ratio testing to biometric
devices. We first examine sequential analysis testing methods and then apply
them to examples of multiple independent bernoulli trials. We use these exmples
to ilustrate the decision of when to stop testing. Lastly, we examine the Beta-binomial
distribution and derive an equation that can be used in sequential analysis
methodology.

Joshua White

"Predicting a Pitcher's Salary Using Statistical Analysis"

Advisor: Robin Lock (rlock@stlawu.edu)

When looking at a baseball player's performance compared to his
salary, there should be some kind of correlation, the better the player the
better the salary. However, a player's worth does not always reflect his
salary. Throughout the season owners and general managers are faced with decisions
dealing with a player's salary and/or whether or not they should keep
them. This may create conflict with a player and his team forcing the player
to deal with arbitration, a process where an external person chooses a salary
figure based on arguments presented by the team and player. The main goal of
this project is to examine models for predicting a pitcher's salary based on
past performance. We start by compiling a database of Major League Baseball
pitcher's statistics from various websites on the internet (mlb.com, espn.com,
baseball-reference.com). This includes the obvious variables of a pitcher (ERA,
walks, strike outs, etc) and will also include other ones such as age, years
in the league, free agent status, and starting pitcher vs. relief. Once the
database is established, we perform statistical analysis, using techniques such
as multiple regressions, to create and assess models for predicting pitchers'
salaries. Once all of the statistical analysis is complete, individual case
studies such as actual arbitration cases for current pitchers are examined and
tested using the models developed.

2005

Hillary Hartson

"Sequential Testing"

Advisor: Michael Schuckers (schuckers@stlawu.edu)

In this paper we investigate Wald's *Sequential Analysis* to develop
a basic understanding of sequential testing. Sequential testing differs from
traditional hypothesis testing in that it allows for three possible conclusions
to be made after a subset of observations have been drawn. These three conclusions
are: reject the null hypothesis, accept the null hypothesis, and continue testing.
This is in contrast to traditional hypothesis testing where the choices to be
made are reject or accept. We begin our discussion of sequential testing by
introducing a sequential test called the sequential probability ratio test and
then use it to derive decision rules for comparing a proportion against a one-sided
alternative. We consider application of this test to a variety of possible rates
for the hypothesized value. Lastly, we demonstrate the effects of changing the
values of our Type I and Type II errors on these decision rules.

Jeff Homer

"Rating Systems for College Football with Turnovers"

Advisor: Michael Schuckers (schuckers@stlawu.edu)

In this project, I create a rating system for Division III college football
based on relative strength and on homefield advantage. This is similar to Stern’s
Division I college football model (Stern, 1995, Chance Magazine). We use a linear
regression model for this. An additional factor that we take into account is
the turnover (fumbles and interceptions) differential. We further study turnovers
and turnover differential to determine whether or not turnovers can be treated
as a random effect for predicting outcomes from college football.

Nikki Lopez

"Analysis of Biometric Authentication Match Scores"

Advisor: Michael Schuckers (schuckers@stlawu.edu)

For bio-authentication devices the topic of template aging is an important one.
Mansfield and Wayman have defined template aging as "the increase in error
rates caused by time related changes in the biometric pattern, its presentation,
and the sensor," Manfield and Wayman (2002). Understanding such an effect is important both for
policy and for system design. To date, the work that has been done in this area
has been descriptive of the template aging effect rather than inferential. Several
authors have all presented descriptive evidence of template aging. Here we focus
on studying error rates (both FAR and FRR) over time and determining whether
or not the impact of time is statistically significant. We apply our methodology
to the National Institute for Standards and Technology (NIST) Biometric Score
Set Release 1 (BSSR1), which was recently made public. The results of our regression
analysis of the match scores and our logistic regression of the decision data
indicate that there is a significant effect due to the changes in time for some
matching algorithms at some thresholds. However, the template aging effect is
not consistent across all modalities and all thresholds.

Petya Madzharova

"Time Series Analysis of Music"

Advisor: Robin Lock (rlock@stlawu.edu)

Mathematics and music have always been very closely related to each other. Here,
we are going to discuss some of the mathematical and statistical characteristics
of music using techniques from time series analysis. First, we will see how
music can be transcribed by applying time series analysis to it – we will
look at and discuss the time series and autocorrelation plots of different songs.
Then, we will observe how we can use the analysis in order to obtain the musical
score from an audio recording. Here, we will work on note identification through
a Fourier series representation of the wave function of the melody. For this
purpose we will use the statistical program R, and its add-on musical package
tuneR

Matt Norton

"Wilson Approximate Confidence Intervals"

Advisor: Michael Schuckers (schuckers@stlawu.edu)

There are a variety of ways in which confidence intervals (CI's) can be created.
However the goal of all CI's are the same in that they attempt to capture a
parameter a certain percentage of the time. For example, in making a 95% confidence
interval, one hopes that the parameter will be captured in the interval 95%
of the time. This paper investigates a method of confidence interval creation
first introduced by Edwin B. Wilson in his 1927 article, "Probable Inference,
the Law of Succession, and Statistical Inference," which was published
in the *Journal of the American Statistical Association*. In Wilson's
paper, he creates a confidence interval for the binomial distribution using
a weighted variance. This paper extends his methodology, essentially creating
a "generalized Wilson" confidence interval that can be applied to
any distribution whose variance is a quadratic function of its mean. This includes
many common distributions, such as the uniform, poisson, gamma, exponential,
and negative binomial. These are examined throughout this paper, considering
each individually as well as comparing them to the traditional large sample
estimator. We further develop the work presented in Atkinson and Schuckers (2004).

2004

Travis Atkinson

"Approximate Confidence Interval Estimation for Beta-binomial Proportions of
Biometric Identification Devices"

Advisor: Michael Schuckers (schuckers@stlawu.edu)

The goal of this project was to determine the performance of confidence
interval estimation using the methodology of Agresti and Coull (1998) on proportions
from a Beta-binomial distribution. Agresti and Coull applied their approach
to the binomial distribution, specifically for estimation proportions. In this
research we tried determine the appropriate manner in which to extend that work
to proportions from a Beta-binomial distribution. In particular we were interested
in applying such a methodology to data from biometric identification devices
such as fingerprint scanners or iris recognitions systems. The approach we took
was to simulate data from a variety of Beta-binomial distributions and compare
the performance of the augmented approach of Agresti and Coull to more traditional
approaches to interval estimation. Specifically we considered four ways to extend
this work. For the binomial n independent binary observations are augmented
with 2 successes and two failures. For the Beta-binomial we have k individuals
with m binary observations each. To augment the Beta-binomial data, we first
considered adding 2 successes and 2 failures to a single individual. Second
we considered adding a success and a failure to the counts for 2 individuals.
Third we considered adding 1 success to the data from each of 2 individuals
and 1 failure to the data for 2 different individuals. Last we considered adding
a new individual n+1 who was given 2 successes and 2 failures. We used a Monte
Carlo approach for evaluating the performance of these four augmented approaches.

Katie Livingstone

"Modelling Disease: Mathematics in Epidemiology and Applications to the SARS
Virus"

Advisor: Patti Frazer Lock (plock@stlawu.edu)

Epidemiology is a field of science that has made many significant advances in
studies of the spread of disease. Studies of epidemics and disease spread have
vast mathematical components. Epidemiological models are based on differential
equations that provide the foundations for modeling change over time. These
are useful in modeling rates of infection, rates of recovery, contact rates,
birth and death rates, etc. Such models can predict the impact of a disease
on a population and can suggest valuable strategies for its control. This study
presented the most basic epidemiological model and examined the common uses
and implications of certain features. It looked at several more complex models
and explained how to modify a disease model according to the characteristics
of a disease. Lastly, this study examined how researchers have modeled SARS,
one of the most recent disease outbreaks, in three published articles.

Whitney
Browne

"The Price Sensitivity Exhibited by Admitted St. Lawrence
Students"

Advisor: Brian Chezum (chezum@stlawu.edu)

This paper examines the
sensitivity of admitted students at St. Lawrence University to changes in cost
of attendance. More specifically, the study will investigate the
effects of an increase in tuition on the probability of a student
enrolling. In order to perform this test, econometric models and
procedures are used to estimate the probability that a given student will attend
St. Lawrence University. Next a tuition increase is introduced by
adjusting the actual cost of attendance at St. Lawrence, relative to each
student’s alternative institution cost. The probabilities found previous
to the tuition increase are compared to those found post-increase and the
elasticity of the demand is calculated. The results show that there is a
response by admitted students to increases in tuition. Before the Newell
center was built, the tuition increase caused a drop in the proportion of
students who enroll, and admitted students are elastic in response to actual
out-of-pocket cost increases. As a result, prior to the new facilities on
campus, by raising tuition St. Lawrence University failed to reap increased
revenues. However, the elasticity of demand exhibited by students is
decreasing (in absolute value) over time, thus there is an indication that the
capital, including buildings and athletic facilities built over the last five
years, is effectively increasing the value of St. Lawrence in the eyes of
admitted students.

2003

Noelle Laing

"The Mathematical Theory Behind the Capital
Asset Pricing Model"

Advisor: Robin Lock (rlock@stlawu.edu)

In an
attempt to assign values to assets and portfolios, William Sharpe developed the
Capital Asset Pricing Model (CAPM) in 1964. He based the valuations on the
expected return of the asset or portfolio and the risk associated with the
investment. The CAPM was an extension of Harry Markowitz’s financial
model, the Efficient Frontier of optimal portfolios, developed in 1954.
Although the CAPM relies on some seemingly unfeasible assumptions, its creation
dramatically improved modern finance theory. This paper explores the
development and mathematical ideas behind the CAPM, including the Efficient
Frontier, the Capital Market Line and the Securities Market Line.

Getting
started reference: Harrington, Diana. Modern
Portfolio Theory and the Capital Asset Pricing Model: A User's
Guide. Englewood Cliffs, New Jersey: Prentice Hall,
1983.

Dennis Leahy

"Random
Graphs"

Advisor: Robin Lock (rlock@stlawu.edu)

A graph
is simply a collection of vertices and edges. A random graph is simply a
collection of vertices and edges where the edges occur by some random
process. Because of this unique state in which a random graph must be
formed, much of the theory revolving around these types of graphs includes
probability theory. In this paper we will use probability theory in
determining the likelihood of certain graphical properties. The ideas of
“almost all graphs” and threshold functions will be introduced and pondered.

We will discuss the concept of graphical evolution and many ideas concerning
graphical connectivity. Such probabilities will be calculated and we will
show how these properties depend on the relationships between graph size and the
probability of an edge. A summary of important results will be given and
computer simulations will be used to compare some of our calculations involving
connectivity.

Getting started reference: Palmer, Edgar M. Graphical Evolution: An Introduction to the Theory of
Random Graphs. New York: John Wiley & Sons, 1985.

2002

Justin Roth

"Survival
Analysis: Finding and Fitting a Model"

Advisor: Robin Lock (rlock@stlawu.edu)

This
paper examines survival analysis with a focus on estimation techniques for
fitting lifetime data.Some examples of lifetime data include life of automobile
brakes, lifetimes of electronics, length of marriages, and length of life after
medical treatment.The parametric models examined include: (1) exponential, (2)
lognormal, and (3) Weibull.The (4) Kaplan-Meier estimator is an alternative
nonparametric approach to model survival data.When trying to fit a lifetime
distribution, censoring may become an issue.Complete data sets include all the
information for all subjects, and censoring takes place when there is missing
information, for example, when subjects leave the study early or outlast the
length of the study.

For each of the four models, these estimation techniques
have been applied to simulated data for both complete and censored data
sets.Once the data has been fitted with a distribution, the Anderson-Darling
test and probability plots are used to determine the goodness of fit.A case
study examining the contraction of sexually transmitted diseases provides a
concrete example of the estimation techniques.

Getting started reference: Lee, Elisa T. Statistical Methods for Survival Analysis. 2nd ed. New York: John Wiley & Sons 1992.

2000

Sarah Auer

"Using Environmental Data and Fathom Software to Teach Statistical Concepts"

Advisor: Robin Lock (rlock@stlawu.edu)

As an independent project for the spring semester, I chose to explore how Fathom, a statistics software package, could be used to teach statistical concepts within the context of a high school statistics course. The project has involved creating a series of activities to be used to guide students through various components of the software program while at the same time introducing them to statistical concepts. In addition, the data used are focused on environmental issues. Information regarding the topics that would be appropriate to cover at the high school level was obtained from the curriculum of the Advanced Placement Statistics course provided by the College Board (see Appendix).

Since Fathom cannot be utilized in the instruction of every statistical concept within the curriculum, I chose a set of topics where I felt Fathom could most effectively be used as an instructional tool. The activities lead students through an exploration of bivariate data. This includes concepts such as analyzing patterns in scatterplots, linearity, least squares regression, residual plots, outliers, influential points, and correlation.

1999

Sharon Rohloff

"Markov Chains"

Advisor: Robin Lock (rlock@stlawu.edu)

Getting started reference: Lee, Elisa T. Statistical Methods for Survival Analysis. 2nd ed. New York: John Wiley & Sons 1992.

2000

Sarah Auer

"Using Environmental Data and Fathom Software to Teach Statistical Concepts"

Advisor: Robin Lock (rlock@stlawu.edu)

As an independent project for the spring semester, I chose to explore how Fathom, a statistics software package, could be used to teach statistical concepts within the context of a high school statistics course. The project has involved creating a series of activities to be used to guide students through various components of the software program while at the same time introducing them to statistical concepts. In addition, the data used are focused on environmental issues. Information regarding the topics that would be appropriate to cover at the high school level was obtained from the curriculum of the Advanced Placement Statistics course provided by the College Board (see Appendix).

Since Fathom cannot be utilized in the instruction of every statistical concept within the curriculum, I chose a set of topics where I felt Fathom could most effectively be used as an instructional tool. The activities lead students through an exploration of bivariate data. This includes concepts such as analyzing patterns in scatterplots, linearity, least squares regression, residual plots, outliers, influential points, and correlation.

1999

Sharon Rohloff

"Markov Chains"

Advisor: Robin Lock (rlock@stlawu.edu)

A Markov chain describes how a process moves from state to state over time. Real world applications can be found in many fields including biology, physics, economics, sociology and games of chance. The structure of a Markov chain is best represented by a matrix of transition probabilities. A variety of examples will serve to illustrate the basic properties of the chains and the usefulness of the transition matrices. Real world examples will include applications of Markov chains to gambling, the game of tennis, and an original example about two-person random walks.

Getting started reference: Kemeny, John and Snell, J. Laurie, Finite Markov Chains. New York: Springer-Verlag, 1976.

1996

Peter Fellows

"Classification of the 1995 Blowdown in the Adirondack Park Using PCA and MIA with LANDSTAT TM Data"

Advisor: Robin Lock (rlock@stlawu.edu) and Willaim Elberty

Landstat Thematic Mapper (TM) satellite images were used to analyze the dammage of a July 15, 1995 windstrom to the western portion of forest-cover of the Adirondack Park, New York and vicinity. Multivariate image analysis (MIA), incorporating principal components analysis (PCA) and associated feature-space scatterplots, was used to identify areas of blowdown, assess severity of impact, and identify forest-type affected. The strategy used in MIA offers a reverse approach to image classification compared to conventional image processing methodology. MIA proved both sensitive and discriminatng, and the procedure has potential usefulness in a variety of image analysis applications.

1995

Karen Fischer

"Statistical Models of Horse Racing"

Advisor: Robin Lock (rlock@stlawu.edu)

In this project, I attempted to predict the results of horse races using multiple regression and discriminant analysis. I explain with supporting data including a sample prediction that multiple regression does not appear to be a precise way of predicting the results. A large portion of the paper explains the procedure of discriminant analysis. It shows going step-by-step how to improve the model in explaining the development of the best model for horse racing using discriminant analysis. I conclude with a sample prediction for a randomly selected race that predicts exactly both the first and second place horses of this particular race using discriminant analysis.

Getting started reference: Beyer, Andrew. Picking Winners, A Horse Player's Guide. Boston: Houghton Mifflin Company, 1985.

1994

Sumeet Wadhera

"Mathematical Modelling: A Study of the Transmission of HIV Infection and AIDS"

Advisor: Chanchal Singh (csingh@stlawu.edu)

The aim of this paper is to review mathematical modeling of transmission of HIV and offer some new models describing a specific situation. I will survey models using various techniques and then present two different models to look in the case of transmission of HIV through prostitution. At present, I am carrying out this modeling even though much of the numerical information needed is not available, as is true for other researchers engaging in this activity. Hence, these models cannot yet be used to reliably predict the future incidence of AIDS. The modeling is further complicated by the fact that we know little about the epidemiology of HIV and the transmission dynamics of the infection. Questions about the proportion of HIV infected persons who will develop AIDS, the progression rate over time from infection to AIDS, role of co-factors in facilitating HIV tansmission or progression to AIDS, degree and variability in individual infectiousness remain unanswered, and, in such issues, depending upon the personal view of the modeler and models use different variables and take various forms. The number and diversity of models of the AIDS epidemic are cumulating at a rapid rate.

Results from these models have varied greatly, with each model purporting a decisive advantage over others. Descriptions of models of HIV transmission that appear in the literature are not sufficiently detailed for other researchers to evaluate the models and comprehend why they produce such diverse conclusions. However, it is clear that one reason for the diversity of results is that, lacking solid date, modelers have different assumptions for initial values of parameters driving these models, different assumptions about their future time trends, and different estimations of distribution functions. Different projections are due to the fact that different populations are modeled and parameters differ. To be able to compare all the models, all modelers would need to work with a standard set of initial parameter values and distributions in order to project future trends. In effect, all would be modeling the same population. All differences would then result from the differences in the description of the process of the epidemic. However, modelers have personal interests in different populations and this adds to the increase in the number of models. Currently, no criterion exists to facilitate a choice among the models.

1992

Matthew Griffin

"Logistic Regression"

Advisor: Robin Lock (:rlock@stlawu.edu)

When one mentions the world

One of the main reasons one builds a model is to predict the value of a dependent variable based on these random variables. It is the dependent variable that we will want to look at for the moment. Say a person will allow for any possible value to be the dependent variable. When looking at a case such as this one could use what is known as

The distinguishing characteristic, then, between a logistic regression model and a linear regression model is the dependent variable in a logistic regression model is binary or dichotomous. That is the dependent variable has only a finite number of outcomes. One can think of an endless number of situations where this is true. As a quick example lets look at test scores. What if we wanted to predict whether or not a person would pass an exam or not. The dependent variable clearly is dichotomous. Either the person passes or not. Then we could look at variables such as past test s cores, past quiz scores, the sex of the student, if the student studies or not, and even his or her GPA. This is in contrast to what one would see in the multiple regression case where one might be trying to predict final course average based on the same above factors. One can see that in this case the dependent variable is continuous and not dichotomous.

We worked through the book entitled Applied Logistic Regression by David W. Hosmer and Stanley Lemeshow. They described the theory of this type of regression and illustrated the technique with a study on heart disease. Basically they were trying to predict whether or not certain individuals were more prone to heart disease and heart attacks based on the number of factors. We can see that their dependent variable is indeed dichotomous by nature. The people will either have heart disease or not. The factors we mentioned are the independent variables of their study and they included such things as age, sex, race, and history of heart disease among other family members. For this paper, we will be doing a similar sort of thing. We will be looking at all the theory behind this topic and extending what we have learned to a study of our own. We will be using admissions data from St. Lawrence University and try to predict whether or not people will come as Freshman in the fall. Thus the dependent variable h as only two possible values; either the student will attend or they will not attend. Some of the independent variables we will be using are Sat scores, sex, high school rank, high school gpa, and the region of the country which they live.

Lisa King

"Efficient Methods of Permutation Testing with an Application to a Test for Correlation"

Advisor: Robin Lock (rlock@stlawu.edu)

By permuting the data, one is breaking up the order of the original data into different orders, trying to eleiminate the relationship between the variables that might be present for the original, unshuffled data. The data are permuted many times, with a test statistic being calculated each time for th enew permuted data. The test statistic that is obtained after each permutation is compared to the test statistic that was calculated for the original data. If the variables are related for th eoriginal ddata, before permuting, then the value of the test statistic for the original unshuffled data should be unususal relative to the values of the test statistic that are obtained for most permutations. "Unusual" , means that the proportion of test statistics for the permuted data that are more extreme than the test statistic for the original dat asould be less than or equal to some pre-defined significance level. This proportion is what is referred to as the p-value of the observe dtest statistic.

(Click here for links to other mathematics abstracts at St. Lawrence)