Media Arts and Technology

Posted: **Fri Sep 16, 2022 7:59 am**

11.15.22 Outliers

For this assignment, we are exploring the process of identifying outliers. All databases have outliers, data that somehow did not fit into any category, or else was incorretly classified. Your task, which hopefully will be creative for you, is to identity what may be outliers in the Seattle Library database.

Karl Yerkes, MAT lecturer did a chart some years ago about errors in the ItemNumber sequencing: https://www.mat.ucsb.edu/~g.legrady/aca ... onCode.png

He states: itemNumber and bibNumber are auto-incrementing database keys in the SPL LIS. Whenever an item is added to the library, the item is asigned a new itemNumber by adding 1 to the last, largest known itemNumber (same with bibNumber for brand new titles). We can analyse these keys to get information about that system. In particular, we can estimate the rate of acquisition of new materials by determining the slope of the plot of check out time versus itemNumber (or bibNumber ). We can estimate when big events happened by investigating the gaps in the data on this plot.

Question: What proportion of items have never been checked out? (i.e., Which are the loneliest items?) Because itemNumber is an auto-incrementing key at the SPL LIS, we only see certain keys (the ones that get checked out) and not others (the ones that never get checked out) in our database but we can estimate the percentage of “lonely” items.

--

Look online for tips on how to best explore data that are outliers. For instance:
https://dataschool.com/how-to-teach-peo ... -with-sql/

--
Post your results here. I am traveling to a conference, so we will need to set up individual meetings times. I will be 9 hours ahead so the earlier your time, the better for me.

Posted: **Tue Nov 15, 2022 11:23 am**

For this assignment, I queried for outliers in two different groups of data from the SPL database: checkouts of rock CDs and horror DVDs. It is valuable to detect and find outliers within a data set because these observations differ significantly from the majority. They have a heavy impact on statistics like the average and the standard deviation which we commonly rely on to explain large sets of data. Also, detecting outliers can lead to finding anomalies or problems within the database which are important to catch. In my analysis, I found outliers by assuming normality of the data and looking for data that was outside of three standard deviations from the mean in both directions. This led to interesting results and conclusions.

Here is the assignment PDF which includes my queries, analysis, and conclusion:

Week 8_ Outliers.pdf: (2.89 MiB) Downloaded 88 times

Here are the output CSV files:

rock_CD_outliers.csv: (3.54 KiB) Downloaded 88 times

outliers_horror_movies.csv: (1.06 KiB) Downloaded 77 times

Posted: **Tue Nov 15, 2022 3:03 pm**

In this report, I focus on entries with incorrectly classified check-in times (earlier than check-out times). I explore overall yearly trends in those anomalies, use cross tabs to classify them by both check-in and check-out, identify the most extreme cases with the largest discrepancies, and investigate cases with both check-in time and check-out time classified incorrectly.

Posted: **Wed Nov 16, 2022 11:54 pm**

For this week‘s assignment, I try to find outliers of different kinds.

* Using standard deviation of checkout times to find the most popular and unpopular items within CD category
* Using both purchase number and checkout times as the indicator of popularity, applying algorithms find out the outliers.
* Since itemNumber is auto incremented when entering the library, this attribute should be consecutive. I want to find out if the data follows such pattern. If not, what’s the distribution looks like? What’s the proportion of item that are never appear in the database?

Some visualizations:
https://tva1.sinaimg.cn/large/008vxvgGg ... 0moacm.jpg
https://tva1.sinaimg.cn/large/008vxvgGg ... 0u00yc.jpg

Python files has been zipped for uploading purpose.

Week 08 Outliers.pdf: (602.85 KiB) Downloaded 88 times

analysis.ipynb.zip: (282.85 KiB) Downloaded 85 times

CD_popularity.csv: (2.86 MiB) Downloaded 83 times

CD_popularity_2D.csv: (3.02 MiB) Downloaded 81 times

bibNumber_itemNumber_dist.csv: (2.11 MiB) Downloaded 82 times

Posted: **Fri Nov 18, 2022 12:03 pm**

Abstract
This week’s assignment calls for us to explore outliers in the Seattle Library database. For this project, I decided to conduct a statistical experiment that would allow me to search for outliers within a database in addition to statistically proving whether or not that outlier has a negative influence on the overall scope of the data and regression model. My research involves heavily on more complicated statistical approaches beyond just calculating the standard deviations of the dataset, but are explained simplistically throughout this paper in order to provide an easier understanding of the analysis attached to these methods.

All queries and csv files are attached within the document

Media Arts and Technology

wk8 - 11.15.22 Outliers

wk8 - 11.15.22 Outliers

Re: wk8 - 11.15.22 Outliers

Re: wk8 - 11.15.22 Outliers

Re: wk8 - 11.15.22 Outliers

Re: wk8 - 11.15.22 Outliers