wk8 - 11.15.22 Outliers
Posted: Fri Sep 16, 2022 7:59 am
11.15.22 Outliers
For this assignment, we are exploring the process of identifying outliers. All databases have outliers, data that somehow did not fit into any category, or else was incorretly classified. Your task, which hopefully will be creative for you, is to identity what may be outliers in the Seattle Library database.
Karl Yerkes, MAT lecturer did a chart some years ago about errors in the ItemNumber sequencing: https://www.mat.ucsb.edu/~g.legrady/aca ... onCode.png
He states: itemNumber and bibNumber are auto-incrementing database keys in the SPL LIS. Whenever an item is added to the library, the item is asigned a new itemNumber by adding 1 to the last, largest known itemNumber (same with bibNumber for brand new titles). We can analyse these keys to get information about that system. In particular, we can estimate the rate of acquisition of new materials by determining the slope of the plot of check out time versus itemNumber (or bibNumber ). We can estimate when big events happened by investigating the gaps in the data on this plot.
Question: What proportion of items have never been checked out? (i.e., Which are the loneliest items?) Because itemNumber is an auto-incrementing key at the SPL LIS, we only see certain keys (the ones that get checked out) and not others (the ones that never get checked out) in our database but we can estimate the percentage of “lonely” items.
--
Look online for tips on how to best explore data that are outliers. For instance:
https://dataschool.com/how-to-teach-peo ... -with-sql/
--
Post your results here. I am traveling to a conference, so we will need to set up individual meetings times. I will be 9 hours ahead so the earlier your time, the better for me.
For this assignment, we are exploring the process of identifying outliers. All databases have outliers, data that somehow did not fit into any category, or else was incorretly classified. Your task, which hopefully will be creative for you, is to identity what may be outliers in the Seattle Library database.
Karl Yerkes, MAT lecturer did a chart some years ago about errors in the ItemNumber sequencing: https://www.mat.ucsb.edu/~g.legrady/aca ... onCode.png
He states: itemNumber and bibNumber are auto-incrementing database keys in the SPL LIS. Whenever an item is added to the library, the item is asigned a new itemNumber by adding 1 to the last, largest known itemNumber (same with bibNumber for brand new titles). We can analyse these keys to get information about that system. In particular, we can estimate the rate of acquisition of new materials by determining the slope of the plot of check out time versus itemNumber (or bibNumber ). We can estimate when big events happened by investigating the gaps in the data on this plot.
Question: What proportion of items have never been checked out? (i.e., Which are the loneliest items?) Because itemNumber is an auto-incrementing key at the SPL LIS, we only see certain keys (the ones that get checked out) and not others (the ones that never get checked out) in our database but we can estimate the percentage of “lonely” items.
--
Look online for tips on how to best explore data that are outliers. For instance:
https://dataschool.com/how-to-teach-peo ... -with-sql/
--
Post your results here. I am traveling to a conference, so we will need to set up individual meetings times. I will be 9 hours ahead so the earlier your time, the better for me.