在线时间:8:00-16:00
迪恩网络APP
随时随地掌握行业动态
扫描二维码
关注迪恩网络微信公众号
开源软件名称:philippbayer/Goodreads_visualization开源软件地址:https://github.com/philippbayer/Goodreads_visualization开源编程语言:Makefile 100.0%开源软件介绍:Goodreads visualizationA Jupyter notebook to play around with Goodreads data and make some seaborn visualizations, learn more about scikit-learn, my own playground! You can use it with your own data - go here and press "Export your library" to get your own csv. The text you're reading is generated from a jupyter notebook by the Makefile. If you want to run it yourself, clone the repository then run
to get the interactive version. In there, replace the path to my Goodreads exported file by yours in the ipynb file, and then run click on Cell -> Run All. ** WARNING ** It seems that there's currently a bug on Goodreads' end with the export of data, as many recently 'read' books have a read-date which is shown on the web page but doesn't show up in the CSV. Dependencies
Python packages
To install all:
Under Windows and anaconda you instead need to run
instead of using pip to install rpy2. LicensesLicense for reviews: CC-BY-SA 4.0 Code: MIT OK, let's start! Setting up the notebook%pylab inline
# for most plots
import numpy as np
import pandas as pd
import seaborn as sns
from collections import defaultdict, Counter, OrderedDict
# for stats
import scipy.stats
# for time-related plots
import datetime
import calendar
# for word cloud
import re
import string
from nltk.corpus import stopwords
from wordcloud import WordCloud
# for Markov chain
from pymarkovchain import MarkovChain
import pickle
import networkx as nx
# for shelf clustering
import distance
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN
sns.set_palette("coolwarm")
# for plotting images
from IPython.display import Image
import gender_guesser.detector as gender
# for R
import pandas
from rpy2 import robjects
# conda install -c r rpy2 on Windows
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [10, 5]
Loading the datadf = pd.read_csv('./goodreads_library_export.csv')
# keep only books that have a rating (unrated books have a rating of 0, we don't need that)
cleaned_df = df[df["My Rating"] != 0]
# get rid of noise in 2012
cleaned_df = cleaned_df[(cleaned_df['Date Added'] > '2013-01-01')] Score distributionWith a score scale of 1-5, you'd expect that the average score is g = sns.distplot(cleaned_df["My Rating"], kde=False)
"Average: %.2f"%cleaned_df["My Rating"].mean(), "Median: %s"%cleaned_df["My Rating"].median()
That doesn't look normally distributed to me - let's ask Shapiro-Wilk (null hypothesis: data is drawn from normal distribution): W, p_value = scipy.stats.shapiro(cleaned_df["My Rating"])
if p_value < 0.05:
print("Rejecting null hypothesis - data does not come from a normal distribution (p=%s)"%p_value)
else:
print("Cannot reject null hypothesis (p=%s)"%p_value)
In my case, the data is not normally distributed (in other words, the book scores are not evenly distributed around the middle). If you think about it, this makes sense: most readers don't read perfectly randomly, I avoid books I believe I'd dislike, and choose books that I prefer. I rate those books higher than average, therefore, my curve of scores is slanted towards the right. plot Pages vs RatingsDo I give longer books better scores? A minor tendency but nothing special (it's confounded by having just 5 possible numbers in ratings) g = sns.jointplot("Number of Pages", "My Rating", data=cleaned_df, kind="reg", height=7, ylim=[0.5,5.5])
g.annotate(scipy.stats.pearsonr)
I seem to mostly read books at around 200 to 300 pages so it's hard to tell whether I give longer books better ratings. It's a nice example that in regards to linear regression, a p-value as tiny as this one doesn't mean much, the r-value is still bad. plot Ratings vs BookshelvesLet's parse ratings for books and make a violin plot for the 7 categories with the most rated books! CATEGORIES = 7 # number of most crowded categories to plot
# we have to fiddle a bit - we have to count the ratings by category,
# since each book can have several comma-delimited categories
# TODO: find a pandas-like way to do this
shelves_ratings = defaultdict(list) # key: shelf-name, value: list of ratings
shelves_counter = Counter() # counts how many books on each shelf
shelves_to_names = defaultdict(list) # key: shelf-name, value: list of book names
for index, row in cleaned_df.iterrows():
my_rating = row["My Rating"]
if my_rating == 0:
continue
if pd.isnull(row["Bookshelves"]):
continue
shelves = row["Bookshelves"].split(",")
for s in shelves:
# empty shelf?
if not s: continue
s = s.strip() # I had "non-fiction" and " non-fiction"
shelves_ratings[s].append(my_rating)
shelves_counter[s] += 10
shelves_to_names[s].append(row.Title)
names = []
ratings = []
for name, _ in shelves_counter.most_common(CATEGORIES):
for number in shelves_ratings[name]:
names.append(name)
ratings.append(number)
full_table = pd.DataFrame({"Category":names, "Rating":ratings})
# if we don't use scale=count here then each violin has the same area
sns.violinplot(x = "Category", y = "Rating", data=full_table, scale='count')
There is some bad SF out there. At this point I wonder - since we can assign multiple 'shelves' (tags) to each book, do I have some tags that appear more often together than not? Let's use R! %load_ext rpy2.ipython all_shelves = shelves_counter.keys()
names_dict = {} # key: shelf name, value: robjects.StrVector of names
for c in all_shelves:
names_dict[c] = robjects.StrVector(shelves_to_names[c])
names_dict = robjects.ListVector(names_dict) %%R -i names_dict -r 150 -w 900 -h 600
library(UpSetR)
names_dict <- fromList(names_dict)
# by default, only 5 sets are considered, so change nsets
upset(names_dict, nsets = 9) Most shelves are 'alone', but 'essays + non-fiction', 'sci-fi + sf' (should clean that up...), 'biography + non-fiction' show the biggest overlap. I may have messed up the categories, let's cluster them! Typos should cluster together # get the Levenshtein distance between all shelf titles, normalise the distance by string length
X = np.array([[float(distance.levenshtein(shelf_1,shelf_2))/max(len(shelf_1), len(shelf_2)) \
for shelf_1 in all_shelves] for shelf_2 in all_shelves])
# scale for clustering
X = StandardScaler().fit_transform(X)
# after careful fiddling I'm settling on eps=10
clusters = DBSCAN(eps=10, min_samples=1).fit_predict(X)
print('DBSCAN made %s clusters for %s shelves/tags.'%(len(set(clusters)), len(all_shelves)))
cluster_dict = defaultdict(list)
assert len(clusters) == len(all_shelves)
for cluster_label, element in zip(clusters, all_shelves):
cluster_dict[cluster_label].append(element)
print('Clusters with more than one member:')
for k in sorted(cluster_dict):
if len(cluster_dict[k]) > 1:
print(k, cluster_dict[k])
Some clusters are problematic due to too-short label names (arab/iraq), some other clusters are good and show me that I made some mistakes in labeling! French and France should be together, Greece and Greek too. Neat! (Without normalising the distance by string length clusters like horror/body-horror don't appear.) plotHistogramDistanceRead.pyLet's check the "dates read" for each book read and plot the distance between books read in days - shows you how quickly you hop from book to book. I didn't use Goodreads in 2012 much so let's see how it looks like without 2012: # first, transform to datetype and get rid of all invalid dates
#dates = pd.to_datetime(cleaned_df["Date Read"])
dates = pd.to_datetime(cleaned_df["Date Added"])
dates = dates.dropna()
sorted_dates = sorted(dates)
last_date = None
differences = []
all_days = []
all_days_without_2012 = [] # not much goodreads usage in 2012 - remove that year
for date in sorted_dates:
if not last_date:
last_date = date
if date.year != 2012:
last_date_not_2012 = date
difference = date - last_date
days = difference.days
all_days.append(days)
if date.year != 2012:
all_days_without_2012.append(days)
last_date = date
sns.distplot(all_days_without_2012, axlabel="Distance in days between books read")
pylab.show() plot Heatmap of dates readParses the "dates read" for each book read, bins them by month, and makes a heatmap to show in which months I read more than in others. Also makes a lineplot for books read, split up by year. NOTE: There is a very strange bug in Goodreads for about a year now. The exported CSV does not correctly track the date read. # we need a dataframe in this format:
# year months books_read
# I am sure there's some magic pandas function for this
read_dict = defaultdict(int) # key: (year, month), value: count of books read
for date in sorted_dates:
this_year = date.year
this_month = date.month
read_dict[ (this_year, this_month) ] += 1
first_date = sorted_dates[0]
first_year = first_date.year
first_month = first_date.month
todays_date = datetime.datetime.today()
todays_year = todays_date.year
todays_month = todays_date.month
all_years = []
all_months = []
all_counts = []
for year in range(first_year, todays_year+1):
for month in range(1, 13):
if (year == todays_year) and month > todays_month:
# don't count future months
break
this_count = read_dict[ (year, month) ]
all_years.append(year)
all_months.append(month)
all_counts.append(this_count)
# now get it in the format heatmap() wants
df = pd.DataFrame( { "month":all_months, "year":all_years, "books_read":all_counts } )
dfp = df.pivot("month", "year", "books_read")
fig, ax = plt.subplots(figsize=(10,10))
# now make the heatmap
ax = sns.heatmap(dfp, annot=True, ax=ax, square= True) What happened in May 2014? Update in 2018 - currently the 'date_read' column doesn't accurately track which books were actually read, this is a bug on Goodreads' end, see for example https://help.goodreads.com/s/question/0D51H00004ADr7o/i-have-exported-my-library-and-some-books-do-not-have-any-information-listed-for-date-read Plot books read by yearg = sns.FacetGrid(df, col="year", sharey=True, sharex=True, col_wrap=4)
g.map(plt.scatter, "month", "books_read")
g.set_ylabels("Books read")
g.set_xlabels("Month")
pylab.xlim(1, 12)
pylab.show() It's nice how reading behaviour (Goodreads usage) connects over the months - it slowly in 2013, stays constant in 2014/2015, and now goes down again. You can see when my first son was born! (Solution: 2016-8-25) (all other >2018 books are still missing their date_read dates...) Guessing authors' gendersLet's check whether I read mostly male or female authors using the gender-guesser package! first_names = cleaned_df['Author'].str.split(' ',expand=True)[0]
d = gender.Detector(case_sensitive=False)
genders = [d.get_gender(name) for name in first_names]
print(list(zip(genders[:5], first_names[:5])))
# let's also add those few 'mostly_female' and 'mostly_male' into the main grou
genders = pd.Series([x.replace('mostly_female','female').replace('mostly_male','male') for x in genders])
gender_ratios = genders.value_counts()
print(gender_ratios)
_ = gender_ratios.plot(kind='bar')
Now THAT'S gender bias. Do I rate the genders differently? cleaned_df['Gender'] = genders
male_scores = cleaned_df[cleaned_df['Gender'] == 'male']['My Rating'].values
female_scores = cleaned_df[cleaned_df['Gender'] == 'female']['My Rating'].values
_ = plt.hist([male_scores, female_scores], color=['r','b'], alpha=0.5) Hard to tell any difference since there are so fewer women authors here - let's split them up into different plots fig, axes = plt.subplots(2,1)
axes[0].hist(male_scores, color='r', alpha=0.5, bins=10)
axes[0].set_xlabel('Scores')
# Make the y-axis label, ticks and tick labels match the line color.
axes[0].set_ylabel('male scores')
axes[1].hist(female_scores, color='b', alpha=0.5, bins=10)
axes[1].set_ylabel('female scores')
fig.tight_layout() Are these two samples from the same distribution? Hard to tell since their size is so different, but let's ask Kolmogorov-Smirnov (null hypothesis: they are from the same distribution) scipy.stats.ks_2samp(male_scores, female_scores)
We cannot reject the null hypthesis as the p-value is very, very high. (but again, there are so few female scores...) Compare with Goodreads 10kA helpful soul has uploaded ratings and stats for the 10,000 books with most ratings on Goodreads (https://github.com/zygmuntz/goodbooks-10k). Let's compare those with my ratings! (You may have to run
to get the 10k submodule) 全部评论
专题导读
上一篇:CermakM/jupyter-datatables: Jupyter Notebook extension leveraging pandas DataFra ...发布时间:2022-07-09下一篇:5h15h/Azure-Maps-Jupyter-Notebooks发布时间:2022-07-09热门推荐
热门话题
阅读排行榜
|
请发表评论