• 设为首页
  • 点击收藏
  • 手机版
    手机扫一扫访问
    迪恩网络手机版
  • 关注官方公众号
    微信扫一扫关注
    迪恩网络公众号

philippbayer/Goodreads_visualization: A Jupyter notebook where I play with my Go ...

原作者: [db:作者] 来自: 网络 收藏 邀请

开源软件名称:

philippbayer/Goodreads_visualization

开源软件地址:

https://github.com/philippbayer/Goodreads_visualization

开源编程语言:

Makefile 100.0%

开源软件介绍:

Binder

Goodreads visualization

A Jupyter notebook to play around with Goodreads data and make some seaborn visualizations, learn more about scikit-learn, my own playground!

You can use it with your own data - go here and press "Export your library" to get your own csv.

The text you're reading is generated from a jupyter notebook by the Makefile. If you want to run it yourself, clone the repository then run

jupyter notebook your_file.ipynb

to get the interactive version. In there, replace the path to my Goodreads exported file by yours in the ipynb file, and then run click on Cell -> Run All.

** WARNING ** It seems that there's currently a bug on Goodreads' end with the export of data, as many recently 'read' books have a read-date which is shown on the web page but doesn't show up in the CSV.

Dependencies

  • Python (3! rpy2 doesn't work under Python2 any more)
  • Jupyter
  • R (for rpy2)

Python packages

  • seaborn
  • pandas
  • wordcloud
  • nltk
  • networkx
  • pymarkovchain
  • scikit-learn
  • distance
  • image (PIL inside python for some weird reason)
  • gender_guesser
  • rpy2

To install all:

pip install seaborn wordcloud nltk networkx pymarkovchain image sklearn distance gender_guesser rpy2

Under Windows and anaconda you instead need to run

conda install rpy2

instead of using pip to install rpy2.

Licenses

License for reviews: CC-BY-SA 4.0 Code: MIT

OK, let's start!

Setting up the notebook

%pylab inline

# for most plots
import numpy as np
import pandas as pd
import seaborn as sns
from collections import defaultdict, Counter, OrderedDict

# for stats
import scipy.stats

# for time-related plots
import datetime
import calendar

# for word cloud
import re
import string
from nltk.corpus import stopwords
from wordcloud import WordCloud

# for Markov chain
from pymarkovchain import MarkovChain
import pickle
import networkx as nx

# for shelf clustering
import distance
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN

sns.set_palette("coolwarm")

# for plotting images
from IPython.display import Image

import gender_guesser.detector as gender

# for R
import pandas
from rpy2 import robjects 
# conda install -c r rpy2 on Windows

import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [10, 5]
Populating the interactive namespace from numpy and matplotlib

Loading the data

df = pd.read_csv('./goodreads_library_export.csv')
# keep only books that have a rating (unrated books have a rating of 0, we don't need that)
cleaned_df = df[df["My Rating"] != 0]

# get rid of noise in 2012
cleaned_df = cleaned_df[(cleaned_df['Date Added'] > '2013-01-01')]

Score distribution

With a score scale of 1-5, you'd expect that the average score is 2.5 3 (since 0 is not counted) after a few hundred books (in other words, is it a normal distribution?)

g = sns.distplot(cleaned_df["My Rating"], kde=False)
"Average: %.2f"%cleaned_df["My Rating"].mean(), "Median: %s"%cleaned_df["My Rating"].median()
('Average: 3.54', 'Median: 4.0')

png

That doesn't look normally distributed to me - let's ask Shapiro-Wilk (null hypothesis: data is drawn from normal distribution):

W, p_value = scipy.stats.shapiro(cleaned_df["My Rating"])
if p_value < 0.05:
    print("Rejecting null hypothesis - data does not come from a normal distribution (p=%s)"%p_value)
else:
    print("Cannot reject null hypothesis (p=%s)"%p_value)
Rejecting null hypothesis - data does not come from a normal distribution (p=8.048559751530179e-22)

In my case, the data is not normally distributed (in other words, the book scores are not evenly distributed around the middle). If you think about it, this makes sense: most readers don't read perfectly randomly, I avoid books I believe I'd dislike, and choose books that I prefer. I rate those books higher than average, therefore, my curve of scores is slanted towards the right.

plot Pages vs Ratings

Do I give longer books better scores? A minor tendency but nothing special (it's confounded by having just 5 possible numbers in ratings)

g = sns.jointplot("Number of Pages", "My Rating", data=cleaned_df, kind="reg", height=7, ylim=[0.5,5.5])
g.annotate(scipy.stats.pearsonr)
C:\Users\00089503\AppData\Local\Continuum\anaconda3\lib\site-packages\seaborn\axisgrid.py:1847: UserWarning: JointGrid annotation is deprecated and will be removed in a future release.
  warnings.warn(UserWarning(msg))





<seaborn.axisgrid.JointGrid at 0x1f90b514080>

png

I seem to mostly read books at around 200 to 300 pages so it's hard to tell whether I give longer books better ratings. It's a nice example that in regards to linear regression, a p-value as tiny as this one doesn't mean much, the r-value is still bad.


plot Ratings vs Bookshelves

Let's parse ratings for books and make a violin plot for the 7 categories with the most rated books!

CATEGORIES = 7 # number of most crowded categories to plot

# we have to fiddle a bit - we have to count the ratings by category, 
# since each book can have several comma-delimited categories
# TODO: find a pandas-like way to do this

shelves_ratings = defaultdict(list) # key: shelf-name, value: list of ratings
shelves_counter = Counter() # counts how many books on each shelf
shelves_to_names = defaultdict(list) # key: shelf-name, value: list of book names
for index, row in cleaned_df.iterrows():
    my_rating = row["My Rating"]
    if my_rating == 0:
        continue
    if pd.isnull(row["Bookshelves"]):
        continue

    shelves = row["Bookshelves"].split(",")

    for s in shelves:
        # empty shelf?
        if not s: continue
        s = s.strip() # I had "non-fiction" and " non-fiction"
        shelves_ratings[s].append(my_rating)
        shelves_counter[s] += 10
        shelves_to_names[s].append(row.Title)

names = []
ratings = []
for name, _ in shelves_counter.most_common(CATEGORIES):
    for number in shelves_ratings[name]:
        names.append(name)
        ratings.append(number)

full_table = pd.DataFrame({"Category":names, "Rating":ratings})

# if we don't use scale=count here then each violin has the same area
sns.violinplot(x = "Category", y = "Rating", data=full_table, scale='count')
<matplotlib.axes._subplots.AxesSubplot at 0x1f90b6e8278>

png

There is some bad SF out there.

At this point I wonder - since we can assign multiple 'shelves' (tags) to each book, do I have some tags that appear more often together than not? Let's use R!

%load_ext rpy2.ipython
all_shelves = shelves_counter.keys()

names_dict = {} # key: shelf name, value: robjects.StrVector of names
for c in all_shelves:
    names_dict[c] = robjects.StrVector(shelves_to_names[c])

names_dict = robjects.ListVector(names_dict)    
%%R -i names_dict -r 150 -w 900 -h 600
library(UpSetR)
names_dict <- fromList(names_dict)
# by default, only 5 sets are considered, so change nsets
upset(names_dict, nsets = 9)

png

Most shelves are 'alone', but 'essays + non-fiction', 'sci-fi + sf' (should clean that up...), 'biography + non-fiction' show the biggest overlap.

I may have messed up the categories, let's cluster them! Typos should cluster together

# get the Levenshtein distance between all shelf titles, normalise the distance by string length
X = np.array([[float(distance.levenshtein(shelf_1,shelf_2))/max(len(shelf_1), len(shelf_2)) \
               for shelf_1 in all_shelves] for shelf_2 in all_shelves])
# scale for clustering
X = StandardScaler().fit_transform(X)

# after careful fiddling I'm settling on eps=10
clusters = DBSCAN(eps=10, min_samples=1).fit_predict(X)
print('DBSCAN made %s clusters for %s shelves/tags.'%(len(set(clusters)), len(all_shelves)))

cluster_dict = defaultdict(list)
assert len(clusters) == len(all_shelves)
for cluster_label, element in zip(clusters, all_shelves):
    cluster_dict[cluster_label].append(element)
    
print('Clusters with more than one member:')
for k in sorted(cluster_dict):
    if len(cluster_dict[k]) > 1:
        print(k, cluster_dict[k])
DBSCAN made 166 clusters for 184 shelves/tags.
Clusters with more than one member:
1 ['fiction', 'action']
2 ['russia', 'russian']
12 ['latin-america', 'native-american']
24 ['ww1', 'ww2']
32 ['humble-bundle2', 'humble-bundle-jpsf']
47 ['essays', 'essay']
49 ['on-living', 'on-writing', 'on-thinking']
50 ['history-of-biology', 'history-of-maths', 'history-of-cs', 'history-of-philosophy']
53 ['greek', 'greece']
66 ['iceland', 'ireland']
88 ['mythology', 'psychology', 'sociology', 'theology']
116 ['philosophy', 'pop-philosophy']
126 ['letters', 'lectures']

Some clusters are problematic due to too-short label names (arab/iraq), some other clusters are good and show me that I made some mistakes in labeling! French and France should be together, Greece and Greek too. Neat!

(Without normalising the distance by string length clusters like horror/body-horror don't appear.)

plotHistogramDistanceRead.py

Let's check the "dates read" for each book read and plot the distance between books read in days - shows you how quickly you hop from book to book.

I didn't use Goodreads in 2012 much so let's see how it looks like without 2012:

# first, transform to datetype and get rid of all invalid dates
#dates = pd.to_datetime(cleaned_df["Date Read"])
dates = pd.to_datetime(cleaned_df["Date Added"])

dates = dates.dropna()
sorted_dates = sorted(dates)

last_date = None
differences = []
all_days = []
all_days_without_2012 = [] # not much goodreads usage in 2012 - remove that year
for date in sorted_dates:
    if not last_date:
        last_date = date
        if date.year != 2012:
            last_date_not_2012 = date
    difference = date - last_date
    
    days = difference.days
    all_days.append(days)
    if date.year != 2012:
        all_days_without_2012.append(days)
    last_date = date

sns.distplot(all_days_without_2012, axlabel="Distance in days between books read")
pylab.show()

png


plot Heatmap of dates read

Parses the "dates read" for each book read, bins them by month, and makes a heatmap to show in which months I read more than in others. Also makes a lineplot for books read, split up by year.

NOTE: There is a very strange bug in Goodreads for about a year now. The exported CSV does not correctly track the date read.

# we need a dataframe in this format:
# year months books_read
# I am sure there's some magic pandas function for this

read_dict = defaultdict(int) # key: (year, month), value: count of books read
for date in sorted_dates:
    this_year = date.year
    this_month = date.month
    read_dict[ (this_year, this_month) ] += 1

first_date = sorted_dates[0]

first_year = first_date.year
first_month = first_date.month

todays_date = datetime.datetime.today()
todays_year = todays_date.year
todays_month = todays_date.month

all_years = []
all_months = []
all_counts = []
for year in range(first_year, todays_year+1):
    for month in range(1, 13):
        if (year == todays_year) and month > todays_month:
            # don't count future months
            break
        this_count = read_dict[ (year, month) ]
        all_years.append(year)
        all_months.append(month)
        all_counts.append(this_count)

# now get it in the format heatmap() wants
df = pd.DataFrame( { "month":all_months, "year":all_years, "books_read":all_counts } )
dfp = df.pivot("month", "year", "books_read")

fig, ax = plt.subplots(figsize=(10,10))
# now make the heatmap
ax = sns.heatmap(dfp, annot=True, ax=ax, square= True)

png

What happened in May 2014?

Update in 2018 - currently the 'date_read' column doesn't accurately track which books were actually read, this is a bug on Goodreads' end, see for example https://help.goodreads.com/s/question/0D51H00004ADr7o/i-have-exported-my-library-and-some-books-do-not-have-any-information-listed-for-date-read


Plot books read by year

g = sns.FacetGrid(df, col="year", sharey=True, sharex=True, col_wrap=4)
g.map(plt.scatter, "month", "books_read")
g.set_ylabels("Books read")
g.set_xlabels("Month")
pylab.xlim(1, 12)
pylab.show()

png

It's nice how reading behaviour (Goodreads usage) connects over the months - it slowly in 2013, stays constant in 2014/2015, and now goes down again. You can see when my first son was born!

(Solution: 2016-8-25)

(all other >2018 books are still missing their date_read dates...)


Guessing authors' genders

Let's check whether I read mostly male or female authors using the gender-guesser package!

first_names = cleaned_df['Author'].str.split(' ',expand=True)[0]
d = gender.Detector(case_sensitive=False)

genders = [d.get_gender(name) for name in first_names]
print(list(zip(genders[:5], first_names[:5])))
# let's also add those few 'mostly_female' and 'mostly_male' into the main grou
genders = pd.Series([x.replace('mostly_female','female').replace('mostly_male','male') for x in genders])
[('male', 'Don'), ('male', 'Daniil'), ('male', 'William'), ('unknown', 'E.T.A.'), ('male', 'John')]
gender_ratios = genders.value_counts()
print(gender_ratios)
_ = gender_ratios.plot(kind='bar')
male       423
unknown     67
female      56
andy         3
dtype: int64

png

Now THAT'S gender bias. Do I rate the genders differently?

cleaned_df['Gender'] = genders

male_scores = cleaned_df[cleaned_df['Gender'] == 'male']['My Rating'].values
female_scores = cleaned_df[cleaned_df['Gender'] == 'female']['My Rating'].values

_ = plt.hist([male_scores, female_scores], color=['r','b'], alpha=0.5)

png

Hard to tell any difference since there are so fewer women authors here - let's split them up into different plots

fig, axes = plt.subplots(2,1)

axes[0].hist(male_scores, color='r', alpha=0.5, bins=10)
axes[0].set_xlabel('Scores')
# Make the y-axis label, ticks and tick labels match the line color.
axes[0].set_ylabel('male scores')

axes[1].hist(female_scores, color='b', alpha=0.5, bins=10)
axes[1].set_ylabel('female scores')

fig.tight_layout()

png

Are these two samples from the same distribution? Hard to tell since their size is so different, but let's ask Kolmogorov-Smirnov (null hypothesis: they are from the same distribution)

scipy.stats.ks_2samp(male_scores, female_scores)
Ks_2sampResult(statistic=0.22018779342723005, pvalue=0.13257156821934568)

We cannot reject the null hypthesis as the p-value is very, very high. (but again, there are so few female scores...)


Compare with Goodreads 10k

A helpful soul has uploaded ratings and stats for the 10,000 books with most ratings on Goodreads (https://github.com/zygmuntz/goodbooks-10k). Let's compare those with my ratings!

(You may have to run

git submodule update

to get the 10k submodule)


鲜花

握手

雷人

路过

鸡蛋
该文章已有0人参与评论

请发表评论

全部评论

专题导读
上一篇:
CermakM/jupyter-datatables: Jupyter Notebook extension leveraging pandas DataFra ...发布时间:2022-07-09
下一篇:
5h15h/Azure-Maps-Jupyter-Notebooks发布时间:2022-07-09
热门推荐
阅读排行榜

扫描微信二维码

查看手机版网站

随时了解更新最新资讯

139-2527-9053

在线客服(服务时间 9:00~18:00)

在线QQ客服
地址:深圳市南山区西丽大学城创智工业园
电邮:jeky_zhao#qq.com
移动电话:139-2527-9053

Powered by 互联科技 X3.4© 2001-2213 极客世界.|Sitemap