kaggle notebook: Introduction to Book Depository Dataset

Book Depository Dataset EDA

Through this notebook we will try to become familiar Book Depository Dataset and extract some usefull insights. The goal of this notebook is to become an introductory step for the dataset.

import pandas as pd
import os
import json
from glob import glob
import matplotlib.pyplot as plt
import seaborn as sns
% matplotlib
inline

Dataset Structure

Files:

categories.csv
dataset.csv
formats.csv
places.csv

The dataset consists of 5 file, the main dataset.csv file and some extra files. Extra files works as lookup tables for category, author, format and publication place. The reason behind this decision was to prevent data redundancy.

Fields:

authors: Book’s author(s) (list of str)
bestsellers-rank: Bestsellers ranking (int)
categories: Book’s categories. Check authors.csv for mapping (list of int)
description: Book description (str)
dimension_x: Book’s dimension X (float cm)
dimension_y: Book’s dimension Y (float cm)
dimension_z: Book’s dimension Z (float mm)
edition: Edition (str)
edition-statement: Edition statement (str)
for-ages: Range of ages (str)
format: Book’s format. Check formats.csv for mapping (int)
id: Book’s unique id (int)
illustrations-note:
imprint:
index-date: Book’s crawling date (date)
isbn10: Book’s ISBN-10 (str)
isbn13: Book’s ISBN-13 (str)
lang: List of book’ language(s)
publication-date: Publication date (date)
publication-place: Publication place (id)
publisher: Publisher (str)
rating-avg: Rating average [0-5] (float)
rating-count: Number of ratings
title: Book’s title (str)
url: Book relative url (https://bookdepository.com + url)
weight: Book’s weight (float gr)

So, lets assign each file to a different dataframe

if os.path.exists('../input/book-depository-dataset'):
    path_prefix = '../input/book-depository-dataset/{}.csv'
else:
    path_prefix = '../export/kaggle/{}.csv'

df, df_f, df_a, df_c, df_p = [
    pd.read_csv(path_prefix.format(_)) for _ in ('dataset', 'formats', 'authors', 'categories', 'places')
]

# df = df.sample(n=500)
df.head()

	authors	bestsellers-rank	categories	description	dimension-x	dimension-y	dimension-z	edition	edition-statement	for-ages	...	isbn10	isbn13	lang	publication-date	publication-place	rating-avg	rating-count	title	url	weight
0	[1]	57858	[220, 233, 237, 2644, 2679, 2689]	They were American and British air force offic...	142.0	211.0	20.0	NaN	Reissue	NaN	...	393325792	9.780393e+12	en	2004-08-17	1.0	4.24	6688.0	The Great Escape	/Great-Escape-Paul-Brickhill/9780393325799	243.00
1	[2, 3]	114465	[235, 3386]	John Moran and Carl Williams were the two bigg...	127.0	203.2	25.4	NaN	NaN	NaN	...	184454737X	9.781845e+12	en	2009-03-13	2.0	3.59	291.0	Underbelly : The Gangland War	/Underbelly-Andrew-Rule/9781844547371	285.76
2	[4]	61,471	[241, 245, 247, 249, 378]	Plain English is the art of writing clearly, c...	136.0	195.0	16.0	Revised	4th Revised edition	NaN	...	199669171	9.780200e+12	en	2013-09-15	3.0	4.18	128.0	Oxford Guide to Plain English	/Oxford-Guide-Plain-English-Martin-Cutts/97801...	338.00
3	[5]	1,347,994	[245, 253, 263, 273, 274, 276, 279, 280, 281, ...	When travelling, do you want to journey off th...	136.0	190.0	33.0	Unabridged	Unabridged edition	NaN	...	1444185497	9.781444e+12	en	2014-12-03	2.0	NaN	NaN	Get Talking and Keep Talking Portuguese Total ...	/Get-Talking-Keep-Talking-Portuguese-Total-Aud...	156.00
4	[6]	58154	[1938, 1941, 1995]	No matter what your actual job title, you are-...	179.0	229.0	18.0	NaN	NaN	NaN	...	321934075	9.780322e+12	en	2016-02-28	4.0	4.30	212.0	The Truthful Art : Data, Charts, and Maps for ...	/Truthful-Art-Alberto-Cairo/9780321934079	732.00

5 rows × 25 columns

Basic Stats

Firtly, lets display some basic statistics:

df.describe()

	dimension-x	dimension-y	dimension-z	id	isbn13	publication-place	rating-avg	rating-count	weight
count	742112.000000	713278.000000	742112.000000	7.790050e+05	7.658780e+05	556846.000000	502381.000000	5.023810e+05	714289.000000
mean	160.560373	222.289753	25.609538	9.781553e+12	9.781559e+12	247.989972	3.932002	1.187949e+04	444.768939
std	37.487785	43.145377	44.218401	1.563374e+09	1.565216e+09	643.253808	0.530740	1.174093e+05	610.212039
min	0.250000	1.000000	0.130000	9.771131e+12	9.780000e+12	1.000000	1.000000	1.000000e+00	15.000000
25%	135.000000	198.000000	9.140000	9.780764e+12	9.780772e+12	2.000000	3.690000	6.000000e+00	172.370000
50%	152.000000	229.000000	16.000000	9.781473e+12	9.781475e+12	8.000000	4.000000	5.200000e+01	299.000000
75%	183.000000	240.000000	25.000000	9.781723e+12	9.781724e+12	178.000000	4.220000	6.880000e+02	521.630000
max	1905.000000	1980.000000	1750.000000	9.798485e+12	9.798389e+12	5501.000000	5.000000	5.870281e+06	90717.530000

Publication Date Distribution: Most books are published in t

df["publication-date"] = df["publication-date"].astype("datetime64")
df.groupby(df["publication-date"].dt.year).id.count().plot(title='Publication date distribution')

<matplotlib.axes._subplots.AxesSubplot at 0x7f7827af68d0>

png

df["index-date"] = df["index-date"].astype("datetime64")
df.groupby(df["index-date"].dt.month).id.count().plot(title='Crawling date distribution')

<matplotlib.axes._subplots.AxesSubplot at 0x7f7827af61d0>

png

df.groupby(['lang']).id.count().sort_values(ascending=False)[:5].plot(kind='pie', title="Most common languages")

<matplotlib.axes._subplots.AxesSubplot at 0x7f78279aca58>

png

import math

sns.lineplot(data=df.groupby(df['rating-avg'].dropna().apply(int)).id.count().reset_index(), x='rating-avg', y='id')

<matplotlib.axes._subplots.AxesSubplot at 0x7f7827970dd8>

png

dims = pd.DataFrame({
    'dims': df['dimension-x'].fillna('0').astype(int).astype(str).str.cat(
        df['dimension-y'].fillna('0').astype(int).astype(str), sep=" x ").replace('0 x 0', 'Unknown').values,
    'id': df['id'].values
})
dims.groupby(['dims']).id.count().sort_values(ascending=False)[:8].plot(kind='pie', title="Most common dimensions")

<matplotlib.axes._subplots.AxesSubplot at 0x7f77ee8a2b38>

png

pd.merge(
    df[['id', 'publication-place']], df_p, left_on='publication-place', right_on='place_id'
).groupby(['place_name']).id.count().sort_values(ascending=False)[:8].plot(kind='pie',
                                                                           title="Most common publication places")

<matplotlib.axes._subplots.AxesSubplot at 0x7f77ee96a208>

png

Book Depository Dataset EDA#

Dataset Structure#

Basic Stats#

Book Depository Dataset EDA

Dataset Structure

Basic Stats