kaggle dataset: Book Depository Dataset

kaggle notebook: Introduction to Book Depository Dataset

github repo: book-depository-dataset

Book Depository Dataset EDA

Through this notebook we will try to become familiar Book Depository Dataset and extract some usefull insights. The goal of this notebook is to become an introductory step for the dataset.

import pandas as pd
import os
import json
from glob import glob
import matplotlib.pyplot as plt
import seaborn as sns
% matplotlib
inline

Dataset Structure

Files:

  • categories.csv
  • dataset.csv
  • formats.csv
  • places.csv

The dataset consists of 5 file, the main dataset.csv file and some extra files. Extra files works as lookup tables for category, author, format and publication place. The reason behind this decision was to prevent data redundancy.

Fields:

  • authors: Book’s author(s) (list of str)
  • bestsellers-rank: Bestsellers ranking (int)
  • categories: Book’s categories. Check authors.csv for mapping (list of int)
  • description: Book description (str)
  • dimension_x: Book’s dimension X (float cm)
  • dimension_y: Book’s dimension Y (float cm)
  • dimension_z: Book’s dimension Z (float mm)
  • edition: Edition (str)
  • edition-statement: Edition statement (str)
  • for-ages: Range of ages (str)
  • format: Book’s format. Check formats.csv for mapping (int)
  • id: Book’s unique id (int)
  • illustrations-note:
  • imprint:
  • index-date: Book’s crawling date (date)
  • isbn10: Book’s ISBN-10 (str)
  • isbn13: Book’s ISBN-13 (str)
  • lang: List of book’ language(s)
  • publication-date: Publication date (date)
  • publication-place: Publication place (id)
  • publisher: Publisher (str)
  • rating-avg: Rating average [0-5] (float)
  • rating-count: Number of ratings
  • title: Book’s title (str)
  • url: Book relative url (https://bookdepository.com + url)
  • weight: Book’s weight (float gr)

So, lets assign each file to a different dataframe

if os.path.exists('../input/book-depository-dataset'):
    path_prefix = '../input/book-depository-dataset/{}.csv'
else:
    path_prefix = '../export/kaggle/{}.csv'

df, df_f, df_a, df_c, df_p = [
    pd.read_csv(path_prefix.format(_)) for _ in ('dataset', 'formats', 'authors', 'categories', 'places')
]
# df = df.sample(n=500)
df.head()

authorsbestsellers-rankcategoriesdescriptiondimension-xdimension-ydimension-zeditionedition-statementfor-ages...isbn10isbn13langpublication-datepublication-placerating-avgrating-counttitleurlweight
0[1]57858[220, 233, 237, 2644, 2679, 2689]They were American and British air force offic...142.0211.020.0NaNReissueNaN...3933257929.780393e+12en2004-08-171.04.246688.0The Great Escape/Great-Escape-Paul-Brickhill/9780393325799243.00
1[2, 3]114465[235, 3386]John Moran and Carl Williams were the two bigg...127.0203.225.4NaNNaNNaN...184454737X9.781845e+12en2009-03-132.03.59291.0Underbelly : The Gangland War/Underbelly-Andrew-Rule/9781844547371285.76
2[4]61,471[241, 245, 247, 249, 378]Plain English is the art of writing clearly, c...136.0195.016.0Revised4th Revised editionNaN...1996691719.780200e+12en2013-09-153.04.18128.0Oxford Guide to Plain English/Oxford-Guide-Plain-English-Martin-Cutts/97801...338.00
3[5]1,347,994[245, 253, 263, 273, 274, 276, 279, 280, 281, ...When travelling, do you want to journey off th...136.0190.033.0UnabridgedUnabridged editionNaN...14441854979.781444e+12en2014-12-032.0NaNNaNGet Talking and Keep Talking Portuguese Total .../Get-Talking-Keep-Talking-Portuguese-Total-Aud...156.00
4[6]58154[1938, 1941, 1995]No matter what your actual job title, you are-...179.0229.018.0NaNNaNNaN...3219340759.780322e+12en2016-02-284.04.30212.0The Truthful Art : Data, Charts, and Maps for .../Truthful-Art-Alberto-Cairo/9780321934079732.00

5 rows × 25 columns

Basic Stats

Firtly, lets display some basic statistics:

df.describe()

dimension-xdimension-ydimension-zidisbn13publication-placerating-avgrating-countweight
count742112.000000713278.000000742112.0000007.790050e+057.658780e+05556846.000000502381.0000005.023810e+05714289.000000
mean160.560373222.28975325.6095389.781553e+129.781559e+12247.9899723.9320021.187949e+04444.768939
std37.48778543.14537744.2184011.563374e+091.565216e+09643.2538080.5307401.174093e+05610.212039
min0.2500001.0000000.1300009.771131e+129.780000e+121.0000001.0000001.000000e+0015.000000
25%135.000000198.0000009.1400009.780764e+129.780772e+122.0000003.6900006.000000e+00172.370000
50%152.000000229.00000016.0000009.781473e+129.781475e+128.0000004.0000005.200000e+01299.000000
75%183.000000240.00000025.0000009.781723e+129.781724e+12178.0000004.2200006.880000e+02521.630000
max1905.0000001980.0000001750.0000009.798485e+129.798389e+125501.0000005.0000005.870281e+0690717.530000

Publication Date Distribution: Most books are published in t

df["publication-date"] = df["publication-date"].astype("datetime64")
df.groupby(df["publication-date"].dt.year).id.count().plot(title='Publication date distribution')
<matplotlib.axes._subplots.AxesSubplot at 0x7f7827af68d0>

png

df["index-date"] = df["index-date"].astype("datetime64")
df.groupby(df["index-date"].dt.month).id.count().plot(title='Crawling date distribution')
<matplotlib.axes._subplots.AxesSubplot at 0x7f7827af61d0>

png

df.groupby(['lang']).id.count().sort_values(ascending=False)[:5].plot(kind='pie', title="Most common languages")
<matplotlib.axes._subplots.AxesSubplot at 0x7f78279aca58>

png

import math

sns.lineplot(data=df.groupby(df['rating-avg'].dropna().apply(int)).id.count().reset_index(), x='rating-avg', y='id')
<matplotlib.axes._subplots.AxesSubplot at 0x7f7827970dd8>

png

dims = pd.DataFrame({
    'dims': df['dimension-x'].fillna('0').astype(int).astype(str).str.cat(
        df['dimension-y'].fillna('0').astype(int).astype(str), sep=" x ").replace('0 x 0', 'Unknown').values,
    'id': df['id'].values
})
dims.groupby(['dims']).id.count().sort_values(ascending=False)[:8].plot(kind='pie', title="Most common dimensions")
<matplotlib.axes._subplots.AxesSubplot at 0x7f77ee8a2b38>

png

pd.merge(
    df[['id', 'publication-place']], df_p, left_on='publication-place', right_on='place_id'
).groupby(['place_name']).id.count().sort_values(ascending=False)[:8].plot(kind='pie',
                                                                           title="Most common publication places")
<matplotlib.axes._subplots.AxesSubplot at 0x7f77ee96a208>

png