Python

Separate app dependencies using Multi-stage builds

Intro Deploying and testing a web app using docker has become the standard nowadays. Quite often, we don’t pay attention on building more than one docker images of the service in order to use them for different scenarios. One of the most known use cases are the following: Use the docker image to run the application. Use the docker image to run the tests. Usually, the second scenario comes with some additional dependencies, such us: testing frameworks, mocking tools and others. In this case, we don’t want to include all these extra dependencies in the production docker image. This post demonstrates how multi-stage builds could be used for this purpose. ...

Store images efficiently in scrapy using folder structure

Introduction NOTE: This article was the reason to implement the scrapy-folder-tree scrapy extension. What is the problem and how deal with it When it comes to image storing, a common pitfall is to save all the images in a single folder. If the number of images is less than few thousands, when, stop reading this post because you will not face any issue. On the other hand, if you are planing to store numerous images, then consider splitting them in different folders. Listing a directory will become faster, more efficient and at the end of the day, your kernel will be happier. A common pattern is to create a folder structure based on the name of every file. For example, let’s say that path/to/image/dir will be the main directory, and you want to store imagefile.jpg. Create folder structure based on file’s characters and save the file inside the leaf folder: ...

Semi-Supervised Fraud Detection

Book Depository Dataset EDA

kaggle dataset: Book Depository Dataset kaggle notebook: Introduction to Book Depository Dataset github repo: book-depository-dataset Book Depository Dataset EDA Through this notebook we will try to become familiar Book Depository Dataset and extract some usefull insights. The goal of this notebook is to become an introductory step for the dataset. import pandas as pd import os import json from glob import glob import matplotlib.pyplot as plt import seaborn as sns % matplotlib inline Dataset Structure Files: categories.csv dataset.csv formats.csv places.csv The dataset consists of 5 file, the main dataset.csv file and some extra files. Extra files works as lookup tables for category, author, format and publication place. The reason behind this decision was to prevent data redundancy. ...