Overview

The overarching goal of the project is to build a comprehensive index of softwares, tools, and databases published in several major journals in the field of bioinformatics and computational biology. This data will become an integral part of Datasets2Tools

The results of analysis -recommendation system and data visualization-will be also featured on a small website, BioToolBay which is currently being developed.

BTB

Methods

Step 1: Web Scraping

Use Scrapy-Splash framework to extract chosen informations from research journals.

BMC Spider Oxford Spider

Step 2: Data cleaning and normalization

BMC Cleaner Oxford Cleaner

Step 3: Tool's Name Extraction- Regex

Tool Extraction

Step 4: Active links to homepages

Link Checker

Step 5: Unique Names, NLP- similarity, t-SNE

NLP

Step 6: Build a Database

Create database Dataframe2SQL

Short Summary

Journal Extracted Articles Tools' Related Articles Unique Tools with Active Link
BMC Bioinformatics 8,358 2,099 1,408
NAR Oxford 20,954 3,627 2,029
Database Oxford 742 391 333
Bioinformatics Oxford 10,681 3,627 2,482
Total 40,735 9,744 5,915*

* unique tools across all the journals

Tools' Similarity

Requires Firefox. Color scale represents relative amount of citations. Size of the balls corresponds to relative number of views.