maayanlab_bioinformatics.harmonization package

Submodules

maayanlab_bioinformatics.harmonization.homologs module

maayanlab_bioinformatics.harmonization.homologs.human_expression_to_mouse(human_expression, strategy='sum', uppercase=False)[source]

Given a human expression matrix, produce a mouse-compatible expression matrix by mapping homologs.

@param human_expression: pd.DataFrame(columns=samples, index=human_genes, values=counts) @param strategy: ‘sum’ – the strategy to use when aggregating duplicates @returns pd.DataFrame(columns=samples, index=mouse_genes, values=counts)

maayanlab_bioinformatics.harmonization.homologs.mouse_expression_to_human(mouse_expression, strategy='sum', uppercase=False)[source]

Given a mouse expression matrix, produce a human-compatible expression matrix by mapping homologs.

@param mouse_expression: pd.DataFrame(columns=samples, index=mouse_genes, values=counts) @param strategy: ‘sum’ – the strategy to use when aggregating duplicates @returns pd.DataFrame(columns=samples, index=human_genes, values=counts)

maayanlab_bioinformatics.harmonization.homologs.mouse_human_homologs(uppercase=False)[source]

Returns a dataframe with mouse/human gene mappings based on MGI. See: http://www.informatics.jax.org/homology.shtml

@param uppercase: bool should mappings be uppercase (i.e. for case insensitive mapping) @returns pd.DataFrame

|mouse|human|
|-----|-----|
|sp140|SP140|

maayanlab_bioinformatics.harmonization.id_mapper module

class maayanlab_bioinformatics.harmonization.id_mapper.IDMapper[source]

Bases: object

Stores id mappings and makes it easy to use many of them in tandem.

mapper = IDMapper()

mapper.update({ 'a': {'A', 'C'} }, namespace='source_1')
mapper.update({ 'b': {'A', 'B'} }, namespace='source_2')
mapper.get('C', namespace='source_2') == 'b'

Because of the overlap in synonyms it is inferred that source_1's 'a' and source_2's 'b' correspond to the same
  id, we can get using any of the synyonyms to retreive the id in a given namespace.
Since this can be problematic when synonyms are malformed, mapper.conflicts_summary() and mapper.conflicts_counts()
  provide ways of debugging excess synonym applications.
conflicts_summary()[source]

Return counts of conflicts in each namespace

find(term)[source]
get(term, namespace=None)[source]
get_id(id, namespace=None)[source]
summary()[source]

Return counts of overlapping namespaces (like a venn diagram)

top_conflicts()[source]

Return conflicting synonym counts

update(mappings, namespace=None)[source]

Add mappings of the form: { identifier: { synonyms } }

maayanlab_bioinformatics.harmonization.ncbi_genes module

maayanlab_bioinformatics.harmonization.ncbi_genes.ncbi_genes_fetch(organism='Mammalia/Homo_sapiens', filters=<function <lambda>>)[source]

Fetch the current NCBI Human Gene Info database. See ftp://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/ for the directory/file of the organism of interest.

maayanlab_bioinformatics.harmonization.ncbi_genes.ncbi_genes_lookup(organism='Mammalia/Homo_sapiens', filters=<function <lambda>>)[source]

Return a lookup dictionary with synonyms as the keys, and official symbols as the values Usage:

ncbi_lookup = ncbi_genes_lookup('Mammalia/Homo_sapiens')
print(ncbi_lookup('STAT3')) # any alias will get converted into the official symbol

maayanlab_bioinformatics.harmonization.transcripts module

maayanlab_bioinformatics.harmonization.transcripts.transcripts_to_genes(df_expression: DataFrame, df_features: DataFrame | None = None, strategy='var', uppercasegenes=False, lookup_dict: Dict[str, str] | None = None, organism='Mammalia/Homo_sapiens')[source]

Map gene alternative ids/transcripts to gene symbols using ncbi_genes_lookup We take a matrix with genes/transcripts on the rows and samples on the columns. In the case of multiple gene/transcript to symbol mappings, we adopt the collision strategy specified. If df_features is provided, we will use ‘symbol’ column as the transcript names, otherwise we will use the df_expression index column. The resulting matrix will naturally have fewer samples, corresponding to gene symbols in the lookup_dict which defaults to official ncbi_gene symbols for homo sapiens.

Parameters:

strategy – (‘var’|’sum’) collision strategy (select one with highest variance, or sum counts)

Module contents

This module contains functions relating to data harmonization.