maayanlab_bioinformatics.harmonization package¶
Submodules¶
maayanlab_bioinformatics.harmonization.homologs module¶
- maayanlab_bioinformatics.harmonization.homologs.human_expression_to_mouse(human_expression, strategy='sum', uppercase=False)[source]¶
Given a human expression matrix, produce a mouse-compatible expression matrix by mapping homologs.
@param human_expression: pd.DataFrame(columns=samples, index=human_genes, values=counts) @param strategy: ‘sum’ – the strategy to use when aggregating duplicates @returns pd.DataFrame(columns=samples, index=mouse_genes, values=counts)
- maayanlab_bioinformatics.harmonization.homologs.mouse_expression_to_human(mouse_expression, strategy='sum', uppercase=False)[source]¶
Given a mouse expression matrix, produce a human-compatible expression matrix by mapping homologs.
@param mouse_expression: pd.DataFrame(columns=samples, index=mouse_genes, values=counts) @param strategy: ‘sum’ – the strategy to use when aggregating duplicates @returns pd.DataFrame(columns=samples, index=human_genes, values=counts)
- maayanlab_bioinformatics.harmonization.homologs.mouse_human_homologs(uppercase=False)[source]¶
Returns a dataframe with mouse/human gene mappings based on MGI. See: http://www.informatics.jax.org/homology.shtml
@param uppercase: bool should mappings be uppercase (i.e. for case insensitive mapping) @returns pd.DataFrame
|mouse|human| |-----|-----| |sp140|SP140|
maayanlab_bioinformatics.harmonization.id_mapper module¶
- class maayanlab_bioinformatics.harmonization.id_mapper.IDMapper[source]¶
Bases:
object
Stores id mappings and makes it easy to use many of them in tandem.
mapper = IDMapper() mapper.update({ 'a': {'A', 'C'} }, namespace='source_1') mapper.update({ 'b': {'A', 'B'} }, namespace='source_2') mapper.get('C', namespace='source_2') == 'b' Because of the overlap in synonyms it is inferred that source_1's 'a' and source_2's 'b' correspond to the same id, we can get using any of the synyonyms to retreive the id in a given namespace. Since this can be problematic when synonyms are malformed, mapper.conflicts_summary() and mapper.conflicts_counts() provide ways of debugging excess synonym applications.
maayanlab_bioinformatics.harmonization.ncbi_genes module¶
- maayanlab_bioinformatics.harmonization.ncbi_genes.ncbi_genes_fetch(organism='Mammalia/Homo_sapiens', filters=<function <lambda>>)[source]¶
Fetch the current NCBI Human Gene Info database. See ftp://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/ for the directory/file of the organism of interest.
- maayanlab_bioinformatics.harmonization.ncbi_genes.ncbi_genes_lookup(organism='Mammalia/Homo_sapiens', filters=<function <lambda>>)[source]¶
Return a lookup dictionary with synonyms as the keys, and official symbols as the values Usage:
ncbi_lookup = ncbi_genes_lookup('Mammalia/Homo_sapiens') print(ncbi_lookup('STAT3')) # any alias will get converted into the official symbol
maayanlab_bioinformatics.harmonization.transcripts module¶
- maayanlab_bioinformatics.harmonization.transcripts.transcripts_to_genes(df_expression: DataFrame, df_features: DataFrame | None = None, strategy='var', uppercasegenes=False, lookup_dict: Dict[str, str] | None = None, organism='Mammalia/Homo_sapiens')[source]¶
Map gene alternative ids/transcripts to gene symbols using
ncbi_genes_lookup
We take a matrix with genes/transcripts on the rows and samples on the columns. In the case of multiple gene/transcript to symbol mappings, we adopt the collision strategy specified. If df_features is provided, we will use ‘symbol’ column as the transcript names, otherwise we will use the df_expression index column. The resulting matrix will naturally have fewer samples, corresponding to gene symbols in thelookup_dict
which defaults to official ncbi_gene symbols for homo sapiens.- Parameters:
strategy – (‘var’|’sum’) collision strategy (select one with highest variance, or sum counts)
Module contents¶
This module contains functions relating to data harmonization.