Caryn K. Johansen

Description

Caryn Johansen is a computational biologist and data scientist.
As a computational biologist she has more than four years experience designing and implementing bioinformatics pipelines for genomic analysis, and using a variety of statistical and machine learning tools for genetic analyses. As a data scientist, she works with data from a range of fields both in and outside of biology, and is an expert at wrangling data in R, creating statistical models, and visualizing compelling results.

About Page

Current Affiliation

Data Scientist • HER Lesbian Dating App • May 2018 - present

Cabinet of Curiosity • Data Science Blog with Ciera Martinez • March 2018 - present

Education

M.S. Bioinformatics and Biology • New York University, NY • 2016
- Adviser: Dr. Michael Purugganan
B.S. Biology • Humboldt State University, CA • 2012
Plant Biology Graduate Group • Ph.D student, U.C. Davis • 2016 - 2018
- Advisers: Dr. Jeffrey Ross-Ibarra & Dr. Daniel Runcie
Center for Population Biology • graduate student affiliate • 2017 - 2018

Technical Skills

Machine learning and statistical methods: supervised learning (classification/regression); unsupervised learning (k-NN clustering); support vector machines, hypothesis testing, regression, hierarchical models, confidence intervals, dimension reduction (principal component analysis)

Software and programming languages: R, Linux, Python, AWS, SQL, JavaScript, D3.js, JAGS

Genomic methods: genome wide association studies, quantitative trait loci analysis, RNA differential gene expression

Selected coursework: Applied Statistical Modeling for the Environmental Sciences, MCMC for Genetics, Quantitative and Population Genetics, Advanced Quantitative Genetics, Mixed Models in Quantitative Genetics

Achievements

Built a relational database and interactive user interface using Python, JavaScript, and SQL for a research laboratory at NYU for researchers to quickly query and visualize the results of five years worth of data collection.

Built maize genetics work flows in R and Linux during my Ph.D work to analyze the genomics and traits of maize, and to identify differentially expressing genes.

Certified as a Data Carpentry Instructor, to teach computational tools for data analysis to academics and non-academics.

Built personal application in R to quantify productivity and assess focus level while working on the computer.

Work Experience and Projects

Data Scientist • May 2018 - present

HER Dating App; San Francisco, CA
Built the data analysis pipeline for the app to assess app health, user behavior, and conduct A/B product tests
Worked and coordinated with the engineering, product, marketing and sales, and leadership teams to deliver key metrics
Give weekly company-wide presentations on the state of the app

Data Scientist Contractor • February 2018 - May 2018

Minotaur Inc.; San Francisco, CA
Building a bioinformatics pipeline in R, Linux, and Python for Minotaur Inc. to make their gene discovery pipeline scalable and fast
Building predictive models to infer plant metabolite presence or absence
Mining public metabolic, proteomic, and genomic databases to identify new possible discoveries for plant metabolic products
Working closely with a small team of computational biologists and analytical chemists to produce a written report on results and proof-of-concept computational infrastructure

PhD graduate student • September 2016 - December 2017

U.C. Davis; Davis, CA
Advisors: Jeffrey Ross-Ibarra & Daniel Runcie
Focus: statistical genetics and population genetics
Constructed statistical models in R to understand the gene expression differences between populations of corn and corn’s wild relative, teosinte
Utilized Bayesian statistics models in R to generate informative gene clusters in corn to better understand how environment impacts gene expression
Designed field experiments based on simulations in R to test the hypothesis that hybrids of diverged populations of corn in Mexico had dis-regulated genes that may contribute to speciation between species
Conducted genome wide association mapping in Linux and R on 503 corn hybrids using mixed linear models to identify genetic loci associated with certain agronomic phenotypes

Bioinformatics Contractor • January 2016

Hampton Creek Foods, Inc.; San Francisco, CA
Worked with the R&D team constructing an in-house plant protein discovery pipeline
Built a tool in R to enable the team to feed in raw data output from a machine to bypass the point and click software available

Teaching Assistant • August 2015 - December 2015

New York University Tandon School of Engineering; New York, NY
Course: Next Generation Sequence Analysis
Provided technical and course-related assistance to educate students on processing big genomic data sets using current biological software tools

Research Assistant • June 2015 - August 2015

New York University; New York, NY
Built a SQL database in Python to host five years of data on protein function predictions and transcriptional regulatory interactions in the Oryza genus
Built the front end in JavaScript and network visualizations with D3.js for the relational database, called The Rice Data Center
This tool enabled search five years worth of genomic, proteomic, and analysis of different rice populations with the purpose to answer questions, to reduce redundant use of research funds, to inform new hypotheses, and to inspire future research

Research Assistant • June 2012 - July 2014

Carnegie Institute for Science, Department of Plant Biology; Stanford, CA
Adviser: Dr. Seung Rhee
Conducted research on the impact of salt stress impact on reproductive success and root architecture in Arabidopsis thaliana. Developed lab protocols to analyze sodium ion distribution in plant tissues.

Community

Data Carpentry Instructor
- Planned: Ecology/R at Chan Zuckerberg Biohub, April 2018

Project Details

Role of gene expression changes in the adaptation of corn and wild relative teosinte hybrids

Examining the effects of hybridization on gene expression patterns in the leaf tissue of corn-teosinte crosses. We noticed that many genes were disregulated in the sense that the gene expression in the offspring was much greater than either of the parents, or much less than either of the parents.
more details…

Use of a Bayesian sparse factor gibbs sampler to identify gene clusters

Calculating the population branch statistic between cultivars of corn and teosinte

The Rice Data Center: a multi-feature data analysis tool for Oryza sativa genetic, proteomic, and regulatory interaction network data

Built a full-stack, custom built database and interactive data visualization created with the purpose to search five years worth of genomic, proteomic, and analysis of different rice populations from the Purugganan lab at NYU.
more details…

Genome wide association study (GWAS) on 503 corn cultivars

Identification of environmental quantitative trait loci in 503 corn cultivars

Seminars and Training

Summer Institute for Statistical Genetics - June 2017
- Hosted by: University of Washington Department of Biostatistics
Data Carpentry Instructor Training - July 2017

Grants and Awards

Henry A. Jastro Research Scholarship Award
- Amount awarded: $2,940
- Awarded by: UC Davis, for the 2017-2018 academic year
Institute for Statistical Genetics Travel Scholarship
- Amount awarded: $1,740
- Awarded by: University of Washington for July, 2017

Scientific Outreach

Science Says, contributing author - 2016 - 2018
Cool Kids Code - December 2016
- After-school program at Riverside Elementary in Sacramento, CA designed to teach elementary age kids the basics of computer programming using the Scratch language

Peer Reviewed Publications

K.M. Hazzouri,J.M. Flowers, J. Visser, H.S.M. Khierallah, U. Rosas, G.M. Pham, R.S. Meyer, C.K. Johansen, G.S. Markhand, Masmoudi K, N. Haider, N. Kadri, Y. Idahgdhour, Joel Malek, D. Thirkhill, R. Kruegger, A.W. Zayed and M.D. Purugganan. Whole genome re-sequencing of date palms yields insights into diversification of a fruit tree crop. Nature communications. 2015 Nov 9;6. doi:10.1038/ncomms9824

See publication