Compilation of Completed Data Science Projects through University of Texas, University of Chicago and MIT
MIT
(All code in Python)
Content | Description | Skills |
---|---|---|
Capstone | Build computer models to detect malaria in images of red blood cells. | • Exploratory Data Analysis • Data Visualization • Statistics • Convolutional Neural Networks |
Foundations for Data Science | Analyze marketing campaigns to help CMO devise the next strategy | • Exploratory Data Analysis • Data Visualization • Statistics |
Data Analysis & Visualization | Apply dimensionality reduction on the Auto-mpg dataset and segmenting bank customer | •PCA • t-sne • Clustering |
Machine Learning | Predict house prices in Boston and predicting hotel bookings cancellations | •Linear Regression • Logistic Regression • kNN |
Practical Data Science | Predict the conversion of leads to customers and forecasting the stock prices | •Decision Trees • Random Forest • Time Series |
Deep Learning | Recognize house number digits from street view images using Neural Networks | •Artificial Neural Networks • Convolutional Neural Networks |
Recommendation Systems | Build a recommendation system to recommend the best Amazon products to users | • Rank Based Recommendation Systems • Similarity Based Recommendation Systems • Matrix Factorization Based Recommendation Systems |
University of Chicago
(All code in R)
Content | Description | Skills |
---|---|---|
Capstone | 1. What are top factors affecting life expectancy? 2. With what accuracy can we predict a country’s life expectancy based upon these factors? 3. Is it possible to predict if a population will have a life expectancy >= 65? |
• Data Exploratory Analysis • Unsupervised Modeling Clustering and Dimension Reduction • Supervised Modeling Multivariate Linear Regression, Obtain Predictor Correlations, Linear Model, Detect Multicollinearity, Predictor Selection, Training and Testing Models |
Clustering | Use k-means clustering to segment athletes into meaningful sub-groups based upon performance criteria. 1. Investigate the concept of exploratory analysis and recognize its role in the machine learning process. 2. Divide observations into meaningful and representative groups using K-means clustering. Assess the outputs from several models to select the best iteration. 3. Combine similar observations using hierarchical clustering. |
• Probability Space • Random Variables • Distributions • Method of Moments • Hypothesis Testing • K-means • K-modes • Agglomerative Hierarchical Clustering |
Principal Components Analysis | Perform Principal Component Analysis to analyze the exercise performance variables in an athlete dataset. - Produce a correlation matrix of the variables. - Perform PCA and analyze the VAF by each factor. Include an elbow plot. Interpret the first loading using loadings plot. - Use a bi plot of the first two factors and loadings and/or other visualizations to identify unique individuals. |
PCA |
Bivariate Linear Regression | Compare and contrast the effect of salary of jewelry spend in these two different populations. 1. Utilize method of moments to develop a sophisticated comprehension of the slope and intercept parameters from a bivariate linear model. 2. Evaluate bivariate linear model parameters and goodness of fit through the use of t-statistics, confidence intervals, and R-squared calculation. 3. Define the assumptions of linear regression, and explore how potential violations can be identified and subsequently correct perform hypothesis testing on linear model results. |
• Interperate Bivariate Linear Regression • Evaluate Bivariate Linear Regression • Assumptions of the Linear Model • Correlation vs. Causation |
Multivariate Linear Regression | Expand upon the bivariate linear model exercise by including additional predictors. 1. Perform different variable transformations and interaction terms to improve the quality of a model (connection to decision trees) 2. Assess issues with multicollinearity and use VIF 3. Engineer features to simplify interpretation and improve model quality |
• Multi-variate linear regression • Assess multi-collinearity and omitted variable bias • Evaluate models and selecting predictors • Perform feature transformations and engineering |
Logistic Regression | Use student GRE scores and GPA to determine if a student was admitted (binary) to a particular university. binary column indicating whether they were admitted to a certain university. | • Introduction to Classification Models • Exploring linear vs. logistic regression • Binary Logistic regression • Loss Optimization |
ANOVA | Use categorical (gender and species) and continuous variables along with to predict the body depth of crabs. | • Incorporate categorical predictors • One-way ANOVA • F test • Extension of one-way ANOVA to two-way ANOVA • Random Effects |
University of Texas
Content | Description | Skills |
---|---|---|
Capstone | Analyze data related to the Covid-19 pandemic, with the aim of producing a machine learning model that predicts the number of cases/fatalities that a given country can expect in the future. | Tools Python, Pandas, Numpy, Tensorflow, Jupyter Notebook Database PostgreSQL, pgAdmin Machine Learning Scikit-learn, Tensorflow, Keras, FBProphet Visualization & Analysis Matplotlib, Plotly, Google Slides, Tableau |
Kickstarting with Excel | Using the Kickstarter dataset, visualize campaign outcomes based on their launch dates and their funding goals. | Excel analysis with visualizations |
VB of Wall Street | Identify and analyze stock performance of multiple select green energy stocks. | Visual Basic |
PyPoll with Python | Develop an automated auditing system to certify the results of a recent local congressional election. | Python basics |
PyCitySchools with Pandas | Prepare standardized test data from each high school district-wide for analysis, reporting and presentation. The results of this analysis will be used to inform the district of performance trends and patterns. This study will highlight possible causation variables to provide insight and understanding for resource allocation and policy administration to maximize district performance in the areas of math and reading. | 1. Open Jupyter Notebook files from local directories using a development environment. 2. Read an external CSV file into a DataFrame. 3. Format a DataFrame column. 4. Determine data types of row values in a DataFrame. 5. Retrieve data from specific columns of a DataFrame. 6. Merge, filter, slice, and sort a DataFrame. 7. Apply the groupby() function to a DataFrame. 8. Use multiple methods to perform a function on a DataFrame. 9. Perform mathematical calculations on columns of a DataFrame or Series. |
PyBer with Matplotlib | Pyber is a ride-sharing company that has collected considerable data regarding its service over time. The purpose of this project is to provide an exploratory analysis of the captured data in order to determine trends and correlations. | 1. Create line, bar, scatter, bubble, pie, and box-and-whisker plots using Matplotlib. 2. Add and modify features of Matplotlib charts. 3. Add error bars to line and bar charts. 4. Pandas, NumPy, and SciPy statistics. |
WeatherPy with Python APIs | PlanMyTrip is a top travel technology company that specializes in internet-related services in the hotel and lodging industry. Collect and present data for customers via the search page which can be filtered based upon preferred travel criteria in order to find an ideal hotel anywhere in the world. |
1. Perform tasks using new Python libraries and modules. 2. Retrieve and use data from an API “get” request to a server. 3. Retrieve and store values from a JSON array. 4. Use try and except blocks to resolve errors. 5. Write Python functions. 6. Create scatter plots using the Matplotlib library, and apply styles and features to a plot. 7. Perform linear regression, and add regression lines to scatter plots. 8. Create heatmaps, and add markers using the Google Maps API. |
Employee Database with SQL | Using corporate personnel information, determine the number of imminent retirements and the number of positions that will need to be filled. | 1. Design an Entity Relations Diagram that will apply to the data. 2. Create and use a SQL database. 3. Import and export large CSV datasets into pgAdmin. 4. Use different joins to create new tables in pgAdmin. 5. Write basic- to intermediate-level SQL statements. |
ETL - Extract, Transform, Load | Develop an algorithm to predict which low-budget movies will become popular, beforehand. | 1. Create an ETL pipeline from raw data to a SQL database. 2. Extract data from disparate sources using Python. 3. Clean and transform data using Pandas. 4. Use regular expressions to parse data and to transform text into numbers. 5. Load data with PostgreSQL. |
Surfs Up with Advanced Data Storage and Retrieval | Using Python, Pandas functions and methods, and SQLAlchemy, filter a database table to retrieve specified data. Convert data to a list, create a DataFrame from the list, and generate the summary statistics. | 1. Differentiate between SQLite and PostgreSQL databases. 2. Use SQLAlchemy to connect to and query a SQLite database. 3. Design a Flask application using data. |
Mission to Mars - Web Scraping with HTML/CSS | Automate a web browser (app.py) to visit a variety of websites to extract data about the mission to mars. Store data locally in a NoSQL database. Render data in a web application created with Flask. | 1. Gain familiarity with and use HTML elements, as well as class and id attributes, to identify content for web scraping. 2. Use BeautifulSoup and Splinter to automate a web browser and perform a web scrape. 3. Create a MongoDB database to store data from the web scrape. 4. Create a web application with Flask to display the data from the web scrape. |
UFO Sightings with JavaScript | Create a dynamic web page to clearly present historic UFO sighting data. | 1. Build and deploy JavaScript functions, including built-in functions. 2. Convert JavaScript functions to arrow functions. 3. Create, populate, and dynamically filter a table using JavaScript and HTML |
Plotly & Belly Button Biodiversity | Use Plotly, a JavaScript data visualization library, to create an interactive data visualization for the web. | 1. Create basic plots with Plotly, including bar charts, line charts, and pie charts. 2. Use D3.json() to fetch external data, such as CSV files and web APIs. 3. Parse data in JSON format. 4. Use functional programming in JavaScript to manipulate data. 5. Use JavaScripts Math library to manipulate numbers. 6. Use event handlers in JavaScript to add interactivity to a data visualization. 7. Deploy an interactive chart to GitHub Pages. |
Mapping Earthquakes with JS & APIs | Use JavaScript and the D3.js library to retrieve coordinates and magnitudes of recent earthquakes from the GeoJSON database. | 1. Retrieve data from a GeoJSON file. 2. Make API requests to a server to host geographical maps. 3. Populate geographical maps with GeoJSON data using JavaScript and the Data-Driven Documents (D3) library. 4. Add multiple map layers to geographical maps using Leaflet control plugins to add user interface controls. 5. Use JavaScript ES6 functions to add GeoJSON data, features, and interactivity to maps. 6. Render maps on a local server. |
NY Citibike with Tableau | Study the NYC ride sharing service to determine the feasibility of a similar program in Des Moines. | 1. Import data into Tableau. 2. Create and style worksheets, dashboards, and stories in Tableau. 3. Use Tableau worksheets to display data in a professional way. 4. Portray data accurately using Tableau dashboards. |
Statistics and R | Perform retrospective analysis on historical data, analytical verification of current automotive specifications and study design of future auto testing. | 1. Load, clean up, and reshape datasets using tidyverse in R. 2. Visualize datasets with basic plots such as line, bar, and scatter plots, boxplots and heatmaps using ggplot2.. 3. Plot and identify distribution characteristics of a given dataset. 4. Formulate null and alternative hypothesis tests for a given data problem. 5. Implement and evaluate simple linear regression and multiple linear regression models for a given dataset. 6. Implement and evaluate the one-sample t-Tests, two-sample t-Tests, and analysis of variance (ANOVA) models for a given dataset. |
Big Data | Use PySpark to perform the ETL process to extract a dataset, transform the data, connect to an AWS RDS instance, and load the transformed data into pgAdmin. Use PySpark, Pandas, or SQL to determine if there is any bias toward favorable reviews from Vine members in the dataset. | 1. Import data from an Amazon AWS S3 data repository into a user-specific AWS S3 account 2. Create pgAdmin database and tie it to the AWS RDS server. 3. Use PySPark and Pandas to filter, pars and coalesce date |
Supervised Machine Learning and Credit Risk | Apply machine learning models to a large dataset in order to build and evaluate loan risk factors. Employ multiple models and sampling techniques to train and evaluate unbalanced class data. | 1. Create training and test groups from a given data set. 2. Implement the logistic regression, decision tree, random forest, and support vector machine algorithms. 3. Interpret the results of the logistic regression, decision tree, random forest, and support vector machine algorithms. 4. Determine which supervised learning algorithm is best used for a given data set or scenario. 5. Use ensemble and resampling techniques to improve model performance. |
Unsupervised Machine Learning and Cryptocurrencies | Create a report that includes what cryptocurrencies are on the trading market and how they could be grouped to create a classification system for this new investment. | 1. Preprocess data for unsupervised learning. 2. Cluster data using the K-means algorithm. 3. Determine the best amount of centroids for K-means using the elbow curve. 4. Use PCA to limit features and speed up the model. |
Neural Networks and Deep Learning Models | Implement a neural network, using the TensorFlow platform in Python, to model (train and test) a dataset containing 34,000 organizations that have previously received funding. The model output is a binary classifier capable of predicting successful donation outcomes. | 1. Implement neural network models using TensorFlow. 2. Preprocess and construct datasets for neural network models. 3. Compare the differences between neural network models and deep neural networks. 4. Implement deep neural network models using TensorFlow. |