The maddening adventures of extracting millions of tweets.

Image for post
Image for post
Connectivity Map of Twitter Activity. Source : Justin Cocco

Twitter’s API is free-to-use and is overall a very useful resource for data analysis. Extracting tweets for one, or several, users can happen without much additional work other than using an R package. Yet — if you are interested in harvesting millions of tweets from tens of thousands of users, you will have to sacrifice additional tears. Here, the general strategy will be outlined for doing just that so that you don’t also lose your collective minds.

The goal is to outline the progressive steps taken to produce a robust and functioning script to extract millions of timelines from a list of thousands of users. …


In the previous series, Philadelphia was looked at in terms of poverty, emphasizing the relationship between demographics and poverty. Now, let’s look at how the level of poverty can characterize the Poverty within Philadelphia. Furthermore, once the level of poverty has been characterized, an attempt to balance the quantity of those in poverty with the severity of poverty within a given zip code will be performed. The hope is to build a model that can capture the city’s areas that may require additional resources and assistance.

Poverty levels are based on a percentage below the poverty line on the federal level, which is currently set at $12,760 for one. The poverty percentages represent the amount above the poverty line such that 50% is approximately six thousand dollars, and 500% is sixty-four thousand dollars a year. …


How Poverty, Education, and Work-force can help understand the health of a Great City, a series.

Philadelphia is a diverse city with 1.59 million inhabitants covering 142.7 square miles. Of these 1.59 million inhabitants, approximately 500,000 of them live in poverty. Today, a quick look at the dynamics of poverty within Philadelphia will be investigated by utilizing R and maps. Poverty will be explored at the zip-code level as per the 2010 Census. Here, several issues will be investigated including Poverty, Education, and Work-force throughout Philadelphia to gain a snap-shot of the city’s health. For all, the investigation will utilize zip-code centric color-coded maps.

Here, raw counts of poverty, per-capita counts of poverty, disparity-counts of poverty, and comparison between the disparity of the Black and Hispanic communities are looked at. …


In Depth Analysis

A look at COVID-19's effects on the Probability of Death given one’s Age and Residence.

Summary : In the following Part One of Three, a time-series is analyzed utilizing Bayesian Probabilities in an effort to build models to quantify the probability of death between the nursing home and general populations. It is found that the probabilities were not independent, and were calculated appropriately. Furthermore, several distinct differences were noted between the nursing home and general public in terms of probability of death, probability of belonging to a certain age group. Following the calculations of the final probability of condition given one’s location and age, the direct comparison at a yearly-, and age-level could finally be performed to describe the relative safety or harm associated with one’s residential location which demonstrate distinct probability differences at both the mean and time-series level. …


Organizations love PDFs, especially governmental bodies. To the masses, they are easy to read, with nice and clean formatting that is easy on the eyes. To the data scientist, they can be nightmares to upload. For example, take a look at this PDF:

Image for post
Image for post
Source: PennsylvaniaDepartment of Health, Demographics of Nursing Home Residents.

What a 105-page nightmare that would be! R reads PDFs as 1-line imports, but clearly this PDF is not designed with data scientists in mind.

Extracting this data for analysis and manipulation is going to be a maze of extractions, re-arrangements, and ultimately many extra-curricular relaxation techniques.

The good news is, I like doing this! So here I’m going to try to walk you through both an example and through my thoughts to help you with your own adventures. This technique will utilize R and several R packages, namely the Tidyverse package and Pdftools packages. For this example we are going to use a slightly easier PDF to practice on. This PDF contains 17 pages, with the top on every page looking like so (top), and the bottom looking like so (bottom). …

About

Justin Cocco

Hi! I’m a molecular biologist turned Paramedic turned aspiring data scientist. I like all things science, history, data, math, and medicine!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store