Introduction

Data science enables us to take data and transform it into meaningful information that can help us make decisions. Data science is interdisciplinary and combines other well-known fields such as probability, statistics, analytics, and computer science.

Statistics

Statistics is the practice of applying mathematical calculations to sets of data to derive meaning. Statistics can give us a quick summary of a dataset, such as the average amount or how consistent a dataset is

There are two types of statistics: descriptive statistics and inferential statistics. Descriptive statistics describe a dataset using mathematically calculated values, such as the mean and standard deviation. For instance, the graph below from FiveThirtyEight charts the wage gap between American men and women in 2014. An example of a descriptive statistic would be that at the 90th income percentile, women make 80.6% of what men make on average. On the other hand, inferential statistics are statistical calculations that enable us to draw conclusions about the larger population. For instance, from looking at the graph we can infer that at the 99th income percentile, women make less than 78% of what men make on average. We can also infer that the reason why the wage gap is smallest at the 10th income percentile is because the minimum wage for men and women is the same.

Probability

Probability is the mathematical study of what could potentially happen. In data science, probability calculations are used to build models. Models are able to help us understand data that has yet to exist - either data we hadn’t previously collected or data that has yet to be created. Data scientists create models to help calculate the probability of a certain action and then they use that probability to make informed decisions.

import random


num_people_in_room = 40 			#Change This Number

#Simulate a room with a certain number of people
def simulate(num_people):
  birthdays = []
  print("Here's what our room looks like:\n")
  months = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]
	#Assign a random birthday to each person
  for i in range(0, num_people):
    #Choose a random month
    month_choice = random.choice(months)
    #Choose a random day based on month
    if month_choice == "February":
      day = random.randint(1, 29)
    elif month_choice == "April" or month_choice == "June" or month_choice == "September" or month_choice == "November":
      day = random.randint(1, 30)
    else:
      day = random.randint(1, 31)
    birthday = month_choice + " " + str(day)
    #Store the birthday
    birthdays.append(birthday)
    print("Person {0}'s birthday: {1}".format(i + 1, birthday))
  calculate_probability(num_people)
  match = False
  #Check for matching birthdays
  for i in range(len(birthdays)):
    if find_duplicates(birthdays, birthdays[i], i):
      match = True
      break
  if not match:
    print("\n\nIn our simulation, no two people have the same birthday")

#Calculate the probability of there being 2 people with the same birthday
def calculate_probability(num_people):
  #Check there is at least 2 people in the room
  if num_people < 2:
    print("\n\nNot enough people in the room!")
    return
  else:
    #Calculate the probability
    numerator = 365
    countdown = 364
    for i in range(2, num_people + 1):
      numerator = numerator * countdown
      countdown -= 1
    denominator = 365 ** num_people
    probability = 1 - numerator/float(denominator)
    #Change probability to percentage
    rounded = round(probability*100, 2)
    print("\n\nThe probability that two people in a room of {0} people have the same birthday is nearly {1}%".format(num_people, rounded))

    
#Find the same birthday within our list of birthdays
def find_duplicates(birthdays_list, birthday, index):
  people = []
  for i in range(len(birthdays_list)):
    if birthdays_list[i] == birthday and i != index:
      people.append(i + 1)
  if people:
    people.append(index + 1)
    print("\n\nIn our simulation, the following people have the same birthdays: ")
    for person in people:
      print("Person {0}".format(person))
    return True
  else:
    return False

simulate(num_people_in_room)

Here's what our room looks like:

Person 1's birthday: June 9
Person 2's birthday: June 21
Person 3's birthday: May 4
Person 4's birthday: December 2
Person 5's birthday: April 14


The probability that two people in a room of 5 people have the same birthday is nearly 2.71%


In our simulation, no two people have the same birthday

Probability is used for many applications in different fields and companies around the world.

One of the most common applications of data science is for advertisements. By using probability, companies determine what kinds of ads are most relevant for a user. For example, YouTube and Facebook ads.

Similarly, probability is also used for e-commerce companies like Amazon, to determine what items to present a user based on their purchase history.

The video-streaming site Netflix uses probability to match titles that you might watch, and will also change the artwork for shows to receive more clicks.

Probability is also fundamental to areas such as machine learning which is applied in applications such as Google Deepmind, AlphaGo, and self-driving cars.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

01_introduction_to_data_science.md

01_introduction_to_data_science.md

Introduction

Statistics

Probability

Files

01_introduction_to_data_science.md

Latest commit

History

01_introduction_to_data_science.md

File metadata and controls

Introduction

Statistics

Probability