-
Notifications
You must be signed in to change notification settings - Fork 34
/
Copy pathI Introduction to Data Engineering.py
88 lines (77 loc) · 3.62 KB
/
I Introduction to Data Engineering.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
""""******************************************************************************
Explore the differences between a data engineer
and a data scientist, get an overview of the various tools data engineers use and
expand your understanding of how cloud technology plays a role in data engineering....
**************************************************************************************"""
#///WHAT IS DATA ENGINEERING///
#Tasks of the data engineer
"""find best fit
Possible Answers
1 Apply a statistical model to a large dataset to find outliers.
OK 2 Set up scheduled ingestion of data from the application databases to an analytical database.
3 Come up with a database schema for an application. """
#---
#Data engineer or data scientist?
"""drag items into correct bucket
{Data engineer}
clean corrupt data
develop scalable data architecture
set up processes to bring data together
streamline data acquisition
cloud tech """
#---
#Data engineering problems
"""decide where you're best suited to be of help
ok 1 Data scientists are querying the online store databases directly and slowing down the functioning of the application since it's using the same database.
2 Harmful product recommendations are affecting the sales numbers of the online store.
3 The online store is slow because the application's database server doesn't have enough memory.
(data engineer should make sure there's a separate database for analytics.) """
#---
#///TOOLS///
#Kinds of databases
"""identify the database in the schematics
1 All database nodes are on the left.
2 All nodes on the left and the analytics node on the right are databases.
ok 3 Accounting, Online Store, Product Catalog, and Analytics are databases. """
#---
#Processing tasks
"""select the most correct statement
1 Data processing is often done on a single, very powerful machine.
ok 2 Data processing is distributed over clusters of virtual machines.
3 Data processing is often very complicated because you have to manually distribute workload over several computers.
( join, clean, or organize data is done in the data processing) """
#---
#Scheduling tools
""" which one is not a responsibility of the scheduler?
1 Make sure jobs run in a specific order and all dependencies are resolved correctly.
2 Make sure the jobs run at midnight UTC each day.
ok 3 Scale up the number of nodes when there's lots of data to be processed. """
#---
#///CLOUD PROVIDERS///
#Why cloud computing?
"""benefits of using cloud computing as opposed to self-hosting data centers. Can you select the most correct statement about cloud computing?
1 Cloud computing is always cheaper.
ok 2 The cloud can provide you with the resources you need, when you need them.
3 On premise machines give me full control over the situation when things break.
(cloud elasticity) """
#---
#Big players in cloud computing
"""Can you order the big three correctly?
1 amazon
2 azure
3 G cloud """
#---
#Cloud services
"""Classify the services into the correct bucket.
- storage
azure blob
amazon s3
cloud storage
- compute
azure virtual machine
AWS EC2
Google compute Engine
- DB
Amazon RDS
Azure SQL DB
G Cloud SQL """