Author: | Mike Levin, HitTail Inventor & SEO in New York City |
---|
- Are you already drinking the Google Koolaid with Analytics, Search Console?
- Are you still paying for Google Sheet automation with expensive plug-ins?
- Do you feel there's some sort of missing "glue" to hold it all together?
- Would you like that glue to be career-advantageous future-proofing coolness?
If you've answered yes to one or more of these questions, then Pipulate is for you— or in a broader sense, Python and the SQL-alternative, Pandas. Go install the Python 3.6 version of Anaconda from https://www.anaconda.com/download/ and then run Jupyter Notebook. It'll pop up in a tab in your browser. Then get used to the following magical incantations:
import pipulate as gs # Install from console with: pip install pipulate
import pandas as pd
import numpy as np
Then hold onto your hat, because you've just jumped onto the bandwagon of a lifetime. Here is my love letter to Python.
There's a reason spreadsheets remain as popular as they do in the face of more capable databases and reporting systems. Spreadsheets are designed for humans— you don't need a developer for every little thing. You can use ledgers in many places where a database would be overkill. However, as anyone who's tried to build durable long-term automations around spreadsheets using VBA, AppScript (or any of the other obvious choices), it ain't so easy. I fix that here.
First, I plan to open the mysterious world of Python-driven data-automation to newbs simply by having you import a library that auto-prompts you for Google's OAuth web-login, getting you right over the first show-stopping hurdle. After that, I advocate a simple convention of selecting ranges out of a spreadsheet, modifying them, and pushing the results back in. And that's mostly done with Pandas; the one data manipulation tool every SEO should right now be mastering.
Congratulations! You're about to learn to program Python (easily) for SEO:
# _____ _ _ _ ____ _ _ _
# |_ _| |__ (_)___ (_)___ | _ \(_)_ __ _ _| | __ _| |_ ___
# | | | '_ \| / __| | / __| | |_) | | '_ \| | | | |/ _` | __/ _ \
# | | | | | | \__ \ | \__ \ | __/| | |_) | |_| | | (_| | || __/
# |_| |_| |_|_|___/ |_|___/ |_| |_| .__/ \__,_|_|\__,_|\__\___|
# There are many frameworks like it, |_| but this one is mine. --MikeL
Contents
- 1 Background & Philosophy
- 1.1 A better way
- 1.2 Of Stacked Decks & Killer Card Combinations
- 1.3 It's Pandas all the way down
- 1.4 From SEO to Data Tech to Datamaster
- 1.5 That's where Pipulate comes in
- 1.6 there's nothing to see here. please move along.
- 1.7 Be a Lightweight Pandas-speaking API-Monkey in the Middle
- 1.8 All The Bands Whose Wagon's You're Hitching Your Star Too Are Belong To Us
- 1.9 You Would Not, Could Not Learn to Code?
- 1.10 The Million-Follower YouTube Challenge... Accepted
- 1.11 Did I Mention You're Either Old's Kool or Fresh Meat?
- 2 Installing Pipulate
- 3 For the impatient
- 4 A tour through Pipulate
- 4.1 Pipulate naming conventions
- 4.2 Of Pandas & dependencies
- 4.3 Configuring a job
- 4.4 Connecting to sheet
- 4.5 And now a word about Jupyter Notebook and REPL enviornments
- 4.6 Your first pipulation
- 4.7 Working with numbers
- 4.8 Appending strings
- 4.9 Alternatives to Pipulate
- 4.10 Cheerleading for Anaconda, Jupyter Notebook and Pandas
- 4.11 Applying a Python function to every row of a Google Sheet
- 4.12 Giving Python function entire row of Google Sheet as input
- 4.13 CAVEAT! Row-aware Pipulate functions grab values from row by index
- 4.14 FYI: Python execution-context choices
- 4.15 Giving your function "extra" argument data per row (a tale of 2 splats)
- 4.15.1 Splat type #2: **kwargs by label (easier to explain first)
- 4.15.2 Splat type #1: *args by label
- 4.15.3 Splat types 1 & 2 together
- 4.15.4 One bourbon one scotch one beer
- 4.15.5 Automatic unpacking of tuples and dicts in *args and **kwargs
- 4.15.6 Putting it all together
- 4.15.7 Testing functions without DataFrame.apply()
- 4.15.8 How parts like to snap together
- 5 Working with very large lists
- 6 SQL Data Sources
- 7 Developing Pipualte functions
- 8 Scraping Google Search Engine Result Pages (SERPs)
- 9 Scheduling
- 9.1 Get yourself a Linux in the Cloud
- 9.2 Logging into your server
- 9.3 Making it a pleasure to log into your server
- 9.4 A scheduler with a better API is a better scheduler
- 9.5 /etc/systemd/service/servicename.service
- 9.6 Shoving the work around to where it belongs
- 9.7 Scheduling issues are more complex than you think.
- 9.8 In the beginning...
- 9.9 And then there was ASCII Art
- 9.10 The first scheduled event
- 9.11 GNU screen, your portal to your scheduled world.
- 9.12 Get used to using Ctrl+A (let go) then d to detach
- 9.13 Making timestamps and caching pervasive
- 10 First Scheduled Script
- 11 Getting over the hump
- 12 To git clone or pip install?
- 13 I prefer to pip install... plus nbviewer.
- 14 Reminders & To-Do's
- 15 Experimential: The Zen of Pipulate
You gotta pay the Python toll
If you want out of the vendor hole.
Congratulations to Conductor for the acquisition by WeWork. This should bring a normally inaccessible tier of SEO-product into the reach of exactly the sorts of people who can use it. For everyone else cobbling it together, I'm here.
I've done ad hoc and scheduled data-churning work for a long time, where investigation becomes automated system becomes generalized tool. I've invented my own systems for this sort of stuff that have gone obsolete in the face of a rapidly changing world so often that it hurts. No really, it hurts. Having now experienced almost-framework packages like Flask that merely help you connect the dots begging to be connected (Requests & Workzeug in their case / GData & GSpread in mine), I'm like Yeah! Pipulate should be like that— almost not there. Focus on what's important: Pandas! Think light-touch framework mentality as pretty much the opposite of Django, which is what most of my framework creations of the past more-or-less resembled. This is I think a better way.
So now, instead of inventing my own systems, I leverage the strength of the cards I'm dealt. This is familiar to anyone who place the Magic The Gathering card battle-game where certain card-combinations can't be played because they are too powerful and throw of the balance (and fairness) of the game. Thankfully for us SEOs, in real-life thankfully such killer combinations are not so easy to ban— such as that of Google Sheets, Jupyter Notebook and Pandas. It's like resurrecting elves from your graveyard for unlimited fireballs the gift that keeps on giving.
An Ode to Pandas:
I'm a #Python #pandas person'
'Cause #SEO will worsen
With each day the #AI's
Waking uuuup!
Solutions?
I've a' plenty!
But none are
Muscle-friendly;
Because
from #vim
I must reach
Juuuupyter!
I used to be an amateur (very amateur) magician. You know, it's all the same feeling for me; magic, art and coding... 3 sides of an eternal golden braid... and a bit of performance art, the constant hunt for a good paying gig, and getting down an bunch of cool new moves that will serve you for a few years until they stop working. Well, this pandas trick has got quite a few good years ahead of it as it replaces SQL as the ad hoc data langua franca du jour— and that ain't no fad. Legit bandwagon here that's perfect for SEO's, so ya had better get a jumpin. If you weren't a data-tech before, you are one now.
Once you can manipulate Pandas DataFrames, you have super-powers and are no longer an SEO. Your bullshit will smell more like statistically convincing correlations, though you will always attach the standard disclaimer, that does not mean causation. Therefore, our field remains Kung Fufu. Anyone who doesn't want to be called out on bad-science better be able to walk the walk. Show process. KNOW process. Invent process. Make process your own secret sauce in such a way where you can EVEN STILL jump-on and benefit from some pretty grown-up and destined to succeed bandwagons.
Here's the basic API-invocation for Pipulate:
cl, df = gs.pipulate(tab, rows, cols) # pulls range from GSheet
# Do stuff to df object using pandas,
# the new "must know" SQL-alternative.
# Just keep the row x col "shape" intact, then...
gs.populate(tab, cl, df) # pushes range to GSheet
That's it. That's Pipulate— it's just the rectangular data-range pull & push stuff (nothing in-between). So, go get Anaconda 3.6 and help me reposition SEO somewhere more than the gut-feeling Intuiters of yesteryear and Data Scientists of tomorrow— the Datamaster with a Python/Pandas industry-standard skill-set has vast room for creative expression and career-path flexibility (hedging your bet on SEO).
If you're reading this, you're among the first who knows what's going on with me here with my latest version of Pipulate— which amounts to yielding to the staggeringly cool process put into motion by Fernando Perez (invented iPython that's become Jupyter Notebook) and Continuum Analytics who put together the Anaconda FOSS equivalent to Mathematica, MATLAB, SAS or whathaveyou. It's data analysis stuff so scientists can solve their crisis of reproducibility a few years back that really shook 'em up and gave "the other side" a lot of ammo.
If you want to fight data with data, you better have some seriously good Kung Fu. I'm the type of SEO who survives just fine at places like 360i with clients like Apple and JCPenney (I helped pull them out of the fire), and these are my tools now in their modern form. I prefer to work in-house where I can slip in work like this and not have to bill it evenly against each account I work on (yuck!). I can get into my grove, walking the walk as I talk the talk... in-house SEO-style... yeah. Good stuff.
And so, here is my little gift to you, world of SEO. It's not DeepCrawl or Conductor or ahrefs. No, but it can connect to any of them. It could draw exactly the dashboard you want from out of all 3 systems (if you've got 'em) and give you just the dashboard you need under simple GMail security context, and not some per-seat license login nonsense. Innovate new reports in Jupyuter Notebook. Refine them. Copypasta them over to Linux scheduling where you fit it togehter like Tetris, squeezing every dollars worth out of your EC2 instance.
This project is just the medicine that a highly stressed SEO field needs. I got your cure right here, and it's really just the minimum baseline technical capabilities you should have today so that you may interact... with... anything.
Track your SEO rankings, Social Media views and counts, whatever! Start in the shallow end with Jupyter Notebook and "graduate" your work to generic scheduled Linux jobs that you can run almost anywhere. If you're a successful YouTuber worried about the gravy-train running dry, diversify your skills with something you'll probably love and can definitely self-learn. This is your chance to jump with me onto the following surprisingly fun bandwagons:
- Python - programming for humans
- Jupyter Notebook - programming for even more humans
- Pandas - data manipulation for humans
- GSheets - data manipulation for even more humans
...but mostly Python. Go read my love letter to Python even though I have to update it for 2018: http://mikelev.in/2011/01/python-programming-language-advantages/
I'm talking to YOU, Animation Gang! Sure, it's fun to animate today, but you will always be at the mercy of people who can express automation concepts as naturally as you speak English. Coding is not what you think. You already do it. You express your creative dynamic right-brained selves in an overwhelmingly technical and meticulous left-brained medium. I'm not sure if you understand what's giving you super-powers on YouTube, but when that foolish dancing gets tired and your energy runs out, come back here, sit down and learn to code Python. You'll green eggs and ham.
- Automate Gimp
- Automate Inkscape
- Automate Blender
- Automate ImageMagik
- Automate Robots
I could go on, but I think I make my point. My daughter challenged me to a million followers on YouTube by VidCon. I'm like, haha, not likely. No one's really interested in your dad's opinion that you're either either doing it old'skool or you're fresh meat. Your dad can reach a million followers by VidCon 2019 if I really focus and work on it. First step is to just catch Jaiden's or James' or Becca's or Tony's attention and mention my channel in one of their videos— a 7 year old YouTuber's super-nerd Dad in New York trying to teach kids Python before their brains are poisoned with JavaScript. Oh... he was the first to unbox a Raspberry Pi on YouTube and got a million-view video out of it... then dropped the ball, because kid. Now kid loves YouTube and is challenges me from a million views to a million followers.
Challenge accepted, Adi. https://www.youtube.com/mikelevin
And while it's definitely not required, I'll also teach you the timeless badass tools of tech: Linux, vim and git— or at least the minimum you need to know about each to project that "I'm Tony Stark" feeling.
Pipulate is about repositioning careers as SEO continues to change. Some tools are like shiny new pennies (Jupyter Notebook & pandas), while others have ascended to be the very fabric of our modern infotech-world— the true Samurai Kung Fu vorpal light saber weapons of tech where developing muscle-memory makes you dangerous... to your competitors.
Step #1: Install https://www.anaconda.com/download/ start Jupyter Notebook. Make a new Python 3 notebook and type:
!pip install pipulate --upgrade
This command executes as an operating system command (because of the exclamation mark) and not Python code. It may take awhile, but you will eventually get the output of the pip program vomit out a bunch of messages about installing this-and-that dependency. Most Pipulate requirements are already met by Anaconda, but if you're in some other Python environment, you can install all the rest of the requirements for Pipulate with pip install pandas (a VERY BIG install).
The quickest way to pipulate is as follows, but I suggest you read through this documentation where I cover creating Pipulate functions. The pandas concept of df.apply() is very important to grok. If you don't grok the word grok, go read Stranger in a Strange Land and return:
import pipulate as gs
import pandas as pd
tab = gs.name('Your Sheet Name').sheet1
cl, df = gs.pipulate(tab, rows=(1, 20), cols=('a', 'b'))
df['B'] = 'foo'
gs.populate(tab, cl, df)
Or the slightly longer-form, but probably easy for maintenance:
import pipulate as gs
import pandas as pd
key = 'gobbledygookdockeyhere'
name = 'Sheet1'
rows = (1, 20)
cols = ('a', 'b')
sheet = gs.key(key)
tab = sheet.worksheet(name)
cl, df = gs.pipulate(tab, rows, cols)
df['B'] = 'foo'
gs.populate(tab, cl, df)
- You must be on Python >= 3.x.
- You must be using a Google Spreadsheet in online mode.
- You must exactly match 'Your Sheet Name' with your actual sheet name (or switch to keys).
- You must be able to recognize row and column indexes when you see them.
- When you import pipulate as gs, Google will (1-time) throw up a giant blue link that you must click and login with the same Google account you used to make the Google Sheet with.
- The meaning of the word grok.
Upon first-run, Pipulate asks you once for access and then not again until things goes wrong and you have to login again. You should be aware that a file named ouath.dat is dropped in your working directory (where you Jupyter Notebook .ipynb files save) which allows full access to your Google stuff. There is a refresh token there that grants new rapidly-expiring access tokens, but which itself doesn't expire. It is used to frequently re-log you in invisibly in the background. If you're interested in seeing these tokens, you can open oauth.dat in a text editor. It really helps to start to demystify OAuth2. Occasionally, Google WILL make even the refresh token have to be recreated with a new Web login, so just be aware of that especially if you build real automations (non-Jupyter Notebook) around Pipulate.
After you've installed Anaconda, run Jupyter Notebook by either looking for an icon named Jupyter Notebook in your Start Menu (Windows) after an Anaconda install, or select Anaconda-Navigator from your Applications folder (Mac) and Launch Jupyter Notebook from there. There's a few ways to get Jupyter Notebook running, but so long as something pops up in your web browser where you can choose New / Notebook: Python 3, then you found it. After you have a new Notebook, if you haven't done step #1 already, then execute:
!pip install pipulate --upgrade
Once pipulate is installed, you don't have to do that again, so after it's done (it can take awhile), you can delete the command and it's ugly output and get onto the real action. There are alternative ways to do pip installs in "Anaconda Prompts" (with more control outside the browser) but details vary on Macs vs. Windows vs. Linux, so I prefer to tell people how to do pip installs from within Anaconda. The --upgrade parameter ensures you always have the latest because I will be updating it often. Step #2, execute:
import pipulate as gs
This will cause an enormous Google Web login-link to appear in your Jupyter Notebook that you must click, which will open another tab in your browser presenting a Google login prompt. Choose which Google account you want to use to access Sheets. It must have permission to the sheet you're manipulating. It also asks for various other Google Service permissions while it has the chance, in case you plan on using Pipulate to track your YouTube view-counts and such, which you should totally do.
In case you're wondering why I recommend the convention of importing pipulate as gs, it's because my other Github module GoodSheet got fully wrapped in here, and I like reminding everyone Pipulate is in fact GoodSheet. I also got very fond of how gs.pipulate() looks, and I think it helps that gs also stands for Google Sheet. It also avoids the verbosity of pipulate.pipulate() or abbreviation-confusion of pi.pipulate() or pip.pipulate() or any of the other choices not nearly as beautiful as gs.pipulate().
For those familiar with the Flask web microframework, it might help to think of Pipulate as something lightly sprinkled in to connect GSpread and Pandas, and not really trying to do all that much itself except a few API innovations to help. The act of pipulating is just picking up an Excel-style rectangular spreadsheet range as both a GSpread cell_list and a Pandas DataFrame, altering the df completely with Pandas, and then using the symmetrical act of POPULATING to push the changes back into Google Sheet.
Pipulate is designed to let you do all your challenging data-manipulation work in Pandas. Pandas is not part of Python "core", but then neither is Google Sheets or GSpread, so don't complain. You're drinking deep of both the Google and Python Koolaid with Pipulate. You could do a lot worse. Any disenfranchised SQL-users out there, Python Pandas is where you should be going. Not to put too fine a point on it, but SQL has let you down. You need a more universal lightweight "general case" data manipulation tool, and Pandas is it whether you realize it yet or not. It's not like Oracle's going to buy Python too. So just go ahead and import Pandas:
import pandas as pd
In that same Jupyter Notebook that you imported pipulate and pandas into, you can now set the values that will allow you to connect to our spreadsheet (file) and worksheet (tab), along with the cell range defined as a set of row and column indexes, using row-numbers and column-letters that display in spreadsheet user interfaces:
key = '[Your GSheet key]'
name = 'Sheet1'
rows = (1, 20)
cols = ('a', 'b')
It's good to switch from using GSheet file-names to their unique "keys" for the sake of avoiding future confusion about which document you're actually working on. It's far too easy to have 2 files with the same name. Be sure to use the long string of characters copied out of a Google Sheet URL for the key. That's the long string of alphanumeric gobbledygook not broken up by slashes. The name is always "Sheet1" on a freshly-made sheet. If you rename it or want to manipulate a different tab, be sure to make it match this. The rows and cols tuple defines the rectangular region you will want to manipulate.
Okay, let's generate some text to manipulate with Pipulate. Enter and execute:
import this
...and you will now have 20 nice new lines about the Zen of Python to copy/paste from Jupyter Notebook to a newly-made Google Sheet you can use for the below exercise. In other words, create a new Google Sheet and paste the 20 Zen of Python lines into cells A1:A20. You are now ready to pipulate.
Open the connection to the Google Sheet (as if it were a database) and copy a rectangular range in both the GSpread cell_list format and as a Pandas DataFrame. This is setting the stage to pipulate, by creating two identical shapes, but of different types (one from GSpread and the other from Pandas):
sheet = gs.key(key)
tab = sheet.worksheet(name)
cl, df = gs.pipulate(tab, rows, cols)
Even though the cl is a cell_list from GSpread, it is also very similar to Python's core datatype called list. Jupyter Notebook lets you inspect the contexts of cl or df simply by running them on their own line. Type this and hit Enter:
cl
As you can see, GSpread cell_lists are just what one might call a one-dimensional array in other languages, which is the same as a normal Python list datatype. However, a few extra attributes have been layered onto each cell, such as cl[0]._row to see what row a cell belongs to and cl[0]._col for its column. In this way, GSpread avoids more complex shapes like a list of lists or a list of tuples, but it does make manipulating it directly as if a spreadsheet a challenge, which is pretty frustrating because that's the entire reason you use a library like GSpread.
Have no fear; Pandas to the rescue! It's not the cl we're going to manipulate. It's the df, which is a Pandas DataFrame and has a lot of powerful database-like tricks built-in. All we have to do is NOT TOUCH the cl until such time as we push our changes back to the spreadsheet. You can also inspect the df with Jupyter Notebook:
df
You can inspect objects like cl and df this way because you are in a REPL (read, eval, print, loop) for Python code execution where the contents of a cl or df is just sort of "hanging around" frozen in memory MID-EXECUTION for your casual perusal. This is both a small miracle, and makes Jupyter Notebook the ideal place for for scientists and marketers to "feel their way around" data before building resilient automations.
I'm also helping you jump on the same bandwagon that's helping scientists solve the crisis of reproducibility that hit their field a few years back when they realized that 70% of published scientific research was unreproducible. While much credit goes to Jupyter Notebook, it's really Anaconda that gets it all installed and erases that pesky multi-platform issues that usually become very major stumbling blocks—even for scientists.
Say you wanted to just plug the value "foo" into column B:
df['B'] = 'foo'
You can now "push" your changed dataframe object back into the still compatibly-shaped cell_list object, but peek at it first "in memory" by just typing df all by itself:
df
Make the changes that you see in memory push back out to the spreadsheet. Watch the browser as you populate to see the changes occur!:
gs.populate(tab, cl, df)
Congratulations. You've just pipulated.
Plugging data dynamically into Google Sheets is nothing new. Pipulate just simplifies it. To do something slightly more interesting, you can simply copy the contents of column A to B:
df['B'] = df['A']
gs.populate(tab, cl, df)
Say there were numbers in column A and you wanted column be to be that number times 2. Notice I have to convert column A to integers even if they look like numbers in the spreadsheet, because GSpread converts all numbers to strings:
df['B'] = df['A'].astype(int) * 2
This example will throw an error if you try it on the Zen of Python data, you would get ValueError: invalid literal for int() with base 10: 'The Zen of Python, by Tim Peters'. But you can put numbers in column A and execute this to see a simple *2 operation and acquaint yourself with how automate-able things start to become when you replace tedious manual Excel processes with automation.
If you wanted to append foo to column A and put the result in column B (like above, but appending strings to an already already string-type column).:
df['B'] = df['A'] + 'foo'
Embedded application languages like Microsoft's VBA or Google's AppScript can achieve similar results, but if I need to explain to you why these are not as good as using Python on the back-end, you're in the wrong place. The same goes for the ever-increasing selection of paid-for Excel and GDocs plug-ins and other proprietary vendor products which probably don't quite do what you need.
Pipulate is mostly about Python and Pandas. You could replace gs.pipulate() and gs.populate() with pd.read_csv() and pd.to_csv() and take Google Sheets out of the equation entirely, or use Excel instead of GSheets by swapping PyExcel for GSpread. There's also a new library pygsheets that does a few more things like named ranges and limited formatting I may switch to. Doesn't matter, because Pipulate is a lightweight wrapper to provide a lightweight spreadsheet manipulation API.
My thinking is that if you have to learn and master one tool for this sort of data manipulation, it might as well be Python/Pandas. Shove all the complexity onto pandas... it's going to be around for awhile... and not cost you anything.
The above example with .astype() also shows that even if you know Python, there's some new learning to do here for things like casting datatypes, which is actually different from pure Python. Pandas sits on NumPy which is a popular C-optimized Python library that provides N-dimensional arrays for the same kind of work that IBM dinosaurs still do in Fortran for science and stuff. Pandas is a FRAMEWORK on top of NumPy for such work, but which turns out to be perfectly designed for what I used to use Pipulate for when it was a Flask-based Web app.
Now say you wanted to apply a function to every line of the DataFrame to do something like retrieve a title tag from a web address, and you had a function that looked like:
def status_code(url):
import requests
rv = 'failed'
try:
rv = requests.get(url).status_code
except:
pass
return rv
Now you can get the status code of every URL in column A with:
df['B'] = df['A'].apply(status_code)
This is where the "framework" known as Pandas steps in with its own conventions. Pandas knows to take the function named in the apply method and for every row of the dataframe, plug the value found in column A into the function called status_code and plug the resulting value into column B. Look carefully at what's going on here, because it's about to get a lot more complicated.
While the above example is powerful, it's not nearly as powerful as feeding TWO arguments to the function using values from out of each row of the dataframe. To do that, we simply call the .apply() method of the ENTIRE DATAFRAME and not just a row:
df['B'] = df.apply(func, axis=1)
There's a few things to note here. First, we HAVE TO include the axis=1 argument or else each COLUMN will be fed to the function by default as it iterates through the dataframe. When you use the df.apply() method, you can step through the entire dataframe row-by-row or column-by-column, and we simply have to include axis=1 to PRESERVE the row-by-row behavior of calling the apply method directly from a row (when it's implicit). Now, we can select a 3-column range:
key = '[Your GSheet key]'
name = 'Sheet1'
rows = (1, 4)
cols = ('a', 'c') # <--Switched "b" to "c"
sheet = gs.key(key)
tab = sheet.worksheet(name)
cl, df = gs.pipulate(tab, rows, cols)
Now we plan on putting a URL in column A and some text that we're going to look for on the page in column B. Instead of just returning a response code, we will return how many times the text was found in the retrieved HTML of the page. So, we will desire to apply this command:
df['C'] = df.apply(count_times, axis=1)
However now the count_times function has more responsibility than the status_code function. Specifically, it needs to know to get the URL from column A and the keyword from column B, so we rewrite status_code as follows:
def count_times(row):
import requests
url = row[0]
keyword = row[1]
rv = None
try:
ro = requests.get(url)
except:
pass
rv = '--'
if ro and ro.status_code == 200:
rv = ro.text.count(keyword)
return rv
With the above example, you put the URL you want to examine in column A and the text whose occurrences you want to count on the page in column B. The results appear in column C. This is where it starts getting more complex, and there are ALWAYS costs to complexity. Mapping has to go somewhere, and I currently choose to put it INSIDE Pipulate functions, which is not necessarily the best long-term decision, but complex as it may be, you're going to be able to follow everything that's going on right there in front of you without maintaining some awful set of per-project externalized mapping tables... ugh! You'll suffer through that sort of thing soon enough. For here, for now; MAGIC NUMBERS!
Remember that the Python code is running under your control so you are not limited as you would be using Google's own built-in Apps Script (Google's answer to VBA) for the same purposes. Your Python code is running on your local machine (via Jupyter Notebook) and can easily be moved to the cloud or on cheap hardware like Raspberry Pi's. Truth be told, Jupyter Notebook is optional.
All your data manipulation or "creative work" is taking place in Pandas DataGrids which you are "painting" onto in memory. Aside from copying the initial range out of a spreadsheet and then pasting the identically-shaped but altered rectangular spreadsheet range back in, this entire system is just becoming adept at Pandas using GSheets instead of CSVs.
When stepping row-by-row through a Python Pandas DataFrame, it is often desirable to insert "meta" attributes that can be used in the function WITHOUT putting those numbers wastefully on every row of the spreadsheet you're manipulating. Say the data we wanted to add is a date and it was the same dates for every row.
one | com | 2018-10-01 | 2018-10-31 |
two | net | 2018-10-01 | 2018-10-31 |
three | org | 2018-10-01 | 2018-10-31 |
Since the date would be the same all the way down, using a whole column in a Google Sheet for it would be a waste. In fact, GSheets has some limit to how many cells you can have, so an extra column with nothing but repeated data is very "expensive" quota-wise and slows your sheet down. Instead, only keep the unique data per-row in the sheet. The Pandas API (and Python API in a broader sense) provides for passing in both fixed-position arguments and labeled arguments by sort of "side-loading" them in as follows:
df['C'] = df.apply(func, axis=1, start='2018-01-01', end='2018-01-31')
APIs are weird. They work different ways in different languages, and this is how Python works. It's weird, but wonderful. There are subtle rules you have to get down here that just comes with experience. It's called learning to think Pythonically, If you're in Jupyter Notebook, take a moment to run this:
import this
The argument named (**kwargs) accepts as a parameter EITHER a Python dictionary object (called a dict, which looks a lot like JSON) or it will accept the more common command-line convention of name=value, name2=value2... as if being typed-in a terminal. I had a lot of difficulty grokking this, but it's one of the reasons Python is used to create user-loved "API-wrappers" to every non-Python API out there. Look at how you're going to have to ACCESS those values from inside a function:
df['C'] = df.apply(func, axis=1, start='2018-01-01', end='2018-01-31')
def func(row, **kwargs):
number = row[0]
tld = row[1]
kwarg1 = kwargs['start']
kwarg2 = kwargs['end']
# Do stuff here
return stuff
That was an example where you have multiple labeled arguments like start and end dates. But if it's being side-loaded in a similar fashion similar to the row, then you use the other type of splat that only uses a single asterisk in the function argument definition:
df['c'] = df.apply(func, axis=1, args=('two', 'peas'))
def func(row, *args):
number = row[0]
tld = row[1]
arg1 = args[0]
arg2 = args[1]
# do stuff here
return stuff
And then as you would imagine, you can mix positional *splatting with labeled **splatting. You just have to use positional first and labeled second (or last, actually), because if you think about it, that's how it must be:
df['c'] = df.apply(func, axis=1, args=('two', 'peas'),
start='2018-01-01', end='2018-01-31')
def func(row, *args):
number = row[0]
tld = row[1]
kwarg1 = kwargs['start']
kwarg2 = kwargs['end']
arg1 = args[0]
arg2 = args[1]
# do stuff here
return stuff
Just to put a fine point on it, because it's really that important, the very common way to define a pipulate function and its arguments is:
def func(row, *args, **kwargs):
...which gets invoked stand-alone like this:
func(one_row, one_tuple, one_dict)
...or via Pandas like this:
df.apply(func, axis=1, one_tuple, one_dict)
...or possibly like this:
df.apply(func, axis=1, ('two', 'peas'), foo='bar', spam='eggs', ping='pong')
...is the same as saying:
- Define a function named "func".
- Require something in position 1.
- Optionally expect a tuple next.
- Optionally expect a dictionary or sequence of labeled values as the last thing(s).
If passing all these lists and name/value pairs starts to get ugly, remember Python actually likes to unpack for tuples and dicts for you as you splat. So this ugly form of the above API-call:
df['C'] = df.apply(func, axis=1, args=('two', 'peas'),
start='2018-01-01', end='2018-01-31')
...can be re-written in Python as:
pod = ('two', 'peas')
dates = {'start' : '2018-01-01', 'end': '2018-01-31'}
df['C'] = df.apply(func, axis=1, pod, dates)
So the common pattern for a Pipulate function which you plan to apply to every row of a Pandas DataFrame using the .apply() method is:
my_val = func(a_list, a_tuple, a_dict)
So say you were starting out with this data, but you needed to use start and end dates with it, along with 2 more pieces of standard information per row.
one | com |
two | net |
three | org |
The Pipulate function to could look like:
def func(row, *pod, **dates):
postion = row[0]
tld = row[1]
pea1 = pod[0]
pea2 = pod[1]
start = dates['start']
end = dates['end']
...and calling it from Pandas, again, like this:
df['C'] = df.apply(func, axis=1,
od=('two', 'peas'),
dates={'start' : '2018-01-01',
'end': '2018-01-31'
}
)
Aren't you glad Python doesn't HAVE TO look like JavaScript?
If you don't really want to connect to Google Sheets and you just want to test your Pipulate function with dummy data to simulate the DataFrame.apply() call, you can use the function directly like this:
my_val = func(['three', 'org'],
('two', 'peas'),
start='2018-01-01',
end='2018-01-31')
But when the time comes to use it with Panda's DataFrame.apply(), it would look like this. Just a reminder, the word "func" is actually the name of the function that you've defined (with def) and axis=1 is what makes ROWS get fed in on each step through the DataFrame:
df['C'] = df.apply(func, axis=1,
pod=('two', 'peas'),
start='2018-01-01',
end='2018-01-31')
Whether you label the tuple or not in the call is optional, but if you do, it has to match the definition. Otherwise, its position is enough.
Some pretty cool concepts of bundling and unbundling of attributes between Python objects and more common command-line API style is going on here. You don't have to use the Python objects as the argument parameters. You can break out and unbundle them yourself. If we only have one date parameter for example, we could feed it in an unlabeled fixed position:
pod = ('two', 'peas')
dates = {'start' : '2018-01-01', 'end': '2018-01-31'}
...which leads to the simplest form to look at:
df['C'] = df.apply(func, axis=1, pod, dates)
And there you have it. That's pretty much the basic use of Pipulate for completely open-ended semi-automated Python Kung Fu in Google Sheets. If you're anything like me, you're feeling chills running down your back at the possibilities. If jumping onto the SCIENCE bandwagon that's occurring (to fix their "crisis of accountability") isn't also the future of SEO, then I don't know what is. All Pipulate does is let you get it in and out of GSheets easily, so you can focus on the hard parts. Let the crazy ad hoc SEO investigations of your dreams begin!
Google Sheet is not always the best place to process very large lists, but the alternative is often worse, so the trick is to just decide by what size chunks you should process at a time. This concept is sometimes called step-by-stride. To use step-by-stride with Pipulate we take a basic example and simply add a "stride" variable and edit out the last 2 lines that set and push the values:
import pandas as pd
import pipulate as gs
stride = 100
key = '[Your GSheet key]'
name = 'Sheet1'
rows = (1, 10000)
cols = ('a', 'b')
sheet = gs.key(key)
tab = sheet.worksheet(name)
cl, df = gs.pipulate(tab, rows, cols)
#df['B'] = 'foo'
#gs.populate(tab, cl, df)
In the above example, we only added a "stride" variable and edited out the last 2 lines that updates the sheet. Say the sheet were 10,000 rows long. Updating A LOT of data with one of these AJAX-y data-calls is never a good idea. The bigger the attempted update of a GSheet in one-pass, the more mysterious things are going on while you wait, and the likelihood of an entire update failing because of a single row failing goes up. The solution is to travel 10,000 rows by 100-row strides (or smaller) and we wanted it to take 1000 steps. We replace the last 2 lines with the following step-by-stride code:
steps = rows[1] - rows[0] + 1
for i in range(steps):
row = i % stride
if not row:
r1 = rows[0] + i
r2 = r1 + stride - 1
rtup = (r1, r2)
print('Cells %s to %s:' % rtup)
cl, df = gs.pipulate(tab, rtup, cols)
df['B'] = 'foo'
gs.populate(tab, cl, df)
And that's pretty much it. All together, the code to process 10,000 rows by 100-row long strides directly in Google Sheets for accomplishing almost anything you can write in a function to replace 'foo' with one of the fancier pandas API calls described above:
import pandas as pd
import pipulate as gs
stride = 100
key = '[Your GSheet key]'
name = 'Sheet1'
rows = (1, 10000)
cols = ('a', 'b')
sheet = gs.key(key)
tab = sheet.worksheet(name)
cl, df = gs.pipulate(tab, rows, cols)
steps = rows[1] - rows[0] + 1
for i in range(steps):
row = i % stride
if not row:
r1 = rows[0] + i
r2 = r1 + stride - 1
rtup = (r1, r2)
print('Cells %s to %s:' % rtup)
cl, df = gs.pipulate(tab, rtup, cols)
df['B'] = 'foo'
gs.populate(tab, cl, df)
It's easiest to pipulate when you only have to apply one quick function to every line of a list because it takes advantage of the Pandas framework conventions; how the .apply() method works in particular. HOWEVER, if your per-row query is a slow and expensive SQL query INSIDE a pipulate function like this (the WRONG way):
def hits(row, **kwargs):
import psycopg2
import apis
url = row[1]
start = kwargs['start']
end = kwargs['end']
a = apis.constr
atup = tuple(a[x] for x in a.keys())
user, password, host, port, dbname = atup
constr = "user='%s' password='%s' host='%s' port='%s' dbname='%s'" % atup
conn = psycopg2.connect(constr)
sql = """SELECT
url,
sum(hits) as hits
FROM
table_name
WHERE
url = '%s'
AND date >= '%s'
AND date <= '%s'
GROUP BY
url
""" % (url, start, end)
df = pd.read_sql(sql, con=conn)
return df['hits'].iloc[0]
We now want to move the SQL query OUTSIDE the function intended to be called from .apply(). Instead, you get all the records in one go and plop them onto your drive as a CSV file and hit THAT later in the function from .apply(). Getting psycopg2 installed is usually easiest through Anaconda's conda repo system (not covered here). First we connect to SQL:
a = apis.constr
atup = tuple(a[x] for x in a.keys())
user, password, host, port, dbname = atup
constr = "user='%s' password='%s' host='%s' port='%s' dbname='%s'" % atup
conn = psycopg2.connect(constr)
Next, we're going to need to build a string fragment for use in the SQL query that calls out every single URL that we want to get data back on. One of the worst parts about SQL is "in list" manipulations. The only way to be sure is a pattern like this:
WHERE
url = 'example1'
OR url = 'example2'
OR url = 'example3'
OR url = 'example4'
...and so on for as many URLs as you have to check. They're probably in your Google sheet already, so let's grab them into a list in a way that creates almost the exact above pattern (yay, Python!):
urls = df['A'].tolist()
urls = "url = '%s'" % "' OR url = '".join(urls)
The 2 lines above convert a Pandas DataFrame into a standard Python list and then into a fragment of a SQL statement. When people talk about being expressive AND brief in Python, this is what they mean. Being able to read and write statements like those above is a pure joy. You can look at the urls value in Jupyter Notebook to confirm it's good (if a bit wordy) valid SQL that will slip right into a query. Now, we unify the SQL fragment above with the rest of the SQL statement using the endlessly beautiful possibilities of the Python API:
def sql_stmt(urls, start, end):
return """SELECT
url,
sum(hits) as hits
FROM
table_name
WHERE
%s
AND date >= '%s'
AND date <= '%s'
GROUP BY
url
""" % (urls, start, end)
You can now use the above function that really only returns the not-executed multi-line text string which is used to populate a Pandas DataFrame and cache the results locally just in case you come back during a separate Jupyter Notebook session, you won't have to re-execute the query (unless you want the freshet data):
df_sql = pd.read_sql(sql_stmt(urls, start='2018-01-01', end='2018-01-31'), con=conn)
df_sql.to_csv('df_sql.csv') #In case you need it later
df_sql = pd.read_csv('df_sql.csv', index_col=0) #Optional / already in memory
We will now use this data source which now contains the "result" list of URLs with the accompanying the number of hits each got in that time-window to create your own Pipulate data source (or service). The GROUP BY in the query and sum(hits) is aggregating all the hit counters into one entry per URL. The correlation here is similar to an Excel VLookup. We make a pipualte function for the DataFrame.apply() method to use THIS local data:
def hits(row, **kwargs):
url = row[1]
df_obj = kwargs['df_obj']
retval = 'Not found'
try:
retval = df_obj.loc[df_obj['url'] == url]
retval = retval['hits'].iloc[0]
except:
pass
return retval
Now instead of hitting the remote, slow, expensive SQL database every time, we execute the SQL once at the beginning and can use the local data to pipulate:
key = '[Your GSheet key]'
name = 'Sheet1'
rows = (1, 1000)
cols = ('a', 'b')
sheet = gs.key(key)
tab = sheet.worksheet(name)
cl, df = gs.pipulate(tab, rows, cols)
df['B'] = df.apply(hits, axis=1, df_obj=df_sql)
gs.populate(tab, cl, df)
Or if it's over a huge list or is error-prone and will need rows entirely skipped because of bad data or whatever, we can step by stride by replacing the above 3 lines with:
stride = 10
steps = rows[1] - rows[0] + 1
for i in range(steps):
row = i % stride
if not row:
r1 = rows[0] + i
r2 = r1 + stride - 1
rtup = (r1, r2)
print('Cells %s to %s:' % rtup)
cl, df = gs.pipulate(tab, rtup, cols)
try:
df['B'] = df.apply(hits, axis=1, df_obj=df_sql)
gs.populate(tab, cl, df)
except:
pass
Because Pipulate functions are really just Python functions (generally being called through the Pandas DataFrame.apply() method), you can develop Pipulate functions just as you would any other Python function.
The only unusual concern is how when you feed an entire "row" of a dataframe to a Python function, it takes the form of an arbitrary variable name (usually row) containing a numerically indexed list of values (the values from the row, of course). This only means that a wee bit of "mapping" need be done inside the function. So say you needed to apply an arbirary function to column C using the data from both columns A and B in this form:
df['C'] = df.apply(arbitrary_function, axis=1)
...then you would need to write the arbitrary function like this:
def arbitrary_function(row):
value_from_A = row[0]
value_from_B = row[1]
# Do something here to
# populate return_value.
return return_value
...so when you're developing functions, the idea is to simulate a Pandas DataFrame row in default Python list syntax to feed into the function for testing... which is this easy:
simulated_row = ['foo', 'bar']
So in Jupyter Notebook actually feeding the simulated row to the arbitrary function for actually running and testing OUTSIDE the Pipulate framework looks like this:
arbitrary_function(simulated_row)
...so developing functions for Pipulate is easy-peasy. Just design your functions to always just take in the first argument as a list whose values have meaning because of their fixed positions — which naturally represent the cell values from rows you'll be pulling in from a spreadsheet.
By the way, namedtuples are the superior way of doing this when not bound by a pre-existing framework, but whatever. Pandas is worth it.
Well, you knew it was coming. Let's scrape some SERPs. It's sooo easy. But I suggest you get yourself an anonymous proxy server or twenty. Put them in a file named proxies.txt, 1-per-line. If ports are used, include them after the IP like this:
152.190.44.178:8080
53.117.213.95
250.227.39.116:8000
20.15.5.222
Now load the file called get_search_results.ipynb. If you cloned the github repo and are working in Jupyuter Notebook, you can work directly in your cloned pipulate folder. I would suggest making a copy of files such as get_search_results.ipynb and keep the originals around as a sort of template.
Anonymous web proxies go bad fast, so before you start a session, you should do a one-time refreshing of your proxy servers. Do that by running this block of code with update_proxies set to True. It will create a file in your repo folder called goodproxies.txt:
update_proxies = True
if update_proxies:
import pipulate.update_proxies as up
up.Main()
After you have a good new list of proxies
Everything so far has been in Jupyter Notebook, and that's great for ad hoc work, but when it comes to "promoting" a good report to daily use, you need scheduling. And that's never pleasant, because you need a machine running somewhere with as much reliability as you can get paying as little as possible. That's just sort of a life lesson there. No matter how powerful you feel in Jupyter Notebook, you're not all that if you can't automated. The answer?
Cloud... EC2 or whatever. Pick your poison. Whatever it is, being server-like (as it should be), you're going to need to get into it... and for that you're likely to receive a key from Amazon or your devops Dept. Figure out how to login to that machine. It should be TOTALLY YOURS. This is your EC2 instance. There are others like it, but this one is yours. Learn how to get in and out of it fast, from almost anywhere. You can do this on a Raspberry Pi too if you don't even want Amazon and a key in the picture.
Once you figure out the ssh command to log in to your server, and do it manually a few times. This follows the model of putting the key file in a usually hidden directory on your system called .ssh which is usually in your home directory:
ssh -i ~/.ssh/id_rsa_yourname ubuntu@55.25.123.156
Once this works for you, create a text file and name it something like go.sh and put it in your sbin. What's an sbin? It's a place you put little text-files that work a lot like commands, but you write them. They're really useful. This is your first Linux lesson from the Pipulate project. Linux (and Unix) won; get used to it. It'll be your next stop after Jupyter Notebook. Scheduling something you set-up in Jupyter Notebook is your natural "bridge" project. So by this point, you struggled through that ssh command; congratulations. Everything else is easy. Find your sbin by looking at your path:
echo $PATH
Find your sbin in that gobbledygookdthen, then put something that looks like this text (your info) in a file called go.sh (or whatever) there. Do the chmod +x trick to make it executable, and then whenever you need to reach your server, just type go. It's really nice to open a shell and to be in your scheduling-environment just like that. We want to do everything immediately reasonable to make the text-based Linux shell environment as totally cool as Jupyter Notebook is:
#!/bin/bash
ssh -i ~/.ssh/id_rsa_yourname ubuntu@55.25.123.156
We are not using crontab as our next step to achieve scheduling as some googling about how to do this on a stock Linux server may indicate. We DON'T like APIs where you have to drive nails through your head here at Pipulate. No, we side with RedHat and others on the matter of default Linux system service management and encourage you to use systemd. It's not the principles of the least moving parts but rather the principle of not having to learn advanced BASH script that's at play here. Thankfully, crontab's replacement systemd is considered a highly supported mainstream alternative. https://en.wikipedia.org/wiki/Systemd
You need a file in /etc/systemd/system which is the name of your service dot service, like mysched.service. To create it, you may have to sudo vim or whatever command because its a protected system location. The contents of your file to kick-off Pipulate (or any other) Python scheduling job like this:
[Unit]
Description=Run Python script to handle scheduling
[Service]
Type=forking
Restart=always
RestartSec=5
User=ubuntu
Group=ubuntu
WorkingDirectory=/home/ubuntu/mysched/
ExecStart=/usr/bin/screen -dmS mysched /home/ubuntu/py35/bin/python /home/ubuntu/mysched/mysched.py
StandardOutput=syslog
StandardError=syslog
[Install]
WantedBy=multi-user.target
You you've just dropped this file in location, but now it needs to be enabled. This is a one-time thing (unless you want it off for debugging or whatever):
sudo systemctl enable zdsched.service
Once you start playing around with the invisible background system services (named daemons in Linux), the temptation is to keep rebooting your server to make sure your changes "took" (similar to Apache/IIS webserver issues). Whenever you're unsure and want to avoid a reboot, you can type:
sudo systemctl daemon-reload
If you want to just restart YOUR scheduling service and not all daemons, you can optionally do:
sudo systemctl restart mysched.service
Who wants to type a longer command when you can type a shorter command? Since we're in a location where we're typically cd' into, we don't need to do that sbin trick we did on your local machine. In fact, I included r.sh in the repo, so just cd into the repo directory and make sure the service names I'm using match the ones you're using, and type:
r[Enter]
...and it should reboot the service keeping mysched.py running. For your curiosity, this is what it's doing:
#!/bin/bash
# This belongs in your sbin
sudo systemctl daemon-reload
sudo systemctl restart mysched.service
This r.sh file comes into play again later, because in order to ensure the health of your scheduling server, we're going to give it a "clean slate" every morning by rebooting it, and we're going to schedule the running of this BASH script FROM PYTHON to do it. This is an example of doing each thing in the place where it best belongs. Reboot from a bash script, respwan from systemd, and actually SCHEDULE from within a single master Python script.
After such a reboot (and on any boot, really), we hand all scheduling responsibility immediately over to Python (even though systemd could do more) because as much better as it is over crontab, Python APIs are better still. We actually are only using systemd as a pedantic task respawner. Think of it as someone watching for your python-script to exit that can 100% reliably re-start it. That's systemd in our scenario. After mysched.py is running, control is immediately handed over to the 3rd party "Schedule" package from PyPI/Github because it's API is better than the default sched module built-into Python. Such things on my mind are:
- Period vs. Exact scheduling (every-x minutes vs every-day at y-o'clock)
- Concurrency when I need it and crystal clarity when I don't
- Minimal new "framework" language. If it feels like Django, turn and run.
- Optional ability to "lock" long-running jobs. General collision handling.
- Calls for little-enough code that if I make a mistake, I can easily recode.
- Crystal clear clarity of what's going on, no matter where I may be.
For now, "pip install schedule" seems to do the job.
When restarting a scheduling-script, you need to know that when it springs back to life it may be in the middle or even towards the end of the daily time-cycle you're probably used to thinking in, so "today's" reports may never get a chance to run. You need to accommodate for this. You also need to be very realistic about how many reports you're going to be able to run on a given server on a given day. It can be like playing a giant game of Tetris, so it would be nice to have concurrency.
Concurrency, you say? Are you suuuuure? There might be order-dependencies and race-conditions in your script runs that you haven't thought about. I find that it's always a good idea to avoid concurrency and to keep it simple (much good karma) if the situation doesn't really call for concurrency. As hardware and hosting gets cheaper, you can always slam out more EC2 instances and put less work per server. Everything scales if you just size your work to fit one unit of generic Linux server.
Staying conservative with your estimates and modest with your promises is always a good idea, specially given how flaky all those APIs you're pulling from could be, you ought to size out the job, then half it. Maybe even quarter it. You won't be sorry. All that extra capacity in the server could be used for temp tables or other unexpected resource hogs you'll run into that you don't see today.
The idea here with Pipulate is to make a very generic and almost organic (with a heart-beat) place to start plugging your scheduled extractions from Jupyter Notebook into, with the least muss and fuss... but also, the most power. Pipulate only exists to make GSheets easier; a 3rd party package from Github/PyPI which itself only exists to make gdata easier; a cryptic but GOOGLE-PROVIDED API to Google Sheets. GData is most definitely NOT made for humans. GSpread is made for those slightly more human. And for those entirely human, there's Pipulate. In fact, there's something other than GSpread that I just discovered which may have either tremendous impact or no impact at all here at Pipulate. I'll let you know, but go take a look in either case. Real kindred spirits over at https://github.com/nithinmurali/pygsheets
Oh yeah, so in the beginning:
#Do whatever virtualenv stuff you do here
pip install schedule
pip install logzero
pip install pyfiglet
pip install colorama
cd ~/
mkdir mysched
cd mysched
vim mysched.py
You can create your mysched.py however you like. I use vim, and it's spiritual and life-changing. It also solves how to be really productive on pretty much any machine you sit down at when doing tasks like this. Anyway, I just added that file to the github repo, but for ease-of-use, I'll show the development of our scheduling script here. First:
from pyfiglet import figlet_format
from colorama import Fore
from logzero import logger, setup_logger
font = 'standard'
subfont = 'cybermedium'
green = Fore.GREEN
white = Fore.WHITE
blue = Fore.BLUE
ascii_art1 = figlet_format('Congratulations!', font=font)
ascii_art2 = figlet_format('Welcome to Wonderland.!', font=subfont)
print('%s%s%s' % (green, ascii_art1, white))
print('%s%s%s' % (blue, ascii_art2, white))
logger = setup_logger(logfile='mysched.log', maxBytes=1000000, backupCount=3)
logger.info('This is some logger info.')
This should give you a good starting point for scheduling... none! By stripping out everything that actually does scheduling, you can see how flashy color-coded ASCII art can color your day, your view, and your perception of time, rightness, and generally keep you on-track. Don't down-play the titles. It should only ever be visible when you're restarting the script a lot for testing, or at around 1:00 AM, or whenever your daily reboot occurs.
Keep in mind that later-on, we're going to "seize" the command-line output stream (your view of the log-file) from anywhere you have a terminal program and ssh program. That could very well be (and often is in my case) your mobile phone. It's easier than you think; you don't even have to look at the actual log files; it just sort of streams down the screen like the Matrix. That's the effect I was going for (thank you LogZero).
You can ALSO see the log-file output that is also being written into mysched.log which you can look at if say the script stopped running and the real-time output went away:
[I 180222 19:36:56 mysched:20] This is some logger info.
[I 180222 19:47:07 mysched:20] This is some logger info.
Going from this "blank" scheduling file to the next step really highlights a lot of the default power of the scheduling module.:
import schedule as sched
from pyfiglet import figlet_format
from colorama import Fore
from logzero import logger, setup_logger
from datetime import date, datetime, timedelta
import time
UTCRebootTime = '06:00' # Generally, 1-AM for me
beat_count = 0
font = 'standard'
subfont = 'cybermedium'
green = Fore.GREEN
white = Fore.WHITE
blue = Fore.BLUE
ascii_art1 = figlet_format('Congratulations!', font=font)
ascii_art2 = figlet_format('Welcome to Wonderland.!', font=subfont)
print('%s%s%s' % (green, ascii_art1, white))
print('%s%s%s' % (blue, ascii_art2, white))
the_time = str(datetime.now().time())[:5]
logger = setup_logger(logfile='mysched.log', maxBytes=1000000, backupCount=3)
logger.info("We're not in Jupyter Notebook anymore. The time is %s." % the_time)
def main():
sched.every(10).minutes.do(heartbeat)
next_min = minute_adder(1).strftime('%H:%M')
logger.info("When the clock strikes %s, down the rabbit hole with you!" % next_min)
sched.every().day.at(next_min).do(the_queue)
sched.every().day.at(UTCRebootTime).do(reboot)
while True:
sched.run_pending()
time.sleep(1)
def heartbeat():
global beat_count
beat_count += 1
logger.info("Heartbeat %s at %s" % (beat_count, datetime.now()))
def the_queue():
logger.info("This is a scheduled event. Jump! Down the rabbit hole...")
def reboot():
logger.info("Rebooting system.")
import subprocess
p = subprocess.Popen(['sh', 'r.sh'], cwd='/home/ubuntu/pipulate/')
def minute_adder(minutes):
the_time = datetime.now().time()
today = date.today()
beat = timedelta(minutes=minutes)
return_value = datetime.combine(today, the_time) + beat
return return_value
if __name__ == '__main__':
main()
You can, and I encourage you to run this directly with the standard Python command-line way of running it. If you haven't been doing it already, cd into that directory and run:
python mysched.py
Webmasters are dead. Long live the Datamaster! This is kind of like the LAMP stack, but for scheduling in the modern world with (what I consider) the least moving parts pushing around the "responsibilities". SQL-ish stuff goes to Pandas, logic stuff to Python, Task-respawning to Linux, Data-UI to Google Sheets... but what about WATCHING your scripts being run? What about getting that at-one Zen feeling with all those invisible plates you have spinning?
If you put the service file in location at /etc/systemd/service/[mysched.py] and reboot already, then this script is running in the background right now. Wanna see it? It may be up to a bunch of those 10-minute heartbeats already. Type:
screen -d -r mysched
That's the gnu screen program. It's a lot like tmux, but if you don't know what that is either, then it's a terminal server just like remote desktop software like RDP or VNC, but instead of being for Graphical Desktops like Windows or Mac, it's just for those type-in command-line terminals. It's a lot to type and remember, so drop this into your /usr/local/sbin, or maybe your ~/ home folder if sbin gives you trouble. I call it "ing.py" so that all together to see what's going on (starting from a terminal on Mac, Windows, whatever), I type:
go[Enter]
do[Enter]
And then if I want to just immediately exit out, I type:
Ctrl+A, D [Enter]
If you want to activate the "do.sh" just make this file by that name, chmod +x it and drop it in your sbin or home:
#!/bin/bash
# Put this in your sbin or ~/ to be useful.
screen -d -r mysched
This means that a secret invisible command-line task starts whenever the machine reboots that you can "connect to" with that command. The -d parameter means force it to detach from whatever other device it's showing on (the "seizing" the display I mentioned earlier) and the -r parameter means reconnect. Together, you can pretty much pull up your scheduling output anywhere anytime.
But once you do, you are inside a terminal window session created by gnu screen and NOT by the original login-session you had. The -d parameter means force it to detach from whatever other device it's showing on (the "seizing" the display I mentioned earlier) and the -r parameter means reconnect. This gives you quite a bit of power to just scroll up and down the log-output (without having to load a single file) using GNU screen's buffer-scroll:
do[Enter] (to seize screen)
Ctrl+A [Esc] (to switch to "Copy mode" with a scroll-back history)
Once you're in Copy Mode, you can use Page Up & Page Down. You can also use Ctrl+B for back and Ctrl+F for forward. When you're done, hit the [Esc] key again. When you want to release the screen session, it's still Ctrl+A, D [Enter] to detach.
And finally, it can feel a little "out of control" to have a script running insistently in the background with no way to stop.
The Unix/Linux-style type-in "terminal" interface that ships with Macs and can be installed with Windows using CygWin or their new Windows 10 BASH shell is your new portal into Wonderland. Jupyter Notebook gave you a taste of the power of Python, but you're not really realizing it until you're running reports during all that other delicious time when you're NOT sitting in front of a browser hitting a button and waiting for something to finish on your local machine.
You can't store everything locally, so don't try. You will run out of space, and there's nothing worse than having to do file-maintenance on Cloud hardware that's supposed to be sparing you from that nonsense. But neither can you write anything that's going to fill your hard drive up forever with past data. Hardware is hardware and resources are actually finite— or rather, they're as finite as you're willing to pay for. So if we want to store the data long-term, it's got to be off-server, probably using a service such as Amazon S3. Using a "data bucket" NoSQL hash-table (call it what you will) is a good idea in situations like this because "deconstructing" everything into rows & columns for SQL-like RDBMS storage isn't worth it, and although field-stuffing into a Text or XML field in an RDBMS would work, it feels a lot like shoving a round peg into a square hole— why do it if a round hole is sitting right there? This is more a place-holder for me to incorporate probably a decorator-based cache system that is back-ended by Amazon S3. That will solve a lot of ongoing server maintenance issues.
Okay, so the above sets out the framework for scheduling. We have:
- A daily reboot
- An every-10-minutes heartbeat
- Something beginning 1-minute after script runs
So the idea now is to build-out from that 3rd point. We just just start putting references to different external Python filename.py's there, and they'll just run. But there's one more trick. I'm adding this function to mysched.py along with importing the importlib library to do the trick:
def do_main(name):
mod = importlib.import_module(name)
mod.main()
This way, if you follow the Python if __name__ == '__main__' convention, you can use this to invoke the method (function) named "main" with it's standard (non-parameterized) call by just putting it in the same folder as mysched.py and referring to the file by name from within mysched.py like this:
do_main(filename)
...and that's all you need to do to schedule something that uses that convention. Conversely, if the code in the external file is of the directly copy-pasted from Jupyter Notebook variety which is likely to NOT use a function called main (or even functions at all), then you can use the alternative version that just does it:
def do_it(name):
mod = importlib.import_module(name)
...invoked with the very straight forward:
do_it(filename)
You will be using .py files and not the .ipynb files of Jupyter Notebook for scheduling. There are various ways to go about it, but I suggest just copy-pasting your separate text-blocks from JN over to your text-editor or whatever, and just re-build your script up from parts over server-side. The reason for this is that it makes you think through your work again. The way you work in Jupyter Notebook is going to be very different from the way you work on a Linux scheduling system. Your considerations are about 100x more complex, and so now is the time to start thinking about them. So make track.py in the same repo directory:
vim hello_world.py
Don't worry. I already put it in the repo for you. It is basically just a template for the ASCII art that I like to do. I think I've gone overboard with colors. I usually only use green for OK with other colors sprinkled in sparingly to capture my attention (warnings & stuff). Thanking you for taking the red pill gets an exception. But now that we've proven scheduling an external script, it's time to add a SECOND external script and-get serious about SEO:
vim track.py
The hardest part for me in trying to grab the reins and gaining control of all the required parts of datamastering, is always staying centered. You always have to know where things are relative to either root "/" or home "~/". Even that is a Unix geek joke about symbolic links. Anyhoo, working directories usually tend to end up relative to ~/ 'cause you don't edit much in root. And that folder-name is likely going to be whatever your git repo. And if you're cloning from me with:
git clone git@github.com:miklevin/pipulate
...then you have a directory probably something like this:
~/pipulate
...which you're always going to want to be in, especially if the machine you're working on is a cloud instance set up somewhere specifically for you specifically for this purpose. Then, you can get rid of a lot of typing by creating a file like this (given you're on Ubuntu / adjust to OS) as your .bash_profile. These are invisible configuration files (because of the dot at the beginning of their filename). .bash_profile is executed whenever your username (in this case, the default EC2 Ubuntu's "ubuntu" user:
source /home/ubuntu/py35/bin/activate /home/ubuntu/py35
PATH="$HOME/:$PATH"
clear
python ~/hi.py
cd /home/ubuntu/pipulate
This answers such pressing questions as:
- How do I always make sure I'm in the exact same Python virtualenv as the one I use for scheduling?
- How can I add little bash helper scripts to my repo and have them in my path so I can easily use them?
- How can I give myself a snazzy ASCII art login messing using Python and the Figlet library?
- How can I avoid typing cd pipulate every time I log into my scheduling server?
So now, whenever from your host machine you type:
go
...you wil be logged automatically onto your cloud server and greeted with a warm welcome that will impress your friends. What does hi.py look like you ask?:
from pyfiglet import figlet_format
from colorama import Fore
def out(print_me, color=Fore.GREEN, font='standard'):
ascii_art = figlet_format(print_me, font=font)
print('%s%s%s' % (color, ascii_art, Fore.WHITE))
out('Welcome.')
out("Get pipulating!", font='cybermedium', color=Fore.WHITE)
When getting started with any new investigation into data residing somewhere that you're supposed to fetch, transform, and extract some insights from begins with "where does the data come from?" and "Where do I put it?" questions. In Google Sheets, you don't really have the option of warehousing all the raw data pulls forever, unless you want your spreadsheets to get slower and slower, then go corrupt. Instead, you have to either use services like Amazon S3, a large local hard drive, or simply not warehouse as much data as you'd like.
Fortunately, you can do a lot in just the limited data-space of a single spreadsheet, which supports (at last check) about 100,000 cells in any Spreadsheet document before reaching a quota (and performance) limit. The 100K cells can be broken up any way over rows & columns.
Because every spreadsheet gives you an automatic 1000 rows and 26 columns, it's easy to make the system use up 26,000 of those cells right away. This doesn't really happen in practice, but there is ambiguity so as a matter of style and preference, I like to delete all the rows and columns I'm not using. As in the field of SEO, we just don't want to send mixed signals to Google.
To be continued...
If you "git clone", then you will get all the standard .ipynb example files in a nice accessible place in ~/ (home) which you'll be able to reach with Jupyter Notebook, but you can't make that your own home directory for a github repo because you can't (easily) mix repos in the same folder. Yet, you're going to want to immediately turn this location into your own repo. The answer is to:
git clone git@github.com:miklevin/pipulate
...in one location in order to have access to the full repo and maybe participate in Pipulate development. In another location, you make the folder destined to become your own repo after you get my files, then delete out the invisible .git folder inside and do a new "git init". So get started making the myschedule folder:
git clone git@github.com:miklevin/pipulate mysched
cd mysched
rm -rf .git
rm -rf pipulate
git init
From here, it's up to you how much you want to purge out from the original repo files before you make you start adding files to your own git repo, but the heart of the thing, the file that the Linux systemd daemon will point to in your /etc/systemd/system/mysched.service file. If you pip installed Pipulate on your system, none of the other files are really necessary. You may also wish to do this clone-and-purge process on your local machine for the sake of keeping your Jupyter Notebook work in a git repo as well. Remember the cloud server you're working on is where you "promote" ad hoc Jupyter Notebook work to live in a more persistent scheduled form, so you want to have very similar environments on both. It's like forking, but none of this is really important enough to fork. It's just your own scheduling instance of Pipulate.
If you're a purist about it and want to build it up from scratch for a fuller understanding (than clone-and-purge can provide), then just make a new folder directly from Jupyter Notebook and copy any raw source files you may need here from github, or go through nbviewer for even easier saving of Jupyter Notebook-ready .ipynb files. Just paste any templates you're interested in from:
https://github.com/miklevin/Pipulate
into:
https://nbviewer.jupyter.org/
...and save the resulting .ipynb file to where your local copy of Jupyter Notebook can reach it.
Reminder to self: add logic to system to always address columns by Excel-style letter-index:
for i, col in enumerate(cols):
letter = gs.cc(i+1)
eval(tab)[col] = letter
eval('%s2' % tab)[col] = '%s.%s' % (tab,letter)
Reminder to self: Pipulate Wisdom.
Much of the complexity is in the data-transform. Separate as much of the "raw data" work as you can from the derivative output-data formats. Those can be re-generated different ways with better and better insights revealed and interactivity. But you may never be able to re-get that raw data. So focus!
- Everything you need has already been done many, many times.
- However, this does not spare you from SOME of the work.
- Processes resist automation until sufficient TLC is applied.
- And even then, it's all going to fall apart for 10 unexpected reasons.
- Your job then is to build with that sort of resilience in mind
- Even if it means scrapping-and-rebuilding quickly.
- Got a dataframe that was tough to get? Maybe it should be a .csv file.
- Got a .csv file on your drive blocking you? Was it generated today?
- It's better to pip install pipulate than git clone pipulate.
- Directories you create to pipulate out of shouldn't all be repos.
- Little secret-sauce pipulations belong in private Github gists.