Lab 7 - Files
The purpose of this lab is to get some more practice with reading and writing files with Python. You’ll be using Pandas for this lab.
The secondary purpose of this lab is to give you a refresher on using GitHub. Remember that your final project will be turned in as a github repository; this lab will be turned in the same way.
This is an individual lab so you can each get practice with github on your own machines; however, you’re welcome to chat with each other if you’re encountering problems.
Part 1: Retrieving the Data
This part of the lab we will walk through together.
Forking
First, visit this GitHub Repository.
The first thing you’ll need to do is fork this repository, so that you have a copy that you have complete control over. To do this, click the Fork button in the top right corner.
If you aren’t in lab physically to walk through this with us, click here for a guide on forking a repository.
Cloning
Next, you’ll want to download this repository to your own computer so that you can work with it. This is called cloning.
The command to do so is
git clone <REPOSITORY URL HERE> <FOLDER NAME HERE>
without any of the brackets. For example, if I wanted to clone my repository into a folder called lab7
, I would type into my terminal:
git clone https://github.com/alackles/cmsc140-f22-lab7.git lab7
Important: Make certain that you are cloning YOUR fork, not my repository. Otherwise, you won’t be able to push to it (save info back to github.)
Make sure you are at the place in your terminal that you want the file to be saved when you perform this cloning step. It will save the folder containing the repository to wherever you have saved However, you can always move the folder later.
Pushing
To check that everything is set up correctly, and to practice adding/committing/pushing, you should change something in the README.md file. You could add your name to this file, for example.
Then if you type
git status
in your terminal, you should see that changes are made. Now you can go through the three-step process of syncing these changes to github.
Add
First, add your changes to the “staging ground”:
git add .
Commit
Next, add a message that says what you’ve done.
git commit -m "Add name to README.md"
Push
Finally, sync the changes back to GitHub.
git push
Now, if you go to your repository, you should see the changes to your README.md.
Unzipping the Data
The last thing you’ll have to do before actually starting on your lab is unzip the data in the zip file so you can work with it. You can do this on the terminal if you like, or you can do it however you normally extract zip files. If you have problems with this step, let me know.
gitignore
Now that your data is unzipped, you should create a file called .gitignore
that has the name of your file in it. That is, the only content in the .gitignore
file should be
city_temperature.csv
Add, commit, and push your .gitignore
file to github.
Part 2: Explore your Data
This data holds average temperatures in various cities around the world every single day across multiple years.
You can glance at the data in VSCode; it’s quite long!
Each row is in the following format:
Region,Country,State,Month,Day,Year,AvgTemperature
Part 3: Process your Data
Your task for this lab is to find the hottest day in each region, and record that information in a new csv file.
You’ll use Pandas for this, and you’re going to get some practice reading documentation to do so. Next week we’ll talk more about finding documentation on your own, and asking good questions to the internet when documentation isn’t enough. We’re getting some practice with that today.
Part of this lab is deciding how to keep track of the hottest day in each region without knowing exactly what the regions are ahead of time, but to help you out, I can tell you there are 7 regions, so your CSV file should have at least 7 lines at the end.
See below for some hints to guide you through this process.
Here’s the general outline of steps you should take:
- First, read in your file with
read_csv()
. - Next, take a look at the Pandas
groupby()
method here. If you use groupby, any functions you apply to the dataframe will happen within groups. - Next, figure out how to find the rows with maximum values in the Temperature column and return their entire row. Look back to the baseball data example if needed.
- (Hint for the above if you’re struggling: How would you do this if you didn’t need to do it for each region? Write this code first. Then, change the dataframe into a grouped dataframe with groupby and try again.)
- Finally, write out your data to a file called
city_maxtemp.csv
.
Part 4: Re-Upload to GitHub
Upload your data and your code to github with the add-commit-push steps, and submit the link to your repository to canvas.