Skip to content
Snippets Groups Projects
Commit ac38cdbf authored by swamina9's avatar swamina9
Browse files

minor changes

parent f59783cb
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
# YouTube Scraper
## Campaign Finance Group
%% Cell type:markdown id: tags:
### 1. Getting Project Dependecies
# if needed#!pip install pandas#!pip install pytchat#!pip install matplotlib#!pip install time
This installs pandas, numpy and [pytchat](https://pypi.org/project/pytchat/)
%% Cell type:code id: tags:
``` python
# uncomment and run to install packages
# !pip install pytchat
# !pip install pandas
# !pip install matplotlib
# !pip install pytchat
```
%% Cell type:code id: tags:
``` python
import pandas as pd
import pytchat
import matplotlib.pyplot as plt
import time
import re
import nltk
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
nltk.download("stopwords")
stop_words = set(stopwords.words('english'))
from nltk.stem.snowball import SnowballStemmer
st = SnowballStemmer('english')
```
%% Cell type:code id: tags:
``` python
def get_yt_data(chat,run_time=10, show_chat=True):
"""
Takes in a chat instance and runtime and returns
a dataframe with all chat data in that time
:param chat: pychat instance
:param run_time: int with total runtime
:param show_chat: boolean telling chat to print or not
:return: pandas dataframe
"""
start_time = time.time()
send_time = []
name = []
message = []
while chat.is_alive():
for c in chat.get().sync_items():
send_time.append(c.datetime)
name.append(c.author.name)
message.append(c.message)
if show_chat:
print(f"{c.datetime} [{c.author.name}]- {c.message}")
# TODO: Runtime is too small then assume an error?
if time.time() - start_time >= run_time:
return pd.DataFrame({'time':send_time,'name':name,'message':message})
def clean_data(df, col, clean_col):
"""
removes stop words and tokenizes the words in each chat
:param df: data frame
:param col: name of column to parse
:param clean_col: name of cleaned column
:return: none
"""
# change to lower and remove spaces on either side
df[clean_col] = df[col].apply(lambda x: x.lower().strip())
# remove extra spaces in between
df[clean_col] = df[clean_col].apply(lambda x: re.sub(' +', ' ', x))
# remove punctuation
df[clean_col] = df[clean_col].apply(lambda x: re.sub('[^a-zA-Z]', ' ', x))
for i in range(len(df)):
line = df[clean_col][i].split()
line = [word for word in line if word not in stop_words]
df[clean_col][i]=line
def word_counts(df,col):
"""
visualizes most frequent words in dictionary
:param df: data frame
:param col: column of data frame
:return: sorted dictionary of most frequent words
"""
c = {}
for i in range(len(df)):
for word in df[col][i]:
if word not in c:
c[word]=1
else:
c[word]+=1
return sorted(c.items(), key=lambda kv: kv[1])
def visualize_top_words(c,k):
"""
visualizes most frequent words in dictionary
:param c: dictionary with counts of words
:param k: top k most frequent words to visualize
:return: none
"""
c = c[::-1]
x= []
y= []
for i in range(k):
x.append(c[i][0])
y.append(c[i][1])
plt.bar(x,y)
```
%% Cell type:markdown id: tags:
### 2. Scraping your YouTube data
- Copy the live YouTube link
- Make it a string and set it equal to the **yt_link** variable
- Run cell and get data
%% Cell type:code id: tags:
``` python
yt_link = "https://www.youtube.com/watch?v=LodADaKUWp8"
chat = pytchat.create(video_id=yt_link)
chat_data = get_yt_data(chat,360,False)
chat_data.head()
```
%% Cell type:code id: tags:
``` python
clean_data(chat_data,"message","clean message")
chat_data
```
%% Cell type:markdown id: tags:
### 3. How can this be used?
- This could be used to scrape data from youtube channels that talk over relevant topics to see what this are being said the most in reference to a project.
- Seeing what users are most active
- Seeing what times are most active
%% Cell type:code id: tags:
``` python
visualize_top_words(word_counts(chat_data,"clean message"),10)
```
%% Cell type:markdown id: tags:
### 4. References to libraries used
- [pytchat](https://pypi.org/project/pytchat/)
- [pandas](https://pandas.pydata.org/)
- [matplotlib](https://matplotlib.org/)
- [nltk](https://www.nltk.org/)
%% Cell type:code id: tags:
``` python
```
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment