Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
D
DataTools_Tutorial_Demo
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Requirements
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Locked files
Build
Pipelines
Jobs
Pipeline schedules
Test cases
Artifacts
Deploy
Releases
Package Registry
Container Registry
Model registry
Operate
Environments
Terraform modules
Monitor
Incidents
Service Desk
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Code review analytics
Issue analytics
Insights
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
vuppalaa
DataTools_Tutorial_Demo
Commits
ac38cdbf
Commit
ac38cdbf
authored
1 year ago
by
swamina9
Browse files
Options
Downloads
Patches
Plain Diff
minor changes
parent
f59783cb
No related branches found
Branches containing commit
No related tags found
No related merge requests found
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
socail_media_scrapper/yt_scraper.ipynb
+0
-10
0 additions, 10 deletions
socail_media_scrapper/yt_scraper.ipynb
with
0 additions
and
10 deletions
socail_media_scrapper/yt_scraper.ipynb
+
0
−
10
View file @
ac38cdbf
...
...
@@ -9,16 +9,6 @@
"## Campaign Finance Group"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1. Getting Project Dependecies\n",
"\n",
"# if needed#!pip install pandas#!pip install pytchat#!pip install matplotlib#!pip install time\n",
"This installs pandas, numpy and [pytchat](https://pypi.org/project/pytchat/)"
]
},
{
"cell_type": "code",
"execution_count": null,
...
...
%% Cell type:markdown id: tags:
# YouTube Scraper
## Campaign Finance Group
%% Cell type:markdown id: tags:
### 1. Getting Project Dependecies
# if needed#!pip install pandas#!pip install pytchat#!pip install matplotlib#!pip install time
This installs pandas, numpy and
[
pytchat
](
https://pypi.org/project/pytchat/
)
%% Cell type:code id: tags:
```
python
# uncomment and run to install packages
# !pip install pytchat
# !pip install pandas
# !pip install matplotlib
# !pip install pytchat
```
%% Cell type:code id: tags:
```
python
import
pandas
as
pd
import
pytchat
import
matplotlib.pyplot
as
plt
import
time
import
re
import
nltk
import
matplotlib.pyplot
as
plt
from
nltk.corpus
import
stopwords
nltk
.
download
(
"
stopwords
"
)
stop_words
=
set
(
stopwords
.
words
(
'
english
'
))
from
nltk.stem.snowball
import
SnowballStemmer
st
=
SnowballStemmer
(
'
english
'
)
```
%% Cell type:code id: tags:
```
python
def
get_yt_data
(
chat
,
run_time
=
10
,
show_chat
=
True
):
"""
Takes in a chat instance and runtime and returns
a dataframe with all chat data in that time
:param chat: pychat instance
:param run_time: int with total runtime
:param show_chat: boolean telling chat to print or not
:return: pandas dataframe
"""
start_time
=
time
.
time
()
send_time
=
[]
name
=
[]
message
=
[]
while
chat
.
is_alive
():
for
c
in
chat
.
get
().
sync_items
():
send_time
.
append
(
c
.
datetime
)
name
.
append
(
c
.
author
.
name
)
message
.
append
(
c
.
message
)
if
show_chat
:
print
(
f
"
{
c
.
datetime
}
[
{
c
.
author
.
name
}
]-
{
c
.
message
}
"
)
# TODO: Runtime is too small then assume an error?
if
time
.
time
()
-
start_time
>=
run_time
:
return
pd
.
DataFrame
({
'
time
'
:
send_time
,
'
name
'
:
name
,
'
message
'
:
message
})
def
clean_data
(
df
,
col
,
clean_col
):
"""
removes stop words and tokenizes the words in each chat
:param df: data frame
:param col: name of column to parse
:param clean_col: name of cleaned column
:return: none
"""
# change to lower and remove spaces on either side
df
[
clean_col
]
=
df
[
col
].
apply
(
lambda
x
:
x
.
lower
().
strip
())
# remove extra spaces in between
df
[
clean_col
]
=
df
[
clean_col
].
apply
(
lambda
x
:
re
.
sub
(
'
+
'
,
'
'
,
x
))
# remove punctuation
df
[
clean_col
]
=
df
[
clean_col
].
apply
(
lambda
x
:
re
.
sub
(
'
[^a-zA-Z]
'
,
'
'
,
x
))
for
i
in
range
(
len
(
df
)):
line
=
df
[
clean_col
][
i
].
split
()
line
=
[
word
for
word
in
line
if
word
not
in
stop_words
]
df
[
clean_col
][
i
]
=
line
def
word_counts
(
df
,
col
):
"""
visualizes most frequent words in dictionary
:param df: data frame
:param col: column of data frame
:return: sorted dictionary of most frequent words
"""
c
=
{}
for
i
in
range
(
len
(
df
)):
for
word
in
df
[
col
][
i
]:
if
word
not
in
c
:
c
[
word
]
=
1
else
:
c
[
word
]
+=
1
return
sorted
(
c
.
items
(),
key
=
lambda
kv
:
kv
[
1
])
def
visualize_top_words
(
c
,
k
):
"""
visualizes most frequent words in dictionary
:param c: dictionary with counts of words
:param k: top k most frequent words to visualize
:return: none
"""
c
=
c
[::
-
1
]
x
=
[]
y
=
[]
for
i
in
range
(
k
):
x
.
append
(
c
[
i
][
0
])
y
.
append
(
c
[
i
][
1
])
plt
.
bar
(
x
,
y
)
```
%% Cell type:markdown id: tags:
### 2. Scraping your YouTube data
-
Copy the live YouTube link
-
Make it a string and set it equal to the
**yt_link**
variable
-
Run cell and get data
%% Cell type:code id: tags:
```
python
yt_link
=
"
https://www.youtube.com/watch?v=LodADaKUWp8
"
chat
=
pytchat
.
create
(
video_id
=
yt_link
)
chat_data
=
get_yt_data
(
chat
,
360
,
False
)
chat_data
.
head
()
```
%% Cell type:code id: tags:
```
python
clean_data
(
chat_data
,
"
message
"
,
"
clean message
"
)
chat_data
```
%% Cell type:markdown id: tags:
### 3. How can this be used?
-
This could be used to scrape data from youtube channels that talk over relevant topics to see what this are being said the most in reference to a project.
-
Seeing what users are most active
-
Seeing what times are most active
%% Cell type:code id: tags:
```
python
visualize_top_words
(
word_counts
(
chat_data
,
"
clean message
"
),
10
)
```
%% Cell type:markdown id: tags:
### 4. References to libraries used
-
[
pytchat
](
https://pypi.org/project/pytchat/
)
-
[
pandas
](
https://pandas.pydata.org/
)
-
[
matplotlib
](
https://matplotlib.org/
)
-
[
nltk
](
https://www.nltk.org/
)
%% Cell type:code id: tags:
```
python
```
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment